Cross validation #2

In a previous post I introduced the concept of cross-validation as a resampling technique. In particular, cross-validation is useful for estimating the test error of a particular model fit in order to evaluate its performance, or to decide on an optimal level of flexibility. In addition, cross-validation can also be used to select an ideal model among many potential models by comparing the estimated test errors among these models.

Depending on the problem and data at hand, it is sometimes ideal to first spit the available data into training and test sets, and then perform cross-validation on only the training set, and then use the test set for reporting, as this test set will contain data that has not influenced model training at all. Typically, the mean and standard deviation of the test error is reported.

In the below examples, I am only interested in the various cross-validation procedures, and will thus not first split my data into training and test sets. Instead, I will treat my entire dataset as a training set, and will draw various training and validation sets, where the validation set is drawn from the training set.

In a future post, I will explore nested cross-validation, which is a technique used to simultaneously perform model hyperparameter turning and evaluation.

Iris dataset

For the below examples, I will train a number of classifiers on the well known Iris dataset. The Iris dataset presents a simple and easy classification task, and is ubiquitous across machine-learning examples.

Let’s first create a simple scatter plot of the Iris data to visualize the three target classes.

import pandas as pd
import seaborn as sns
from sklearn.datasets import load_iris

# Loading Iris as pandas DataFrame
iris = load_iris(as_frame=True)

# Creating DataFrame to match up target label with target_name
iris_target_names = pd.DataFrame(data=dict(target=[0, 1, 2], 
                                           target_name=iris.target_names))

# Merging predictor and response DataFrames, via left join
data = (iris["frame"]
        .merge(right=iris_target_names,
               on="target",
               how="left"))

# Creating scatter plot
sns.scatterplot(x="sepal length (cm)",
                y="sepal width (cm)",
                hue="target_name",
                data=data)

Note the Iris dataset contains four predictor fields in total. petal length(cm) and petal width(cm) are not shown on the above two-dimensional scatter plot.

Validation set approach

First, let’s explore the simple validation set approach. For detail on this approach, please see my earlier post.

Using scikit-learn, we can split our data into training and test sets using train_test_split function. Note, under this simple approach, the terms validation set and test set of often used interchangeable. We’ll train our classifier using the method of linear discriminant analysis.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split

# Loading Iris again, returning X and y as NumPy arrays
X, y = load_iris(return_X_y=True)

# Simple validation set split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    random_state=1234,
                                                    shuffle=True)

# Initializing LDA estimator and fitting on training observations
lda = LinearDiscriminantAnalysis()
lda.fit(X=X_train, y=y_train)

Note, under the hood, train_test_split is simply a wrapper around the scikit-learn ShuffleSplit and StratifiedShuffleSplit classes.

Now that we’ve trained our model, we use the predict and score methods to see the class predictions and calculate the correct classification rate, or accuracy, respectively.

lda.predict(X_train)  # predictions on training set
lda.predict(X_test)   # predictions on test set

lda.score(X=X_train, y=y_train)  # accuracy on training set
lda.score(X=X_test, y=y_test)    # accuracy on test set

On the training set, our accuracy is approximately 98.214%, while on the test set it is approximately 97.368%. Since the Iris classification task is so easy, these high accuracy rates (and corresponding low error rates) aren’t surprising. On more challenging classification tasks in practice, we would expect much higher error rates.

Note as well, under a different random validation set split, we would expect different training and error rates.

$k$-fold and stratified $k$-fold cross-validation

Next, let’s explore $k$-fold cross-validation. $k$-fold cross-validation and it’s related techniques are probably the most popular techniques used in practice for model assessment and selection. For a detailed introduction to the $k$-fold cross-validation, please refer to my earlier post on the concept.

A related technique to $k$-fold cross-validation is stratified $k$-fold cross-validation. Stratified $k$-fold cross validation is similar to $k$-fold cross-validation, except the observations in each of the $k$ folds are selected such that class labels of a qualitative response are roughly equally distributed. This is particularly useful for classification tasks where the class labels are highly imbalanced.

scikit-learn’s `cross_val_score`

We can use scikit-learn’s cross_val_score function to perform these types of cross-validation techniques. cross_val_score will perform $k$-Fold cross validation, unless the estimator is a classifier and y is binary or multiclass, in which case stratified $k$-fold cross-validation is used.

Using our same lda estimator from above, we can use cross_val_score to perform stratified 10-fold cross-validation.

>>> from sklearn.model_selection import cross_val_score

>>> scores = cross_val_score(estimator=lda, X=X, y=y, cv=10)
>>> scores

array([1.        , 1.        , 1.        , 1.        , 0.93333333,
       1.        , 0.86666667, 1.        , 1.        , 1.        ])

Note, the ten returned values are the accuracy rates for the model trained on the other nine folds, and evaluated on the particular validation fold. By default, cross_val_score performs 5-fold cross-validation using cv=5.

We can also easily calculate the mean and standard deviation of these ten cross-validation scores.

scores.mean()  # 0.98
scores.std()   # 0.0356

Under the hood, cross_val_score calls scikit-learn’s KFold or StratifiedKFold classes to perform the cross-validation. We could have performed the same cross-validation procedure as above using the below code.

from sklearn.model_selection import StratifiedKFold

cv_stratified_k_fold = StratifiedKFold(n_splits=10, shuffle=False)
cross_val_score(estimator=lda, X=X, y=y, cv=cv_stratified_k_fold)

In the regression setting, we can use the KFold class in an analogous manner.

scikit-learn’s `cross_validate`

In addition to cross_val_score, scikit-learn also offers the cross_validate function. This function supports evaluating multiple specified scoring metrics, and it returns a dictionary containing fit times, score times, test scores, and optionally training scores. The time measures are returned in units of seconds.

from sklearn.model_selection import cross_validate

scores = cross_validate(estimator=lda,
                        X=X,
                        y=y,
                        cv=10,
                        scoring=("roc_auc_ovo", "accuracy"),
                        return_train_score=True)

A list of supported scoring metrics can be found here.

The Python standard library pprint is useful for viewing dictionary objects like scores.

>>> from pprint import pprint

>>>pprint(scores)

{'fit_time': array([0.00120902, 0.04706788, 0.00307202, 0.00163078, 0.00100708,
       0.00134993, 0.00214291, 0.00090027, 0.00096679, 0.00087285]),
 'score_time': array([0.00728393, 0.07550716, 0.005898  , 0.00772309, 0.00524998,
       0.00416422, 0.00359702, 0.00362277, 0.00341129, 0.00327492]),
 'test_accuracy': array([1.        , 1.        , 1.        , 1.        , 0.93333333,
       1.        , 0.86666667, 1.        , 1.        , 1.        ]),
 'test_roc_auc_ovo': array([1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 0.98666667, 1.        , 1.        , 1.        ]),
 'train_accuracy': array([0.97777778, 0.97777778, 0.97777778, 0.97777778, 0.98518519,
       0.97777778, 0.99259259, 0.97777778, 0.97777778, 0.97777778]),
 'train_roc_auc_ovo': array([0.99884774, 0.99884774, 0.99884774, 0.99901235, 0.99917695,
       0.99934156, 0.99983539, 0.99917695, 0.99901235, 0.99901235])}

Comments on repeated $k$-fold cross-validation

At this point, it would be useful to briefly discuss the repeated $k$-fold cross-validation technique, and the lack of use it provides.

Repeated $k$-fold cross-validation refers to a technique in with the $k$-fold cross-validation method is repeated multiple times. Since $k$-fold cross-validation creates random folds of the data and returns an estimation of the test error, repeated $k$-fold cross-validation returns multiple estimates of the test error, each created with different random folds. After all the repeated cross-validation test error rates are returned, summary statistics such as the mean and standard deviation of all test error estimates are often reported.

This technique and the stratified variant of it are supported in scikit-learn via the RepeatedKFold and RepeatedStratifiedKFold classes.

While repeated $k$-fold cross validation may seem like an attractive method to produce a more accurate test error estimate, it has been found to be not useful in practice and often misleading.

In a 2012 paper by Belgian researchers Gitte Vanwinckelen and Hendrik Blockeel, it was argued that when performing cross-validation for model assessment and selection, we are interested in the predictive accuracy of our model on a sample $S$ taken from a population $P$. The authors denote this value as $\sigma_2$ and contrast it with $\sigma_1$, the mean predictive accuracy of our model taken over all data sets $S’$ of the same size as $S$ taken from $P$.

$\sigma_2$ has both bias (because $S$ is a subset taken from $P$; a new subset $S_1$ would be different) and has high variance (because of the randomly selected $k$ folds used to estimate the test error rate).

The authors argue that repeating the $k$-fold cross-validation process can reduce the variance, but not the bias. Instead, the repeated $k$-fold cross-validation process produces an accurate estimate of the mean of all $k$-fold cross-validation test error estimates, across all possible $k$-fold cross-validations, denoted $\mu_k$. However, we are interested in $\sigma_2$, and $\mu_k$ is not necessarily an accurate estimate of $\sigma_2$.

Not useful example of repeated $k$-fold cross-validation

After reading the above paper, I was still interested in applying the method of repeated $k$-fold cross-validation for my own knowledge and exploration. With the understanding that the below method does not produce meaningful results in practice, below is an example demonstrating how to apply repeated stratified $k$-fold cross-validation to compare four classifiers.

First, let’s initialize the RepeatedStratifiedKFold object. We’ll perform 10-fold cross-validation across 5 repeats.

from sklearn.model_selection import RepeatedStratifiedKFold

cv_repeated_stratified_k_fold = RepeatedStratifiedKFold(n_splits=10, 
                                                        n_repeats=5, 
                                                        random_state=1234)

Next, let’s perform this cross-validation technique to compare the classification results of linear discriminant analysis, quadratic discriminant analysis, logistic regression, and quadratic logistic regression.

I’ll create a separate scikit-learn Pipeline for each model first.

For the particular models chosen below, scaling our features will not affect the results. However, for demonstration purposes, I decided to scale the features in each pipeline. I also found the LDFGS solver did not converge for the quadratic logistic regression model without first scaling the features.

Note, the logistic regression and quadratic logistic regression models below will be one-vs-rest multi-class classifiers, since our target in the Iris dataset has three distinct levels.

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# Creating individual pipelines for each model
pipeline_lda = \
    ("LDA", (Pipeline([
        ("standard_scaler", StandardScaler()),
        ("lda", LinearDiscriminantAnalysis())
    ])))

pipeline_qda = \
    ("QDA", Pipeline([
        ("standard_scaler", StandardScaler()),
        ("qda", QuadraticDiscriminantAnalysis())
    ]))

pipeline_log_reg = \
    ("Logistic Regression", Pipeline([
        ("standard_scaler", StandardScaler()),
        ("log_reg", LogisticRegression(penalty="none"))
    ]))

pipeline_quadratic_log_reg = \
    ("Quadratic Logistic Regression", Pipeline([
        ("standard_scaler", StandardScaler()),
        ("polynomial_features", PolynomialFeatures(degree=2, interaction_only=False)),
        ("logistic_regression", LogisticRegression(penalty="none"))
    ]))

Next, we’ll create a list of our Pipelines, and loop through them to calculate the repeated stratified $k$-fold cross-validation test error rates. Within the loop, the results will be organized into a single pandas DataFrame.

# Initializing DataFrame for results
results = pd.DataFrame()

# Looping through pipelines, storing results
for pipe, model in pipelines:

    # Getting cross validation scores
    cv_scores = cross_val_score(model, X, y, cv=cv_repeated_stratified_k_fold)

    # Calculating mean CV scores (approximation for mu_1 in paper's notation)
    cv_scores_mean = pd.DataFrame(data=cv_scores.reshape(10, 5)).mean()

    # Organizing results into DataFrame
    results_per_model = pd.DataFrame(data=dict(mean_cv_score=cv_scores_mean,
                                               model=pipe,
                                               repeat=["Repeat " + str(i) for i in range(1, 6)]))

    # Concatenating results
    results = pd.concat([results, results_per_model])

Our results object contains five averaged cross-validation scores (one for each repeat) for each of the four models.

>>> results
   mean_cv_score                          model    repeat
     0.973333                            LDA  Repeat 1
     0.966667                            LDA  Repeat 2
     0.973333                            LDA  Repeat 3
     0.986667                            LDA  Repeat 4
     1.000000                            LDA  Repeat 5
     0.960000                            QDA  Repeat 1
     0.966667                            QDA  Repeat 2
     0.953333                            QDA  Repeat 3
     0.986667                            QDA  Repeat 4
     1.000000                            QDA  Repeat 5
     0.953333            Logistic Regression  Repeat 1
     0.953333            Logistic Regression  Repeat 2
     0.980000            Logistic Regression  Repeat 3
     0.980000            Logistic Regression  Repeat 4
     1.000000            Logistic Regression  Repeat 5
     0.946667  Quadratic Logistic Regression  Repeat 1
     0.906667  Quadratic Logistic Regression  Repeat 2
     0.926667  Quadratic Logistic Regression  Repeat 3
     0.940000  Quadratic Logistic Regression  Repeat 4
     0.960000  Quadratic Logistic Regression  Repeat 5

Lastly, we can plot our results via a boxplot.

(sns
 .boxplot(data=results, x="model", y="mean_cv_score")
 .set(title='Model Comparison via \n Repeated Stratified k-Fold Cross Validation',
      xlabel="Model",
      ylabel="Mean Cross-Validation Score"))

Due to the easy nature of this classification task, all of our accuracy values are quite high. It does appear that the quadratic logistic regression model is (not surprisingly) overfitting the data, and is performing worse on the validation sets.

Leave-one-out cross validation

To implement leave-one-out cross-validation (LOOCV), we can use scikit-learn’s LeaveOneOut class. For a more detailed introduction to LOOCV, please refer to my previous post on cross-validation.

from sklearn.model_selection import LeaveOneOut

cv_loo = LeaveOneOut()
scores = cross_val_score(lda, X, y, cv=cv_loo)
scores.mean()  # 0.98

Other cross-validation techniques

Lastly, I wanted to briefly discuss some other cross-validation techniques supported by scikit-learn.

`GroupKFold`

The GroupKFold class is similar to the regular KFold class, except all observations for a unique group is guaranteed to never be in both the training and validation sets. That is, all observations for a particular group will appear in either the training or validation set per cross-validation iteration, but not both.

This technique is useful when the information about a particular group should remain together for model training, and then the model should be assessed on unseen groups.

For example, if we have data on various customers (with multiple observations per customer) and we are hoping to predict whether a customer will churn or not, it might be useful to use all observations about various customers to train the model, and to then predict churn on unseen observations related to other customers.

`TimeSeriesSplit`

The TimeSeriesSplit class can be used to perform cross-validation on time series observations observed at fixed time intervals.

Time series data is characterized by correlation of observations that are near in time (known as autocorrelation). However, standard cross-validation techniques such as $k$-fold cross-validation assume the observations are independent and identically distributed. Thus, using these standard techniques on time series data would create unrealistic correlation between training and validation observations, and would result in poor generalized test error estimates.

Therefore, with time series data, we must evaluate our models using “future” observations that were not used to train the model.

The TimeSeriesSplit class achieves this by returning the first $k$ folds as a training set and the $k+1$th fold as a validation set. Then, for each cross-validation iteration, successive training sets are supersets of earlier training sets, and the new validation set becomes the next fold in temporal order.

Other techniques

In addition to the techniques discussed above, a variety of other scikit-learn cross-validation methods can be found here.