Exploring a pandas to scikit-learn workflow

I recently read through this excellent Medium article about the ColumnTransformer estimator in scikit-learn and how it can be used in tandem with Pipelines and the OneHotEncoder estimator. To strengthen my own understanding of the concept, I decided to follow the post with my own working example, and summarize the concepts along the way.

Introduction

In scikit-learn’s 0.20 release, the ColumnTransformer estimator was released. This estimator allows different transformers to be applied to different fields of the data in parallel, before concatenating the results together. In particular, this estimator is attractive when processing a pandas DataFrame with categorical fields.

For this working example, I’ll be using the same slimmed down Titanic dataset from my previous logistic regression post.

Below, I’ll make use of the ColumnTransformer estimator to encode two categorical fields via scikit-learn’s OneHotEncoder. Because scikit-learn machine learning models require their input to be two-dimensional numerical arrays, an encoding preprocessing step is required.

I’ll also use the SimpleImputer and StandardScaler estimators to preform some preprocessing of numerical fields. Lastly, I’ll perform a regularized logistic regression and wrap all of these steps into a reusable and convenient Pipeline.

For example purposes further along in this post, I’ll take the last 10 rows of titanic as a test dataset, and allocate the remaining rows as my training data.

titanic_train = titanic.iloc[:-10]
titanic_test = titanic.iloc[-10:]

OneHotEncoder

Let’s first encode sex to demonstrate some functionality of OneHotEncoder.

from sklearn.preprocessing import OneHotEncoder

# Initializing one-hot encoder
# Forcing dense matrix to be returned
encoder = OneHotEncoder(sparse=False)

# Encoding categorical field
sex_train = titanic_train[["sex"]]
sex_train_encoded = encoder.fit_transform(sex_train)

>>> sex_train_encoded
array([[0., 1.],
       [1., 0.],
       [1., 0.],
       ...,
       [0., 1.],
       [0., 1.],
       [1., 0.]])

From the output, we can see that the male and female values of sex have been encoded into two binary columns.

Notice that a NumPy array was returned. We can access column names indicating which feature of sex is represented by each column using the get_feature_names() method.

>>> feature_names = encoder.get_feature_names()
>>> feature_names
array(['x0_female', 'x0_male'], dtype=object)

We can also use the inverse_transform() method to return the original categorical label from the sex column. Notice the brackets around sex_train_encoded[0] that force a list to be returned instead of a NumPy array.

import numpy as np

# Inverse transforming the first row
>>> encoder.inverse_transform([sex_train_encoded[0]])
array([['male']], dtype=object)

# Inverse transforming all rows
>>> encoder.inverse_transform(sex_train_encoded)
array([['male'],
       ['female'],
       ['female'],
       ...,
       ['male'],
       ['male'],
       ['female']], dtype=object)

# Verifying arrays are equivalent after inverse transforming
>>> np.array_equal(encoder.inverse_transform(sex_train_encoded), 
                   sex_train)
True

Applying transformations to training & test sets

Whenever we transform a training field, we must also transform the corresponding field in the test set. We must do this after splitting up the training and test datasets, instead of performing the transformation first and then splitting the data. If we used the latter method here, some information from our test set would leak over into our training set. This mistake is sometimes referred to as data leakage.

# Encoding the sex column of our test dataset
>>> sex_test = titanic_test[["sex"]].copy()
>>> encoder.transform(sex_test)
array([[1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.]])

The above transformation works great. Because we have already initialized our one-hot encoder, we do not need to do so again.

Although the above example worked well, occasionally we’ll run into problems when transforming our test set.

Problem #1: ValueError: Found unknown categories

Occasionally, a value will be present in our test set that is not present in our training set. This can present a problem, as our initialized one-hot encoder is expecting the same unique label values as the training set.

Let’s suppose the first value of sex_test was misspelled fmale instead of female.

>>> sex_test.iloc[0, 0] = "fmale"

>>> sex_test.head()
        sex
704   fmale
705    male
706  female
707    male
708    male

If we attempt the transformation on this new DataFrame, we’ll get an error indicating an unknown category was found.

>>> encoder.transform(sex_test)
ValueError: Found unknown categories ['fmale'] in column 0 during transform

In practice, we should investigate this issue further. However, for this example, let’s initialize a new OneHotEncoder with the handle_unknown parameter set to "ignore", and fit it on the training observations again. Then, when we attempt to transform the test observations, the unknown values will be encoded as a row of all 0’s.

# Initializing new encoder
>>> encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")

# Fitting on training observations
>>> encoder.fit(sex_train)

# Transforming test observations
# Notice first row is all 0's
>>> encoder.transform(sex_test)
array([[0., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.]])

Problem #2: Missing values

Handling missing values in our test set is similar to handling unknown values in our test set. If we initialize our encoder with handle_unknown="ignore", these missing observations will be gracefully handled and encoded as rows of all 0’s.

Note, it appears the 0.23.2 release of scikit-learn is able to have None values but not NaN values.

# Assigning None to some elements of sex_test
>>> sex_test.iloc[1, 0] = None
>>> sex_test.iloc[2, 0] = None
>>> sex_test.head()

# Encoding
# Notice first three rows are all 0's
>>> encoder.transform(sex_test)
array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.]])

Imputing missing values

In the event where we do have missing data, it can be useful to impute the missing values. We can use scikit-learn’s SimpleImputer transformer for this from the impute module.

Let’s artificially create an NaN value in sex_train_copy below, and impute this missing values. We can use the strategy parameter to control how the imputation is done. For numerical data, strategy can be set to either mean or median. For categorical data, we can set strategy to constant, which will allow us to set a constant string value to convert missing values to. We can also set strategy="most_frequent" for both numerical and categorical observations, which will replace missing values with the most frequent observation in that column.

When strategy is equal to "constant", we can optionally use the fill_value parameter to create the constant string value. Below we’ll just use the default value of missing_value.

from sklearn.impute import SimpleImputer

# Assigning first element as NaN
sex_train_copy = sex_train.copy()
sex_train.iloc[0, 0] = np.nan

# Initializing SimpleImputer
# fill_value="missing_value" by default
simple_imputer = SimpleImputer(strategy="constant")

# Fitting and transforming
sex_train_copy_imputed = simple_imputer.fit_transform(sex_train_copy)
sex_train_copy_imputed

Now, we can use the fit_transform() method as before for encoding.

>>> encoder.fit_transform(sex_train_copy_imputed)
array([[0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       ...,
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

scikit-learn’s `Pipeline`

Instead of manually applying multiple fitting and transformation steps, we can instead use a Pipeline. A Pipeline allows a list of transformations to be successively run, and a model can also be trained as the last estimator. Pipelines are especially useful for reproducibe workflows, such as applying the same transformation to training and test sets or different subsets of a dataset during cross validation.

Each step in the Pipeline consists of a two-item tuple. The first element of the tuple is a string that labels the step, and the second element is an initialized estimator. The output of each previous step will be the input to the next step.

from sklearn.pipeline import Pipeline

# Creating steps
step_simple_imputer = ("simple_imputer", SimpleImputer(strategy="constant"))
step_encoder = ("encoder", OneHotEncoder(sparse=False, handle_unknown="ignore"))

# Creating pipeline
pipeline = Pipeline([step_simple_imputer, step_encoder])

# Fitting and transforming
pipeline.fit_transform(sex_train_copy)

After fitting the Pipeline on the training data, it is easy to transform the test data. Notice that because the pipeline has already been fit, we do not need to refit it below, and can instead just call transform().

pipeline.transform(sex_test)

Transforming multiple categorical columns

It is simple as well to use our Pipeline on multiple categorical columns. Just refit the Pipeline and run the transformation.

multiple_fields = titanic_train[["sex", "ticket_class"]]
pipeline.fit_transform(multiple_fields)

Accessing steps in our `Pipeline`

We can also access the individual steps of our Pipeline via the named_steps attribute. For example, we can access the feature names via get_feature_names() method after specifying encoder.

encoder = pipeline.named_steps["encoder"]
encoder.get_feature_names()

scikit-learn’s `ColumnTransformer`

The ColumnTransformer estimator from the compose module allows the user to control which columns get which transformation. This is especially useful when we consider the different transformations categorical and numerical fields will need.

The ColumnTransformer takes a three-item tuple of the following structure:

("name_of_column_transformer", "SomeTransformer(parameters), columns_to_transform")

The columns_to_transform could be a list of column names or integer indices, a boolean array, or a function that resolves to a selection of columns.

Passing a `Pipeline` to a `ColumnTransformer`

We can also pass a Pipeline to the SomeTransformer(parameters) input above. Notice the below Pipeline is equivalent to the one described above, but _categorical has been appended throughout. We’ll have a numerical equivalent version shortly.

from sklearn.compose import ColumnTransformer

# Creating steps
step_simple_imputer_categorical = ("simple_imputer", SimpleImputer(strategy="constant"))
step_encoder_categorical = ("encoder", OneHotEncoder(sparse=False, handle_unknown="ignore"))

# Creating pipeline
pipeline_categorical = Pipeline([step_simple_imputer_categorical, step_encoder_categorical])

# Defining categorical columns
columns_categorical = ["sex", "ticket_class"]

# Defining column transformer
transformers_categorical = [("transformers_categorical",
                             pipeline_categorical,
                             columns_categorical)]

# Creating column transformer
column_transformer = ColumnTransformer(transformers=transformers_categorical)

Because our ColumnTransformer selects columns, we can pass the entire titanic_train DataFrame to it. The defined columns will be select and transformed as appropriate. We could of course pass titanic_test to our ColumnTransformer as well.

>>> column_transformer.fit_transform(titanic_train)
array([[0., 1., 0., 0., 1.],
       [1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 1.],
       [0., 1., 0., 0., 1.],
       [1., 0., 1., 0., 0.]])

To get the feature names of our encoded variables as we have done before, we need to use the named_transformers_ attribute of our ColumnTransformer. After doing so, we can then use the named_steps attribute of our pipeline as before.

(column_transformer
    .named_transformers_["transformers_categorical"]
    .named_steps["encoder"]
    .get_feature_names())

Transforming numeric columns

Next, let’s transform our numeric columns. We’ll impute missing numeric values using the median of that column, and then standardize the values.

from sklearn.preprocessing import StandardScaler

# Creating steps
step_simple_imputer_numeric = ("simple_imputer", SimpleImputer(strategy="median"))
step_standard_scaler_numeric = ("standard_scaler", StandardScaler())

# Creating pipeline
pipeline_numeric = Pipeline([step_simple_imputer_numeric,
                             step_standard_scaler_numeric])

# Defining numeric columns
columns_numeric = ["age", "fare"]

# Defining column transformer
transformers_numeric = [("transformers_numeric",
                         pipeline_numeric,
                         columns_numeric)]

# Creating column transformer
column_transformer = ColumnTransformer(transformers=transformers_numeric)

Just as before, we can fit the ColumnTransformer directly to our DataFrame, and then transform it as appropriately.

>>> column_transformer.fit_transform(titanic_train)
array([[-0.52929637, -0.52052854],
       [ 0.56642281,  0.68305589],
       [-0.25536658, -0.50784109],
       ...,
       [-0.66626127, -0.47173729],
       [-0.73474371, -0.50838994],
       [ 1.79910688,  0.90626108]])

Combining categorical and numeric column transformers

Next, let’s modify our ColumnTransformer structure so that we can perform both the categorical and numeric transformations in parallel. The two resulting transformed NumPy arrays will be concatenated together into one array.

# Defining column transformer
>>> transformers = \
        [("transformers_categorical", pipeline_categorical, columns_categorical),
        ("transformers_numeric", pipeline_numeric, columns_numeric)]

# Creating column transformer
>>> column_transformer = ColumnTransformer(transformers=transformers)

# Transforming both categorical and numeric columns
>>> column_transformer.fit_transform(titanic_train)
array([[ 0.        ,  1.        ,  0.        , ...,  1.        , -0.52929637        ,-0.52052854],
       [ 1.        ,  0.        ,  1.        , ...,  0.        ,  0.56642281        , 0.68305589],
       [ 1.        ,  0.        ,  0.        , ...,  1.        , -0.25536658        ,-0.50784109],
       ...,
       [ 0.        ,  1.        ,  0.        , ...,  1.        , -0.66626127        ,-0.47173729],
       [ 0.        ,  1.        ,  0.        , ...,  1.        , -0.73474371        ,-0.50838994],
       [ 1.        ,  0.        ,  1.        , ...,  0.        ,  1.79910688        ,0.90626108]])

Training a machine learning model

Next, let’s update our Pipeline to feed our transformed data into a machine learning model. We’ll train a logistic regression model below. Unlike in my prior posts, this time I will make use of LogisticRegression()’s default regularization since we standardized our numeric predictor variables.

Below, we’ll just use the fit() method instead of fit_transform(), because our final step in the Pipeline will be to actually fit the model.

from sklearn.linear_model import LogisticRegression

# Creating steps
step_column_transformers = ("column_transformers", column_transformer)
step_logistic_regression = ("logistic_regression", LogisticRegression())

# Creating pipeline
log_reg_pipeline = Pipeline([step_column_transformers, step_logistic_regression])

# Assigning y
y = titanic_train["survived"]

# Transforming data and fitting model
log_reg_pipeline.fit(titanic_train, y)

We can use the score() method to return the correct classification rate.

>>> log_reg_pipeline.score(titanic_train, y)
0.7911931818181818

Cross-validation

Of course, the above correct classification rate value indicates the results on the training set. To get a better idea of how our model might perform on test data, let’s perform a 10-fold cross-validation.

>>> from sklearn.model_selection import KFold, cross_val_score

# Initializing k-fold cross-validator
>>> k_fold = KFold(n_splits=10, shuffle=True, random_state=123)

# Getting cross-validation scores
>>> cross_val_scores = cross_val_score(estimator=log_reg_pipeline, 
                                       X=titanic_train, 
                                       y=y, 
                                       cv=k_fold)

# Getting average cross-validation score
>>> cross_val_scores.mean()
0.785513078470825

Grid search

Lastly, let’s perform a grid search to determine the optimal values of our transformation and fitting procedures. We’ll pass a dictionary object to scikit-learn’s GridSearchCV. We’ll need to put double underscores between the name of each layer in our Pipeline, as well as the actual parameter name.

from sklearn.model_selection import GridSearchCV

# Defining parameter grid
param_grid = {
    "column_transformers__transformers_numeric__simple_imputer__strategy":
        ["mean", "median"],
    "logistic_regression__C":
        [.0001, .001, .01, .1, 1, 10, 100, 1000]
}

# Initializing grid search
grid_search = GridSearchCV(estimator=log_reg_pipeline,
                           param_grid=param_grid,
                           cv=k_fold)

# Fitting grid search
grid_search.fit(titanic_train, y)

We can also view the best parameter combination and the best score.

>>> grid_search.best_params_
{'column_transformers__transformers_numeric__simple_imputer__strategy': 'mean', 'logistic_regression__C': 10}

>>> grid_search.best_score_
0.7869215291750503

We can view detailed results as a pandas DataFrame as well.

import pandas as pd

>>> pd.DataFrame(grid_search.cv_results_)
    mean_fit_time  std_fit_time  ...  std_test_score  rank_test_score
      0.201999      0.415343  ...        0.066751               15
      0.014747      0.000764  ...        0.060855               13
      0.016023      0.001118  ...        0.038972               11
      0.019522      0.003494  ...        0.055859                7
      0.020603      0.002647  ...        0.059064                9
      0.018984      0.003322  ...        0.060287                1
      0.016421      0.000717  ...        0.060287                1
      0.016278      0.000662  ...        0.060287                1
      0.015751      0.000653  ...        0.066751               15
      0.016661      0.002365  ...        0.060855               13
     0.017589      0.001195  ...        0.038972               11
     0.018600      0.001226  ...        0.055859                7
     0.018348      0.000565  ...        0.059064                9
     0.018073      0.001442  ...        0.060287                1
     0.016373      0.000261  ...        0.060287                1
     0.017112      0.001278  ...        0.060287                1
[16 rows x 20 columns]