Ethan Wicker

11 more takeaways from Designing Machine Learning Systems

Continual learning, model calibration, distribution shifts, trade-offs and more

February 11, 2024

This is a second post detailing 11 more takeaways I had after recently reading Chip Huyen’s Designing Machine Learning Systems. You can find my first post here. [Read More]

10 takeaways from Designing Machine Learning Systems

Semi-supervision, active learning, the hashing trick and more

February 9, 2024

I recently read through Chip Huyen’s Designing Machine Learning Systems, first published in 2022. I found it a highly useful overview of a rapidly evolving field. As I expected from her other writings, Chip provided clear explanations while straddling the line nicely between technical depth and summarization. Machine learning engineering... [Read More]

Streaming analytics

A brief overview

December 20, 2022

Streaming analytics refers to the processing and analyzing of data continuously, as opposed to regular batches. Streams are triggered by specific events as the result of an action or set of actions. Examples of these triggering events might include financial transactions, thermostat readings, student responses, or website purchases. Streaming analytics... [Read More]

Principal components regression

An overview and scikit-learn example

March 14, 2021

Principal components analysis (PCA) is a common and popular technique for deriving a low-dimensional set of features from a large set of variables. For more information on PCA, please refer to my earlier post on the technique. In this post, I’ll explore using PCA as a dimension reduction technique for... [Read More]

Principal components analysis

An overview with scikit-learn and statsmodels examples

March 11, 2021

Principal components analysis (PCA) is a technique that computes the principal components of a dataset and then subsequently uses these components in understanding the data. PCA is an unsupervised approach. In a future post, I’ll explore principal components regression, a related supervised technique that makes uses of the principal components... [Read More]

Regularization via ridge regression and the lasso #2

A working example using scikit-learn, GridSearchCV, seaborn and statsmodels

March 4, 2021

This is the second post in a short series discussing the common regularization methods of ridge regression and the lasso. In an earlier post, I introduced much of the theory surrounding these methods. For a more detailed overview of regularization, please see that earlier post. [Read More]

Regularization via ridge regression and the lasso #1

An introduction and overview

March 3, 2021

Regularization is a method of fitting a model containing all predictors $p$ that regularize the coefficient estimates towards zero. Also known as constraining or shrinking the model’s coefficient estimates, regularization can significantly reduce the model’s variance and thus improve test error estimates and model performance. The two most commonly used... [Read More]

Bootstrap resampling

An overview and example with scikit-learn's resample and BaggingRegressor

February 23, 2021

The bootstrap is a widely used resampling technique first introduced by Bradley Efron in 1979 commonly used to quantify the uncertainty associated with a given estimator or statistical learning method. The bootstrap can be applied to many problems and methods, and is commonly used to estimate the standard errors of... [Read More]

Nested cross-validation

An introduction, overview, and scikit-learn example

February 20, 2021

Nested cross-validation can be viewed as an extension of simpler cross-validation techniques. When performing model selection or model evaluation, $k$-fold cross-validation is a crucial method for estimating a particular model’s test error on unseen observations. However, as Cawley and Talbot discussed in a 2010 paper, when performing model selection and... [Read More]

Cross validation #2

scikit-learn's KFold, StratifiedKFold, LeaveOneOut, GroupKFold, and TimeSeriesSplit classes

February 16, 2021

In a previous post I introduced the concept of cross-validation as a resampling technique. In particular, cross-validation is useful for estimating the test error of a particular model fit in order to evaluate its performance, or to decide on an optimal level of flexibility. In addition, cross-validation can also be... [Read More]