Introduction to model selection

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Model selection
        Machine Learning
      

Introduction

Having trained several predictive models on available datasets, how do we know which of the trained models is a better performer compared to the rest? The act of choosing better models is known as model selection in machine learning.

Model selection can help in choosing better hyperparameters of the same modeling family. For example, choosing the value of number of neighbors, $ K $, in $K$-nearest neighbors is essential for getting good predictive performance.

Model selection is also useful in comparing across multiple model families. For example, whether a support vector machine or a decision tree is a better predictive model for a task can be addressed using model selection strategies.

In this article, we will explore the recommended strategies for performing model selection in supervised learning settings.

Prerequisites

To understand model selection strategies, we recommend familiarity with the concepts in

It would also help to have some familiarity with some machine learning models for classification and regression, such as

Follow the above links to first get acquainted with the corresponding concepts.

The model selection recipe

Model selection is straightforward.

To choose suitable settings for hyperparameter, select those that can help the model achieve the best predictive performance.
When choosing among several model families, again, select the family that (after hyperparameter tuning) with the best predictive performance.

Seems, the most important step in model selection is actually estimating predictive performance. We have explained an elaborate list of evaluation metrics for classification and similarly, for regression. So, we have metrics for predictive performance. But how do we measure it?

A naive approach would be evaluating predictive performance on training data. The problem with this approach is that the model has already seen all examples from the training set. The model may have just memorized a direct mapping from input instance to its output target variable, without learning a general signature or pattern for this mapping. Such a model will have superb predictive performance on the training data, but miserable performance on future unseen examples.

An alternative strategy might involve splitting the dataset into two parts — a training set and a testing set. As the name implies, we train the model on the training set and evaluate its predictive performance on the testing set. Although better than the previous naive approach, this train-test splitting strategy has a problem — the predictive performance is specific to the testing set. If the test set is not big enough, it may not represent the variety of data the model may encounter in the future.

We need a better strategy that ensures that the estimated predictive performance is generalized to multiple testing sets, instead of a single test set. The strategy of cross-validation offers this better strategy.

Cross-validation

Cross-validation generalizes the idea of training/testing splits in a principled way.

The available supervised data is shuffled and split into $ K $-folds, each containing approximately same number of observations. This splitting automatically leads to $ K $ training/testing splits — for the $ i $-th training/testing split, consider the $ i$-th fold as the testing set and the remaining folds as the training set. The model's predictive performance is then estimated as the average of the evaluations from these $K$ training/testing splits. This strategy is known as $K$-fold cross-validation.

Why is this better than just randomly training/testing multiple times on the overall set? Theory aside, $K$-fold cross-validation guarantees that every observation in the available labeled set appears in some test set. Such a guarantee is not possible with multiple random splits.

Leave-one-out cross-validation

An extreme case of $K$-fold cross-validation uses $ K = 1 $. With only one observation in each fold, this approach to evaluation is known as leave-one-out cross-validation (LOOCV). The benefit of LOOCV over going with higher values of $ K $ is the availability of a relatively larger dataset for training the model. For example, with 100 labeled examples, LOOCV will offer 99 examples for training, while $ 5 $-fold cross-validation will offer only 80 examples for training. On the flipside, the performance evaluation in LOOCV will depend on a single example in each test split, while the $ 5$-fold cross-validation will have less variance from being evaluated over 20 examples. Therefore, the default recommendation is to avoid using LOOCV, unless there is a significant labeled data scarcity.

Repeated cross-validation

The default recommendation is to use repeated $ K$-fold cross-validation CITE[kohavi-1995]. For example, repeat a $ 5 $-fold cross-validation experiment a total of 10 times, each time creating the folds randomly. Such an experiment would offer two benefits:

The predictive performance is estimated over multiple $ K $-folds, implying independence from the splits that created the folds, since the splits were randomized several times.
The same examples have appeared in multiple test sets, once in each of the $ K$-fold evaluations, possibly tested against different instantiations of the training set. Compare that a single $ K $-fold cross-validation. It may have the rare chance of creating splits that are bad for evaluation — all tough examples grouped in a single test set and all the weak examples in the corresponding training set. A repeated cross-validation ensures that the performance on any examples is also training set independent.

Stratified cross-validation

In classification scenarios, it is crucial to ensure that the relative proportion of categories in the training set is similar to that in the test set. When applying cross-validation, a stratified approach to creating the folds ensures that the relation proportion of examples from different classes is approximately the same across the folds. Such a strategy is known as stratified cross-validation and it may be applied in conjunction with repeated cross-validation that we studied earlier.

Nested cross-validation

At the onset of this article, we suggested the two ways model selection is useful — hyperparameter tuning and comparing model families.

A typical strategy of achieving both these goals is to use the so-called nested cross-validation. In nested cross-validation, there are two levels of cross-validation, with one within another.

In the outer level, the training and test splits are used to estimate model predictive performance for comparison across model families.
In the inner level, the training and testing splits are used to estimate predictive performance for different settings of hyperparameters of the same model family. In this case, the testing set is actually known as the validation set. Using randomized search or grid search, the best performing hyperparameter settings are identified.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Let's connect

Please share your comments, questions, encouragement, and feedback.