Resampling Techniques

Resampling is the process of drawing samples repeatedly from a training data set and refitting the model on each sample to get additional information to decide on the best fit. The two of the most widely used techniques are cross-validation and bootstrap.

1. Cross-Validation

Cross-validation is used to estimate the test error to evaluate model performance (model assessment) or select a proper level of flexibility (model selection).

Test error is the average error results in predicting an output from a new observation. Cross-validation estimates the test error by holding out a subset of training observations.

Approaches of Evaluating Test Error

  1. Validation Set
  2. Leave-One-Out Cross-Validation
  3. k-Fold Cross-Validation

Validation Set

Randomly divides the available observation data set into two parts; a training set and a validation set/hold-out set (Figure 1). Test error is computed for the validation set, once the model is fitted using the training set.

Figure 1: Validation Set Approach

Leave-One-Out Cross-Validation (LOOCV)

Similar to validation set approach, we divide the observations into two parts. However, this time validation set includes only one observation (Figure 2).

Figure 2: LOOCV Approach

k-Fold Cross-Validation (k-Fold CV)

In this approach we divide the total observations randomly into k groups or folds. The first fold is treated as validation set and the remaining are treated as training sets (Figure 3). The mean squared error (MSE) is computed on the validation set (held-out fold). k-fold CV estimate is then computed by averaging MSE for each fold.

CV_{(k)} = \frac{1}{k}{\sum_{i=1}^{k}{MSE_i}}

Figure 3: k-Fold CV Approach

Bootstrap

This technique is capable of estimating the standard errors of the coefficients in the linear regression setting. It involves generating new samples using a computer emulator. It repeatedly samples the observations to generate new samples, so that we don’t need to pull new samples from the population.