An Overview of Statistical Learning

Statistical learning methods are useful for two main purposes. That is for prediction or inference. Prediction refers to predicting an output (aka: response or dependent) variable whereas inference refers to understanding how an output variable is affected by input (aka: predictors or independent) variables. There are many different linear and non-linear methods that can be used for both prediction and inference. However, there is a trade-off between linear models and non-models. Linear models are simple and highly interpretable whereas non-linear models are flexible yet lack interpretability. Non-linear models are suitable for prediction unlike for inference, you don’t need lot of interpretation to explain the effects of input variables on output variables.

Overview of Classical Statistical Learning Methods

Here’s a summary of existing classical statistical learning methods. Problems with a quantitative output are generally referred to as ‘regression problems’ whereas problems with qualitative output are generally referred to as ‘classification problems.’

Extracted from the book ‘Introduction to Statistical Learning’

Techniques for Assessing Model Accuracy

We need to evaluate how good a model is prior using it for actual prediction or inference. Thus we need to measure how well its predictions match the observed data. Following are few common techniques that use to evaluate a model that operates in either regression setting or classification setting.

Mean Squared Error (MSE): This is mainly used in regression setting. Variance and bias are two properties to observe when using MSE. The training MSE computed on the training data set decreases when flexibility is increased (i.e. when used non-linear models). Testing MSE computed on a testing data set which was not used to train the model behaves in a U shape. However, one should consider the testing MSE than the training MSE.

Variance – Amount by which estimated output change if estimated using a different training set

Bias – Error introduced by approximating a complicated real-life problem by using a simple model like linear regression

Bayes Classifier: This is used in classification setting

K-Nearest Neighbors: This is used in classification setting