Extending Linear Regression Models

Linear regression models assume the relationship between predictor variables and output variable is linear. Although linear models are simple and easy to interpret, it lacks predictive power because the true relationship is rarely linear. Ridge regression, lasso, and principle component regression are improved linear models with better model fitting. Yet, still those methods use a liner model. Moving beyond the linearity assumption, there are extensions of linear models that cater for non-linear relationships of data.

  1. Polynomial Regression
  2. Step Function
  3. Regression Splines
  4. Local Regression
  5. Generalized Additive Models

In this post, I only brief about polynomial regression and generalize additive models.

Polynomial Regression

This is a simple extension of linear regression model using a polynomial function.

y_i = \beta_0 + \beta_1x_i + \beta_2x_i{^2} + \beta_3x_i{^3} + ... + \beta_dx_i{^d} + \epsilon_i

d < 5 (it is not common to use degree greater than 3 or 4)

Computing Regression Coefficients

Estimated using least squares

Generalized Additive Models

Generalized additive models (GAM) can be used in both regression and classification settings. It can be seen as an extension of multiple linear regression which supports both quantitative and qualitative output variable.

It also provides a framework to extend standard linear model by allowing non-linear function of each variable while maintaining the additivity.

GAM in Regression Setting

y_i = \beta_0 + \sum_{j=1}^{p}{f_i(x_{ij})} + \epsilon_i

       = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + ... + f_p(x_{ip}) + \epsilon_i

This is an additive model, because it includes a separate f_j for each X_j and finally sum up all their contributions.

Pros

  • Can automatically model non-linear relationships. No need to try out different transformations on each predictor variable individually.
  • Can make more accurate predictions
  • Can examine the effect of each predictor variable on the output variable while holding other predictor variables fixed, because the model is additive.
  • Provides a useful representations and interpretations for inference problems.

Cons

  • Model is restricted because its additive. When there are lots of predictor variables interactions can be missed.

GAM in Classification Setting

GAM can be used when the output variable is qualitative. This extends the logit function which is used in logistic regression model.

log (\frac{p(X)}{1 - p(X)}) = \beta_0 + f_1(X_1) + f_2(X_2) + ... + f_p(X_p)