Multiple Linear Regression

Multiple regression is used when there are more than one predictors or input variables. It extends the simple linear regression model by giving each predictor a separate slope coefficient within a single model.

Given the number of predictors p equation is

Y =  \beta_0 +  \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon 

\beta_i is the average effect on Y by increasing one of X_i, holding all the other predictors fixed.

\beta_0, \beta_1 ,....., \beta_p are called regression coefficients.

Computing Regression Coefficients

Regression coefficients are computed using the same least squares method as for simple linear regression models.

Choose \beta_0, \beta_1 ,....., \beta_p to minimize the sum of squared residuals.

RSS = \sum_{i=1}^{n}{(y_i - \hat{y})}^2

Assessing Accuracy of Regression Coefficients

This is measured by answering the simple question is there a relationship between outcome variable and predictors.

Here we need to check whether all the regression coefficients are zero.

H_0 : \beta_1 = \beta_2 = ... = \beta_p = 0 (null hypothesis)

H_a : at least one \beta_i is non-zero (alternative hypothesis)

Hypothesis test is performed by computing the F-statistics

F = \frac{(TSS - RSS)/p}{RSS/(n - p - 1)}


Assessing Accuracy of Model (Model Fit)

Similar to simple linear regression RSE and R^2 are used in assessing the model fit.

  1. RSE = \sqrt{\frac{1}{n - p - 1}RSS}
  2. R^2 = Cor(Y, \hat{Y}) 
  3. In addition to RSE and R^2, it is recommended to plot the data and see if there are problems.

Assumptions in Multiple Linear Regression

Two main assumptions underlying the relationship between predictor variables and output variable

  1. Additive – effect of changes in a predictor variable on the output variable is independent from the values of other predictor variables
  2. Linear – change in output variable value by one unit of change in predictor variable is constant regardless of the value of the predictor variable

Other Assumptions

  • Error terms, \epsilon_1, \epsilon_2, ... \epsilon_n are uncorrelated
  • Error terms have constant variance, Var(\epsilon_i)  = \sigma^2

Extensions of Multiple Linear Regression

To accommodate non-linearity, a simple extension is polynomial regression.

Potential Problems and Remedies

1 Non-linearity of output-input variable’s relationships

Remedy:

  • Residual Plots – Draw residual plots to identify non-linearity
  • If non-linear, transform predictors to for example log{X} or \sqrt{X} or X^2

2 Correlation of error terms

This occurs often in time series data.

Remedy:

  • To identify this plot residuals as a function of time. If it’s the case we can see tracking in the residuals that is adjacent residuals having the similar values.

3 Non-constant variance in error terms

Remedy: 

  • To identify this draw the residual plot and see if there is a funnel shape
  • Transform the response/output Y using a concave function like log{Y} or \sqrt{Y}

4 Outliers

  • Draw residual plots to identify
  • Better approach is to plot studentized residuals which is computed by dividing each residual e_1 by its estimated standard error. 
  • Remedy: Remove such observations

5 Collinearity

This means two or more predictor variables are closely related to each other.

Issues of collinearity

  • Power of hypothesis test reduces
  • Reduces the accuracy of regression coefficients
  • Causes standard error to grow (this happens due to the reduced accuracy of regression coefficients)
  • Result in decline in t-statistic

How to detect?

  • Look at the correlation matrix of the predictors
  • Compute variance inflation factor (VIF)

VIF (\hat\beta_j) = \frac{1}{1 - R^2_{X_j|X_-j}}

Rule of thumb: VIF > 5 or VIF > 10 indicates collinearity

Remedy:

  • Drop one of the problematic variables from the model
  • Combine collinear variables together into a single predictor variable