Multiple Linear Regression - Malmi Amadoru | මල්මි අමදෝරු

Multiple regression is used when there are more than one predictors or input variables. It extends the simple linear regression model by giving each predictor a separate slope coefficient within a single model.

Given the number of predictors p equation is

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon$

$\beta_i$ is the average effect on Y by increasing one of $X_i$ , holding all the other predictors fixed.

$\beta_0, \beta_1 ,....., \beta_p$ are called regression coefficients.

Computing Regression Coefficients

Regression coefficients are computed using the same least squares method as for simple linear regression models.

Choose $\beta_0, \beta_1 ,....., \beta_p$ to minimize the sum of squared residuals.

$RSS = \sum_{i=1}^{n}{(y_i - \hat{y})}^2$

Assessing Accuracy of Regression Coefficients

This is measured by answering the simple question is there a relationship between outcome variable and predictors.

Here we need to check whether all the regression coefficients are zero.

$H_0 : \beta_1 = \beta_2 = ... = \beta_p = 0$ (null hypothesis)

$H_a :$ at least one $\beta_i$ is non-zero (alternative hypothesis)

Hypothesis test is performed by computing the F-statistics

$F = \frac{(TSS - RSS)/p}{RSS/(n - p - 1)}$

Assessing Accuracy of Model (Model Fit)

Similar to simple linear regression RSE and $R^2$ are used in assessing the model fit.

$RSE = \sqrt{\frac{1}{n - p - 1}RSS}$
$R^2 = Cor(Y, \hat{Y})$
In addition to RSE and $R^2$ , it is recommended to plot the data and see if there are problems.

Assumptions in Multiple Linear Regression

Two main assumptions underlying the relationship between predictor variables and output variable

Additive – effect of changes in a predictor variable on the output variable is independent from the values of other predictor variables
Linear – change in output variable value by one unit of change in predictor variable is constant regardless of the value of the predictor variable

Other Assumptions

Error terms, $\epsilon_1, \epsilon_2, ... \epsilon_n$ are uncorrelated
Error terms have constant variance, $Var(\epsilon_i) = \sigma^2$

Extensions of Multiple Linear Regression

To accommodate non-linearity, a simple extension is polynomial regression.

Potential Problems and Remedies

1 Non-linearity of output-input variable’s relationships

Remedy:

Residual Plots – Draw residual plots to identify non-linearity
If non-linear, transform predictors to for example $log{X}$ or $\sqrt{X}$ or $X^2$

2 Correlation of error terms

This occurs often in time series data.

Remedy:

To identify this plot residuals as a function of time. If it’s the case we can see tracking in the residuals that is adjacent residuals having the similar values.

3 Non-constant variance in error terms

Remedy:

To identify this draw the residual plot and see if there is a funnel shape
Transform the response/output Y using a concave function like $log{Y}$ or $\sqrt{Y}$

4 Outliers

Draw residual plots to identify
Better approach is to plot studentized residuals which is computed by dividing each residual $e_1$ by its estimated standard error.
Remedy: Remove such observations

5 Collinearity

This means two or more predictor variables are closely related to each other.

Issues of collinearity

Power of hypothesis test reduces
Reduces the accuracy of regression coefficients
Causes standard error to grow (this happens due to the reduced accuracy of regression coefficients)
Result in decline in t-statistic

How to detect?

Look at the correlation matrix of the predictors
Compute variance inflation factor (VIF)

$VIF (\hat\beta_j) = \frac{1}{1 - R^2_{X_j|X_-j}}$

Rule of thumb: VIF > 5 or VIF > 10 indicates collinearity

Remedy:

Drop one of the problematic variables from the model
Combine collinear variables together into a single predictor variable