Simple Linear Regression

Linear regression is a simple supervised learning approach for dealing with quantitative outcome variables. Mathematically simple linear regression which only includes a single input or predictor variable (X) is represented as;


Y \approx  \beta_0 +  \beta_1 X 

Y =  \beta_0 +  \beta_1 X + \epsilon

\beta_0 & \beta_1 are called model coefficients or model parameters. \beta_0 represents the intercept and \beta_1 represents the slope.

\epsilon is the mean-zero random error term. That is there may be other variables that cause the model to be varied from actual or there may be measurement errors.

Computing Model Coefficients

By far the most common approach is calculating the “Least Squares Coefficient Estimates“. It involves minimizing the residual sum of squares (RSS). RSS is the difference between a particular observed output value and predicted output value.

RSS = e{_1}{^2} + e{_2}{^2} + ... +  e{_n}{^2} 

where residual e_i = y_i - \bar{y}_i

Least Squares Coefficient Estimates are defined by;

\hat{\beta}_1 = \frac{\sum_{i=1}^{n}{(x_i - \bar{x}) (y_i - \bar{y})}} {\sum_{i=1}^{n}{(x_i - \bar{x})}^2}

\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}

where \bar{x} & \bar{y} are sample means

Assessing Accuracy of Model Coefficients

The best linear approximation to the true relationship between X and Y is called the population regression line whereas least squares regression coefficients defines the least square line.

When assessing model coefficients we answer the simple question how accurate is the sample mean \hat{\mu} compared to population mean \mu?

To answer this, we compute standard error of \hat{\mu} which we call SE(\hat{\mu}) as follow:

Var(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n}

This tells us the average amount estimate \hat{\mu} differs from the actual value of \mu

The standard error associated with \hat{\beta_0}

SE(\hat{\beta_0})^2 =\sigma^2[{\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}{(x_i - \bar{x})}^2}]

The standard error associated with \hat{\beta_1}

SE(\hat{\beta_1})^2 = \frac{\sigma^2}{\sum_{i=1}^{n}{(x_i - \bar{x})}^2}

\sigma is known as the residual standard error (RSE)

RSE = \sqrt{RSS/(n-2)}

Applications of Standard Errors

  • To compute confidence intervals
  • To perform hypothesis tests on the coefficients

Hypothesis Testing

If \beta_1 = 0 there is no relationship between X and Y, that is H_0 : \beta_1 = 0 (null hypothesis)

If \beta_1 \neq 0 there is a relationship between X and Y, that is H_a : \beta_1 \neq 0 (alternative hypothesis)

To test that we compute a t-statistic:

t = \frac{\hat{\beta_1} - 0}{SE(\hat{\beta_1})}

This measures the number of standard deviations that \hat{\beta_1} is away from 0.

Assessing Accuracy of Model (Model Fit)

Once we have rejected the null hypothesis (previous section) and figured out there is a relationship between X and Y, we need to asses the accuracy of the model. That is to quantify the extent to which the model fits the data.

This is assessed by two measures: Residual Standard Error (RSE) and the R^2 statistic.

  1. Residual Standard Error (RSE)

This measures the average amount of deviation from true regression line. It is considered as a measure of the lack of fit of the model.

RSE = \sqrt{\frac{1}{n-2}RSS}

RSS = \sum_{i=1}^{n}{(y_i - \hat{y})}^2

    2. R^2

R^2 measures the proportion of variability in Y that can be explained by X of the model. It is always between 0 and 1, and independent of Y.

R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}

This can also be calculated using Cor(X,Y) which is the correlation between X and Y. That is we can use r = Cor(X,Y) instead of R^2 to measure the fit of the linear model because in simple linear regression, R^2 = r^2.