Linear regression is a simple supervised learning approach for dealing with quantitative outcome variables. Mathematically simple linear regression which only includes a single input or predictor variable (X) is represented as;
& are called model coefficients or model parameters. represents the intercept and represents the slope.
is the mean-zero random error term. That is there may be other variables that cause the model to be varied from actual or there may be measurement errors.
Computing Model Coefficients
By far the most common approach is calculating the “Least Squares Coefficient Estimates“. It involves minimizing the residual sum of squares (RSS). RSS is the difference between a particular observed output value and predicted output value.
where residual
Least Squares Coefficient Estimates are defined by;
where & are sample means
Assessing Accuracy of Model Coefficients
The best linear approximation to the true relationship between X and Y is called the population regression line whereas least squares regression coefficients defines the least square line.
When assessing model coefficients we answer the simple question how accurate is the sample mean compared to population mean ?
To answer this, we compute standard error of which we call as follow:
This tells us the average amount estimate differs from the actual value of .
The standard error associated with
The standard error associated with
is known as the residual standard error (RSE)
Applications of Standard Errors
- To compute confidence intervals
- To perform hypothesis tests on the coefficients
Hypothesis Testing
If there is no relationship between X and Y, that is (null hypothesis)
If there is a relationship between X and Y, that is (alternative hypothesis)
To test that we compute a t-statistic:
This measures the number of standard deviations that is away from 0.
Assessing Accuracy of Model (Model Fit)
Once we have rejected the null hypothesis (previous section) and figured out there is a relationship between X and Y, we need to asses the accuracy of the model. That is to quantify the extent to which the model fits the data.
This is assessed by two measures: Residual Standard Error (RSE) and the statistic.
- Residual Standard Error (RSE)
This measures the average amount of deviation from true regression line. It is considered as a measure of the lack of fit of the model.
2.
measures the proportion of variability in Y that can be explained by X of the model. It is always between 0 and 1, and independent of Y.
This can also be calculated using Cor(X,Y) which is the correlation between X and Y. That is we can use r = Cor(X,Y) instead of to measure the fit of the linear model because in simple linear regression, .