The first or “Linearity” assumption implies that \(\operatorname{E}(\epsilon_i) = \operatorname{E}[(y_i) -(\beta_0 + \beta_1 x_i)]=0\)
The systematic form of the model.
If you get this seriously wrong, then predictions will be inaccurate and any explanation of the relationship between the variables may be in misleading ways.
Independence assumption.
The presence of strong dependence of errors means that there is less information in the data than the sample size may suggest.
Furthermore, there is a risk that the analyst will mistakenly introduce systematic components to the model in an attempt to deal with an unsuspected dependence in the errors
Unfortunately, it is difficult to detect dependence in errors using regression diagnostics except in special situations such as temporal data. (For other types of data, the analyst will need to rely on less testable assumptions about independence based on contextual knowledge.)
Equal variance assumption.
A failure to address this violation of the linear model assumptions may result in inaccurate inferences. In particular, prediction uncertainty may not be properly quantified. (Even so, excepting serious violations, the adequacy of the inference may not be seriously compromised.)
Normality is the least important assumption.
For large datasets, the inference will be quite robust to a lack of normality as the central limit theorem (CLT) will mean that the approximations will tend to be adequate. Unless the sample size is quite small or the errors very strongly abnormal, this assumption is not crucial to success.
Assuming the LINE hypotheses enables us to know that the Gauss-Markov Theorem holds, which means that we obtain unbiased, minimal variance estimators for the coefficients of the regression function.
The LS estimation does not require random errors to be Gaussian. Inference (confidence interval and hypothesis test) does require random errors to be Gaussian.
The Residuals, Residual Sum of Squares (RSS), and Mean Squared Error (MSE)
Practically, we are not able to sample the errors \(\epsilon_i, i = 1, \ldots, n\), in any direct way, only the \(y_i\). However, we would want to use the error values to support the validity of our model, as pointed out in the previous section.
Recall the definition of residuals (or prediction errors):
\[e_i = y_i - \hat{y_i},\]
where \(\hat{y_i} = \hat{\beta_0} + \hat{\beta_1} x_i\) for \(i = 1, \ldots, n\).
We will use the residuals \(e_i\) in place of the errors \(\epsilon_i\), in essence as proxies for the errors \(\epsilon_i\) whose values we do not have access to, to help justify the validity of our linear regression models, as we will see in the following sections.
RSS and MSE as estimate of \(\sigma^2\)
Residual sum of squares (RSS, also denoted as \(SS_{error}\)), which is also called error sum of squares (SSE) in some references
\[RSS = \sum_{i=1}^n e_i^2\]
An unbiased estimator of \(\sigma^2\) is
\[\hat\sigma^2 = RSS/(n-2),\]
which is called mean squared errors (MSE), and its square root \(\hat\sigma\) is known as the Residual Standard Error (RSE).
Comments on the MSE:
The numerator adds up, in squared units, how far each response \(y_i\) is from its estimated mean \(\hat y_i\).
the degrees of freedom (d.f.)
The denominator divides the sum by \(n-2\), because we effectively estimate two parameters - the population intercept \(\beta_0\) and the population slope \(\beta_1\). That is, we lose two degrees of freedom.
Obtain \(\hat\sigma^2\) in R
fit.students =lm(weight~height, data = students)fit.students$residuals # residuals
It gives the proportion of the variation in the response explained by the model.
\(R^2\) is the square of the multiple correlation coefficient (i.e., Pearson correlation coefficient) which is defined as the sample correlation coefficient between \(\mathbf{y}\) and \(\mathbf{\hat{y}}\). Convince yourself by checking out the values in R:
What is a good values of \(R^2\)? It depends on application.
If \(R^2 = 1\), all of the data points fall on the regression line. The predictor \(x\) accounts for all of the variation in \(y\).
If \(R^2 = 0\), the estimated regression line is perfectly horizontal. The predictor \(x\) accounts for none of the variation in \(y\).
Caution: when \(R^2\) is close to zero, it does not necessarily mean that \(x\) and \(y\) are not related. See an example that illustrates the appropriate relationship between \(x\) and \(y\) is “quadratic”, not “linear”.
Adjusted \(R^2\) is a modification of \(R^2\) that adjusts for the number of independent variables in a LR model. We are going to introduce adjusted \(R^2\) when we cover the topic of multiple linear regression (MLR).
Sampling Distributions for \(\hat\beta_0\) and \(\hat\beta_1\)
Recall the properties of LSE \(\hat\beta_0\) and \(\hat\beta_1\):
By Gauss-Markov theorem, we know \(\hat\beta_0\) and \(\hat\beta_1\) are best unbiased estimators of \(\beta_0\) and \(\beta_1\), respectively.
Normality
Since we know
\[\hat{\beta}_{1}=\frac{\mathrm{S_{xy}}}{\mathrm{S_{xx}}}=\sum_{i=1}^{n} c_{i} y_{i},\] where \(c_{i}=\frac{x_{i}-\bar{x}}{s_{x x}},\)\(\mathrm{S_{x x}}=\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2},\) and \(\mathrm{S_{xy}} = \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)=\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right) y_{i}\)
\[\hat{\beta}_{0}=\bar{y}-\hat{\beta}_{1}\bar{x}=\sum_{i=1}^{n} d_{i} y_{i},\] where \(d_{i}=\frac{1}{n}-c_{i} x_{i}\)
Key Property: Any linear combination of independent Normal random variables is also Normal.
Since \(\hat\beta_0\) and \(\hat\beta_1\) are both finite, linear combinations of the \(y_i\) (which are independent) and each \(y_i\) is normally distributed, both \(\hat\beta_0\) and \(\hat\beta_1\) are normally distributed as well.
Homework Question: Please convince yourselves of the sampling distributions of \(\hat\beta_0\) and \(\hat\beta_1\) by your own derivation. The normality has been shown in the lecture. You need to find the expectation, the variance of \(\hat\beta_0\) and \(\hat\beta_1\), respectively.
When \(\sigma^2\) is unknown, since \(Var(\hat\beta_0)=\sigma^2(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}})\) and \(Var(\hat\beta_1)=\frac{\sigma^2}{S_{xx}}\), we estimate \(\sigma^2\) with MSE, \(\hat\sigma^2\), that is,
\[Var(\hat\beta_0)=\hat\sigma^2(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}})\]\[Var(\hat\beta_1)=\frac{\hat\sigma^2}{S_{xx}}\] When this is done, the resulting variable follows a \(t\)-distribution instead.
Confidence Intervals (CI) for \(\beta_0\) and \(\beta_1\)
Recall that we are ultimately always interested in drawing conclusions about the population, not the particular sample we observed.
In the SLR setting, we are often interested in learning about the population intercept \(\beta_0\) and the population slope\(\beta_1\).
With the estimated variance of \(\hat\beta_0\) and \(\hat\beta_1\), we in turn obtain estimates for the respective standard error, \(SE(\hat\beta_0)\) and \(SE(\hat\beta_1)\), by taking square root, and the confidence intervals (CI) for the true values of the intercept and slope as well:
with \((1-\alpha)\times 100\%\) confidence. Here, \(t(\frac{\alpha}{2}, n-2)\) is the value that cuts off \(\alpha/2 \times 100\%\) in the upper tail of the \(t\)-distribution for \(n-2\) degrees of freedom.
Obtain CI’s of \(\beta_0\) and \(\beta_1\) in R,
fit.skcan =lm(Mort~ Lat, data = skcan) # skin cancer data setsummary(fit.skcan)
Call:
lm(formula = Mort ~ Lat, data = skcan)
Residuals:
Min 1Q Median 3Q Max
-38.972 -13.185 0.972 12.006 43.938
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 389.1894 23.8123 16.34 < 2e-16 ***
Lat -5.9776 0.5984 -9.99 3.31e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.12 on 47 degrees of freedom
Multiple R-squared: 0.6798, Adjusted R-squared: 0.673
F-statistic: 99.8 on 1 and 47 DF, p-value: 3.309e-13