Lecture 2

Author

Jiaye Xu

Published

February 17, 2022

Lecture 2 Overview

  • LINE Assumptions of SLR Model

  • The Residuals, Residual Sum of Squares (RSS), and Mean Squared Error (MSE)

    • RSS and the MSE as estimate of \(\sigma^2\)

    • Decomposition of Sum of Squares

  • ANOVA Table

  • Goodness-of-fit

  • Inference

    • Sampling Distributions for \(\hat\beta_0\) and \(\hat\beta_1\)

    • Confidence Intervals (CI) for \(\beta_0\) and \(\beta_1\)


“All models are wrong, but some are useful.” – George E. P. Box, British statistician.

All models are right conditional on all of the underlying assumptions.

LINE Assumptions of SLR Model

  • Linearity: The mean of the response, \(\operatorname{E}(y_i) = \beta_0 + \beta_1 x_i\), is a linear function of \(x_i\).
  • Independence: The errors, \(\epsilon_i\), are independent.
  • Normality: The errors, \(\epsilon_i\), follow a normal distribution.
  • Equal Variance: The errors, \(\epsilon_i\), have equal variance (denoted \(\sigma^2\)). This property is also called homoscedasticity.

Recap the simple linear regression (SLR) model: Given \(n\) fixed values \(x_1, \ldots, x_n\), we have

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad i = 1, \ldots, n,\]

where

\[\epsilon_i \stackrel{iid}{\sim} \mathcal{N}(0,\sigma^2)\]

Comments:

  • The first or “Linearity” assumption implies that \(\operatorname{E}(\epsilon_i) = \operatorname{E}[(y_i) -(\beta_0 + \beta_1 x_i)]=0\)

  • The systematic form of the model.

    If you get this seriously wrong, then predictions will be inaccurate and any explanation of the relationship between the variables may be in misleading ways.

  • Independence assumption.

    • The presence of strong dependence of errors means that there is less information in the data than the sample size may suggest.

    • Furthermore, there is a risk that the analyst will mistakenly introduce systematic components to the model in an attempt to deal with an unsuspected dependence in the errors

    • Unfortunately, it is difficult to detect dependence in errors using regression diagnostics except in special situations such as temporal data. (For other types of data, the analyst will need to rely on less testable assumptions about independence based on contextual knowledge.)

  • Equal variance assumption.

    A failure to address this violation of the linear model assumptions may result in inaccurate inferences. In particular, prediction uncertainty may not be properly quantified. (Even so, excepting serious violations, the adequacy of the inference may not be seriously compromised.)

  • Normality is the least important assumption.

    For large datasets, the inference will be quite robust to a lack of normality as the central limit theorem (CLT) will mean that the approximations will tend to be adequate. Unless the sample size is quite small or the errors very strongly abnormal, this assumption is not crucial to success.

  • Assuming the LINE hypotheses enables us to know that the Gauss-Markov Theorem holds, which means that we obtain unbiased, minimal variance estimators for the coefficients of the regression function.

  • The LS estimation does not require random errors to be Gaussian. Inference (confidence interval and hypothesis test) does require random errors to be Gaussian.

The Residuals, Residual Sum of Squares (RSS), and Mean Squared Error (MSE)

Practically, we are not able to sample the errors \(\epsilon_i, i = 1, \ldots, n\), in any direct way, only the \(y_i\). However, we would want to use the error values to support the validity of our model, as pointed out in the previous section.

Recall the definition of residuals (or prediction errors):

\[e_i = y_i - \hat{y_i},\]

where \(\hat{y_i} = \hat{\beta_0} + \hat{\beta_1} x_i\) for \(i = 1, \ldots, n\).

We will use the residuals \(e_i\) in place of the errors \(\epsilon_i\), in essence as proxies for the errors \(\epsilon_i\) whose values we do not have access to, to help justify the validity of our linear regression models, as we will see in the following sections.

RSS and MSE as estimate of \(\sigma^2\)

  • Residual sum of squares (RSS, also denoted as \(SS_{error}\)), which is also called error sum of squares (SSE) in some references

\[RSS = \sum_{i=1}^n e_i^2\]

  • An unbiased estimator of \(\sigma^2\) is

\[\hat\sigma^2 = RSS/(n-2),\]

which is called mean squared errors (MSE), and its square root \(\hat\sigma\) is known as the Residual Standard Error (RSE).

Comments on the MSE:

  • The numerator adds up, in squared units, how far each response \(y_i\) is from its estimated mean \(\hat y_i\).

  • the degrees of freedom (d.f.)

    The denominator divides the sum by \(n-2\), because we effectively estimate two parameters - the population intercept \(\beta_0\) and the population slope \(\beta_1\). That is, we lose two degrees of freedom.

Obtain \(\hat\sigma^2\) in R

fit.students = lm(weight~height, data = students)

fit.students$residuals # residuals
           1            2            3            4            5            6 
  6.86676322  -5.27081825   3.45401883   0.04127444   5.04127444 -13.23388849 
           7            8            9           10 
 -0.23388849 -10.37146995  -0.50905141  14.21578566 
summary(fit.students) # summary in overview

Call:
lm(formula = weight ~ height, data = students)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.2339  -4.0804  -0.0963   4.6445  14.2158 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -266.5344    51.0320  -5.223    8e-04 ***
height         6.1376     0.7353   8.347 3.21e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.641 on 8 degrees of freedom
Multiple R-squared:  0.897, Adjusted R-squared:  0.8841 
F-statistic: 69.67 on 1 and 8 DF,  p-value: 3.214e-05
summary(fit.students)$sigma # RSE via R function
[1] 8.641368
(summary(fit.students)$sigma)^2 # MSE via R function
[1] 74.67324
sum((weight -fit.students$fitted.values)^2)/(n-2) # MSE via formula
[1] 74.67324

Decompostion of Sum of Squares

Key Notations

  • Total sum of squares (SST or\(SS_{total}\))

\[SST = \sum_{i=1}^n (y_i -\bar{y})^2\]

  • Regression sum of squares (SSR, a.k.a., \(SS_{regression}\), \(SS_{model}\)) is the explained sum of squares by the model

\[SSR = \sum_{i=1}^n (\hat{y}_i -\bar{y})^2\]

  • Error sum of squares (SSE or \(SS_{error}\)) is the unexplained sum of squares

\[SSE = \sum_{i=1}^n (y_i -\hat{y}_i)^2\] Decomposition of SST

\[SST = SSR + SSE\] Equivalent notation,

\[SS_{total}=SS_{regression}+SS_{error}\]

ANOVA Table of SLR

Source df SS MS
Model 2-1 SSR MSR = SSR/(2-1)
Error n-2 SSE MSE = SSE/(n-2)
Total n-1 SST
anova(fit.students)
Analysis of Variance Table

Response: weight
          Df Sum Sq Mean Sq F value    Pr(>F)    
height     1 5202.2  5202.2  69.666 3.214e-05 ***
Residuals  8  597.4    74.7                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From ANOVA table, we also can obtain the value of \(\hat\sigma^2\), i.e., MSE.

Goodness-of-fit

It is useful to know how well a SLR model fits the data. A measure of goodness-of-fit is the so-called coefficient of determination, or \(R^2\)

\[R^2 = \frac{SS_{regression}}{SS_{total}} = 1- \frac{SS_{error}}{SS_{total}}\]

Understanding \(R^2\)

  • \(R^2\) is a number between 0 and 1.

  • It gives the proportion of the variation in the response explained by the model.

  • \(R^2\) is the square of the multiple correlation coefficient (i.e., Pearson correlation coefficient) which is defined as the sample correlation coefficient between \(\mathbf{y}\) and \(\mathbf{\hat{y}}\). Convince yourself by checking out the values in R:

mcc <- cor(weight, fit.students$fitted.values)
mcc^2
[1] 0.8969953
summary(fit.students)$r.squared # R-squared
[1] 0.8969953
  • What is a good values of \(R^2\)? It depends on application.
    • If \(R^2 = 1\), all of the data points fall on the regression line. The predictor \(x\) accounts for all of the variation in \(y\).
    • If \(R^2 = 0\), the estimated regression line is perfectly horizontal. The predictor \(x\) accounts for none of the variation in \(y\).
    • Caution: when \(R^2\) is close to zero, it does not necessarily mean that \(x\) and \(y\) are not related. See an example that illustrates the appropriate relationship between \(x\) and \(y\) is “quadratic”, not “linear”.
temp.x = seq(-1, 1, by=0.1)
temp.y = temp.x^2 + rnorm(length(temp.x), 0, 0.1)
plot(temp.x, temp.y)
lines(temp.x, lm(temp.y~ I(temp.x^2))$fitted, col = "blue")

cor(temp.x, temp.y)
[1] -0.01608401
  • Adjusted \(R^2\) is a modification of \(R^2\) that adjusts for the number of independent variables in a LR model. We are going to introduce adjusted \(R^2\) when we cover the topic of multiple linear regression (MLR).

Sampling Distributions for \(\hat\beta_0\) and \(\hat\beta_1\)

Recall the properties of LSE \(\hat\beta_0\) and \(\hat\beta_1\):

  • By Gauss-Markov theorem, we know \(\hat\beta_0\) and \(\hat\beta_1\) are best unbiased estimators of \(\beta_0\) and \(\beta_1\), respectively.

  • Normality

Since we know

  • \[\hat{\beta}_{1}=\frac{\mathrm{S_{xy}}}{\mathrm{S_{xx}}}=\sum_{i=1}^{n} c_{i} y_{i},\] where \(c_{i}=\frac{x_{i}-\bar{x}}{s_{x x}},\) \(\mathrm{S_{x x}}=\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2},\) and \(\mathrm{S_{xy}} = \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)=\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right) y_{i}\)

  • \[\hat{\beta}_{0}=\bar{y}-\hat{\beta}_{1}\bar{x}=\sum_{i=1}^{n} d_{i} y_{i},\] where \(d_{i}=\frac{1}{n}-c_{i} x_{i}\)

  • Key Property: Any linear combination of independent Normal random variables is also Normal.

Since \(\hat\beta_0\) and \(\hat\beta_1\) are both finite, linear combinations of the \(y_i\) (which are independent) and each \(y_i\) is normally distributed, both \(\hat\beta_0\) and \(\hat\beta_1\) are normally distributed as well.

Homework Question: Please convince yourselves of the sampling distributions of \(\hat\beta_0\) and \(\hat\beta_1\) by your own derivation. The normality has been shown in the lecture. You need to find the expectation, the variance of \(\hat\beta_0\) and \(\hat\beta_1\), respectively.

It can be shown that when \(\sigma^2\) is known

\[\hat\beta_0 \sim \mathcal{N}\left(\beta_0, \sigma^2(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}})\right)\]

\[\hat\beta_1 \sim \mathcal{N}\left(\beta_1, \frac{\sigma^2}{S_{xx}}\right)\]

When \(\sigma^2\) is unknown, since \(Var(\hat\beta_0)=\sigma^2(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}})\) and \(Var(\hat\beta_1)=\frac{\sigma^2}{S_{xx}}\), we estimate \(\sigma^2\) with MSE, \(\hat\sigma^2\), that is,

\[Var(\hat\beta_0)=\hat\sigma^2(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}})\] \[Var(\hat\beta_1)=\frac{\hat\sigma^2}{S_{xx}}\] When this is done, the resulting variable follows a \(t\)-distribution instead.

Confidence Intervals (CI) for \(\beta_0\) and \(\beta_1\)

  • Recall that we are ultimately always interested in drawing conclusions about the population, not the particular sample we observed.

  • In the SLR setting, we are often interested in learning about the population intercept \(\beta_0\) and the population slope\(\beta_1\).

With the estimated variance of \(\hat\beta_0\) and \(\hat\beta_1\), we in turn obtain estimates for the respective standard error, \(SE(\hat\beta_0)\) and \(SE(\hat\beta_1)\), by taking square root, and the confidence intervals (CI) for the true values of the intercept and slope as well:

\[\hat\beta_0 - t(\frac{\alpha}{2}, n-2)SE(\hat\beta_0)\le \beta_0 \le \hat\beta_0 + t(\frac{\alpha}{2}, n-2)SE(\hat\beta_0)\]

\[\hat\beta_1 - t(\frac{\alpha}{2}, n-2)SE(\hat\beta_1)\le \beta_1 \le \hat\beta_1 + t(\frac{\alpha}{2}, n-2)SE(\hat\beta_1)\]

with \((1-\alpha)\times 100\%\) confidence. Here, \(t(\frac{\alpha}{2}, n-2)\) is the value that cuts off \(\alpha/2 \times 100\%\) in the upper tail of the \(t\)-distribution for \(n-2\) degrees of freedom.

Obtain CI’s of \(\beta_0\) and \(\beta_1\) in R,

fit.skcan = lm(Mort~ Lat, data = skcan) # skin cancer data set
summary(fit.skcan)

Call:
lm(formula = Mort ~ Lat, data = skcan)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.972 -13.185   0.972  12.006  43.938 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 389.1894    23.8123   16.34  < 2e-16 ***
Lat          -5.9776     0.5984   -9.99 3.31e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.12 on 47 degrees of freedom
Multiple R-squared:  0.6798,    Adjusted R-squared:  0.673 
F-statistic:  99.8 on 1 and 47 DF,  p-value: 3.309e-13
# 1. 95% CI by formula
b0 = summary(fit.skcan)$coefficients[1] # beta0_hat
b1 = summary(fit.skcan)$coefficients[2] # beta1_hat
t_pct = qt(0.975, df = length(mortality)-2) # t-percentile, alpha = 0.05
n = length(mortality)
Sxx = sum((latitude - mean(latitude))^2)
se_b0 = summary(fit.skcan)$sigma*sqrt(1/n + mean(latitude)^2/Sxx)
se_b1 = summary(fit.skcan)$sigma/sqrt(Sxx)
(ci_b0 = b0 + c(-1, 1)*t_pct*se_b0)
[1] 341.2852 437.0936
(ci_b1 = b1 + c(-1, 1)*t_pct*se_b1)
[1] -7.181404 -4.773867
# 2. by built-in R function
confint(fit.skcan) # 95% CI
                 2.5 %     97.5 %
(Intercept) 341.285151 437.093552
Lat          -7.181404  -4.773867
confint(fit.skcan, level = 0.9) # 90% CI
                  5 %       95 %
(Intercept) 349.23403 429.144672
Lat          -6.98166  -4.973612

Factors Affecting the Width of C.I. for \(\beta_1\)

  • Confidence level: as the confidence level decreases, the width of the interval decreases. Usually confidence levels are never set below 90%.

  • MSE or \(\hat\sigma^2\): as MSE decreases, the width of the interval decreases.

  • The spread of the predictor \(x\) values: the more spread out the predictor \(x\) values, i.e., the larger Sxx, the narrower the interval.

  • Sample size \(n\): as the sample size \(n\) increases, the width of the interval decreases.