Lecture 3

Author

Jiaye Xu

Published

February 26, 2022

Lecture 3 Overview

  • Inference of SLR (continued)

    • Confidence Intervals (CI) for Mean Response

    • Prediction Intervals (PI) for New Observations

    • Hypothesis Tests

      • t-test for coefficients

      • F-test for significance of SLR


In practice, we could ask two different scientific research questions concerning the response, for example:

  1. What is the mean (expected) mortality from the skin cancer for all the locations at 40 degrees north latitude?

  2. What mortality from the skin cancer can we predict for an individual location where the latitude is 40 degrees north?

More generally, the two statistical questions can be asked as

  1. What is the mean response \(E(Y_i)\) when the predictor value is \(x_i\)?

  2. What value will a new observation \(Y_{i(new)}\) be when the predictor value is \(x_{i}\)?

Note that \(x_{i}\) is a value within the range of \(x\) values in the data set but \(x_{i}\) does not have to be one of the actual \(x\) values.

Besides the estimated (or predicted) value, we are also interested in the quantification of uncertainty.

“Best Guess” for (Mean) Response

To answer the two scientific research questions on mortality rate due skin cancer, we calculate the fitted value (or predicted value) when the predictor is \(x_i\): \[\hat{y}_i = \hat \beta_0 + \hat \beta_1 x_i\].

Mark up the point of \((x_i, \hat y_i)\) in the plot:

fit.skcan = lm(Mort~ Lat, data = skcan)
(coef(fit.skcan)[1] + coef(fit.skcan)[2]*40) # estimated mean response
(Intercept) 
   150.0839 
plot(x = skcan$Lat, y = skcan$Mort, xlab = "Latitude (at center of state)", ylab = "Mortality (Deaths per 10 million)", main = "Skin cancer mortality vs. Latitude")
abline(lm(Mort~ Lat, data = skcan), col = "blue")
segments(0,  coef(fit.skcan)[1] + coef(fit.skcan)[2]*40, 40,  coef(fit.skcan)[1] + coef(fit.skcan)[2]*40, col = "grey", lwd=3, lty = "dotted")
segments(40, 0, 40, coef(fit.skcan)[1] + coef(fit.skcan)[2]*40, col = "grey", lwd=3, lty = "dotted")

For both questions, our “best guess” of mortality rate is 150 deaths per 10 million people.

Comments:

  • As always, to be confident in the answer to our research questions, we should put an interval around our best guesses.

  • We need to construct a confidence interval (CI) for the mean response and a prediction interval (PI) for a new response.

Confidence Intervals (CI) for Mean Response

The concept of confidence interval:

A confidence interval is an interval associated with a parameter and is a frequentist concept. The parameter is assumed to be non-random but unknown, and the confidence interval is computed from data. Because the data are random, the interval is random.

A 95% confidence interval will contain the true parameter with probability 0.95. That is, with a large number of repeated samples, 95% of the intervals would contain the true parameter.

Q: When is it okay to use the formula for the confidence interval for \(E(Y_i)\) ?

A: When \(x_i\) is a value within the range of the \(x\) values in the data set — that is, when \(x_i\) is a value within the “scope of the model.” But, note that \(x_i\) does not have to be one of the actual \(x\) values in the data set.

For the mean response \(E(Y_i)\) when the predictor value is \(x_i\). The general formula for the confidence interval in words is \[\text{sample estimate} \pm \text{t-multiplier} \times \text{standard error}\] and the formula in notation is \[\hat y_i \pm t(\frac{\alpha}{2}, n-2)SE(\hat y_i)\] where

  • \(\hat y_i\) is the fitted value (or predicted value) of the response when the predictor is \(x_i\),

  • \(t(\frac{\alpha}{2}, n-2)\) is the t-multiplier,

  • \(SE(\hat y_i) = \hat \sigma\sqrt{(\frac{1}{n}+\frac{(x_i -\bar{x})^2}{S_{xx}})}\) it is the standard error of \(\hat y_i\).

Implement in R

# 95% CI for the mean response with latitude = 40
x_0= 40
(y_0 = unname(fit.skcan$coefficients[1] + fit.skcan$coefficients[2]*x_0))
[1] 150.0839
# 1. using formula
CI =  y_0 + 
  c(-1, 1) * qt(1-.05/2, n - 2) * summary(fit.skcan)$sigma * sqrt(1/n + (x_0 - mean(skcan$Lat))^2/sum((skcan$Lat-mean(skcan$Lat))^2))
CI
[1] 144.5617 155.6061
# The estimate of mean response is 150.0839, and 95% CI is (144.5617, 155.6061) when latitude is 40.

# 2. by built-in R function
new = data.frame(Lat = 40)
CI1 = predict(fit.skcan, new, se.fit = TRUE, interval = "confidence", level = 0.95, type = "response") 
CI1$fit
       fit      lwr      upr
1 150.0839 144.5617 155.6061

Factors Affecting the Width of the Confidence Interval for the Mean Response

The width of the confidence interval is \[\text{Width}=2\times t(\frac{\alpha}{2}, n-2)SE(\hat y_i)\] where \(SE(\hat y_i) = \hat \sigma\sqrt{(\frac{1}{n}+\frac{(x_i -\bar{x})^2}{S_{xx}})}\)

Factors affecting the width

  1. Confidence level: as the confidence level decreases, the width of the interval decreases.

  2. MSE (\(\hat \sigma^2\)): as MSE decreases, the width of the interval decreases.

  3. The spread of the predictor \(x\) values: the more spread out the predictor \(x\) values, i.e., the larger \(S_{xx}\), the narrower the interval.

  4. Sample size \(n\): as the sample size \(n\) increases, the width of the interval decreases.

  5. The distance between \(x_i\) and the average of the sample’s predictor values, i.e., \(\bar x\): when \(x_i\) is closer to \(\bar x\), the width of the interval decreases.

Prediction Intervals (PI) for New Observations

A prediction interval is an interval associated with a random variable yet to be observed, with a specified probability of the random variable lying within the interval.

Q: When is it okay to use the formula for the prediction interval for \(y_{i(new)}\)?

A: When \(x_i\) is a value within the scope of the model. Again,\(x_i\) does not have to be one of the actual \(x\) values in the data set.

Distribution of New Observations

For a new reponse variable at \(x_i\), \(y_{i(new)} = \beta_0 + \beta_1 x_i + \epsilon_i\) is an unobserved true value. Assuming the fitted model is correct, the prediction of a future value \(y_{i(new)}\), denoted as \(\tilde y_{i(new)}\), is \[\tilde y_{i(new)} = \hat y_{i(new)} + \epsilon_i= \hat\beta_0 + \hat\beta_1 x_i + \epsilon_i\] We do not know the future \(\epsilon_i\), but we expect it has mean zero (by assumption) so the point prediction of is just \(\hat y_{i(new)}=\hat\beta_0 + \hat\beta_1 x_i\) with the variation from two sources \[\operatorname{Var}(\tilde y_{i(new)}) = \operatorname{Var}(\hat\beta_0 + \hat\beta_1 x_i) + \operatorname{Var}(\epsilon_i)\] reasonably assuming that the future \(\epsilon_i\) is independent of \(\boldsymbol{\hat\beta}={(\hat\beta_0 \, \hat\beta_1)}\). Therefore, \[ \sigma^2{(\frac{1}{n}+\frac{(x_i -\bar{x})^2}{S_{xx}})} + \sigma^2,\] and the distribution of a new response is

\[\tilde y_{i(new)}\sim \mathcal{N}\left(\beta_0+\beta_1 x_i, \sigma^2{(\frac{1}{n}+\frac{(x_i -\bar{x})^2}{S_{xx}}+1)}\right)\]

Prediction Intervale for a New Observation

The general formula for the prediction interval (PI) in words is \[\text{sample estimate} \pm \text{t-multiplier} \times \text{standard error}\] and the formula in notation is \[\hat y_{i(new)} \pm t(\frac{\alpha}{2}, n-2)SE(\tilde y_{i(new)})\] where

  • \(\hat y_{i(new)}\) is the fitted value of the response when the predictor is \(x_i\),

  • the standard error of prediction \[SE(\tilde y_{i(new)})=\hat\sigma \sqrt{(\frac{1}{n}+\frac{(x_i -\bar{x})^2}{S_{xx}}+1)}\] where RSE (\(\hat \sigma\)) is used as an estimate for \(\sigma\).

Comments:

  • Given the same confidence level \(\alpha\), prediction interval (PI) is always wider than the confidence interval (CI) due the extra MSE \(\hat \sigma^2\).

  • Factors affecting the width of this prediction Interval are the same as the factors affecting the width of the confidence interval for mean response.

Implement in R

# a 95% PI for a new observation with latitude = 40
# 1. use formula
PI <-  y_0 + 
  c(-1, 1) * qt(1-.05/2, n - 2) * summary(fit.skcan)$sigma * sqrt(1+1/n + (x_0 - mean(skcan$Lat))^2/sum((skcan$Lat-mean(skcan$Lat))^2))
PI
[1] 111.2350 188.9329
# The predicted value is 150.0839, and 95% PI is (111.235, 188.9329) when latitude is 40.

# 2. by built-in R function
ans2 = predict(fit.skcan, new, se.fit = TRUE, interval = "prediction", level = 0.95, type = "response")
ans2$fit
       fit     lwr      upr
1 150.0839 111.235 188.9329

Interpretation: we are 95% confident that the skin cancer mortality rate at an individual location at 40 degrees north will be between 111.24 and 188.93 deaths per 10 million people.

Hypothesis Tests

Recall General Steps to Conduct a Hypothesis Test

  • Set up two competing hypotheses: Null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_1\) or \(H_a\)).

  • Set some significance level called \(\alpha\): the most common \(\alpha\) is 0.05 or 5%.

  • Calculate a test statistic (\(t\star\)): a function of the data whose distribution depends only on the parameter(s) being tested.

  • Calculate probability value (p-value). The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. (In our context, a p-value is the probability of seeing data at least as extreme as \(t\star\) under the assumption of \(H_0\).) A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.

  • The rejection region is found by using \(\alpha\) to find a critical value. The rejection region is the area that is more extreme than the critical value \(c_\alpha\).

  • Make a test decision about the null hypothesis. Reject \(H_0\) if the p-value \(< \alpha\). (Or equivalently reject \(H_0\) if \(t\star > c_\alpha\).)

Illlustration of rejection region, and critical value. Image Source

t-test for coefficients

Let’s revisit the example concerning the relationship between skin cancer mortality rates (\(y\)) and latitude at center of state (\(x\)).

skcan = read.table("skincancer.txt", header = TRUE)
plot(x = skcan$Lat, y = skcan$Mort, xlab = "Latitude (at center of state)", ylab = "Mortality (Deaths per 10 million)", main = "Skin cancer mortality vs. Latitude")
abline(lm(Mort~ Lat, data = skcan), col = "blue")

Scientific Question: Is there a relationship between the population of all skin cancer mortality rates and latitudes? Or, is \(\beta_1 \neq 0\)?

Statistical Question: Should we reject \(H_0: \beta_1 = 0\)?

Hypothesis tests for \(\beta_1\)

  1. Hypotheses \[H_0: \beta_1= {\beta_1}_0 \quad \text{and}\quad \beta_1 \neq{\beta_1}_0\], where \({\beta_1}_0\) is a fixed value.

  2. Test statistic \[t\star = \frac{\hat\beta_1-{\beta_1}_0}{SE(\hat\beta_1)}\]

  3. Null distribution \[t\star\sim t_{n-2}\]

  4. p-value \[p= 2P(t_{n-2}>|t\star|)\]

  5. Critical value \[c_\alpha = t_{n-2, 1-\alpha/2}\]

For a two-sided test, we reject \(H_0\) if \(|t\star|> c_\alpha\), or equivalently, we reject \(H_0\) if \(p<\alpha\).

Note that when \(|t\star|\leq c_\alpha\) or \(p\geq\alpha\) we do not say that we accept \(H_1\). Instead, we put it as we fail to reject \(H_0\).

Implementation in R

# 1. use formulae
b1 = summary(fit.skcan)$coefficients[2] # beta1_hat
se_b1 = summary(fit.skcan)$coefficients[2,2]# SE(b1)
# equivalently
# se_b1 = coef(summary(fit.skcan))["Lat", "Std. Error"] 
## (1) comparing quantile
(t_star = b1/se_b1) # test statistic
[1] -9.989836
(ca_t = qt(0.975, df = length(mortality)-2)) # critical value, i.e., t-percentile with alpha = 0.05
[1] 2.011741
## (2) comparing probability
(p_b1 = 2*pt(abs(t_star),n-2,lower.tail = F)) # p-value, compared to alpha = 0.05
[1] 3.309456e-13
# 2. use summary output
summary(fit.skcan)

Call:
lm(formula = Mort ~ Lat, data = skcan)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.972 -13.185   0.972  12.006  43.938 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 389.1894    23.8123   16.34  < 2e-16 ***
Lat          -5.9776     0.5984   -9.99 3.31e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.12 on 47 degrees of freedom
Multiple R-squared:  0.6798,    Adjusted R-squared:  0.673 
F-statistic:  99.8 on 1 and 47 DF,  p-value: 3.309e-13

F-test for significance of SLR

For SLR, the F−test concerns the following statistical hypothesis test: \[H_0: \beta_1= 0 \quad \text{and}\quad H_1: \beta_1 \neq 0\] This test in the case of SLR tells us whether the predictor adds any insight to the model at all. That is, it tells us whether employing the predictor makes the model more complex than it needs to be and simply including the parameter \(\beta_0\) alone would suffice – or not.

The expected values of MSR and MSE suggest how to test the null hypothesis against the alternative hypothesis.

Review of ANOVA table and Decomposition of SST

Source df SS MS
Model 2-1 SSR MSR = SSR/(2-1)
Error n-2 SSE MSE = SSE/(n-2)
Total n-1 SST

It can be shown that \[E(M S R)=\sigma^{2}+\beta_{1}^{2} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}\] And note that we have already known that \[E(MSE)=\sigma^{2}\] (Your Turn: calculate the E(MSE) and E(MSR).)

  • If \(\beta_1 = 0\), then we would expect the ratio \(\frac{MSR}{MSE}\) to be equal to \(1\).

  • If \(\beta_1 \neq 0\), then we would expect the ratio \(\frac{MSR}{MSE}\) to be greater than \(1\).

These two facts suggest that we should use the ratio, \(\frac{MSR}{MSE}\) , to determine whether or not \(\beta_1 = 0\). This gives us the clue to construct a test statistic and conduct the F-test for slope \(beta_1\)

  1. Hypotheses \[H_0: \beta_1= 0 \quad \text{and}\quad \beta_1 \neq 0\]

  2. Test statistic \[F\star =\frac{SSR/(2-1)}{SSE/(n-2)}=\frac{MSR}{MSE}\]

  3. Null distribution \[F\star \sim F(1, n-2)\]

  4. p-value \[p> F(F\star, 1, n-2)\]

  5. Critical value \[c_\alpha = F(1-\alpha, 1, n-2)\]

  6. Draw a conclusion: We reject \(H_0\) if p-value \(<\alpha\) (or equivalently, if \(F\star>c_\alpha\)).

Comments:

  • For the simple linear regression (SLR), F-test and t-test are equivalent, \(F \star = (t \star)^2\).

Illustration of F and t distributions
  • Under \(H_0\), the F-statistic should follow an F distribution with respective parameters \((d_1-d_2, n-d_2)\), where \(d_1-d_2\) is the difference in the number of regression function parameters between \(H_1\) and \(H_0\), 2-1=1 in SLR case, and \(n-d_2\) is the number of degrees of freedom for \(H_1\) , \(n-2\) in SLR case. We are going to learn the general formula of F-statistic in multiple linear regression (MLR).

  • F-test in the case of SLR tells us whether the predictor adds any explanatory value to the model. For SLR, a rejection of \(H_0: \beta_1 = 0\) implies the significance of the whole SLR model, and the significance of the slope parameter \(\beta_1\).

Implementation in R

# 1. F-test by formula
## (1) comparing quantile
# names(anova(fit.skcan))
(fstar = anova(fit.skcan)$`F value`[1]) # F-statistic
[1] 99.79683
(ca_f = qf(0.95, df1 = anova(fit.skcan)$Df[1], df2 = anova(fit.skcan)$Df[2])) # critical value, i.e., f-percentile with alpha = 0.05. d.f. = (2-1, n-2) = (1, 47)
[1] 4.0471
## (2) comparing probability
(p_f = pf(fstar, df1 = anova(fit.skcan)$Df[1],df2 = anova(fit.skcan)$Df[2], lower.tail = F)) # p-value, compared to significance level alpha = 0.05
[1] 3.309456e-13
# 2. F-test using anova output 
anova(fit.skcan)
Analysis of Variance Table

Response: Mort
          Df Sum Sq Mean Sq F value    Pr(>F)    
Lat        1  36464   36464  99.797 3.309e-13 ***
Residuals 47  17173     365                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 3. Equivalence of F-test and t-test for SLR
summary(fit.skcan)

Call:
lm(formula = Mort ~ Lat, data = skcan)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.972 -13.185   0.972  12.006  43.938 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 389.1894    23.8123   16.34  < 2e-16 ***
Lat          -5.9776     0.5984   -9.99 3.31e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.12 on 47 degrees of freedom
Multiple R-squared:  0.6798,    Adjusted R-squared:  0.673 
F-statistic:  99.8 on 1 and 47 DF,  p-value: 3.309e-13
# names(summary(fit.skcan))
(summary(fit.skcan)$coefficients[2,3])^2 # (t*)^2
[1] 99.79683
fstar # F-statistic
[1] 99.79683
## p-values for the two tests are identical except for minor numerical difference.