In practice, we could ask two different scientific research questions concerning the response, for example:
What is the mean (expected) mortality from the skin cancer for all the locations at 40 degrees north latitude?
What mortality from the skin cancer can we predict for an individual location where the latitude is 40 degrees north?
More generally, the two statistical questions can be asked as
What is the mean response \(E(Y_i)\) when the predictor value is \(x_i\)?
What value will a new observation \(Y_{i(new)}\) be when the predictor value is \(x_{i}\)?
Note that \(x_{i}\) is a value within the range of \(x\) values in the data set but \(x_{i}\) does not have to be one of the actual \(x\) values.
Besides the estimated (or predicted) value, we are also interested in the quantification of uncertainty.
“Best Guess” for (Mean) Response
To answer the two scientific research questions on mortality rate due skin cancer, we calculate the fitted value (or predicted value) when the predictor is \(x_i\): \[\hat{y}_i = \hat \beta_0 + \hat \beta_1 x_i\].
Mark up the point of \((x_i, \hat y_i)\) in the plot:
fit.skcan =lm(Mort~ Lat, data = skcan)(coef(fit.skcan)[1] +coef(fit.skcan)[2]*40) # estimated mean response
(Intercept)
150.0839
plot(x = skcan$Lat, y = skcan$Mort, xlab ="Latitude (at center of state)", ylab ="Mortality (Deaths per 10 million)", main ="Skin cancer mortality vs. Latitude")abline(lm(Mort~ Lat, data = skcan), col ="blue")segments(0, coef(fit.skcan)[1] +coef(fit.skcan)[2]*40, 40, coef(fit.skcan)[1] +coef(fit.skcan)[2]*40, col ="grey", lwd=3, lty ="dotted")segments(40, 0, 40, coef(fit.skcan)[1] +coef(fit.skcan)[2]*40, col ="grey", lwd=3, lty ="dotted")
For both questions, our “best guess” of mortality rate is 150 deaths per 10 million people.
Comments:
As always, to be confident in the answer to our research questions, we should put an interval around our best guesses.
We need to construct a confidence interval (CI) for the mean response and a prediction interval (PI) for a new response.
Confidence Intervals (CI) for Mean Response
The concept of confidence interval:
A confidence interval is an interval associated with a parameter and is a frequentist concept. The parameter is assumed to be non-random but unknown, and the confidence interval is computed from data. Because the data are random, the interval is random.
A 95% confidence interval will contain the true parameter with probability 0.95. That is, with a large number of repeated samples, 95% of the intervals would contain the true parameter.
Q: When is it okay to use the formula for the confidence interval for \(E(Y_i)\) ?
A: When \(x_i\) is a value within the range of the \(x\) values in the data set — that is, when \(x_i\) is a value within the “scope of the model.” But, note that \(x_i\) does not have to be one of the actual \(x\) values in the data set.
For the mean response \(E(Y_i)\) when the predictor value is \(x_i\). The general formula for the confidence interval in words is \[\text{sample estimate} \pm \text{t-multiplier} \times \text{standard error}\] and the formula in notation is \[\hat y_i \pm t(\frac{\alpha}{2}, n-2)SE(\hat y_i)\] where
\(\hat y_i\) is the fitted value (or predicted value) of the response when the predictor is \(x_i\),
\(t(\frac{\alpha}{2}, n-2)\) is the t-multiplier,
\(SE(\hat y_i) = \hat \sigma\sqrt{(\frac{1}{n}+\frac{(x_i -\bar{x})^2}{S_{xx}})}\) it is the standard error of \(\hat y_i\).
Implement in R
# 95% CI for the mean response with latitude = 40x_0=40(y_0 =unname(fit.skcan$coefficients[1] + fit.skcan$coefficients[2]*x_0))
[1] 150.0839
# 1. using formulaCI = y_0 +c(-1, 1) *qt(1-.05/2, n -2) *summary(fit.skcan)$sigma *sqrt(1/n + (x_0 -mean(skcan$Lat))^2/sum((skcan$Lat-mean(skcan$Lat))^2))CI
[1] 144.5617 155.6061
# The estimate of mean response is 150.0839, and 95% CI is (144.5617, 155.6061) when latitude is 40.# 2. by built-in R functionnew =data.frame(Lat =40)CI1 =predict(fit.skcan, new, se.fit =TRUE, interval ="confidence", level =0.95, type ="response") CI1$fit
fit lwr upr
1 150.0839 144.5617 155.6061
Factors Affecting the Width of the Confidence Interval for the Mean Response
The width of the confidence interval is \[\text{Width}=2\times t(\frac{\alpha}{2}, n-2)SE(\hat y_i)\] where \(SE(\hat y_i) = \hat \sigma\sqrt{(\frac{1}{n}+\frac{(x_i -\bar{x})^2}{S_{xx}})}\)
Factors affecting the width
Confidence level: as the confidence level decreases, the width of the interval decreases.
MSE (\(\hat \sigma^2\)): as MSE decreases, the width of the interval decreases.
The spread of the predictor \(x\) values: the more spread out the predictor \(x\) values, i.e., the larger \(S_{xx}\), the narrower the interval.
Sample size \(n\): as the sample size \(n\) increases, the width of the interval decreases.
The distance between \(x_i\) and the average of the sample’s predictor values, i.e., \(\bar x\): when \(x_i\) is closer to \(\bar x\), the width of the interval decreases.
Prediction Intervals (PI) for New Observations
A prediction interval is an interval associated with a random variable yet to be observed, with a specified probability of the random variable lying within the interval.
Q: When is it okay to use the formula for the prediction interval for \(y_{i(new)}\)?
A: When \(x_i\) is a value within the scope of the model. Again,\(x_i\) does not have to be one of the actual \(x\) values in the data set.
Distribution of New Observations
For a new reponse variable at \(x_i\), \(y_{i(new)} = \beta_0 + \beta_1 x_i + \epsilon_i\) is an unobserved true value. Assuming the fitted model is correct, the prediction of a future value \(y_{i(new)}\), denoted as \(\tilde y_{i(new)}\), is \[\tilde y_{i(new)} = \hat y_{i(new)} + \epsilon_i= \hat\beta_0 + \hat\beta_1 x_i + \epsilon_i\] We do not know the future \(\epsilon_i\), but we expect it has mean zero (by assumption) so the point prediction of is just \(\hat y_{i(new)}=\hat\beta_0 + \hat\beta_1 x_i\) with the variation from two sources \[\operatorname{Var}(\tilde y_{i(new)}) = \operatorname{Var}(\hat\beta_0 + \hat\beta_1 x_i) + \operatorname{Var}(\epsilon_i)\] reasonably assuming that the future \(\epsilon_i\) is independent of \(\boldsymbol{\hat\beta}={(\hat\beta_0 \, \hat\beta_1)}\). Therefore, \[ \sigma^2{(\frac{1}{n}+\frac{(x_i -\bar{x})^2}{S_{xx}})} + \sigma^2,\] and the distribution of a new response is
The general formula for the prediction interval (PI) in words is \[\text{sample estimate} \pm \text{t-multiplier} \times \text{standard error}\] and the formula in notation is \[\hat y_{i(new)} \pm t(\frac{\alpha}{2}, n-2)SE(\tilde y_{i(new)})\] where
\(\hat y_{i(new)}\) is the fitted value of the response when the predictor is \(x_i\),
the standard error of prediction \[SE(\tilde y_{i(new)})=\hat\sigma \sqrt{(\frac{1}{n}+\frac{(x_i -\bar{x})^2}{S_{xx}}+1)}\] where RSE (\(\hat \sigma\)) is used as an estimate for \(\sigma\).
Comments:
Given the same confidence level \(\alpha\), prediction interval (PI) is always wider than the confidence interval (CI) due the extra MSE \(\hat \sigma^2\).
Factors affecting the width of this prediction Interval are the same as the factors affecting the width of the confidence interval for mean response.
Implement in R
# a 95% PI for a new observation with latitude = 40# 1. use formulaPI <- y_0 +c(-1, 1) *qt(1-.05/2, n -2) *summary(fit.skcan)$sigma *sqrt(1+1/n + (x_0 -mean(skcan$Lat))^2/sum((skcan$Lat-mean(skcan$Lat))^2))PI
[1] 111.2350 188.9329
# The predicted value is 150.0839, and 95% PI is (111.235, 188.9329) when latitude is 40.# 2. by built-in R functionans2 =predict(fit.skcan, new, se.fit =TRUE, interval ="prediction", level =0.95, type ="response")ans2$fit
fit lwr upr
1 150.0839 111.235 188.9329
Interpretation: we are 95% confident that the skin cancer mortality rate at an individual location at 40 degrees north will be between 111.24 and 188.93 deaths per 10 million people.
Hypothesis Tests
Recall General Steps to Conduct a Hypothesis Test
Set up two competing hypotheses: Null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_1\) or \(H_a\)).
Set some significance level called \(\alpha\): the most common \(\alpha\) is 0.05 or 5%.
Calculate a test statistic (\(t\star\)): a function of the data whose distribution depends only on the parameter(s) being tested.
Calculate probability value (p-value). The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. (In our context, a p-value is the probability of seeing data at least as extreme as \(t\star\) under the assumption of \(H_0\).) A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.
The rejection region is found by using \(\alpha\) to find a critical value. The rejection region is the area that is more extreme than the critical value \(c_\alpha\).
Make a test decision about the null hypothesis. Reject \(H_0\) if the p-value \(< \alpha\). (Or equivalently reject \(H_0\) if \(t\star > c_\alpha\).)
Illlustration of rejection region, and critical value. Image Source
t-test for coefficients
Let’s revisit the example concerning the relationship between skin cancer mortality rates (\(y\)) and latitude at center of state (\(x\)).
skcan =read.table("skincancer.txt", header =TRUE)plot(x = skcan$Lat, y = skcan$Mort, xlab ="Latitude (at center of state)", ylab ="Mortality (Deaths per 10 million)", main ="Skin cancer mortality vs. Latitude")abline(lm(Mort~ Lat, data = skcan), col ="blue")
Scientific Question: Is there a relationship between the population of all skin cancer mortality rates and latitudes? Or, is \(\beta_1 \neq 0\)?
Statistical Question: Should we reject \(H_0: \beta_1 = 0\)?
Hypothesis tests for\(\beta_1\)
Hypotheses \[H_0: \beta_1= {\beta_1}_0 \quad \text{and}\quad \beta_1 \neq{\beta_1}_0\], where \({\beta_1}_0\) is a fixed value.
Test statistic \[t\star = \frac{\hat\beta_1-{\beta_1}_0}{SE(\hat\beta_1)}\]
Null distribution \[t\star\sim t_{n-2}\]
p-value \[p= 2P(t_{n-2}>|t\star|)\]
Critical value \[c_\alpha = t_{n-2, 1-\alpha/2}\]
For a two-sided test, we reject \(H_0\) if \(|t\star|> c_\alpha\), or equivalently, we reject \(H_0\) if \(p<\alpha\).
Note that when \(|t\star|\leq c_\alpha\) or \(p\geq\alpha\) we do not say that we accept \(H_1\). Instead, we put it as we fail to reject \(H_0\).
Call:
lm(formula = Mort ~ Lat, data = skcan)
Residuals:
Min 1Q Median 3Q Max
-38.972 -13.185 0.972 12.006 43.938
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 389.1894 23.8123 16.34 < 2e-16 ***
Lat -5.9776 0.5984 -9.99 3.31e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.12 on 47 degrees of freedom
Multiple R-squared: 0.6798, Adjusted R-squared: 0.673
F-statistic: 99.8 on 1 and 47 DF, p-value: 3.309e-13
F-test for significance of SLR
For SLR, the F−test concerns the following statistical hypothesis test: \[H_0: \beta_1= 0 \quad \text{and}\quad H_1: \beta_1 \neq 0\] This test in the case of SLR tells us whether the predictor adds any insight to the model at all. That is, it tells us whether employing the predictor makes the model more complex than it needs to be and simply including the parameter \(\beta_0\) alone would suffice – or not.
The expected values of MSR and MSE suggest how to test the null hypothesis against the alternative hypothesis.
Review of ANOVA table and Decomposition of SST
Source
df
SS
MS
Model
2-1
SSR
MSR = SSR/(2-1)
Error
n-2
SSE
MSE = SSE/(n-2)
Total
n-1
SST
It can be shown that \[E(M S R)=\sigma^{2}+\beta_{1}^{2} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}\] And note that we have already known that \[E(MSE)=\sigma^{2}\] (Your Turn: calculate the E(MSE) and E(MSR).)
If \(\beta_1 = 0\), then we would expect the ratio \(\frac{MSR}{MSE}\) to be equal to \(1\).
If \(\beta_1 \neq 0\), then we would expect the ratio \(\frac{MSR}{MSE}\) to be greater than \(1\).
These two facts suggest that we should use the ratio, \(\frac{MSR}{MSE}\) , to determine whether or not \(\beta_1 = 0\). This gives us the clue to construct a test statistic and conduct the F-test for slope \(beta_1\)
Test statistic \[F\star =\frac{SSR/(2-1)}{SSE/(n-2)}=\frac{MSR}{MSE}\]
Null distribution \[F\star \sim F(1, n-2)\]
p-value \[p> F(F\star, 1, n-2)\]
Critical value \[c_\alpha = F(1-\alpha, 1, n-2)\]
Draw a conclusion: We reject \(H_0\) if p-value \(<\alpha\) (or equivalently, if \(F\star>c_\alpha\)).
Comments:
For the simple linear regression (SLR), F-test and t-test are equivalent, \(F \star = (t \star)^2\).
Illustration of F and t distributions
Under \(H_0\), the F-statistic should follow an F distribution with respective parameters \((d_1-d_2, n-d_2)\), where \(d_1-d_2\) is the difference in the number of regression function parameters between \(H_1\) and \(H_0\), 2-1=1 in SLR case, and \(n-d_2\) is the number of degrees of freedom for \(H_1\) , \(n-2\) in SLR case. We are going to learn the general formula of F-statistic in multiple linear regression (MLR).
F-test in the case of SLR tells us whether the predictor adds any explanatory value to the model. For SLR, a rejection of \(H_0: \beta_1 = 0\) implies the significance of the whole SLR model, and the significance of the slope parameter \(\beta_1\).