Inference for one linear combination of parameters
Distributional theorem for the inference for a single linear combination of parameters
Inference for each \(\beta_j\)
Confidence region
General F-test for Nested Model Comparison
Sequential (or Extra) Sum of Squares
Inference for MLR
Distribution Theory under Gaussian Assumption
The LS estimation does not require Gaussian assumption for random errors while inference does.
The assumption \(\epsilon_i \stackrel{iid}{\sim} \mathcal{N}(0, \sigma^2)\) can be rewritten in matrix notation,\(\boldsymbol\epsilon {\sim} \mathcal{N}(\mathbf 0, \sigma^2 \mathbf I_n)\) and hence \(\mathbf y {\sim} \mathcal{N}(\mathbf X \boldsymbol\beta, \sigma^2 \mathbf I_n)\). A number of distributional results then follow.
Theorem
If \(\mathbf y {\sim} \mathcal{N}(\mathbf X \boldsymbol\beta, \sigma^2 \mathbf I_n)\), where \(\mathbf X\) is \(n \times p\) of rank p, then:
\(\hat{\boldsymbol\beta}\) is independent of \(\hat\sigma^2\)
\(RSS/\sigma^2 \sim \chi^2_{n-p}\)
Proof.
Since \(\hat{\boldsymbol{\beta}} = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \mathbf y = \mathbf C y\), where \(\mathbf C\) is a \(p \times n\) matrix such that \(\operatorname{rank}(\mathbf C) = \operatorname{rank}(\mathbf X^T) = \operatorname{rank}(\mathbf X) = p\) (by \(rank\) property of matrix), \(\hat{\boldsymbol{\beta}}\) has a multivariate normal distribution by the following theorem:
\(\quad\)let\(\mathbf Y \sim \mathcal{N}_n(\boldsymbol\mu, \boldsymbol\Sigma)\), \(\mathbf C\)be an\(m \times n\)matrix of rank\(m\), and\(\mathbf d\)be an\(m \times 1\)vector. Then\(\mathbf C\mathbf Y + \mathbf d \sim \mathcal{N}_m(\mathbf C\boldsymbol\mu+\mathbf d, \mathbf C\boldsymbol\Sigma\mathbf C^T)\).
In particular from \(\operatorname{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\) and \(\operatorname{Var}(\hat{\boldsymbol{\beta}})=\sigma^2 (\mathbf X^T \mathbf X)^{-1}\) (proved in previous lecture), we have \(\hat{\boldsymbol\beta} \sim \mathcal{N}_p(\boldsymbol\beta, \sigma^2(\mathbf X^T \mathbf X)^{-1})\).
\((\hat{\boldsymbol\beta}-\boldsymbol\beta)^T\mathbf X^T \mathbf X(\hat{\boldsymbol\beta}-\boldsymbol\beta)/\sigma^2=(\hat{\boldsymbol\beta}-\boldsymbol\beta)^T \left(\operatorname{Var}(\hat{\boldsymbol{\beta}})\right)^{-1}(\hat{\boldsymbol\beta}-\boldsymbol\beta)/\sigma^2\), which is distributed as \(\chi^2_p\) by [1.] and theorem of distribution of quadratic forms:
\(\quad\)Suppose that \(\mathbf Y \sim \mathcal{N}_n(\boldsymbol\mu, \boldsymbol\Sigma)\), where \(\boldsymbol\Sigma\) is positive-definite. Then, \(Q= (\mathbf Y-\boldsymbol\mu)^T\boldsymbol\Sigma^{-1}(\mathbf Y-\boldsymbol\mu)\sim \chi^2_n\).
\(\mathbf H\) is symmetric, idempotent (\(\mathbf H = \mathbf H^2\)), and so is \(\mathbf I-\mathbf H\). Since \(\mathbf H = \mathbf X(\mathbf X^T \mathbf X)^{-1}\mathbf X^T\), we obtain
\(\quad\)Let \(\mathbf Y \sim \mathcal{N}_n(\boldsymbol\mu, \boldsymbol\Sigma)\) and define \(\mathbf U = \mathbf A\mathbf Y\), \(\mathbf V = \mathbf B \mathbf Y.\) Then \(\mathbf U\) and \(\mathbf V\) are independent if and only if \(\operatorname{Cov}(\mathbf U, \mathbf V) =\mathbf A \boldsymbol\Sigma\mathbf B^T= \mathbf 0\)
in ours case \(\mathbf U=\hat{\boldsymbol\beta}\) and \(\mathbf V= \mathbf e\),
\(\hat{\boldsymbol\beta}\) is independent of \(\mathbf e\) and therefore of \(\hat\sigma^2\).(Note that \(\hat\sigma^2 = RSS/(n-p)\) and \(RSS =\mathbf e^T \mathbf e\))
Since \(RSS = \mathbf e^T \mathbf e=\mathbf y^T (\mathbf I- \mathbf H)\mathbf y\) and \(\mathbf I-\mathbf H\) is symmetric and idempotent of rank\(n-p\),
Based on the theorem of distribution of quadratic forms
\(\quad\)Let \(\mathbf Y \sim \mathcal{N}_n(\mathbf 0,\mathbf I_n )\) and let \(\mathbf A\) be a symmetric matrix. Then \(\mathbf Y^T \mathbf A \mathbf Y\) is \(\chi^2_r\) if and only if \(\mathbf A\) is idempotent of rank \(r\)
we conclude \(RSS/\sigma^2 \sim \chi^2_{n-p}\).
Extra sum of squares principle
Consider the following general hypothesis which may involve more than one linear combinations of parameters:
\[H_0: \mathbf A \boldsymbol\beta = \mathbf c\]
where \(\mathbf A\) is a known \(q \times p\) matrix of rank \(q\), \(\mathbf c\) is a known \(q \times 1\) vector.
Since we want to test \(H_0: \mathbf A \boldsymbol\beta = \mathbf c\), a natural statistic for testing this is \(\mathbf A \hat{\boldsymbol\beta} - \mathbf c\); \(H_0\) will be rejected if \(\mathbf A \hat{\boldsymbol\beta}\) is sufficiently different from \(\mathbf c\).
However, not every element in \(\mathbf A \hat{\boldsymbol\beta}\) should be treated the same, as they have different precisions. Therefore, the distance measure incorporating the precision is
\[(\mathbf A \hat{\boldsymbol{\beta}}-\mathbf{c})^{T}\left[\operatorname{Var}(\mathbf A \hat{\boldsymbol{\beta}})\right]^{-1}(\mathbf A \hat{\boldsymbol{\beta}}-\mathbf{c}),\]
where \(\operatorname{Var}(\mathbf A \hat{\boldsymbol{\beta}})=\sigma^2\mathbf A\left(\mathbf X^{T}\mathbf X\right)^{-1} \mathbf A^{T}\), and \(\sigma^2\) is estimated by \(\hat\sigma^2\). Now we arrive at
\[(\mathbf A \hat{\boldsymbol{\beta}}-\mathbf{c})^{T}\left[\mathbf A\left(\mathbf X^{T}\mathbf X\right)^{-1} \mathbf A^{T}\right]^{-1}(\mathbf A \hat{\boldsymbol{\beta}}-\mathbf{c})/\hat\sigma^2.\]
Next step is to find a test statistic which is constant times the quadratic measure above.
(Derivation and proof is skipped out of the limited scope of this course. Let’s jump to the theorem.)
Theorem
Under \(H_0\) we fit the reduced model and denote \(RSS_{H_0}\) as the RSS for the reduced model. Then
\[
R S S_{H_{0}}-R S S=(\mathbf A \hat{\boldsymbol{\beta}}-\mathbf{c})^{T}\left[\mathbf A\left(\mathbf X^{T}\mathbf X\right)^{-1} \mathbf A^{T}\right]^{-1}(\mathbf A \hat{\boldsymbol{\beta}}-\mathbf{c})
\]
is the extra sum of squares due to the hypothesis \(H_0\) (which should be small when \(H_0\) is true).
The F-test statistic
\[F \star = \frac{(R S S_{H_{0}}-R S S)/q}{RSS/(n-p)} \stackrel{H_0}{\sim}F_{q, n-p}\]
Extra sum of square principle is a general approach: fit the full model and the reduced model under\(H_0\), compute the increase of \(RSS\) as the extra sum of squares, scale it using the estimate of \(\sigma^2\) under the full model and appropriate d.f..
Data set: gala in package faraway– Species diversity on the Galapagos Islands.
Data set description: There are 30 Galapagos islands and 7 variables in the dataset. The relationship between the number of plant species and several geographic variables is of interest. (The original dataset contained several missing values which have been filled for convenience.)
Species: the number of plant species found on the island
Endemics: the number of endemic species
Area: the area of the island (km\(^2\))
Elevation: the highest elevation of the island (m)
Nearest: the distance from the nearest island (km)
Scruz: the distance from Santa Cruz island (km)
Adjacent: the area of the adjacent island (square km)
data(gala, package="faraway") # Loads data set "gala" in package "faraway"head(gala)
The advantage of the confidence interval relative to the corresponding hypothesis test is that we get information about plausible ranges for the parameters.
When we are interested in more than one parameter, we set \(\mathbf A = \mathbf I\) and \(\mathbf c =\boldsymbol \beta\), we have
These regions are ellipsoidally shaped. Because these ellipsoids lie in higher dimensions, they cannot easily be visualized except for the two-dimensional case. (When \(p=2\), we can draw confidence ellipses.)
Confidence region takes into account the correlation between the estimates of coefficients.
Let’s see how these compare to the univariate confidence intervals. For example, we can construct the joint \(95\%\) confidence region for \(\beta_{Area}\) and \(\beta_{Adjacent}\).
In the figure above, we have added the point of the least squares estimates which lies at the center of the ellipse. We also added the univariate confidence intervals for both dimensions seen as dotted lines.
What outcome can we determine from this plot?
The joint hypothesis \(H_0 :\beta_{Area}=\beta_{Adjacent} = 0\) is rejected because the origin does not lie inside the ellipse.
The hypothesis \(H_0:\beta_{Area} = 0\) is not rejected because zero does lie within the vertical dashed lines.
The hypothesis \(H_0:\beta_{Adjacent} = 0\) is rejected because the horizontal dashed lines do not encompass zero.
Comments:
If you want to test multiple parameters, you need to use a joint testing procedure and not try to combine several univariate tests. (Multiple comparison and simultaneous confidence intervals are the topics beyond the scope of this course.)
In higher dimensions, confidence ellipses are not easily visualized so our example here is more of educational than practical value.
Nevertheless, it should serve as a caution in interpreting a collection of univariate hypothesis tests or confidence intervals.
General F-test for Nested Model Comparison
Nested Model Comparison: Motivation
Given several predictors for a response, we might wonder whether all are needed.
Consider a larger model, \(\Omega\), and a smaller model, \(\omega\), which consists of a subset of the predictors that are in \(\Omega\). We call \(\omega\) the reduced modelnested in full model\(\Omega\).
We will take \(\omega\) to represent the null hypothesis \(H_0\) and \(\Omega\) to represent the alternative \(H_1\).
If \(RSS_\omega - RSS_\Omega\) is small, the fit of the smaller model is almost as good as the larger model and so we would prefer the smaller model on the grounds of simplicity.
If the difference is large, then the superior fit of the larger model would be preferred.
This suggests that the test statistic would potentially be
\[\frac{RSS_\omega - RSS_\Omega}{RSS_\Omega}\]
In addition, we should consider the scaling with dimensions: suppose that the dimension (or number of parameters) of \(\Omega\) is \(p\) and the dimension of \(\omega\) is \(q\), then we scale to get an F-statistic,
Testing the significance of the MLR is a special case of nested model comparison, where the full model is the fitted MLR, and the reduced model is the intercept-only model.
Analysis of Variance Table
Model 1: Species ~ Elevation + Nearest + Scruz
Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
Res.Df RSS Df Sum of Sq F Pr(>F)
1 26 158292
2 24 89231 2 69060 9.2874 0.00103 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparing these two models is corresponding to test a null hypothesis \(H_0:\beta_{Area}=\beta_{Adjacent} = 0\). We reject the \(H_0\), therefore, the reduced model. (Same testing conclusion as we obtained using rejection region.)
Sequential Sum of Squares and Partial Sum of Squares
Two common types of sum of squares:
sequential
partial
To introduce the concepts, we consider the simple example of the following linear model
Comparing with \(SS_{total} = SS_{model}+SS_{error}\), we see that \(SS_{total}\) has been split into \(S S_{x_{1}}\), \(S S_{x_{2} \mid x_{1}}\) and \(S S_{x_{3} \mid x_{1}, x_{2}}\).
Interpretations:
\(S S_{x_{1}}\) : \(SS\) explained by \(x_1\)
\(S S_{x_{2} \mid x_{1}}\) : extra \(SS\) due to the addition of \(x_2\) to the model that already includes \(x_1\)
\(S S_{x_{3} \mid x_{1}, x_{2}}\) : extra \(SS\) due to the addition of \(x_3\) to the model that already includes \(x_1\) and \(x_2\)
Type I SS Table
\[
\begin{array}{ccccc}
\text { Source } & S S & \text { d.f. } & M S & F \\
\hline x_{1} & S S_{x_{1}} & 1 & M S_{x_{1}} & M S_{x_{1}} / M S E \\
x_{2} \mid x_{1} & S S_{x_{2} \mid x_{1}} & 1 & M S_{x_{2} \mid x_{1}} & M S_{x_{2} \mid x_{1}} / M S E \\
x_{3} \mid x_{1}, x_{2} & S S_{x_{3} \mid x_{1}, x_{2}} & 1 & M S_{x_{3} \mid x_{1}, x_{2}} & M S_{x_{3} \mid x_{1}, x_{2}} / M S E \\
\text { error } & S S_{\text {error }} & \mathrm{n}-4 & M S E & \\
\hline \text { total } & S S_{\text {total }} & \mathrm{n}-1 & &
\end{array}
\]
Comments:
One can calculate \(S S_{x_{1}}\), \(S S_{x_{2} \mid x_{1}}\) and \(S S_{x_{3} \mid x_{1}, x_{2}}\) by fitting three models. But sequential (type I) SS computes them simultaneously.
In R, the anova() function gives the type I SS.
Type I SS depends on the order in which the variables are entered into the model.
Partial (type III) sum of squares
Using type III is equivalent to simply t-tests.
Let’s look at the example:
To test the hypothesis \(H_0 : \beta_2 = 0\), we can fit the full model and the reduced model without \(x_1\), and compute \(S S_{x_{1} \mid x_{2}, x_{3}}\) which is the extra sum of squares explained by \(x_1\) when \(x_2\) and \(x_3\) are included in the model.
Similarly, to test \(H_0 : \beta_3 = 0\) and \(H_0 : \beta_4 = 0\), we need to compute \(S S_{x_{2} \mid x_{1}, x_{3}}\) and \(S S_{x_{3} \mid x_{1}, x_{2}}\)
Type III SS Table
\[
\begin{array}{ccc}
\text { Source } & S S & H_{0} \\
\hline x_{1} & S S_{x_{1} \mid x_{2}, x_{3}} & \beta_{2}=0 \\
x_{2} & S S_{x_{2} \mid x_{1}, x_{3}} & \beta_{3}=0 \\
x_{3} & S S_{x_{3} \mid x_{1}, x_{2}} & \beta_{4}=0
\end{array}
\]
Comments:
One can calculate these SS’s by fitting multiple models. Partial (type III) sum of squares computes them simultaneously.
In R, the Anova() function gives the type III SS.
attach(airquality)air <-na.omit(airquality)detach(airquality)attach(air)fit0 <-lm(Ozone ~1, data=air)fit1 <-lm(Ozone ~ Solar.R, data=air)fit2 <-lm(Ozone ~ Solar.R + Wind, data=air)fit3 <-lm(Ozone ~ Solar.R + Wind + Temp, data=air)anova(fit0, fit1, fit2, fit3) # Type I SS
Analysis of Variance Table
Model 1: Ozone ~ 1
Model 2: Ozone ~ Solar.R
Model 3: Ozone ~ Solar.R + Wind
Model 4: Ozone ~ Solar.R + Wind + Temp
Res.Df RSS Df Sum of Sq F Pr(>F)
1 110 121802
2 109 107022 1 14780 32.944 8.946e-08 ***
3 108 67053 1 39969 89.094 9.509e-16 ***
4 107 48003 1 19050 42.463 2.424e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1