Lecture 6

Author

Jiaye Xu

Published

March 21, 2022

Lecture 6 Overview

Multiple Linear Regression (MLR)

Inference for MLR
- Distribution Theory under Gaussian Assumption
- Extra sum of squares principle
- Test all covariate effects equal to zero
- Inference for one linear combination of parameters
  - Distributional theorem for the inference for a single linear combination of parameters
  - Inference for each \(\beta_j\)
- Confidence region
- General F-test for Nested Model Comparison
- Sequential (or Extra) Sum of Squares

Inference for MLR

Distribution Theory under Gaussian Assumption

The LS estimation does not require Gaussian assumption for random errors while inference does.

The assumption \(\epsilon_i \stackrel{iid}{\sim} \mathcal{N}(0, \sigma^2)\) can be rewritten in matrix notation,\(\boldsymbol\epsilon {\sim} \mathcal{N}(\mathbf 0, \sigma^2 \mathbf I_n)\) and hence \(\mathbf y {\sim} \mathcal{N}(\mathbf X \boldsymbol\beta, \sigma^2 \mathbf I_n)\). A number of distributional results then follow.

Theorem

If \(\mathbf y {\sim} \mathcal{N}(\mathbf X \boldsymbol\beta, \sigma^2 \mathbf I_n)\), where \(\mathbf X\) is \(n \times p\) of rank p, then:

\(\hat{\boldsymbol\beta} \sim \mathcal{N}_p(\boldsymbol\beta, \sigma^2(\mathbf X^T \mathbf X)^{-1})\)
\((\hat{\boldsymbol\beta}-\boldsymbol\beta)^T\mathbf X^T \mathbf X(\hat{\boldsymbol\beta}-\boldsymbol\beta)/\sigma^2 \sim \chi^2_p\)
\(\hat{\boldsymbol\beta}\) is independent of \(\hat\sigma^2\)
\(RSS/\sigma^2 \sim \chi^2_{n-p}\)

Proof.

Since \(\hat{\boldsymbol{\beta}} = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \mathbf y = \mathbf C y\), where \(\mathbf C\) is a \(p \times n\) matrix such that \(\operatorname{rank}(\mathbf C) = \operatorname{rank}(\mathbf X^T) = \operatorname{rank}(\mathbf X) = p\) (by \(rank\) property of matrix), \(\hat{\boldsymbol{\beta}}\) has a multivariate normal distribution by the following theorem:

\(\quad\)let \(\mathbf Y \sim \mathcal{N}_n(\boldsymbol\mu, \boldsymbol\Sigma)\), \(\mathbf C\) be an \(m \times n\) matrix of rank \(m\), and \(\mathbf d\) be an \(m \times 1\) vector. Then \(\mathbf C\mathbf Y + \mathbf d \sim \mathcal{N}_m(\mathbf C\boldsymbol\mu+\mathbf d, \mathbf C\boldsymbol\Sigma\mathbf C^T)\).

In particular from \(\operatorname{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\) and \(\operatorname{Var}(\hat{\boldsymbol{\beta}})=\sigma^2 (\mathbf X^T \mathbf X)^{-1}\) (proved in previous lecture), we have \(\hat{\boldsymbol\beta} \sim \mathcal{N}_p(\boldsymbol\beta, \sigma^2(\mathbf X^T \mathbf X)^{-1})\).

\((\hat{\boldsymbol\beta}-\boldsymbol\beta)^T\mathbf X^T \mathbf X(\hat{\boldsymbol\beta}-\boldsymbol\beta)/\sigma^2=(\hat{\boldsymbol\beta}-\boldsymbol\beta)^T \left(\operatorname{Var}(\hat{\boldsymbol{\beta}})\right)^{-1}(\hat{\boldsymbol\beta}-\boldsymbol\beta)/\sigma^2\), which is distributed as \(\chi^2_p\) by [1.] and theorem of distribution of quadratic forms:

\(\quad\)Suppose that \(\mathbf Y \sim \mathcal{N}_n(\boldsymbol\mu, \boldsymbol\Sigma)\), where \(\boldsymbol\Sigma\) is positive-definite. Then, \(Q= (\mathbf Y-\boldsymbol\mu)^T\boldsymbol\Sigma^{-1}(\mathbf Y-\boldsymbol\mu)\sim \chi^2_n\).

\[\operatorname{Cov}(\hat{\boldsymbol\beta}, \mathbf e) = \operatorname{Cov}\left((\mathbf X^T \mathbf X)^{-1}\mathbf X^T \mathbf y, (\mathbf I- \mathbf H)\mathbf y \right)\\ = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T\operatorname{Cov}\left( \mathbf y \right)(\mathbf I- \mathbf H)^T\\ =\sigma^2(\mathbf X^T \mathbf X)^{-1}\mathbf X^T(\mathbf I- \mathbf H)^T\]

\(\mathbf H\) is symmetric, idempotent (\(\mathbf H = \mathbf H^2\)), and so is \(\mathbf I-\mathbf H\). Since \(\mathbf H = \mathbf X(\mathbf X^T \mathbf X)^{-1}\mathbf X^T\), we obtain

\[\operatorname{Cov}(\hat{\boldsymbol\beta}, \mathbf e) =\sigma^2(\mathbf X^T \mathbf X)^{-1}\mathbf X^T(\mathbf I- \mathbf H)= \mathbf 0\]

By statistical independence theorem

\(\quad\)Let \(\mathbf Y \sim \mathcal{N}_n(\boldsymbol\mu, \boldsymbol\Sigma)\) and define \(\mathbf U = \mathbf A\mathbf Y\), \(\mathbf V = \mathbf B \mathbf Y.\) Then \(\mathbf U\) and \(\mathbf V\) are independent if and only if \(\operatorname{Cov}(\mathbf U, \mathbf V) =\mathbf A \boldsymbol\Sigma\mathbf B^T= \mathbf 0\)

in ours case \(\mathbf U=\hat{\boldsymbol\beta}\) and \(\mathbf V= \mathbf e\),

\(\hat{\boldsymbol\beta}\) is independent of \(\mathbf e\) and therefore of \(\hat\sigma^2\).(Note that \(\hat\sigma^2 = RSS/(n-p)\) and \(RSS =\mathbf e^T \mathbf e\))

Since \(RSS = \mathbf e^T \mathbf e=\mathbf y^T (\mathbf I- \mathbf H)\mathbf y\) and \(\mathbf I-\mathbf H\) is symmetric and idempotent of rank \(n-p\),

\[ \begin{aligned} \mathbf{R S S} &=\mathbf y^T\left(\mathbf{I}_{n}-\mathbf{H}\right) \mathbf{y} \\ &=\mathbf y^T\left(\mathbf{I}_{n}-\mathbf{H}\right)^T\left(\mathbf{I}_{n}-\mathbf{H}\right)\left(\mathbf{I}_{n}-\mathbf{H}\right) \mathbf{y} \\ &=(\mathbf{y}-\mathbf{H}\mathbf{X} \boldsymbol\beta-\mathbf{H}\boldsymbol\epsilon)^{T}\left(\mathbf{I}_{n}-\mathbf{H}\right)(\mathbf{y}-\mathbf{H}\mathbf{X}\boldsymbol \beta-\mathbf{H}\boldsymbol\epsilon) \\ &=(\mathbf{y}-\mathbf{X} \boldsymbol\beta-\mathbf{H}\boldsymbol\epsilon)^{T}\left(\mathbf{I}_{n}-\mathbf{H}\right)(\mathbf{y}-\mathbf{X}\boldsymbol \beta-\mathbf{H}\boldsymbol\epsilon) \text{ by } \mathbf{H}\mathbf{X}=\mathbf{X}\\ &=\boldsymbol \epsilon^T\left(\mathbf{I}_{n}-\mathbf{H}\right) \boldsymbol \epsilon \end{aligned} \]

Based on the theorem of distribution of quadratic forms

\(\quad\)Let \(\mathbf Y \sim \mathcal{N}_n(\mathbf 0,\mathbf I_n )\) and let \(\mathbf A\) be a symmetric matrix. Then \(\mathbf Y^T \mathbf A \mathbf Y\) is \(\chi^2_r\) if and only if \(\mathbf A\) is idempotent of rank \(r\)

we conclude \(RSS/\sigma^2 \sim \chi^2_{n-p}\).

Extra sum of squares principle

Consider the following general hypothesis which may involve more than one linear combinations of parameters:

\[H_0: \mathbf A \boldsymbol\beta = \mathbf c\]

where \(\mathbf A\) is a known \(q \times p\) matrix of rank \(q\), \(\mathbf c\) is a known \(q \times 1\) vector.

Since we want to test \(H_0: \mathbf A \boldsymbol\beta = \mathbf c\), a natural statistic for testing this is \(\mathbf A \hat{\boldsymbol\beta} - \mathbf c\); \(H_0\) will be rejected if \(\mathbf A \hat{\boldsymbol\beta}\) is sufficiently different from \(\mathbf c\).

However, not every element in \(\mathbf A \hat{\boldsymbol\beta}\) should be treated the same, as they have different precisions. Therefore, the distance measure incorporating the precision is

\[(\mathbf A \hat{\boldsymbol{\beta}}-\mathbf{c})^{T}\left[\operatorname{Var}(\mathbf A \hat{\boldsymbol{\beta}})\right]^{-1}(\mathbf A \hat{\boldsymbol{\beta}}-\mathbf{c}),\]

where \(\operatorname{Var}(\mathbf A \hat{\boldsymbol{\beta}})=\sigma^2\mathbf A\left(\mathbf X^{T}\mathbf X\right)^{-1} \mathbf A^{T}\), and \(\sigma^2\) is estimated by \(\hat\sigma^2\). Now we arrive at

\[(\mathbf A \hat{\boldsymbol{\beta}}-\mathbf{c})^{T}\left[\mathbf A\left(\mathbf X^{T}\mathbf X\right)^{-1} \mathbf A^{T}\right]^{-1}(\mathbf A \hat{\boldsymbol{\beta}}-\mathbf{c})/\hat\sigma^2.\]

Next step is to find a test statistic which is constant times the quadratic measure above.

(Derivation and proof is skipped out of the limited scope of this course. Let’s jump to the theorem.)

Theorem

Under \(H_0\) we fit the reduced model and denote \(RSS_{H_0}\) as the RSS for the reduced model. Then

\[ R S S_{H_{0}}-R S S=(\mathbf A \hat{\boldsymbol{\beta}}-\mathbf{c})^{T}\left[\mathbf A\left(\mathbf X^{T}\mathbf X\right)^{-1} \mathbf A^{T}\right]^{-1}(\mathbf A \hat{\boldsymbol{\beta}}-\mathbf{c}) \]

is the extra sum of squares due to the hypothesis \(H_0\) (which should be small when \(H_0\) is true).

The F-test statistic

\[F \star = \frac{(R S S_{H_{0}}-R S S)/q}{RSS/(n-p)} \stackrel{H_0}{\sim}F_{q, n-p}\]

Extra sum of square principle is a general approach: fit the full model and the reduced model under \(H_0\), compute the increase of \(RSS\) as the extra sum of squares, scale it using the estimate of \(\sigma^2\) under the full model and appropriate d.f..

Test all covariate effects equal to zero

When we let

\[ A=\left(\begin{array}{ccccc} 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \cdots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{array}\right) \]

Then the above hypothesis can be expressed as

\[H_0: \mathbf A \boldsymbol\beta = \mathbf 0\]

which is equivalent to the hypothesis concerned in the F-test for the significance of MLR, i.e.,

\[H_0: \beta_1=\ldots=\beta_{p-1}= 0\]

(supposing \(\mathbf x_1 = \mathbf 1\) for the intercept).

We can test other hypotheses with appropriate choices of \(\mathbf A\) (beyond the scope of this course).

Inference for one linear combination of parameters

We have introduced the theorem about F-test statistics. Now we discuss how t-statistics are defined.

Let \(\mathbf A^T =(\mathbf a_1, \ldots, \mathbf a_q)\) and \(\mathbf c^T =(c_1, \ldots, c_q)\)
The hypothesis \(H_0: \mathbf A \boldsymbol\beta = \mathbf c\) can be re-expressed as

\[\mathbf a_i^T \boldsymbol\beta = c_i, \quad i = 1, \ldots, q\]

If the F-test is significant, the next question is which ones of the hypothesis

\[\mathbf a_i^T \boldsymbol\beta = c_i, \quad i = 1, \ldots, q\]

may be violated.

We need inference methods for a single linear combination of parameters.

Distributional theorem for the inference for a single linear combination of parameters

We first ignore subscript \(i\). Under normality assumption,

\(\mathbf a^T\hat{\boldsymbol\beta} \sim \mathcal{N}(\mathbf a^T\boldsymbol\beta, \sigma^2\mathbf a^T(\mathbf X^T \mathbf X)^{-1}\mathbf a)\)
\(U=\frac{\mathbf a^T\hat{\boldsymbol\beta}-\mathbf a^T{\boldsymbol\beta}}{\sigma \sqrt{\mathbf a^T(\mathbf X^T \mathbf X)^{-1}\mathbf a}} \sim \mathcal{N}(0, 1)\)
\(V= \frac{(n-p)\hat\sigma^2}{\sigma^2} \sim \chi^2_{n-p}\)
\(U\) and \(V\) are independent
\(T = \frac{U}{\sqrt{V/(n-p)}}=\frac{\mathbf a^T\hat{\boldsymbol\beta}-\mathbf a^T{\boldsymbol\beta}}{\hat\sigma \sqrt{\mathbf a^T(\mathbf X^T \mathbf X)^{-1}\mathbf a}} \sim t_{n-p}\)
Reject \(H_0: \mathbf a^T \boldsymbol\beta = c\) at significance level \(\alpha\) if \(|t\star| \geq t_{n-p}(1-\alpha/2)\)

where \[t\star= \frac{U}{\sqrt{V/(n-p)}}=\frac{\mathbf a^T\hat{\boldsymbol\beta}-\mathbf c}{\hat\sigma \sqrt{\mathbf a^T(\mathbf X^T \mathbf X)^{-1}\mathbf a}}\]

The \(100\times(1 -\alpha)\%\) confidence interval for \(\mathbf a^T\boldsymbol\beta\) is

\[\mathbf a^T\hat{\boldsymbol\beta} \pm t_{n-p}(1-\alpha/2) \hat\sigma \sqrt{\mathbf a^T(\mathbf X^T \mathbf X)^{-1}\mathbf a}\]

Inference for each \(\beta_j\)

In particular, letting a be the vector with the \(j\)th element equals to 1 and all other elements equal to zero, to test the hypothesis

\[ H_0: \beta_j = c\]

we calculate the t-statistic

\[ t\star=\frac{\hat{\beta}_{j}-c}{\hat{\sigma} \sqrt{\left(\mathbf X^{T} \mathbf X\right)_{jj}^{-1}}} \]

At significance level \(\alpha\), reject \(H_0: \mathbf a^T \boldsymbol\beta = c\) if \(|t\star| \geq t_{n-p}(1-\alpha/2)\).

The \(100\times(1 -\alpha)\%\) confidence interval for \(\beta_i\) is

\[\hat{\beta}_{j} \pm t_{n-p}(1-\alpha/2)\hat{\sigma} \sqrt{\left(\mathbf X^{T} \mathbf X\right)_{jj}^{-1}}\]

Implementation MLR inferences in R

Data set: gala in package faraway– Species diversity on the Galapagos Islands.

Data set description: There are 30 Galapagos islands and 7 variables in the dataset. The relationship between the number of plant species and several geographic variables is of interest. (The original dataset contained several missing values which have been filled for convenience.)

Species: the number of plant species found on the island

Endemics: the number of endemic species

Area: the area of the island (km\(^2\))

Elevation: the highest elevation of the island (m)

Nearest: the distance from the nearest island (km)

Scruz: the distance from Santa Cruz island (km)

Adjacent: the area of the adjacent island (square km)

data(gala, package="faraway") # Loads data set "gala" in package "faraway"
head(gala)

             Species Endemics  Area Elevation Nearest Scruz Adjacent
Baltra            58       23 25.09       346     0.6   0.6     1.84
Bartolome         31       21  1.24       109     0.6  26.3   572.33
Caldwell           3        3  0.21       114     2.8  58.7     0.78
Champion          25        9  0.10        46     1.9  47.4     0.18
Coamano            2        1  0.05        77     1.9   1.9   903.82
Daphne.Major      18       11  0.34       119     8.0   8.0     1.84

names(gala)

[1] "Species"   "Endemics"  "Area"      "Elevation" "Nearest"   "Scruz"    
[7] "Adjacent"

# fit MLR
lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data=gala)
summary(lmod)


Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent, 
    data = gala)

Residuals:
     Min       1Q   Median       3Q      Max 
-111.679  -34.898   -7.862   33.460  182.584 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.068221  19.154198   0.369 0.715351    
Area        -0.023938   0.022422  -1.068 0.296318    
Elevation    0.319465   0.053663   5.953 3.82e-06 ***
Nearest      0.009144   1.054136   0.009 0.993151    
Scruz       -0.240524   0.215402  -1.117 0.275208    
Adjacent    -0.074805   0.017700  -4.226 0.000297 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 60.98 on 24 degrees of freedom
Multiple R-squared:  0.7658,    Adjusted R-squared:  0.7171 
F-statistic:  15.7 on 5 and 24 DF,  p-value: 6.838e-07

# LSE of coefficients with CI
(betas = coef(lmod))

 (Intercept)         Area    Elevation      Nearest        Scruz     Adjacent 
 7.068220709 -0.023938338  0.319464761  0.009143961 -0.240524230 -0.074804832

confint(lmod, level = .95)

                  2.5 %      97.5 %
(Intercept) -32.4641006 46.60054205
Area         -0.0702158  0.02233912
Elevation     0.2087102  0.43021935
Nearest      -2.1664857  2.18477363
Scruz        -0.6850926  0.20404416
Adjacent     -0.1113362 -0.03827344

Confidence region

The advantage of the confidence interval relative to the corresponding hypothesis test is that we get information about plausible ranges for the parameters.

When we are interested in more than one parameter, we set \(\mathbf A = \mathbf I\) and \(\mathbf c =\boldsymbol \beta\), we have

\[ \frac{(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta})^{T}\mathbf X^{T} \mathbf X(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta})}{p \hat{\sigma}^{2}} \sim F_{p, n-p} \]

and construct a \(100\times(1-\alpha)\%\) confidence region for \(\boldsymbol \beta\) using:

\[(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta})^{T}\mathbf X^{T} \mathbf X(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}) \leq p\hat{\sigma}^{2} F_{p, n-p}(1-\alpha)\]

These regions are ellipsoidally shaped. Because these ellipsoids lie in higher dimensions, they cannot easily be visualized except for the two-dimensional case. (When \(p=2\), we can draw confidence ellipses.)
Confidence region takes into account the correlation between the estimates of coefficients.
Let’s see how these compare to the univariate confidence intervals. For example, we can construct the joint \(95\%\) confidence region for \(\beta_{Area}\) and \(\beta_{Adjacent}\).

# install.packages("ellipse")
library(ellipse)
plot(ellipse(lmod,which = c(2,6)),type="l",ylim=c(-0.13,0))
points(coef(lmod)[2], coef(lmod)[6], pch=19)
abline(v=confint(lmod)[2,],lty=2)
abline(h=confint(lmod)[6,],lty=2)

In the figure above, we have added the point of the least squares estimates which lies at the center of the ellipse. We also added the univariate confidence intervals for both dimensions seen as dotted lines.

What outcome can we determine from this plot?
- The joint hypothesis \(H_0 :\beta_{Area}=\beta_{Adjacent} = 0\) is rejected because the origin does not lie inside the ellipse.
- The hypothesis \(H_0:\beta_{Area} = 0\) is not rejected because zero does lie within the vertical dashed lines.
- The hypothesis \(H_0:\beta_{Adjacent} = 0\) is rejected because the horizontal dashed lines do not encompass zero.
Comments:
- If you want to test multiple parameters, you need to use a joint testing procedure and not try to combine several univariate tests. (Multiple comparison and simultaneous confidence intervals are the topics beyond the scope of this course.)
- In higher dimensions, confidence ellipses are not easily visualized so our example here is more of educational than practical value.
- Nevertheless, it should serve as a caution in interpreting a collection of univariate hypothesis tests or confidence intervals.

General F-test for Nested Model Comparison

Nested Model Comparison: Motivation

Given several predictors for a response, we might wonder whether all are needed.

Consider a larger model, \(\Omega\), and a smaller model, \(\omega\), which consists of a subset of the predictors that are in \(\Omega\). We call \(\omega\) the reduced model nested in full model \(\Omega\).

We will take \(\omega\) to represent the null hypothesis \(H_0\) and \(\Omega\) to represent the alternative \(H_1\).

If \(RSS_\omega - RSS_\Omega\) is small, the fit of the smaller model is almost as good as the larger model and so we would prefer the smaller model on the grounds of simplicity.
If the difference is large, then the superior fit of the larger model would be preferred.

This suggests that the test statistic would potentially be

\[\frac{RSS_\omega - RSS_\Omega}{RSS_\Omega}\]

In addition, we should consider the scaling with dimensions: suppose that the dimension (or number of parameters) of \(\Omega\) is \(p\) and the dimension of \(\omega\) is \(q\), then we scale to get an F-statistic,

\[F\star = \frac{(RSS_\omega - RSS_\Omega)/(p-q)}{RSS_\Omega/(n-p)}\stackrel{H_0}{\sim} F_{p-q, n-p}\]

Comments:

Testing the significance of the MLR is a special case of nested model comparison, where the full model is the fitted MLR, and the reduced model is the intercept-only model.

Implementation in R:

lmod.reduced <- lm(Species ~ Elevation + Nearest + Scruz, gala)

anova(lmod.reduced, lmod)

Analysis of Variance Table

Model 1: Species ~ Elevation + Nearest + Scruz
Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)   
1     26 158292                               
2     24  89231  2     69060 9.2874 0.00103 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing these two models is corresponding to test a null hypothesis \(H_0:\beta_{Area}=\beta_{Adjacent} = 0\). We reject the \(H_0\), therefore, the reduced model. (Same testing conclusion as we obtained using rejection region.)

Sequential Sum of Squares and Partial Sum of Squares

Two common types of sum of squares:

sequential
partial

To introduce the concepts, we consider the simple example of the following linear model

\[y_{i}=\beta_{1}+\beta_{2} x_{i 1}+\beta_{3} x_{i 2}+\beta_{4} x_{i 3}+\epsilon_{i}\]

the sequential (type I) SS.

Consider fitting three nested models

Model 1: \(y_{i}=\beta_{1}+\beta_{2} x_{i 1}+\epsilon_{i}\)

Model 2: \(y_{i}=\beta_{1}+\beta_{2} x_{i 1}+\beta_{3} x_{i 2}+\epsilon_{i}\)

Model 3: \(y_{i}=\beta_{1}+\beta_{2} x_{i 1}+\beta_{3} x_{i 2}+\beta_{4} x_{i 3}+\epsilon_{i}\)

Let \(SS(x1)\), \(SS(x1; x2)\) and \(SS(x1; x2; x3)\) be explained SS of models 1, 2, and 3 respectively.

Define notations

\[ \begin{aligned} S S_{x_{1}} &=S S\left(x_{1}\right) \\ S S_{x_{2} \mid x_{1}} &=S S\left(x_{1}, x_{2}\right)-S S\left(x_{1}\right) \\ &=S S_{\text {error }}\left(x_{1}\right)-S S_{\text {error }}\left(x_{1}, x_{2}\right) \\ S S_{x_{3} \mid x_{1}, x_{2}} &=S S\left(x_{1}, x_{2}, x_{3}\right)-S S\left(x_{1}, x_{2}\right) \\ &=S S_{\text {error }}\left(x_{1}, x_{2}\right)-S S_{\text {error }}\left(x_{1}, x_{2}, x_{3}\right) \end{aligned} \]

The \(SS_{total}\) may be decomposed into

Comparing with \(SS_{total} = SS_{model}+SS_{error}\), we see that \(SS_{total}\) has been split into \(S S_{x_{1}}\), \(S S_{x_{2} \mid x_{1}}\) and \(S S_{x_{3} \mid x_{1}, x_{2}}\).

Interpretations:

\(S S_{x_{1}}\) : \(SS\) explained by \(x_1\)
\(S S_{x_{2} \mid x_{1}}\) : extra \(SS\) due to the addition of \(x_2\) to the model that already includes \(x_1\)
\(S S_{x_{3} \mid x_{1}, x_{2}}\) : extra \(SS\) due to the addition of \(x_3\) to the model that already includes \(x_1\) and \(x_2\)

Type I SS Table

\[ \begin{array}{ccccc} \text { Source } & S S & \text { d.f. } & M S & F \\ \hline x_{1} & S S_{x_{1}} & 1 & M S_{x_{1}} & M S_{x_{1}} / M S E \\ x_{2} \mid x_{1} & S S_{x_{2} \mid x_{1}} & 1 & M S_{x_{2} \mid x_{1}} & M S_{x_{2} \mid x_{1}} / M S E \\ x_{3} \mid x_{1}, x_{2} & S S_{x_{3} \mid x_{1}, x_{2}} & 1 & M S_{x_{3} \mid x_{1}, x_{2}} & M S_{x_{3} \mid x_{1}, x_{2}} / M S E \\ \text { error } & S S_{\text {error }} & \mathrm{n}-4 & M S E & \\ \hline \text { total } & S S_{\text {total }} & \mathrm{n}-1 & & \end{array} \]

Comments:

One can calculate \(S S_{x_{1}}\), \(S S_{x_{2} \mid x_{1}}\) and \(S S_{x_{3} \mid x_{1}, x_{2}}\) by fitting three models. But sequential (type I) SS computes them simultaneously.
In R, the anova() function gives the type I SS.
Type I SS depends on the order in which the variables are entered into the model.

Partial (type III) sum of squares

Using type III is equivalent to simply t-tests.

Let’s look at the example:

To test the hypothesis \(H_0 : \beta_2 = 0\), we can fit the full model and the reduced model without \(x_1\), and compute \(S S_{x_{1} \mid x_{2}, x_{3}}\) which is the extra sum of squares explained by \(x_1\) when \(x_2\) and \(x_3\) are included in the model.
Similarly, to test \(H_0 : \beta_3 = 0\) and \(H_0 : \beta_4 = 0\), we need to compute \(S S_{x_{2} \mid x_{1}, x_{3}}\) and \(S S_{x_{3} \mid x_{1}, x_{2}}\)

Type III SS Table

\[ \begin{array}{ccc} \text { Source } & S S & H_{0} \\ \hline x_{1} & S S_{x_{1} \mid x_{2}, x_{3}} & \beta_{2}=0 \\ x_{2} & S S_{x_{2} \mid x_{1}, x_{3}} & \beta_{3}=0 \\ x_{3} & S S_{x_{3} \mid x_{1}, x_{2}} & \beta_{4}=0 \end{array} \]

Comments:

One can calculate these SS’s by fitting multiple models. Partial (type III) sum of squares computes them simultaneously.

In R, the Anova() function gives the type III SS.

attach(airquality)
air <- na.omit(airquality)
detach(airquality)
attach(air)
fit0 <- lm(Ozone ~ 1, data=air)
fit1 <- lm(Ozone ~ Solar.R, data=air)
fit2 <- lm(Ozone ~ Solar.R + Wind, data=air)
fit3 <- lm(Ozone ~ Solar.R + Wind + Temp, data=air)
anova(fit0, fit1, fit2, fit3) # Type I SS

Analysis of Variance Table

Model 1: Ozone ~ 1
Model 2: Ozone ~ Solar.R
Model 3: Ozone ~ Solar.R + Wind
Model 4: Ozone ~ Solar.R + Wind + Temp
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    110 121802                                  
2    109 107022  1     14780 32.944 8.946e-08 ***
3    108  67053  1     39969 89.094 9.509e-16 ***
4    107  48003  1     19050 42.463 2.424e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(car)
Anova(fit3, type="III")

Anova Table (Type III tests)

Response: Ozone
            Sum Sq  Df F value    Pr(>F)    
(Intercept)   3494   1  7.7888  0.006227 ** 
Solar.R       2986   1  6.6563  0.011237 *  
Wind         11642   1 25.9495 1.516e-06 ***
Temp         19050   1 42.4630 2.424e-09 ***
Residuals    48003 107                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(fit3)


Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp, data = air)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.485 -14.219  -3.551  10.097  95.619 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
Solar.R       0.05982    0.02319   2.580  0.01124 *  
Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
Temp          1.65209    0.25353   6.516 2.42e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.18 on 107 degrees of freedom
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.5948 
F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16