Lecture 5

Author

Jiaye Xu

Published

March 14, 2022

Lecture 5 Overview

Multiple Linear Regression (MLR)

  • Matrix Form of the Linear Model (LM)

  • Least Squares Estimate (LSE) in Matrix Notation

    • Fitted values and hat matrix

    • Residuals, Residual Sum of Squares (RSS) and estimate of \(\sigma^2\)

  • Properties of the LS estimates

  • Decomposition of Sum of Squares in Matrix Notation

  • Goodness-of-fit

  • Inferences for MLR (with R)

    • MLE under Gaussian assumption

    • Significance Testing of Each Variable

    • F-test for Significance of MLR


Multiple Linear Regression (MLR)

Now we move from the SLR model with one predictor to the MLR model with two or more predictors.

Everything we’ve learned about the SLR model extends to the MLR model with at most minor modification.

Matrix Form of the Linear Model (LM)

Simple Linear Regression in Matrix Form

It is more efficient to use matrices to define the LM and the subsequent analyses. Let us look at a heuristic example, the matrix formulation of SLR:

Previously, the SLR equation is written as

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad i = 1, \ldots, n.\]

If we actually elaborate on the equations for \(i = 1, \ldots, n\), we obtain \(n\) equations \[\begin{array}{cc} y_{1} &=\beta_{0}+\beta_{1} x_{1}+\varepsilon_{1} \\ y_{2} &=\beta_{0}+\beta_{1} x_{2}+\varepsilon_{2} \\ \vdots & \\ y_{n} &=\beta_{0}+\beta_{1} x_{n}+\varepsilon_{n} \end{array}\]

We can formulate the above SLR equations in matrix notation: \[\left(\begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array}\right)=\left(\begin{array}{cc} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{n} \end{array}\right)\left[\begin{array}{c} \beta_{0} \\ \beta_{1} \end{array}\right]+\left(\begin{array}{c} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \end{array}\right). \]

For simplicity in notation, our SLR equations reduce to a short and simple statement \[\mathbf{y}=\mathbf{X} \boldsymbol{\beta}+\boldsymbol{\varepsilon},\] where \(\mathbf{X}\) is a \(n \times 2\) matrix. \(\mathbf{y}\) is a \(n \times 1\) column vector, \(\boldsymbol\beta\) is a \(2 \times 1\) column vector, and \(\boldsymbol\varepsilon\) is an \(n \times 1\) column vector.

Matrix Represetation of Multiple Linear Regression (MLR)

\[\mathbf{y}=\left(\begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array}\right),\quad \mathbf{x}_j=\left(\begin{array}{c} x_{1j} \\ x_{2j} \\ \vdots \\ x_{nj} \end{array}\right), \quad \boldsymbol{\epsilon}=\left(\begin{array}{c} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{n} \end{array}\right), \quad \boldsymbol{\beta}=\left(\begin{array}{c} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p-1} \end{array}\right)\] and \[\mathbf{X}= (\mathbf{x}_1, \ldots, \mathbf{x}_p)= \left(\begin{array}{ccc}x_{11}&\ldots & x_{1p}\\ \vdots & \vdots & \vdots \\ x_{n1} & \ldots & x_{np}\end{array}\right)\]

Least Squares Estimate (LSE) in Matrix Notation

Recall in LSE for SLR, our goal is to find

a line that fits the data “best” s.t. the \(n\) prediction errors are as small as possible in some overall sense.

  • In MLR, we define the best estimate of \(\boldsymbol{\beta}\) as the one which minimizes the sum of the squared errors:

\[ \sum \boldsymbol\varepsilon_{i}^{2}=\boldsymbol\varepsilon^{T} \boldsymbol\varepsilon=(\mathbf y-\mathbf X \boldsymbol\beta)^{T}(\mathbf y-\mathbf X \boldsymbol\beta) \]

  • Differentiating with respect to

\(\boldsymbol{\beta}\) and setting to zero, we find that \(\hat{\boldsymbol{\beta}}\) satisfies: \[\mathbf X^T \mathbf X\hat{\boldsymbol{\beta}} = \mathbf X^T \mathbf y,\]

which is call normal equations.

Supplemental derivation to obtain normal equations

Differentiate the sum of the squared errors with respect to \(\boldsymbol{\beta}\) and set to zero, we obtain:

\[\frac{\partial (\mathbf y-\mathbf X \boldsymbol\beta)^{T}(\mathbf y-\mathbf X \boldsymbol\beta)}{\partial \boldsymbol\beta}\\ =\frac{\partial (\mathbf y^T \mathbf y-\mathbf y^T\mathbf x \boldsymbol\beta - \boldsymbol\beta^T \mathbf x^T \mathbf y + \boldsymbol\beta^T \mathbf x^T \mathbf x \boldsymbol\beta)}{\partial \boldsymbol\beta}\\ = -\mathbf x^T \mathbf y - \mathbf x^T \mathbf y + 2\mathbf x^T \mathbf x \boldsymbol\beta=0\] by the property of derivatives of matrices and vectors, i.e., of first order

\[\frac{\partial \mathbf x^T \mathbf a}{\partial \mathbf x} = \frac{\partial \mathbf a^T \mathbf x}{\partial \mathbf x} = \mathbf a\] and of second order \[\frac{\partial \mathbf x^T \mathbf A \mathbf x}{\partial \mathbf x} =(\mathbf A + \mathbf A^T)\mathbf x.\]

Therefore, we find that \(\hat{\boldsymbol{\beta}}\) satisfies the normal equations:

\[\mathbf X^T \mathbf X\hat{\boldsymbol{\beta}} = \mathbf X^T \mathbf y,\]

  • Now provided \(\mathbf X^T \mathbf X\) is invertible (assuming that \(\mathbf X\) is full column rank), the LS estimates of \(\boldsymbol{\beta}\) is

\[\hat{\boldsymbol{\beta}} = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \mathbf y\]

Fitted values and hat matrix

  • Let \(\hat{\mathbf y} =(\hat y_1, \ldots, \hat y_n)^\top\) be fitted values. Then,

\[\hat{\mathbf y} = \mathbf X \hat{\boldsymbol{\beta}}=\mathbf X(\mathbf X^T \mathbf X)^{-1}\mathbf X^T \mathbf y \equiv \mathbf H \mathbf y\]

  • The matrix

\[\mathbf H = \mathbf X(\mathbf X^T \mathbf X)^{-1}\mathbf X^T\]

is the so called hat (projection, influence) matrix.

  • \(\mathbf H\) is symmetric, idempotent (\(\mathbf H = \mathbf H^2\)) and

\[\operatorname{rank}(\mathbf H) = \operatorname{tr}(\mathbf H) = p\]

Residuals, Residual Sum of Squares (RSS) and estimate of \(\sigma^2\)

  • Define residuals

\[e_i = y_i -\hat y_i\]

  • Let \(\mathbf e =(e_1, \ldots, e_n)^\top\). Then

\[\mathbf e= \mathbf y-\hat{\mathbf y} = (\mathbf I- \mathbf H)\mathbf y\]

  • Residual Sum of Squares (RSS) in matrix form

\[RSS = \sum_{i=1}^n e_i^2= \mathbf e^T \mathbf e=\mathbf y^T (\mathbf I- \mathbf H)\mathbf y\]

because \(\mathbf H\) is symmetric and idempotent.

  • An unbiased estimator of \(\sigma^2\) is

\[\hat\sigma^2 = RSS/(n-p)\]

\(n-p\) is called the degrees of freedom of the model.

Properties of the LS estimates

The LS estimate is unbiased

\[\operatorname{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\]

with variance \[\operatorname{Var}(\hat{\boldsymbol{\beta}})=\sigma^2 (\mathbf X^T \mathbf X)^{-1}\]

Sometimes you need the standard error for a particular component of \(\hat{\boldsymbol{\beta}}\) which can be picked out as \[\operatorname{SE}(\hat{\boldsymbol{\beta}}_{i-1})= \sigma \sqrt{(\mathbf X^T \mathbf X)^{-1}_{ii}}\]

Supplementary derivation details

  • Expectation of \(\hat{\boldsymbol{\beta}}\)

We define the expectation (or mean) of a random vector \(\mathbf v = (v_1,\ldots, v_d)\) of length \(d\), denoted \(\operatorname{E}(\mathbf v)\), to also be a vector of the same length, having \(\operatorname{E}(v_i)\) as its \(i\)th component. Also, with respect to such a random vector \(\mathbf v\), we have, for any matrix \(\mathbf A\) whose entries are fixed (non-random) and with \(d\) columns, the identity: \[\operatorname{E}(\mathbf A\mathbf v)= \mathbf A\operatorname{E}(\mathbf v)\]

Then, \[\operatorname{E}(\hat{\boldsymbol{\beta}}) = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \operatorname{E}(\mathbf y )\\ = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \mathbf X \boldsymbol{\beta}=\boldsymbol{\beta}\]

  • Variance of \(\hat{\boldsymbol{\beta}}\)

We define the variance of a random vector \(\mathbf v = (v_1,\ldots, v_d)\) of length \(d\) to be a \(d \times d\) matrix, denoted \(\operatorname{Var}(\mathbf v)\), having \(\operatorname{Var}(v_i)\) as its \((i,i)\) entry and \(\operatorname{Cov}(v_i,v_j)=\operatorname{Cov}(v_j,v_i)\) as both its \((i,j)\) and \((j,i)\) entries. For any matrix \(\mathbf A\) whose entries are fixed (non-random) and with d columns, we have the identity: \[\operatorname{Var}(\mathbf A\mathbf v)= \mathbf A\operatorname{Var}(\mathbf v) \mathbf A^\top\]

Therefore, \[\operatorname{Var}(\hat{\boldsymbol{\beta}}) = (\mathbf X^T \mathbf X)^{-1}\operatorname{Var}(\mathbf X^T \mathbf y )\left((\mathbf X^T \mathbf X)^{-1}\right)^T\\ = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T\operatorname{Var}( \mathbf y )\mathbf X\left((\mathbf X^T \mathbf X)^{-1}\right)^T\\ = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \sigma^2 \mathbf{I}\, \mathbf X\left((\mathbf X^T \mathbf X)^{-1}\right)^T\\ =\sigma^2 (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \mathbf X (\mathbf X^T \mathbf X)^{-1} =\sigma^2 (\mathbf X^T \mathbf X)^{-1}\]

BLUE (Gauss-Markov theorem)

For any \(p\)-dimensional vector \(\mathbf c\), \(\mathbf c^T \hat{\boldsymbol{\beta}}\) is the unique best linear unbiased estimator (BLUE) of \(\mathbf c^T \boldsymbol \beta\) in the sense that it has the minimum variance (equivalently minimum mean squared error \(\operatorname{E}((\mathbf c^T \hat{\boldsymbol{\beta}}-\mathbf c^T \boldsymbol \beta)^2)\) in the class of linear unbiased estimators for \(\mathbf c^T \boldsymbol \beta\).

Comments:

  1. It results from an orthogonal projection onto the model space. It makes sense geometrically.

Geometrical representation of the estimation \(\boldsymbol{\beta}\). The data vector \(\mathbf y\) is projected orthogonally onto the model space spanned by \(\mathbf X\). The fit is represented by projection \(\hat{\mathbf y}=\mathbf X \hat{\boldsymbol{\beta}}\) (or \(\hat{\mathbf y}=\mathbf H \mathbf y\) where \(\mathbf H\) is an orthogonal projection matrix) with the difference between the fit and the data represented by the residual vector \(\hat{\boldsymbol \varepsilon}\) (i.e., \(\mathbf e\)).
  1. If the errors are independent and identically normally distributed, it is the maximum likelihood estimator (MLE). Loosely put, the MLE is the value of \(\boldsymbol{\beta}\) that maximizes the probability of the data that was observed.

  2. The Gauss–Markov theorem states that \(\hat{\boldsymbol{\beta}}\) is the best linear unbiased estimate (BLUE).

Decomposition of Sum of Squares in Matrix Notation

  • Let \(\mathbf 1 = (1, \ldots, 1)^\top\) be a vector of all ones. When the LM contains the intercept, \(\mathbf x_1 =\mathbf 1\)

  • Let \(\bar y = \sum_{i=1}^n y_i/n\)

  • By geometrically representing the following equality

\[(\mathbf y-\bar y \mathbf 1) = (\hat{\mathbf y}-\bar y \mathbf 1) + (\mathbf y-\hat{\mathbf y})\]

we know that \(\bar y \mathbf 1\) as an estimator of \(\mathbf y\) are also in the model space spanned by \(\mathbf X\), and vectors \((\hat{\mathbf y}-\bar y \mathbf 1)\) and \((\mathbf y-\hat{\mathbf y})\) are orthogonal. Therefore, by Pythagoras’ theorem, we have

\[\|\mathbf y-\bar y \mathbf 1\|^2 = \|\hat{\mathbf y}-\bar y \mathbf 1\|^2 + \|\mathbf y-\hat{\mathbf y}\|^2\]

  • The term \(\|\mathbf y-\bar y \mathbf 1\|^2 = \sum_{i=1}^n(y_i - \bar y)^2\) is the total sum of squares (\(SS_{total}\))

  • The term \(\|\hat{\mathbf y}-\bar y \mathbf 1\|^2=\sum_{i=1}^n(\hat y_i - \bar y)^2\) is the explained sum of squares by the model (\(SS_{model}\) or \(SS_{regression}\) )

  • The term \(\|\mathbf y-\hat{\mathbf y}\|^2\) is the residual sum of squares (RSS) (also denoted as \(SS_{error}\))

  • The above equation decomposes \(SS_{total}\) into two parts: explained due to the LM and unexplained:

\[SS_{total}=SS_{model}+SS_{error}\]

ANOVA Table of MLR

Source df SS MS
Model p-1 \(SS_{model}\) \(MS_{model} = SS_{model}/(p-1)\)
Error n-p \(SS_{error}\) \(MSE = SS_{error}/(n-p)\)
Total n-1 \(SS_{total}\)

Goodness-of-fit

A brief review:

In SLR, we have introduced a measure of goodness-of-fit which is the so-called coefficient of determination, or \(R^2\)

\[R^2 = \frac{SS_{model}}{SS_{total}} = 1- \frac{SS_{error}}{SS_{total}}\]

  • It gives the proportion of the variation in the response explained by the model.

  • \(R^2\) is the square of the multiple correlation coefficient (i.e., Pearson correlation coefficient) which is defined as the sample correlation coefficient between \(\mathbf{y}\) and \(\mathbf{\hat{y}}\).

Adjusted \(R^2\)

One problem with standard \(R^2\) is that it always increases (or stays the same) as more predictors are added to a multiple linear regression model, even if the predictors added are unrelated to the response variable.

Hence, all else being equal, models with fewer independent variables are penalized, even when they explain the behavior of the response relatively well.

Adjusted \(R^2\) (denoted \(\bar{R}^{2}\)) is a modification of \(R^2\) that adjusts for the number of independent variables in a model:

\[\bar{R}^{2}=1-\frac{n-1}{n-p}(1-R^2)\] Equivalently,

\[ \bar{R}^{2}=1-\frac{S S_{\text {error }} /(n-p)}{S S_{\text {total }} /(n-1)} \]

Comments:

The principle behind the \(\bar{R}^{2}\) can be seen by rewriting \(R^2\) as

\[ R^{2}=1-\frac{\operatorname{VAR}_{E}}{\operatorname{VAR}_{T}} \]

where \(\operatorname{VAR}_{E} = SS_{error}/n\) and \(\operatorname{VAR}_{T} =SS_{total}/n\) are estimates of the variances of the errors and of the observations, respectively. In the definition of \(\bar{R}^{2}\), these estimates are replaced by “unbiased” versions: \(\operatorname{VAR}_{E} = SS_{error}/(n-p)\) and \(\operatorname{VAR}_{T} =SS_{total}/(n-1)\)

Comparison between \(\bar{R}^{2}\) and \(R^{2}\)

  • When a variable is added to the model, \(R^{2}\) always increases while \(\bar{R}^{2}\) can increase or decrease.

  • Unlike \(R^{2}\), \(\bar{R}^{2}\) increases only if the new term improves the model more than would be expected by chance. \(\bar{R}^{2}\) can be negative, and will always be less than or equal to \(R^{2}\)

(Your Turn: Can you explain why \(\bar{R}^{2}\) could be negative? You may give derivation and/or an example to describe this situation.)

  • \(\bar{R}^{2}\) is useful in the variable selection stage of model building. \(R^{2}\) is not useful for variable selection.

Inference for MLR

MLE under Gaussian assumption

The LS estimation does not require Gaussian assumption for random errors while inference (hypothesis test and confidence interval) does.

  • When \(\epsilon_i \stackrel{iid}{\sim} \mathcal{N}(0, \sigma^2)\), the LS estimates \(\hat{\boldsymbol\beta}\) are also the MLE.

  • \(\hat{\boldsymbol\beta} \sim \mathcal{N}(\boldsymbol\beta, \sigma^2(\mathbf X^T \mathbf X)^{-1})\)

Significance Testing of Each Variable

Within a MLR model, we may want to know whether a particular predictor is making a useful contribution to the model.

To carry out each test, the t-statistic (\(t \star\)) is still calculated as

\[t\star = \frac{\hat\beta_j}{SE(\hat\beta_j)}\]

When we cannot reject the null hypothesis (\(H_0: \beta_j= 0\)), we should say that we do not need variable \(x_j\) in the model given that the rest of variables (except for \(x_j\)) will remain in the model.

Null distribution \[t\star\sim t_{n-p}\]

F-test for Significance of MLR

For MLR, the F−test concerns the following statistical hypothesis test:

\[H_0: \beta_1=\ldots=\beta_{p-1}= 0 \\ H_1: \beta_j \neq 0 \quad\text{for at least one} \, j\in \{1,2,\ldots, p-1\}\]

  • Test statistic

\[F\star =\frac{SS_{regression}/(p-1)}{SS_{error}/(n-p)}=\frac{MSR}{MSE}\]

  • Null distribution

\[F\star \sim F(p-1, n-p)\]

Implementation in R, with data set of blood pressure (BP as response), age (Age) and weight (Weight) as predictors.

  Pt  BP Age Weight  BSA Dur Pulse Stress
1  1 105  47   85.4 1.75 5.1    63     33
2  2 115  49   94.2 2.10 3.8    70     14
3  3 116  49   95.3 1.98 8.2    72     10
4  4 117  50   94.7 2.01 5.8    73     99
5  5 112  51   89.4 1.89 7.0    72     95
6  6 121  48   99.5 2.25 9.3    71     10
[1] "Pt"     "BP"     "Age"    "Weight" "BSA"    "Dur"    "Pulse"  "Stress"
# fit MLR
fit2 = lm(BP ~ Age + Weight)
summary(fit2)

Call:
lm(formula = BP ~ Age + Weight)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.89968 -0.35242  0.06979  0.35528  0.82781 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -16.57937    3.00746  -5.513 3.80e-05 ***
Age           0.70825    0.05351  13.235 2.22e-10 ***
Weight        1.03296    0.03116  33.154  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5327 on 17 degrees of freedom
Multiple R-squared:  0.9914,    Adjusted R-squared:  0.9904 
F-statistic: 978.2 on 2 and 17 DF,  p-value: < 2.2e-16
# LSE of coefficients with CI 
(betas = coef(fit2))
(Intercept)         Age      Weight 
-16.5793694   0.7082515   1.0329611 
confint(fit2, level = .95)
                  2.5 %      97.5 %
(Intercept) -22.9245526 -10.2341861
Age           0.5953468   0.8211561
Weight        0.9672272   1.0986950