For simplicity in notation, our SLR equations reduce to a short and simple statement \[\mathbf{y}=\mathbf{X} \boldsymbol{\beta}+\boldsymbol{\varepsilon},\] where \(\mathbf{X}\) is a \(n \times 2\) matrix. \(\mathbf{y}\) is a \(n \times 1\) column vector, \(\boldsymbol\beta\) is a \(2 \times 1\) column vector, and \(\boldsymbol\varepsilon\) is an \(n \times 1\) column vector.
Matrix Represetation of Multiple Linear Regression (MLR)
a line that fits the data “best” s.t. the \(n\) prediction errors are as small as possible in some overall sense.
In MLR, we define the best estimate of \(\boldsymbol{\beta}\) as the one which minimizes the sum of the squared errors:
\[
\sum \boldsymbol\varepsilon_{i}^{2}=\boldsymbol\varepsilon^{T} \boldsymbol\varepsilon=(\mathbf y-\mathbf X \boldsymbol\beta)^{T}(\mathbf y-\mathbf X \boldsymbol\beta)
\]
Differentiating with respect to
\(\boldsymbol{\beta}\) and setting to zero, we find that \(\hat{\boldsymbol{\beta}}\) satisfies: \[\mathbf X^T \mathbf X\hat{\boldsymbol{\beta}} = \mathbf X^T \mathbf y,\]
which is call normal equations.
Supplemental derivation to obtain normal equations
Differentiate the sum of the squared errors with respect to \(\boldsymbol{\beta}\) and set to zero, we obtain:
\[\frac{\partial (\mathbf y-\mathbf X \boldsymbol\beta)^{T}(\mathbf y-\mathbf X \boldsymbol\beta)}{\partial \boldsymbol\beta}\\
=\frac{\partial (\mathbf y^T \mathbf y-\mathbf y^T\mathbf x \boldsymbol\beta - \boldsymbol\beta^T \mathbf x^T \mathbf y + \boldsymbol\beta^T \mathbf x^T \mathbf x \boldsymbol\beta)}{\partial \boldsymbol\beta}\\
= -\mathbf x^T \mathbf y - \mathbf x^T \mathbf y + 2\mathbf x^T \mathbf x \boldsymbol\beta=0\] by the property of derivatives of matrices and vectors, i.e., of first order
\[\frac{\partial \mathbf x^T \mathbf a}{\partial \mathbf x} = \frac{\partial \mathbf a^T \mathbf x}{\partial \mathbf x} = \mathbf a\] and of second order \[\frac{\partial \mathbf x^T \mathbf A \mathbf x}{\partial \mathbf x} =(\mathbf A + \mathbf A^T)\mathbf x.\]
Therefore, we find that \(\hat{\boldsymbol{\beta}}\) satisfies the normal equations:
with variance \[\operatorname{Var}(\hat{\boldsymbol{\beta}})=\sigma^2 (\mathbf X^T \mathbf X)^{-1}\]
Sometimes you need the standard error for a particular component of \(\hat{\boldsymbol{\beta}}\) which can be picked out as \[\operatorname{SE}(\hat{\boldsymbol{\beta}}_{i-1})= \sigma \sqrt{(\mathbf X^T \mathbf X)^{-1}_{ii}}\]
Supplementary derivation details
Expectation of \(\hat{\boldsymbol{\beta}}\)
We define the expectation (or mean) of a random vector\(\mathbf v = (v_1,\ldots, v_d)\) of length \(d\), denoted \(\operatorname{E}(\mathbf v)\), to also be a vector of the same length, having \(\operatorname{E}(v_i)\) as its \(i\)th component. Also, with respect to such a random vector \(\mathbf v\), we have, for any matrix \(\mathbf A\) whose entries are fixed (non-random) and with \(d\) columns, the identity: \[\operatorname{E}(\mathbf A\mathbf v)= \mathbf A\operatorname{E}(\mathbf v)\]
Then, \[\operatorname{E}(\hat{\boldsymbol{\beta}}) = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \operatorname{E}(\mathbf y )\\
= (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \mathbf X \boldsymbol{\beta}=\boldsymbol{\beta}\]
Variance of \(\hat{\boldsymbol{\beta}}\)
We define the variance of a random vector\(\mathbf v = (v_1,\ldots, v_d)\) of length \(d\) to be a \(d \times d\) matrix, denoted \(\operatorname{Var}(\mathbf v)\), having \(\operatorname{Var}(v_i)\) as its \((i,i)\) entry and \(\operatorname{Cov}(v_i,v_j)=\operatorname{Cov}(v_j,v_i)\) as both its \((i,j)\) and \((j,i)\) entries. For any matrix \(\mathbf A\) whose entries are fixed (non-random) and with d columns, we have the identity: \[\operatorname{Var}(\mathbf A\mathbf v)= \mathbf A\operatorname{Var}(\mathbf v) \mathbf A^\top\]
For any \(p\)-dimensional vector \(\mathbf c\), \(\mathbf c^T \hat{\boldsymbol{\beta}}\) is the unique best linear unbiased estimator (BLUE) of \(\mathbf c^T \boldsymbol \beta\) in the sense that it has the minimum variance (equivalently minimum mean squared error \(\operatorname{E}((\mathbf c^T \hat{\boldsymbol{\beta}}-\mathbf c^T \boldsymbol \beta)^2)\) in the class of linear unbiased estimators for \(\mathbf c^T \boldsymbol \beta\).
Comments:
It results from an orthogonal projection onto the model space. It makes sense geometrically.
Geometrical representation of the estimation \(\boldsymbol{\beta}\). The data vector \(\mathbf y\) is projected orthogonally onto the model space spanned by \(\mathbf X\). The fit is represented by projection \(\hat{\mathbf y}=\mathbf X \hat{\boldsymbol{\beta}}\) (or \(\hat{\mathbf y}=\mathbf H \mathbf y\) where \(\mathbf H\) is an orthogonal projection matrix) with the difference between the fit and the data represented by the residual vector \(\hat{\boldsymbol \varepsilon}\) (i.e., \(\mathbf e\)).
If the errors are independent and identically normally distributed, it is the maximum likelihood estimator (MLE). Loosely put, the MLE is the value of \(\boldsymbol{\beta}\) that maximizes the probability of the data that was observed.
The Gauss–Markov theorem states that \(\hat{\boldsymbol{\beta}}\) is the best linear unbiased estimate (BLUE).
Decomposition of Sum of Squares in Matrix Notation
Let \(\mathbf 1 = (1, \ldots, 1)^\top\) be a vector of all ones. When the LM contains the intercept, \(\mathbf x_1 =\mathbf 1\)
Let \(\bar y = \sum_{i=1}^n y_i/n\)
By geometrically representing the following equality
\[(\mathbf y-\bar y \mathbf 1) = (\hat{\mathbf y}-\bar y \mathbf 1) + (\mathbf y-\hat{\mathbf y})\]
we know that \(\bar y \mathbf 1\) as an estimator of \(\mathbf y\) are also in the model space spanned by \(\mathbf X\), and vectors \((\hat{\mathbf y}-\bar y \mathbf 1)\) and \((\mathbf y-\hat{\mathbf y})\) are orthogonal. Therefore, by Pythagoras’ theorem, we have
\[\|\mathbf y-\bar y \mathbf 1\|^2 = \|\hat{\mathbf y}-\bar y \mathbf 1\|^2 + \|\mathbf y-\hat{\mathbf y}\|^2\]
The term \(\|\mathbf y-\bar y \mathbf 1\|^2 = \sum_{i=1}^n(y_i - \bar y)^2\) is the total sum of squares (\(SS_{total}\))
The term \(\|\hat{\mathbf y}-\bar y \mathbf 1\|^2=\sum_{i=1}^n(\hat y_i - \bar y)^2\) is the explained sum of squares by the model (\(SS_{model}\) or \(SS_{regression}\) )
The term \(\|\mathbf y-\hat{\mathbf y}\|^2\) is the residual sum of squares (RSS) (also denoted as \(SS_{error}\))
The above equation decomposes \(SS_{total}\) into two parts: explained due to the LM and unexplained:
\[SS_{total}=SS_{model}+SS_{error}\]
ANOVA Table of MLR
Source
df
SS
MS
Model
p-1
\(SS_{model}\)
\(MS_{model} = SS_{model}/(p-1)\)
Error
n-p
\(SS_{error}\)
\(MSE = SS_{error}/(n-p)\)
Total
n-1
\(SS_{total}\)
Goodness-of-fit
A brief review:
In SLR, we have introduced a measure of goodness-of-fit which is the so-called coefficient of determination, or \(R^2\)
It gives the proportion of the variation in the response explained by the model.
\(R^2\) is the square of the multiple correlation coefficient (i.e., Pearson correlation coefficient) which is defined as the sample correlation coefficient between \(\mathbf{y}\) and \(\mathbf{\hat{y}}\).
Adjusted \(R^2\)
One problem with standard \(R^2\) is that it always increases (or stays the same) as more predictors are added to a multiple linear regression model, even if the predictors added are unrelated to the response variable.
Hence, all else being equal, models with fewer independent variables are penalized, even when they explain the behavior of the response relatively well.
Adjusted \(R^2\) (denoted \(\bar{R}^{2}\)) is a modification of \(R^2\) that adjusts for the number of independent variables in a model:
where \(\operatorname{VAR}_{E} = SS_{error}/n\) and \(\operatorname{VAR}_{T} =SS_{total}/n\) are estimates of the variances of the errors and of the observations, respectively. In the definition of \(\bar{R}^{2}\), these estimates are replaced by “unbiased” versions: \(\operatorname{VAR}_{E} = SS_{error}/(n-p)\) and \(\operatorname{VAR}_{T} =SS_{total}/(n-1)\)
Comparison between \(\bar{R}^{2}\) and \(R^{2}\)
When a variable is added to the model, \(R^{2}\) always increases while \(\bar{R}^{2}\) can increase or decrease.
Unlike \(R^{2}\), \(\bar{R}^{2}\) increases only if the new term improves the model more than would be expected by chance. \(\bar{R}^{2}\) can be negative, and will always be less than or equal to \(R^{2}\)
(Your Turn: Can you explain why \(\bar{R}^{2}\) could be negative? You may give derivation and/or an example to describe this situation.)
\(\bar{R}^{2}\) is useful in the variable selection stage of model building. \(R^{2}\) is not useful for variable selection.
Inference for MLR
MLE under Gaussian assumption
The LS estimation does not require Gaussian assumption for random errors while inference (hypothesis test and confidence interval) does.
When \(\epsilon_i \stackrel{iid}{\sim} \mathcal{N}(0, \sigma^2)\), the LS estimates \(\hat{\boldsymbol\beta}\) are also the MLE.
Within a MLR model, we may want to know whether a particular predictor is making a useful contribution to the model.
To carry out each test, the t-statistic (\(t \star\)) is still calculated as
\[t\star = \frac{\hat\beta_j}{SE(\hat\beta_j)}\]
When we cannot reject the null hypothesis (\(H_0: \beta_j= 0\)), we should say that we do not need variable \(x_j\) in the model given that the rest of variables (except for \(x_j\)) will remain in the model.
Null distribution \[t\star\sim t_{n-p}\]
F-test for Significance of MLR
For MLR, the F−test concerns the following statistical hypothesis test:
\[H_0: \beta_1=\ldots=\beta_{p-1}= 0 \\
H_1: \beta_j \neq 0 \quad\text{for at least one} \, j\in \{1,2,\ldots, p-1\}\]