dat = read.table('cement.txt', header = T)
# head(dat)
# names(dat)
# pairs(dat)
y = dat$y
x1 = dat$x1
x2 = dat$x2
x3 = dat$x3
x4 = dat$x4Lecture 9
Lecture 9 Overview
Variable Selection
Occam’s Razor and Trade-offs
Compare Non-nested Models: criteria
Stepwise Procedures
Best Subset Selection: exhaustive search
Model Building and Variable Selection
Why variable selection?
Practically, often a large number of independent variables are investigated and it is likely that not all of them are important.
Statistical inferences can be carried out more efficiently with smaller models.
Simple goodness-of-fit criteria such as RSS and \(R^2\) cannot be used since they will lead to the largest model.
The goal of the variable selection is to find a subset of independent variables.
Occam’s Razor and Trade-offs
Our objective here is to describe systematic algorithmic strategies that can optimize the trade-off between goodness-of-fit and excess model complexity.
Occam’s razor has been the guiding principle for the compromises:
the model that fits observations sufficiently well in the least complex way should be preferred.
To be precise on “fits observations sufficiently well”, we need a quantity that measures how well a model fits the data.
To be precise on “the least complex way”, we need a quantity that measures the complexity of a model.
Understanding the “trade-offs”:
The likelihood (or RSS) provides a measure of goodness-of-fit.
The degrees of freedom (df) (i.e. number of parameters) provides a measure of model complexity.
The negative likelihood (RSS) and df are two opposite aspects of a model: the negative likelihood (RSS) decreases as df increases.
The goal of model/variable selection is to strike a balance between these two conflicting aspects.
The Occam’s razor suggests that simple models should be preferred to more complicated ones, other things being equal.
Compare Non-nested Models: information criteria
Nested model comparison via general F-test has been introduced in lecture 6. Here, we look at the method for comparing non-nested models.
A direct compromise between the goodness-of-fit and model complexity is
\[-2\text{maximized log-likelihood} + \xi p\]
where
We seek models that minimizes the above criterion
for LM with Gaussian assumption,
\[-2\text{maximized log-likelihood} = n \log(RSS/n) + \text{constant}\]
In general, \(\xi\) is a penalty parameter to the model complexity. A larger \(\xi\) leads to a simpler model.
when \(\xi= 2\): the above criterion is called Akaike’s Information Criterion (AIC)
when \(\xi= \log n\): the above criterion is called Bayesian Information Criterion (BIC)
Comments:
The only difference between AIC and BIC is the multiplier of \(p\), the number of parameters.
The BIC places a higher penalty ( \(\log(n) > 2\) when \(n > 7\)) on the number of parameters in the model so will tend to reward more parsimonious (smaller) models. (AIC performs well for “complex” true models and poorly for “simple” true models, while BIC does just the opposite.)
For regression models, the information criteria combine information about the RSS, number of parameters \(p\) in the model, and the sample size \(n\).
A small value, compared to values for other possible models, is good.
In
R, obtained byAIC()andBIC().
Stepwise Procedures
Stepwise Regression is a procedure we can use to build a regression model from a set of candidate predictors by adding and removing predictors in a stepwise manner into the model until there is no statistically valid reason to enter or remove anymore.
The goal of stepwise regression is to build a regression model that includes all of the predictor variables that are statistically significantly with respect to the response variable.
Stepwise Regression using F-Tests to Choose Variables
Overview of Stepwise Regression using F-Tests to Choose Variables:
First, we start with no predictors in our “stepwise model.”
Then, at each step along the way we either enter or remove a predictor based on the general (partial) F-tests – that is, the equivalent t-tests for the slope parameters – that are obtained.
We stop when no more predictors can be justifiably entered or removed from our stepwise model, thereby leading us to a “final model.”
The specific stepwise procedure:
Step 1
Fit each of the one-predictor models – that is, regress \(y\) on \(x_1\), regress \(y\) on \(x_2\),\(\dots\), and regress \(y\) on \(x_{p-1}\).
Of those predictors whose t-test p-value is less than some \(\alpha\) level, say, \(0.15\), the first predictor put in the stepwise model is the predictor that has the smallest t-test p-value.
If no predictor has a t-test p-value less than \(\alpha\), stop.
Note that in a stepwise regression procedure, \(\alpha\) can be larger than \(0.05\), in order to make it easier to enter or remove a predictor.
Step 2
Suppose \(x_1\) had the smallest t-test p-value below \(\alpha\) and therefore was deemed the “best” single predictor arising from the the first step.
Now, fit each of the two-predictor models that include \(x_1\) as a predictor – that is, regress \(y\) on \(x_1\) and \(x_2\), regress \(y\) on \(x_1\) and \(x_3\),…,and regress \(y\) on \(x_1\) and \(x_{p-1}\).
Of those predictors whose t-test p-value is less than \(\alpha\), the second predictor put in the stepwise model is the predictor that has the smallest t-test p-value.
If no predictor has a t-test p-value less than \(\alpha\), stop. The model with the one predictor obtained from the first step is your final model.
But, suppose instead that \(x_2\) was deemed the “best” second predictor and it is therefore entered into the stepwise model.
Now, since \(x_1\) was the first predictor in the model, step back and see if entering \(x_2\) into the stepwise model somehow affected the significance of the \(x_1\) predictor.
That is, check the t-test p-value for testing \(\beta_1 = 0\). If the t-test p-value for \(\beta_1 = 0\) has become not significant, remove \(x_1\) from the stepwise model.
Step 3
Suppose both \(x_1\) and \(x_2\) made it into the two-predictor stepwise model and remained there.
Now, fit each of the three-predictor models that include \(x_1\) and \(x_2\) as predictors – that is, regress \(y\) on \(x_1\) and \(x_2\), and \(x_3\), regress \(y\) on \(x_1\) and \(x_2\), and \(x_4\), …, and regress \(y\) on \(x_1\) and \(x_2\), and \(x_{p-1}\).
Of those predictors whose t-test p-value is less than \(\alpha\), the third predictor put in the stepwise model is the predictor that has the smallest t-test p-value.
If no predictor has a t-test p-value less than \(\alpha\), stop. The model containing the two predictors obtained from the second step is your final model.
But, suppose instead that \(x_3\) was deemed the “best” third predictor and it is therefore entered into the stepwise model.
Now, since \(x_1\) and \(x_2\) were the first predictors in the model, step back and see if entering \(x_3\) into the stepwise model somehow affected the significance of the \(x_1\) and \(x_2\) predictors.
If the t-test p-value for either \(\beta_1 = 0\) and \(\beta_2 = 0\) has become not significant, then remove the predictor from the stepwise model.
Stopping the procedure. Continue the steps as described above until adding an additional predictor does not yield a t-test p-value below \(\alpha\).
Implement in R
- Step 1: Using
lm()four times to regress Y on each of the four predictors, we obtain:
(In R, to conduct the first step, we can use the add1() function, which is equivalent to but easier than calling lm() four times.)
# stepwise regression
# step 1
mod0 = lm(y ~ 1)
add1(mod0, ~.+x1+x2+x3+x4, test = 'F') # add1: add all possible single terms to a modelSingle term additions
Model:
y ~ 1
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 2715.76 71.444
x1 1 1450.08 1265.69 63.519 12.6025 0.0045520 **
x2 1 1809.43 906.34 59.178 21.9606 0.0006648 ***
x3 1 776.36 1939.40 69.067 4.4034 0.0597623 .
x4 1 1831.90 883.87 58.852 22.7985 0.0005762 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The t-statistic for \(x_4\) is largest in absolute value and therefore the p-value for \(x_4\) is the smallest. As a result of the first step, we enter \(x_4\) into our stepwise model.
Comments:
Note that in add1(), the general linear (partial) F-test is used. It can be checked that the p-value for the F-test for each predictor is the same as the p-value for the t-test for each predictor obtained by lm().
- Step 2: choose the second predictor when \(x_4\) is in the model. We regress \(y\) on \(x_4\) and \(x_1\), regress \(y\) on \(x_4\) and \(x_2\), regress \(y\) on \(x_4\) and \(x_3\). Instead of using
lm()for three times, we use theadd1()function:
#step 2
mod1 = update(mod0, ~.+x4)
add1(mod1, ~.+x1+x2+x3, test = 'F') # second argument is formula of 'scope' giving the terms to be considered for addingSingle term additions
Model:
y ~ x4
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 883.87 58.852
x1 1 809.10 74.76 28.742 108.2239 1.105e-06 ***
x2 1 14.99 868.88 60.629 0.1725 0.6867
x3 1 708.13 175.74 39.853 40.2946 8.375e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-statistic (p-value) for \(x_1\) is the largest (smallest). As a result of the second step, we enter \(x_1\) into our stepwise model.
The
update()function adds or removes a predictor to a linear model. In this example, it means that the model is updated by including \(x_4\). It is the same as regressing \(y\) on \(x_4\) bylm().
Before we proceed to the next step, we need to check whether or not adding \(x_1\) in the model affects the significance of \(x_4\):
mod2 = update(mod1, ~.+x1)
summary(mod2)
Call:
lm(formula = y ~ x4 + x1)
Residuals:
Min 1Q Median 3Q Max
-5.0234 -1.4737 0.1371 1.7305 3.7701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 103.09738 2.12398 48.54 3.32e-13 ***
x4 -0.61395 0.04864 -12.62 1.81e-07 ***
x1 1.43996 0.13842 10.40 1.11e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.734 on 10 degrees of freedom
Multiple R-squared: 0.9725, Adjusted R-squared: 0.967
F-statistic: 176.6 on 2 and 10 DF, p-value: 1.581e-08
The p-value for \(x_4\) shows that adding \(x_1\) in the model doesn’t affect the significance of \(x_4\).
- Step 3. choose the third predictor when \(x_4\) and \(x_1\) are in the model.
We regress \(y\) on \(x_4\), \(x_1\), and \(x_2\); and we regress \(y\) on \(x_4\), \(x_1\), and \(x_3\), obtaining:
#step 3
# mod2 = update(mod1, ~.+x1)
add1(mod2, ~.+x2+x3, test = 'F')Single term additions
Model:
y ~ x4 + x1
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 74.762 28.742
x2 1 26.789 47.973 24.974 5.0259 0.05169 .
x3 1 23.926 50.836 25.728 4.2358 0.06969 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The predictor \(x_2\) has the smallest F-test p-value. Therefore, as a result of the third step, we enter \(x_2\) into our stepwise model.
Comments:
Recall that in a stepwise regression procedure, we don’t need to set the \(\alpha\) level as strict as \(0.05\). It could be larger.
Now, since \(x_1\) and \(x_4\) were the first two predictors in the model, we must step back and see if entering \(x_2\) into the stepwise model affected the significance of the \(x_1\) and \(x_4\) predictors.
#step 4
mod3 = update(mod2, ~.+x2)
summary(mod3)
Call:
lm(formula = y ~ x4 + x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-3.0919 -1.8016 0.2562 1.2818 3.8982
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.6483 14.1424 5.066 0.000675 ***
x4 -0.2365 0.1733 -1.365 0.205395
x1 1.4519 0.1170 12.410 5.78e-07 ***
x2 0.4161 0.1856 2.242 0.051687 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.309 on 9 degrees of freedom
Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08
If we choose \(\alpha = 0.15\), then the p-value for \(x_4\) is larger than \(\alpha\). Therefore, we remove the predictor \(x_4\) from the stepwise model, leaving us with the predictors \(x_1\) and \(x_2\) in our stepwise model.
- Step 4. choose the third predictor when \(x_1\) and \(x_2\) are in the model.
We regress \(y\) on \(x_1\), \(x_2\), and \(x_3\).
(Note that we don’t need to regress \(y\) on \(x_1\), \(x_2\), and \(x_4\), since we just delete \(x_4\) from a model containing \(x_1\), \(x_2\), and \(x_4\).)
#remove x4 because the p-value is large
mod4 = update(mod3, ~.-x4)
#check x3 => p-value is large and stop the searching
add1(mod4, ~.+x3, test = 'F')Single term additions
Model:
y ~ x1 + x2
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 57.904 25.420
x3 1 9.7939 48.111 25.011 1.8321 0.2089
- Stop the stepwise regression procedure
According to its p-value, \(x_3\) is not eligible for entry into our stepwise model. That is, we stop our stepwise regression procedure. Our final regression model, based on the stepwise procedure using F-tests to choose predictors, contains only the predictors \(x_1\) and \(x_2\).
#final model
mod.final = lm(y ~ x1 + x2)
summary(mod.final)
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-2.893 -1.574 -1.302 1.363 4.048
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.57735 2.28617 23.00 5.46e-10 ***
x1 1.46831 0.12130 12.11 2.69e-07 ***
x2 0.66225 0.04585 14.44 5.03e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.406 on 10 degrees of freedom
Multiple R-squared: 0.9787, Adjusted R-squared: 0.9744
F-statistic: 229.5 on 2 and 10 DF, p-value: 4.407e-09
Stepwise Regression using AIC to to Choose Variables
- Forward addition (bottom-up)
Start with some or no covariate
Add one variable according to some criterion (e.g. AIC)
Stop when no variables need to be added
- Backward elimination (top-down)
Start with a model contains all covariates
Eliminate one variable according to some criterion (e.g. AIC)
Stop when no variables need to be dropped
Stepwise regression is a combination of forward addition and backward elimination.
In R, we can use the step() function to conduct stepwise regression procedure, instead of using addl1() or lm() for multiple times. In this function, the default criterion to choose predictors is AIC. For this example, we can get the following:
#find the model by AIC using stepwise regression
mod0 = lm(y ~ 1)
mod.upper = lm(y ~ x1 + x2 + x3 + x4)
step(mod0, scope = list(lower = mod0, upper = mod.upper))Start: AIC=71.44
y ~ 1
Df Sum of Sq RSS AIC
+ x4 1 1831.90 883.87 58.852
+ x2 1 1809.43 906.34 59.178
+ x1 1 1450.08 1265.69 63.519
+ x3 1 776.36 1939.40 69.067
<none> 2715.76 71.444
Step: AIC=58.85
y ~ x4
Df Sum of Sq RSS AIC
+ x1 1 809.10 74.76 28.742
+ x3 1 708.13 175.74 39.853
<none> 883.87 58.852
+ x2 1 14.99 868.88 60.629
- x4 1 1831.90 2715.76 71.444
Step: AIC=28.74
y ~ x4 + x1
Df Sum of Sq RSS AIC
+ x2 1 26.79 47.97 24.974
+ x3 1 23.93 50.84 25.728
<none> 74.76 28.742
- x1 1 809.10 883.87 58.852
- x4 1 1190.92 1265.69 63.519
Step: AIC=24.97
y ~ x4 + x1 + x2
Df Sum of Sq RSS AIC
<none> 47.97 24.974
- x4 1 9.93 57.90 25.420
+ x3 1 0.11 47.86 26.944
- x2 1 26.79 74.76 28.742
- x1 1 820.91 868.88 60.629
Call:
lm(formula = y ~ x4 + x1 + x2)
Coefficients:
(Intercept) x4 x1 x2
71.6483 -0.2365 1.4519 0.4161
Some Cautions about Stepwise Regression Procedure:
The final model is not guaranteed to be optimal in any specified sense.
The procedure yields a single final model, although there are often several equally good models.
Stepwise regression does not take into account a researcher’s knowledge about the predictors. It may be necessary to force the procedure to include important predictors.
One should not over-interpret the order in which predictors are entered into the model.
Best Subset Selection: exhaustive search
Best subset selection (exhaustive search) Procedure:
For \(k = 1, \ldots, p - 1\), fit all models that contain the intercept and exactly \(k\) independent variables, and select the one, say \(\mathcal M_k\) , with the smallest RSS. Denote \(\mathcal M_0\) as the model with intercept only.
Select the best model among \(\mathcal M_0,\ldots,\mathcal M_{p-1}\) with the smallest AIC (or based on other criteria)
Comment:
Exhaustive Search checks all combinations of candidate predictors but consequently requires more computational processing time.
Since we need to fit \(2^{p-1}\) models, the above procedure is impractical when \(p\) is large.
Details of the procedure:
Step 1
First, identify all of the possible regression models derived from all of the possible combinations of the candidate predictors. Unfortunately, this can be a huge number of possible models.
- Suppose we have one candidate predictor, \(x_1\). Then, there are two possible regression models we can consider:
one model with no predictors.
one model with the predictor \(x_1\).
- Suppose we have \(2\) candidate predictor – \(x_1\) and \(x_2\). Then, there are \(4\) possible regression models we can consider:
one model with no predictors.
two models with the only one predictor each – the model with \(x_1\) alone; the model with \(x_2\) alone.
one model with all two predictors – the model with \(x_1\) and \(x_2\).
In general, if there are \(p-1\) possible candidate predictors, then there are \(2^{p-1}\) possible regression models containing the predictors.
Step 2
- From the possible models identified in the first step, determine
the one-predictor models that do the “best”
the two-predictor models that do the “best”, and so on.
- Note that since this step returns separate best models of all sizes up to \(p-1\) and since different model selection criteria (e.g., AIC, BIC, etc.) differ only in how models of different sizes are compared, the result of the “best” model of a certain size (i.e., \(\mathcal M_k \in \{\mathcal M_0,\ldots,\mathcal M_{p-1}\}\)) does not depend on the choice of cost-complexity trade-off.
Step 3
- Select the best model among \(\mathcal M_0,\ldots,\mathcal M_{p-1}\) with some well-defined criteria. Some possible criteria:
the largest \(R^2\)
the largest adjusted \(R^2\)
the smallest MSE (or square root of MSE)
the smallest Mallows’ Cp-statistic s.t. \(C_p \approx p\). (We will learn it soon.)
Note that the different criteria quantify different aspects of the regression model, and therefore often yield different choices for the best set of predictors.
Further evaluate and refine the handful of models identified in the last step. This might entail performing residual analyses, transforming the predictors and/or response, adding interaction terms, and so on.
Do this until you are satisfied that you have found a model that
meets the model conditions,
does a good job of summarizing the trend in the data,
and most importantly allows you to answer your research question.
Implement in R
# best subsets regression
# install.packages('leaps')
library(leaps)
mod = regsubsets(cbind(x1,x2,x3,x4), y)
summary.mod = summary(mod)
# "best" k-predictor models
summary.mod$which (Intercept) x1 x2 x3 x4
1 TRUE FALSE FALSE FALSE TRUE
2 TRUE TRUE TRUE FALSE FALSE
3 TRUE TRUE TRUE FALSE TRUE
4 TRUE TRUE TRUE TRUE TRUE
summary.mod$which shows that a “best” model includes \(k\) predictor(s). For example, the first row tells us that for all possible models with only one predictor, the model with \(x_4\) is the “best”.
# criterion: R^2
summary.mod$rsq[1] 0.6745420 0.9786784 0.9823355 0.9823756
summary.mod$rsq shows the \(R^2\) for each “best” model. It shows that going from the “best” one-predictor model to the “best” two-predictor model. The \(R^2\) value jumps from 67.5 to 97.9, which is the largest increase.
Based on the \(R^2\)-value criterion, the “best” model is the model with the two predictors \(x_1\) and \(x_2\).
# criterion: adjusted R^2
summary.mod$adjr2[1] 0.6449549 0.9744140 0.9764473 0.9735634
# criterion: compute MSE
n = length(y)
rss = summary.mod$rss
mses = c(rss[1]/(n-2), rss[2]/(n-3), rss[3]/(n-4), rss[4]/(n-5))
mses[1] 80.351538 5.790448 5.330303 5.982955
Mallows’ Cp
The Mallow’s Cp criterion is defined as
\[C_p = RSS/\sigma^2 +2p-n\]
When \(\sigma^2\) is unknown, replace it by an estimate
\[C_p = RSS/\hat\sigma^2 +2p-n\]
Identify models with small \(p\) and \(C_p\) such that \(C_p \approx p\)
A model with a bad fit will have \(C_p\) much bigger than \(p\). Plot \(C_p\) vs \(p\) provides a way to check for large bias (those points far away from the line \(C_p =p\)).
Comments:
The full model always yields \(C_p =p\), so don’t select the full model based on \(C_p\).
If all models, except the full model, yield a large \(C_p\) not near \(p\), it suggests some important predictor(s) are missing from the analysis.
In this case, we are well-advised to identify the predictors that are missing.
If a number of models have \(C_p\) near \(p\), choose the model with the smallest \(C_p\) value, thereby insuring that the combination of the bias and the variance is at a minimum.
When more than one model has a small value of \(C_p\) value near \(p\), in general, choose the simpler model or the model that meets your research needs.
# criterion: Mallows' Cp
(cp = summary.mod$cp)[1] 138.730833 2.678242 3.018233 5.000000
# make a plot to find the best model(s): Cp is close to p
p = 2:5
plot(p, cp, type = 'o', cex = 1.5, main = 'Cp vs P (Number of Parameters)')
abline(0, 1, col="red")