Lecture 9

Author

Jiaye Xu

Published

April 18, 2022

Lecture 9 Overview

  • Variable Selection

    • Occam’s Razor and Trade-offs

    • Compare Non-nested Models: criteria

    • Stepwise Procedures

    • Best Subset Selection: exhaustive search


Model Building and Variable Selection

Why variable selection?

  • Practically, often a large number of independent variables are investigated and it is likely that not all of them are important.

  • Statistical inferences can be carried out more efficiently with smaller models.

  • Simple goodness-of-fit criteria such as RSS and \(R^2\) cannot be used since they will lead to the largest model.

The goal of the variable selection is to find a subset of independent variables.

Occam’s Razor and Trade-offs

Our objective here is to describe systematic algorithmic strategies that can optimize the trade-off between goodness-of-fit and excess model complexity.

Occam’s razor has been the guiding principle for the compromises:

the model that fits observations sufficiently well in the least complex way should be preferred.

  • To be precise on “fits observations sufficiently well”, we need a quantity that measures how well a model fits the data.

  • To be precise on “the least complex way”, we need a quantity that measures the complexity of a model.

Understanding the “trade-offs”:

  • The likelihood (or RSS) provides a measure of goodness-of-fit.

  • The degrees of freedom (df) (i.e. number of parameters) provides a measure of model complexity.

  • The negative likelihood (RSS) and df are two opposite aspects of a model: the negative likelihood (RSS) decreases as df increases.

  • The goal of model/variable selection is to strike a balance between these two conflicting aspects.

  • The Occam’s razor suggests that simple models should be preferred to more complicated ones, other things being equal.

Compare Non-nested Models: information criteria

Nested model comparison via general F-test has been introduced in lecture 6. Here, we look at the method for comparing non-nested models.

A direct compromise between the goodness-of-fit and model complexity is

\[-2\text{maximized log-likelihood} + \xi p\]

where

  • We seek models that minimizes the above criterion

  • for LM with Gaussian assumption,

\[-2\text{maximized log-likelihood} = n \log(RSS/n) + \text{constant}\]

  • In general, \(\xi\) is a penalty parameter to the model complexity. A larger \(\xi\) leads to a simpler model.

    • when \(\xi= 2\): the above criterion is called Akaike’s Information Criterion (AIC)

    • when \(\xi= \log n\): the above criterion is called Bayesian Information Criterion (BIC)

    • Comments:

      • The only difference between AIC and BIC is the multiplier of \(p\), the number of parameters.

      • The BIC places a higher penalty ( \(\log(n) > 2\) when \(n > 7\)) on the number of parameters in the model so will tend to reward more parsimonious (smaller) models. (AIC performs well for “complex” true models and poorly for “simple” true models, while BIC does just the opposite.)

      • For regression models, the information criteria combine information about the RSS, number of parameters \(p\) in the model, and the sample size \(n\).

      • A small value, compared to values for other possible models, is good.

      • In R, obtained by AIC() and BIC().

Stepwise Procedures

Stepwise Regression is a procedure we can use to build a regression model from a set of candidate predictors by adding and removing predictors in a stepwise manner into the model until there is no statistically valid reason to enter or remove anymore.

The goal of stepwise regression is to build a regression model that includes all of the predictor variables that are statistically significantly with respect to the response variable.

Stepwise Regression using F-Tests to Choose Variables

Overview of Stepwise Regression using F-Tests to Choose Variables:

  • First, we start with no predictors in our “stepwise model.”

  • Then, at each step along the way we either enter or remove a predictor based on the general (partial) F-tests – that is, the equivalent t-tests for the slope parameters – that are obtained.

  • We stop when no more predictors can be justifiably entered or removed from our stepwise model, thereby leading us to a “final model.”

The specific stepwise procedure:

Step 1

  1. Fit each of the one-predictor models – that is, regress \(y\) on \(x_1\), regress \(y\) on \(x_2\),\(\dots\), and regress \(y\) on \(x_{p-1}\).

  2. Of those predictors whose t-test p-value is less than some \(\alpha\) level, say, \(0.15\), the first predictor put in the stepwise model is the predictor that has the smallest t-test p-value.

  3. If no predictor has a t-test p-value less than \(\alpha\), stop.

Note that in a stepwise regression procedure, \(\alpha\) can be larger than \(0.05\), in order to make it easier to enter or remove a predictor.

Step 2

  1. Suppose \(x_1\) had the smallest t-test p-value below \(\alpha\) and therefore was deemed the “bestsingle predictor arising from the the first step.

  2. Now, fit each of the two-predictor models that include \(x_1\) as a predictor – that is, regress \(y\) on \(x_1\) and \(x_2\), regress \(y\) on \(x_1\) and \(x_3\),…,and regress \(y\) on \(x_1\) and \(x_{p-1}\).

  3. Of those predictors whose t-test p-value is less than \(\alpha\), the second predictor put in the stepwise model is the predictor that has the smallest t-test p-value.

  4. If no predictor has a t-test p-value less than \(\alpha\), stop. The model with the one predictor obtained from the first step is your final model.

  5. But, suppose instead that \(x_2\) was deemed the “bestsecond predictor and it is therefore entered into the stepwise model.

  6. Now, since \(x_1\) was the first predictor in the model, step back and see if entering \(x_2\) into the stepwise model somehow affected the significance of the \(x_1\) predictor.

That is, check the t-test p-value for testing \(\beta_1 = 0\). If the t-test p-value for \(\beta_1 = 0\) has become not significant, remove \(x_1\) from the stepwise model.

Step 3

  1. Suppose both \(x_1\) and \(x_2\) made it into the two-predictor stepwise model and remained there.

  2. Now, fit each of the three-predictor models that include \(x_1\) and \(x_2\) as predictors – that is, regress \(y\) on \(x_1\) and \(x_2\), and \(x_3\), regress \(y\) on \(x_1\) and \(x_2\), and \(x_4\), …, and regress \(y\) on \(x_1\) and \(x_2\), and \(x_{p-1}\).

  3. Of those predictors whose t-test p-value is less than \(\alpha\), the third predictor put in the stepwise model is the predictor that has the smallest t-test p-value.

  4. If no predictor has a t-test p-value less than \(\alpha\), stop. The model containing the two predictors obtained from the second step is your final model.

  5. But, suppose instead that \(x_3\) was deemed the “bestthird predictor and it is therefore entered into the stepwise model.

  6. Now, since \(x_1\) and \(x_2\) were the first predictors in the model, step back and see if entering \(x_3\) into the stepwise model somehow affected the significance of the \(x_1\) and \(x_2\) predictors.

If the t-test p-value for either \(\beta_1 = 0\) and \(\beta_2 = 0\) has become not significant, then remove the predictor from the stepwise model.

Stopping the procedure. Continue the steps as described above until adding an additional predictor does not yield a t-test p-value below \(\alpha\).

Implement in R

dat = read.table('cement.txt', header = T)
# head(dat)
# names(dat)
# pairs(dat)
y = dat$y
x1 = dat$x1
x2 = dat$x2
x3 = dat$x3
x4 = dat$x4
  • Step 1: Using lm() four times to regress Y on each of the four predictors, we obtain:

(In R, to conduct the first step, we can use the add1() function, which is equivalent to but easier than calling lm() four times.)

# stepwise regression

# step 1
mod0 = lm(y ~ 1)
add1(mod0, ~.+x1+x2+x3+x4, test = 'F') # add1: add all possible single terms to a model
Single term additions

Model:
y ~ 1
       Df Sum of Sq     RSS    AIC F value    Pr(>F)    
<none>              2715.76 71.444                      
x1      1   1450.08 1265.69 63.519 12.6025 0.0045520 ** 
x2      1   1809.43  906.34 59.178 21.9606 0.0006648 ***
x3      1    776.36 1939.40 69.067  4.4034 0.0597623 .  
x4      1   1831.90  883.87 58.852 22.7985 0.0005762 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The t-statistic for \(x_4\) is largest in absolute value and therefore the p-value for \(x_4\) is the smallest. As a result of the first step, we enter \(x_4\) into our stepwise model.

Comments:

Note that in add1(), the general linear (partial) F-test is used. It can be checked that the p-value for the F-test for each predictor is the same as the p-value for the t-test for each predictor obtained by lm().

  • Step 2: choose the second predictor when \(x_4\) is in the model. We regress \(y\) on \(x_4\) and \(x_1\), regress \(y\) on \(x_4\) and \(x_2\), regress \(y\) on \(x_4\) and \(x_3\). Instead of using lm() for three times, we use the add1() function:
#step 2
mod1 = update(mod0, ~.+x4)
add1(mod1, ~.+x1+x2+x3, test = 'F') # second argument is formula of 'scope' giving the terms to be considered for adding
Single term additions

Model:
y ~ x4
       Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
<none>              883.87 58.852                       
x1      1    809.10  74.76 28.742 108.2239 1.105e-06 ***
x2      1     14.99 868.88 60.629   0.1725    0.6867    
x3      1    708.13 175.74 39.853  40.2946 8.375e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The F-statistic (p-value) for \(x_1\) is the largest (smallest). As a result of the second step, we enter \(x_1\) into our stepwise model.

  • The update() function adds or removes a predictor to a linear model. In this example, it means that the model is updated by including \(x_4\). It is the same as regressing \(y\) on \(x_4\) by lm().

Before we proceed to the next step, we need to check whether or not adding \(x_1\) in the model affects the significance of \(x_4\):

mod2 = update(mod1, ~.+x1)
summary(mod2)

Call:
lm(formula = y ~ x4 + x1)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0234 -1.4737  0.1371  1.7305  3.7701 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 103.09738    2.12398   48.54 3.32e-13 ***
x4           -0.61395    0.04864  -12.62 1.81e-07 ***
x1            1.43996    0.13842   10.40 1.11e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.734 on 10 degrees of freedom
Multiple R-squared:  0.9725,    Adjusted R-squared:  0.967 
F-statistic: 176.6 on 2 and 10 DF,  p-value: 1.581e-08

The p-value for \(x_4\) shows that adding \(x_1\) in the model doesn’t affect the significance of \(x_4\).

  • Step 3. choose the third predictor when \(x_4\) and \(x_1\) are in the model.

We regress \(y\) on \(x_4\), \(x_1\), and \(x_2\); and we regress \(y\) on \(x_4\), \(x_1\), and \(x_3\), obtaining:

#step 3
# mod2 = update(mod1, ~.+x1)
add1(mod2, ~.+x2+x3, test = 'F')
Single term additions

Model:
y ~ x4 + x1
       Df Sum of Sq    RSS    AIC F value  Pr(>F)  
<none>              74.762 28.742                  
x2      1    26.789 47.973 24.974  5.0259 0.05169 .
x3      1    23.926 50.836 25.728  4.2358 0.06969 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The predictor \(x_2\) has the smallest F-test p-value. Therefore, as a result of the third step, we enter \(x_2\) into our stepwise model.

Comments:

Recall that in a stepwise regression procedure, we don’t need to set the \(\alpha\) level as strict as \(0.05\). It could be larger.

Now, since \(x_1\) and \(x_4\) were the first two predictors in the model, we must step back and see if entering \(x_2\) into the stepwise model affected the significance of the \(x_1\) and \(x_4\) predictors.

#step 4
mod3 = update(mod2, ~.+x2)
summary(mod3)

Call:
lm(formula = y ~ x4 + x1 + x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0919 -1.8016  0.2562  1.2818  3.8982 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  71.6483    14.1424   5.066 0.000675 ***
x4           -0.2365     0.1733  -1.365 0.205395    
x1            1.4519     0.1170  12.410 5.78e-07 ***
x2            0.4161     0.1856   2.242 0.051687 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.309 on 9 degrees of freedom
Multiple R-squared:  0.9823,    Adjusted R-squared:  0.9764 
F-statistic: 166.8 on 3 and 9 DF,  p-value: 3.323e-08

If we choose \(\alpha = 0.15\), then the p-value for \(x_4\) is larger than \(\alpha\). Therefore, we remove the predictor \(x_4\) from the stepwise model, leaving us with the predictors \(x_1\) and \(x_2\) in our stepwise model.

  • Step 4. choose the third predictor when \(x_1\) and \(x_2\) are in the model.

We regress \(y\) on \(x_1\), \(x_2\), and \(x_3\).

(Note that we don’t need to regress \(y\) on \(x_1\), \(x_2\), and \(x_4\), since we just delete \(x_4\) from a model containing \(x_1\), \(x_2\), and \(x_4\).)

#remove x4 because the p-value is large
mod4 = update(mod3, ~.-x4)

#check x3 => p-value is large and stop the searching
add1(mod4, ~.+x3, test = 'F')
Single term additions

Model:
y ~ x1 + x2
       Df Sum of Sq    RSS    AIC F value Pr(>F)
<none>              57.904 25.420               
x3      1    9.7939 48.111 25.011  1.8321 0.2089
  • Stop the stepwise regression procedure

According to its p-value, \(x_3\) is not eligible for entry into our stepwise model. That is, we stop our stepwise regression procedure. Our final regression model, based on the stepwise procedure using F-tests to choose predictors, contains only the predictors \(x_1\) and \(x_2\).

#final model
mod.final = lm(y ~ x1 + x2)
summary(mod.final)

Call:
lm(formula = y ~ x1 + x2)

Residuals:
   Min     1Q Median     3Q    Max 
-2.893 -1.574 -1.302  1.363  4.048 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 52.57735    2.28617   23.00 5.46e-10 ***
x1           1.46831    0.12130   12.11 2.69e-07 ***
x2           0.66225    0.04585   14.44 5.03e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.406 on 10 degrees of freedom
Multiple R-squared:  0.9787,    Adjusted R-squared:  0.9744 
F-statistic: 229.5 on 2 and 10 DF,  p-value: 4.407e-09

Stepwise Regression using AIC to to Choose Variables

  • Forward addition (bottom-up)
  1. Start with some or no covariate

  2. Add one variable according to some criterion (e.g. AIC)

  3. Stop when no variables need to be added

  • Backward elimination (top-down)
  1. Start with a model contains all covariates

  2. Eliminate one variable according to some criterion (e.g. AIC)

  3. Stop when no variables need to be dropped

Stepwise regression is a combination of forward addition and backward elimination.

In R, we can use the step() function to conduct stepwise regression procedure, instead of using addl1() or lm() for multiple times. In this function, the default criterion to choose predictors is AIC. For this example, we can get the following:

#find the model by AIC using stepwise regression
mod0 = lm(y ~ 1)
mod.upper = lm(y ~ x1 + x2 + x3 + x4)
step(mod0, scope = list(lower = mod0, upper = mod.upper))
Start:  AIC=71.44
y ~ 1

       Df Sum of Sq     RSS    AIC
+ x4    1   1831.90  883.87 58.852
+ x2    1   1809.43  906.34 59.178
+ x1    1   1450.08 1265.69 63.519
+ x3    1    776.36 1939.40 69.067
<none>              2715.76 71.444

Step:  AIC=58.85
y ~ x4

       Df Sum of Sq     RSS    AIC
+ x1    1    809.10   74.76 28.742
+ x3    1    708.13  175.74 39.853
<none>               883.87 58.852
+ x2    1     14.99  868.88 60.629
- x4    1   1831.90 2715.76 71.444

Step:  AIC=28.74
y ~ x4 + x1

       Df Sum of Sq     RSS    AIC
+ x2    1     26.79   47.97 24.974
+ x3    1     23.93   50.84 25.728
<none>                74.76 28.742
- x1    1    809.10  883.87 58.852
- x4    1   1190.92 1265.69 63.519

Step:  AIC=24.97
y ~ x4 + x1 + x2

       Df Sum of Sq    RSS    AIC
<none>               47.97 24.974
- x4    1      9.93  57.90 25.420
+ x3    1      0.11  47.86 26.944
- x2    1     26.79  74.76 28.742
- x1    1    820.91 868.88 60.629

Call:
lm(formula = y ~ x4 + x1 + x2)

Coefficients:
(Intercept)           x4           x1           x2  
    71.6483      -0.2365       1.4519       0.4161  

Some Cautions about Stepwise Regression Procedure:

  • The final model is not guaranteed to be optimal in any specified sense.

  • The procedure yields a single final model, although there are often several equally good models.

  • Stepwise regression does not take into account a researcher’s knowledge about the predictors. It may be necessary to force the procedure to include important predictors.

  • One should not over-interpret the order in which predictors are entered into the model.