Resampling Methods

Author

Jiaye Xu

Published

April 27, 2024

Why Bootstrapping?

  • To quantify the uncertainty associated with a given estimator or statistical learning method: e.g., to estimate the standard errors of the coefficients from a linear regression fit

  • When the familiar standard intervals, e.g., \(\hat\theta \pm 1.96 \hat{se}\) for \(95\%\) coverage, are not accurate.

  • When the distributional assumption, e.g., normality, possibly are not met. (asymmetric intervals around \(\hat\theta\))

A Heuristic Example: Blind-Box Toy Collection

Children (and some adults) are frequently enticed to buy blind boxes to collect all the action figures/toys. Assume there are 15 action figures and blind-box contains exactly one with probabilities in the following table

Figure A B C D E F G H I J K L M N O
Probability 0.2 0.1 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0.02 0.02 0.02 0.02 0.02

Questions:

  • Estimate the expected number of boxes needed to collect all 15 action figures.
  • What is the uncertainty of your estimate?
  • What is the probability you bought more than 300 boxes?
prob.table <- c(.2, .1, .1, .1, .1, .1, .05, .05, .05, .05, .02, .02, .02, .02, .02)
boxes <- seq(1,15)
box.count <- function(prob=prob.table){
  check <- double(length(prob))
  i <- 0
  while(sum(check)<length(prob)){
    x <- sample(boxes, 1, prob=prob)
    check[x] <- 1
    i <- i+1
  }
  return(i)
}
trials <- 1000
sim.boxes <- double(trials)
for(i in 1:trials){
  sim.boxes[i] <- box.count()
}
est <- mean(sim.boxes)
mcse <- sd(sim.boxes) / sqrt(trials)
interval <- est + c(-1,1)*1.96*mcse
est
[1] 114.128
interval
[1] 110.5308 117.7252
hist(sim.boxes, main="Histogram of Total Boxes", xlab="Boxes")
abline(v=300, col="red", lwd=2)

sum(sim.boxes>300)/1000
[1] 0.013

Plug-In Principle and the Bootstrap

  • The worst mistake one can make in statistics is to confuse the sample and the population or to confuse estimators and parameters. That is, \(\hat{\theta}\) is not \(\theta\).

  • Plug-in principle seems to say the opposite

    • Sometimes it is okay to just plug in an estimate for an unknown parameter
    • In particular, it is okay to plug in a consistent estimator of the asymptotic variance of a parameter in forming asymptotic confidence intervals for that parameter.
  • The “bootstrap” is a cute name for a vast generalization of the plug-in principle.

Nonparametric Bootstrap

  • The bootstrap comes in two flavors, parametric and nonparametric.
  • The nonparametric bootstrap, considered and explained non-theoretically, is just an analogy

Nonparametric Bootstrap

World Real Bootstrap
true distribution \(F\) \(\hat{F}_n\)
data \(X_1, \dots, X_n\) IID \(F\) \(X_1^*, \dots, X_n^*\) IID \(\hat{F}_n\)
empirical distribution \(\hat{F}_n\) \(F_n^*\)
parameter \(\theta = t(F)\) \(\hat{\theta}_n = t(\hat{F}_n)\)
estimator \(\hat{\theta}_n = t(\hat{F}_n)\) \(\theta_n^* = t(F_n^*)\)
error \(\hat{\theta}_n - \theta\) \(\theta_n^* - \hat{\theta}_n\)
standardized error \(\frac{\hat{\theta}_n - \theta}{s(\hat{F}_n)}\) \(\frac{\theta_n^* - \hat{\theta}_n}{s(F_n^*)}\)

Notation \(\theta = t(F)\) means \(\theta\) is some function of the true unknown distribution

  • The notation \(X_1^*, \dots, X_n^*\) IID \(\hat{F}_n\) means \(X_1^*, \dots, X_n^*\) are independent and identically distributed from the empirical distribution of the real data

  • Sampling from the empirical distribution is just like sampling from a finite population, where the “population” is the real data \(X_1, \dots, X_n\)

    • To be IID sampling must be with replacement
    • \(X_1^*, \dots, X_n^*\) are a sample with replacement from \(X_1, \dots, X_n\)
    • Called resampling
  • This sampling distribution depends on the true unknown distribution \(F\) of the real data which may be very difficult or impossible to calculate theoretically

  • In the “bootstrap world” everything is known, \(\hat{F}_n\) plays the role of the true unknown distribution and \(\hat{\theta}_n\) plays the role of the true unknown parameter value

  • The sampling distribution of \(\theta_n^*\) or \(\theta_n^* - \hat{\theta}_n\) or \(\frac{\theta_n^* - \hat{\theta}_n}{s(F_n^*)}\) may still be difficult to calculate theoretically, but it can always be “calculated” by simulation.

  • The bootstrap is large sample, approximate, asymptotic

    • It is not an exact method

Example: abnormal speed data – bootstrapping a statistic

  • In the units of the data, the currently accepted “true” speed is 33.02
  • Does the data support the current accepted speed of 33.02?
speed <- c(28, -44, 29, 30, 26, 27, 22, 23, 33, 16, 24, 29, 24, 40 , 21, 31, 34, -2, 25, 19)

  • A \(t\)-test assumes the population of measurements is normally distributed
  • With this small sample size and a severe departure from normality, we can’t be guaranteed a good approximation
  • Instead, we can consider the bootstrap
  1. State null and alternative hypotheses \[ H_0: \mu = 33.02 \text{ versus } H_a: \mu \ne 33.02 \]
  2. Choose a significance level, in our case 0.05
  3. Choose a test statistic, since we wish to estimate the mean speed we can use the sample average
  4. Find the observed value of the test statistic
  5. Calculate a p-value.
  • We need a p-value, but we don’t have the sampling distribution of our test statistic when the null hypothesis is true
  • Shift the data to represent the population. (At least, they have the same mean.)
mean(speed)
[1] 21.75
newspeed <- speed - mean(speed) + 33.02
  • Now we have our fake population and take out 20 observations at random, with replacement

  • We calculate the average and save it, then repeat this process many, many times

  • Now we have a sampling distribution with mean 33.02. We can compare this to our observed sample average and obtain a p-value

n <- 1000
bstrap <- double(n)
for (i in 1:n){ 
  newsample <- sample(newspeed, 20, replace=T)
  bstrap[i] <- mean(newsample)
  }

  • The p-value is the probability of getting something more extreme than what we observed

  • Notice \(21.75\) is \(33.02 - 21.75 = 11.27\) units away from the null hypothesis. So p-value is the probability of being more than 11.27 units away from 33.02 (like what we do in a two-sided test)

(sum(bstrap < 21.75) + sum(bstrap > 44.29))/1000
[1] 0.009
  • Since our significance level is 5%, we reject \(H_0\)

Example: Sleep study

  • Data set sleep{datasets}: show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.

    dim(sleep)
    [1] 20  3
    head(sleep)
      extra group ID
    1   0.7     1  1
    2  -1.6     1  2
    3  -0.2     1  3
    4  -1.2     1  4
    5  -0.1     1  5
    6   3.4     1  6
  • The two sample \(t\)-test checks for differences in means according to a known null distribution

  • Let’s resample and generate the sampling distribution under the bootstrap assumption

bootstrap.resample <- function (object) sample (object, length(object), replace=TRUE)
diff.in.means <- function(df) {
  mean(df[df$group==1,"extra"]) - mean(df[df$group==2,"extra"]) # extra: numeric increase in hours of sleep
}
resample.diffs <- replicate(2000, diff.in.means(sleep[bootstrap.resample(1:nrow(sleep)),]))
hist(resample.diffs, main="Bootstrap Sampling Distribution")
abline(v=diff.in.means(sleep), col=2, lwd=3)

(sum(resample.diffs < diff.in.means(sleep)) + sum(resample.diffs > -diff.in.means(sleep)))/2000
[1] 0.51

Bootstrapping with Built-In Functions

  • R has built in bootstrapping functions, see boot library
  • Example of the function boot
  • Use the bootstrap to generate a 95% confidence interval for R-squared
  • Linear regression of miles per gallon (mpg) on car weight (wt) and displacement (disp)
  • Data source is mtcars
  • The bootstrapped confidence interval is based on 1000 replications
library(boot)
str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
rsq <- function(formula, data, indices) {
  d <- data[indices,]  
  fit <- lm(formula, data=d)
  return(summary(fit)$r.square)
} 
results <- boot(data=mtcars, statistic=rsq, 
    R=1000, formula=mpg~wt+disp)
results 

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ 
    wt + disp)


Bootstrap Statistics :
     original     bias    std. error
t1* 0.7809306 0.01263917  0.04908174
boot.ci(results, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, type = "bca")

Intervals : 
Level       BCa          
95%   ( 0.6352,  0.8480 )  
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable