Resampling Methods

Author

Jiaye Xu

Published

April 27, 2024

Why Bootstrapping?

To quantify the uncertainty associated with a given estimator or statistical learning method: e.g., to estimate the standard errors of the coefficients from a linear regression fit
When the familiar standard intervals, e.g., \(\hat\theta \pm 1.96 \hat{se}\) for \(95\%\) coverage, are not accurate.
When the distributional assumption, e.g., normality, possibly are not met. (asymmetric intervals around \(\hat\theta\))

A Heuristic Example: Blind-Box Toy Collection

Children (and some adults) are frequently enticed to buy blind boxes to collect all the action figures/toys. Assume there are 15 action figures and blind-box contains exactly one with probabilities in the following table

Figure	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
Probability	0.2	0.1	0.1	0.1	0.1	0.1	0.05	0.05	0.05	0.05	0.02	0.02	0.02	0.02	0.02

Questions:

Estimate the expected number of boxes needed to collect all 15 action figures.
What is the uncertainty of your estimate?
What is the probability you bought more than 300 boxes?

prob.table <- c(.2, .1, .1, .1, .1, .1, .05, .05, .05, .05, .02, .02, .02, .02, .02)
boxes <- seq(1,15)
box.count <- function(prob=prob.table){
  check <- double(length(prob))
  i <- 0
  while(sum(check)<length(prob)){
    x <- sample(boxes, 1, prob=prob)
    check[x] <- 1
    i <- i+1
  }
  return(i)
}

trials <- 1000
sim.boxes <- double(trials)
for(i in 1:trials){
  sim.boxes[i] <- box.count()
}
est <- mean(sim.boxes)
mcse <- sd(sim.boxes) / sqrt(trials)
interval <- est + c(-1,1)*1.96*mcse
est

[1] 114.128

interval

[1] 110.5308 117.7252

hist(sim.boxes, main="Histogram of Total Boxes", xlab="Boxes")
abline(v=300, col="red", lwd=2)

sum(sim.boxes>300)/1000

[1] 0.013

Plug-In Principle and the Bootstrap

The worst mistake one can make in statistics is to confuse the sample and the population or to confuse estimators and parameters. That is, \(\hat{\theta}\) is not \(\theta\).
Plug-in principle seems to say the opposite
- Sometimes it is okay to just plug in an estimate for an unknown parameter
- In particular, it is okay to plug in a consistent estimator of the asymptotic variance of a parameter in forming asymptotic confidence intervals for that parameter.
The “bootstrap” is a cute name for a vast generalization of the plug-in principle.

Nonparametric Bootstrap

The bootstrap comes in two flavors, parametric and nonparametric.
The nonparametric bootstrap, considered and explained non-theoretically, is just an analogy

Nonparametric Bootstrap

World	Real	Bootstrap
true distribution	\(F\)	\(\hat{F}_n\)
data	\(X_1, \dots, X_n\) IID \(F\)	\(X_1^, \dots, X_n^\) IID \(\hat{F}_n\)
empirical distribution	\(\hat{F}_n\)	\(F_n^*\)
parameter	\(\theta = t(F)\)	\(\hat{\theta}_n = t(\hat{F}_n)\)
estimator	\(\hat{\theta}_n = t(\hat{F}_n)\)	\(\theta_n^* = t(F_n^*)\)
error	\(\hat{\theta}_n - \theta\)	\(\theta_n^* - \hat{\theta}_n\)
standardized error	\(\frac{\hat{\theta}_n - \theta}{s(\hat{F}_n)}\)	\(\frac{\theta_n^* - \hat{\theta}_n}{s(F_n^*)}\)

Notation \(\theta = t(F)\) means \(\theta\) is some function of the true unknown distribution

The notation \(X_1^*, \dots, X_n^*\) IID \(\hat{F}_n\) means \(X_1^*, \dots, X_n^*\) are independent and identically distributed from the empirical distribution of the real data
Sampling from the empirical distribution is just like sampling from a finite population, where the “population” is the real data \(X_1, \dots, X_n\)
- To be IID sampling must be with replacement
- \(X_1^*, \dots, X_n^*\) are a sample with replacement from \(X_1, \dots, X_n\)
- Called resampling
This sampling distribution depends on the true unknown distribution \(F\) of the real data which may be very difficult or impossible to calculate theoretically
In the “bootstrap world” everything is known, \(\hat{F}_n\) plays the role of the true unknown distribution and \(\hat{\theta}_n\) plays the role of the true unknown parameter value
The sampling distribution of \(\theta_n^*\) or \(\theta_n^* - \hat{\theta}_n\) or \(\frac{\theta_n^* - \hat{\theta}_n}{s(F_n^*)}\) may still be difficult to calculate theoretically, but it can always be “calculated” by simulation.
The bootstrap is large sample, approximate, asymptotic
- It is not an exact method

Example: abnormal speed data – bootstrapping a statistic

In the units of the data, the currently accepted “true” speed is 33.02
Does the data support the current accepted speed of 33.02?

speed <- c(28, -44, 29, 30, 26, 27, 22, 23, 33, 16, 24, 29, 24, 40 , 21, 31, 34, -2, 25, 19)

A \(t\)-test assumes the population of measurements is normally distributed
With this small sample size and a severe departure from normality, we can’t be guaranteed a good approximation
Instead, we can consider the bootstrap

State null and alternative hypotheses \[ H_0: \mu = 33.02 \text{ versus } H_a: \mu \ne 33.02 \]
Choose a significance level, in our case 0.05
Choose a test statistic, since we wish to estimate the mean speed we can use the sample average
Find the observed value of the test statistic
Calculate a p-value.

We need a p-value, but we don’t have the sampling distribution of our test statistic when the null hypothesis is true
Shift the data to represent the population. (At least, they have the same mean.)

mean(speed)

[1] 21.75

newspeed <- speed - mean(speed) + 33.02

Now we have our fake population and take out 20 observations at random, with replacement
We calculate the average and save it, then repeat this process many, many times
Now we have a sampling distribution with mean 33.02. We can compare this to our observed sample average and obtain a p-value

n <- 1000
bstrap <- double(n)
for (i in 1:n){ 
  newsample <- sample(newspeed, 20, replace=T)
  bstrap[i] <- mean(newsample)
  }

The p-value is the probability of getting something more extreme than what we observed
Notice \(21.75\) is \(33.02 - 21.75 = 11.27\) units away from the null hypothesis. So p-value is the probability of being more than 11.27 units away from 33.02 (like what we do in a two-sided test)

(sum(bstrap < 21.75) + sum(bstrap > 44.29))/1000

[1] 0.009

Since our significance level is 5%, we reject \(H_0\)

Example: Sleep study

Data set sleep{datasets}: show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.

dim(sleep)

[1] 20  3

head(sleep)

  extra group ID
1   0.7     1  1
2  -1.6     1  2
3  -0.2     1  3
4  -1.2     1  4
5  -0.1     1  5
6   3.4     1  6

The two sample \(t\)-test checks for differences in means according to a known null distribution
Let’s resample and generate the sampling distribution under the bootstrap assumption

bootstrap.resample <- function (object) sample (object, length(object), replace=TRUE)
diff.in.means <- function(df) {
  mean(df[df$group==1,"extra"]) - mean(df[df$group==2,"extra"]) # extra: numeric increase in hours of sleep
}
resample.diffs <- replicate(2000, diff.in.means(sleep[bootstrap.resample(1:nrow(sleep)),]))

hist(resample.diffs, main="Bootstrap Sampling Distribution")
abline(v=diff.in.means(sleep), col=2, lwd=3)

(sum(resample.diffs < diff.in.means(sleep)) + sum(resample.diffs > -diff.in.means(sleep)))/2000

[1] 0.51

Bootstrapping with Built-In Functions

R has built in bootstrapping functions, see boot library
Example of the function boot
Use the bootstrap to generate a 95% confidence interval for R-squared
Linear regression of miles per gallon (mpg) on car weight (wt) and displacement (disp)
Data source is mtcars
The bootstrapped confidence interval is based on 1000 replications

library(boot)

str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

rsq <- function(formula, data, indices) {
  d <- data[indices,]  
  fit <- lm(formula, data=d)
  return(summary(fit)$r.square)
} 
results <- boot(data=mtcars, statistic=rsq, 
    R=1000, formula=mpg~wt+disp)

results


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ 
    wt + disp)


Bootstrap Statistics :
     original     bias    std. error
t1* 0.7809306 0.01263917  0.04908174

boot.ci(results, type="bca")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, type = "bca")

Intervals : 
Level       BCa          
95%   ( 0.6352,  0.8480 )  
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable