prob.table <- c(.2, .1, .1, .1, .1, .1, .05, .05, .05, .05, .02, .02, .02, .02, .02)
boxes <- seq(1,15)
box.count <- function(prob=prob.table){
check <- double(length(prob))
i <- 0
while(sum(check)<length(prob)){
x <- sample(boxes, 1, prob=prob)
check[x] <- 1
i <- i+1
}
return(i)
}Resampling Methods
Why Bootstrapping?
To quantify the uncertainty associated with a given estimator or statistical learning method: e.g., to estimate the standard errors of the coefficients from a linear regression fit
When the familiar standard intervals, e.g., \(\hat\theta \pm 1.96 \hat{se}\) for \(95\%\) coverage, are not accurate.
When the distributional assumption, e.g., normality, possibly are not met. (asymmetric intervals around \(\hat\theta\))
A Heuristic Example: Blind-Box Toy Collection
Children (and some adults) are frequently enticed to buy blind boxes to collect all the action figures/toys. Assume there are 15 action figures and blind-box contains exactly one with probabilities in the following table
| Figure | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Probability | 0.2 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.05 | 0.05 | 0.05 | 0.05 | 0.02 | 0.02 | 0.02 | 0.02 | 0.02 |
Questions:
- Estimate the expected number of boxes needed to collect all 15 action figures.
- What is the uncertainty of your estimate?
- What is the probability you bought more than 300 boxes?
trials <- 1000
sim.boxes <- double(trials)
for(i in 1:trials){
sim.boxes[i] <- box.count()
}
est <- mean(sim.boxes)
mcse <- sd(sim.boxes) / sqrt(trials)
interval <- est + c(-1,1)*1.96*mcse
est[1] 114.128
interval[1] 110.5308 117.7252
hist(sim.boxes, main="Histogram of Total Boxes", xlab="Boxes")
abline(v=300, col="red", lwd=2)
sum(sim.boxes>300)/1000[1] 0.013
Plug-In Principle and the Bootstrap
The worst mistake one can make in statistics is to confuse the sample and the population or to confuse estimators and parameters. That is, \(\hat{\theta}\) is not \(\theta\).
Plug-in principle seems to say the opposite
- Sometimes it is okay to just plug in an estimate for an unknown parameter
- In particular, it is okay to plug in a consistent estimator of the asymptotic variance of a parameter in forming asymptotic confidence intervals for that parameter.
The “bootstrap” is a cute name for a vast generalization of the plug-in principle.
Nonparametric Bootstrap
- The bootstrap comes in two flavors, parametric and nonparametric.
- The nonparametric bootstrap, considered and explained non-theoretically, is just an analogy
Nonparametric Bootstrap
| World | Real | Bootstrap |
|---|---|---|
| true distribution | \(F\) | \(\hat{F}_n\) |
| data | \(X_1, \dots, X_n\) IID \(F\) | \(X_1^*, \dots, X_n^*\) IID \(\hat{F}_n\) |
| empirical distribution | \(\hat{F}_n\) | \(F_n^*\) |
| parameter | \(\theta = t(F)\) | \(\hat{\theta}_n = t(\hat{F}_n)\) |
| estimator | \(\hat{\theta}_n = t(\hat{F}_n)\) | \(\theta_n^* = t(F_n^*)\) |
| error | \(\hat{\theta}_n - \theta\) | \(\theta_n^* - \hat{\theta}_n\) |
| standardized error | \(\frac{\hat{\theta}_n - \theta}{s(\hat{F}_n)}\) | \(\frac{\theta_n^* - \hat{\theta}_n}{s(F_n^*)}\) |
Notation \(\theta = t(F)\) means \(\theta\) is some function of the true unknown distribution
The notation \(X_1^*, \dots, X_n^*\) IID \(\hat{F}_n\) means \(X_1^*, \dots, X_n^*\) are independent and identically distributed from the empirical distribution of the real data
Sampling from the empirical distribution is just like sampling from a finite population, where the “population” is the real data \(X_1, \dots, X_n\)
- To be IID sampling must be with replacement
- \(X_1^*, \dots, X_n^*\) are a sample with replacement from \(X_1, \dots, X_n\)
- Called resampling
This sampling distribution depends on the true unknown distribution \(F\) of the real data which may be very difficult or impossible to calculate theoretically
In the “bootstrap world” everything is known, \(\hat{F}_n\) plays the role of the true unknown distribution and \(\hat{\theta}_n\) plays the role of the true unknown parameter value
The sampling distribution of \(\theta_n^*\) or \(\theta_n^* - \hat{\theta}_n\) or \(\frac{\theta_n^* - \hat{\theta}_n}{s(F_n^*)}\) may still be difficult to calculate theoretically, but it can always be “calculated” by simulation.
The bootstrap is large sample, approximate, asymptotic
- It is not an exact method
Example: abnormal speed data – bootstrapping a statistic
- In the units of the data, the currently accepted “true” speed is 33.02
- Does the data support the current accepted speed of 33.02?
speed <- c(28, -44, 29, 30, 26, 27, 22, 23, 33, 16, 24, 29, 24, 40 , 21, 31, 34, -2, 25, 19)
- A \(t\)-test assumes the population of measurements is normally distributed
- With this small sample size and a severe departure from normality, we can’t be guaranteed a good approximation
- Instead, we can consider the bootstrap
- State null and alternative hypotheses \[ H_0: \mu = 33.02 \text{ versus } H_a: \mu \ne 33.02 \]
- Choose a significance level, in our case 0.05
- Choose a test statistic, since we wish to estimate the mean speed we can use the sample average
- Find the observed value of the test statistic
- Calculate a p-value.
- We need a p-value, but we don’t have the sampling distribution of our test statistic when the null hypothesis is true
- Shift the data to represent the population. (At least, they have the same mean.)
mean(speed)[1] 21.75
newspeed <- speed - mean(speed) + 33.02Now we have our fake population and take out 20 observations at random, with replacement
We calculate the average and save it, then repeat this process many, many times
Now we have a sampling distribution with mean 33.02. We can compare this to our observed sample average and obtain a p-value
n <- 1000
bstrap <- double(n)
for (i in 1:n){
newsample <- sample(newspeed, 20, replace=T)
bstrap[i] <- mean(newsample)
}
The p-value is the probability of getting something more extreme than what we observed
Notice \(21.75\) is \(33.02 - 21.75 = 11.27\) units away from the null hypothesis. So p-value is the probability of being more than 11.27 units away from 33.02 (like what we do in a two-sided test)
(sum(bstrap < 21.75) + sum(bstrap > 44.29))/1000[1] 0.009
- Since our significance level is 5%, we reject \(H_0\)
Example: Sleep study
Data set
sleep{datasets}: show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.dim(sleep)[1] 20 3head(sleep)extra group ID 1 0.7 1 1 2 -1.6 1 2 3 -0.2 1 3 4 -1.2 1 4 5 -0.1 1 5 6 3.4 1 6The two sample \(t\)-test checks for differences in means according to a known null distribution
Let’s resample and generate the sampling distribution under the bootstrap assumption
bootstrap.resample <- function (object) sample (object, length(object), replace=TRUE)
diff.in.means <- function(df) {
mean(df[df$group==1,"extra"]) - mean(df[df$group==2,"extra"]) # extra: numeric increase in hours of sleep
}
resample.diffs <- replicate(2000, diff.in.means(sleep[bootstrap.resample(1:nrow(sleep)),]))hist(resample.diffs, main="Bootstrap Sampling Distribution")
abline(v=diff.in.means(sleep), col=2, lwd=3)
(sum(resample.diffs < diff.in.means(sleep)) + sum(resample.diffs > -diff.in.means(sleep)))/2000[1] 0.51
Bootstrapping with Built-In Functions
- R has built in bootstrapping functions, see
bootlibrary - Example of the function
boot - Use the bootstrap to generate a 95% confidence interval for R-squared
- Linear regression of miles per gallon (
mpg) on car weight (wt) and displacement (disp) - Data source is
mtcars - The bootstrapped confidence interval is based on 1000 replications
library(boot)str(mtcars)'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
rsq <- function(formula, data, indices) {
d <- data[indices,]
fit <- lm(formula, data=d)
return(summary(fit)$r.square)
}
results <- boot(data=mtcars, statistic=rsq,
R=1000, formula=mpg~wt+disp)results
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~
wt + disp)
Bootstrap Statistics :
original bias std. error
t1* 0.7809306 0.01263917 0.04908174
boot.ci(results, type="bca")BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = results, type = "bca")
Intervals :
Level BCa
95% ( 0.6352, 0.8480 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable