<- 12
x x
[1] 12
We will use the open source statistical software R, available at http://www.r-project.org, and the open source & productive integrated development environment (IDE) RStudio that can be downloaded from https://posit.co/)
<- 12
x x
[1] 12
+2 x
[1] 14
-2 x
[1] 10
/2 x
[1] 6
*2 x
[1] 24
^2 x
[1] 144
<- x+2
y y
[1] 14
<- x+2) (y
[1] 14
<- c(1,2,3,4)
x x
[1] 1 2 3 4
### element-wise operation
+2 x
[1] 3 4 5 6
-2 x
[1] -1 0 1 2
/2 x
[1] 0.5 1.0 1.5 2.0
*2 x
[1] 2 4 6 8
^2 x
[1] 1 4 9 16
<- c(1,2,3,4,5,6,7,8)
y y
[1] 1 2 3 4 5 6 7 8
*y x
[1] 1 4 9 16 5 12 21 32
%*% t(x) # (4 by 1) times (1 by 4) x
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 4 6 8
[3,] 3 6 9 12
[4,] 4 8 12 16
# equivalently, crossprod(t(x))
t(x) %*% x # (1 by 4) times (4 by 1)
[,1]
[1,] 30
# equivalently, crossprod(x)
%*%t(y) # (4 by 1) times (1 by 8) x
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 2 3 4 5 6 7 8
[2,] 2 4 6 8 10 12 14 16
[3,] 3 6 9 12 15 18 21 24
[4,] 4 8 12 16 20 24 28 32
# equivalently, crossprod(t(x), t(y))
# Note: the function t(x) takes the transpose of x
<- matrix(y, 2) # of row = 2, the matrix is filled by columns by default.
z1 z1
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
<- matrix(y, 2, byrow=TRUE)
z2 z2
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
# Select a specific element
1] x[
[1] 1
2] x[
[1] 2
# Select a range of elements
2:4] x[
[1] 2 3 4
# Drop a specific element
-2] x[
[1] 1 3 4
# Drop a range of elements
-(2:3)] x[
[1] 1 4
# Use list to select and drop elements
<-c(1,3)
y x[y]
[1] 1 3
-y] x[
[1] 2 4
# Combine two vectors
<-c(x,y)
x x
[1] 1 2 3 4 1 3
# Reverse a vector
rev(x)
[1] 3 1 4 3 2 1
# Test equality
1==5
[1] FALSE
1==1
[1] TRUE
# Test inequality
1!=5
[1] TRUE
1!=1
[1] FALSE
1:3
[1] 1 2 3
seq(from=0, to=1, by=0.12)
[1] 0.00 0.12 0.24 0.36 0.48 0.60 0.72 0.84 0.96
seq(from=0, to=1, length=7)
[1] 0.0000000 0.1666667 0.3333333 0.5000000 0.6666667 0.8333333 1.0000000
<- c(1,2,3,4)
x mean(x)
[1] 2.5
var(x)
[1] 1.666667
sd(x)
[1] 1.290994
sum(x)
[1] 10
prod(x)
[1] 24
length(x)
[1] 4
dnorm(0, mean=0, sd=1) # Normal distribution
[1] 0.3989423
dexp(1, rate=1) # Exponential distribution
[1] 0.3678794
dbinom(5, size=10, prob=0.5) # Binomial distribution
[1] 0.2460938
dpois(10, lambda=10) # Poisson distribution
[1] 0.12511
dgamma(1, shape=1, scale=1) # Gamma distribution
[1] 0.3678794
# NOTE: Here the density function is being evaluated at the given x with the specified parameters
pnorm(0, mean=0, sd=1) # P(X<=0)
[1] 0.5
qnorm(0.95, mean=0, sd=1) # quantile related to 95% lower tail probability
[1] 1.644854
rnorm(10, mean=0, sd=1) # generates 10 realizations of a standard normal RV
[1] 0.3126021 -0.5172903 2.0719703 0.5180661 0.7301469 -0.1118300
[7] 0.6443572 0.6318846 0.3926022 1.4656712
<-c(1,2,3,4)
x<-x^2
yplot(x,y)
plot(x,y, xlab="x label here", ylab="y label here", pch=2,col="blue")
# NOTE: Please make sure to always use labels! Otherwise your plots are not helpful for other people (e.g. graders) interpreting your plots.
for(x in 1:10){
print(x+1)
}
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
<-1
xwhile(x<10){
print(x)
<-x+x
x }
[1] 1
[1] 2
[1] 4
[1] 8
<-1
xwhile(x<10){
<-x+x
xprint(x)
}
[1] 2
[1] 4
[1] 8
[1] 16
# ex. factorial
<- function(n) if (n == 1) 1 else n * Fact(n - 1)
Fact Fact(5)
[1] 120
# NOTE: equivalently, use r function factorial(5), i.e., 5!
Children (and some adults) are frequently enticed to buy blind boxes to collect all the action figures/toys. Assume there are 15 action figures and blind-box contains exactly one with probabilities in the following table
Figure | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Probability | 0.2 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.05 | 0.05 | 0.05 | 0.05 | 0.02 | 0.02 | 0.02 | 0.02 | 0.02 |
Questions:
<- c(.2, .1, .1, .1, .1, .1, .05, .05, .05, .05, .02, .02, .02, .02, .02)
prob.table <- seq(1,15)
boxes <- function(prob=prob.table){
box.count <- double(length(prob))
check <- 0
i while(sum(check)<length(prob)){
<- sample(boxes, 1, prob=prob)
x <- 1
check[x] <- i+1
i
}return(i)
}
<- 1000
trials <- double(trials)
sim.boxes for(i in 1:trials){
<- box.count()
sim.boxes[i]
}<- mean(sim.boxes)
est <- sd(sim.boxes) / sqrt(trials)
mcse <- est + c(-1,1)*1.96*mcse
interval est
[1] 113.328
interval
[1] 109.7884 116.8676
hist(sim.boxes, main="Histogram of Total Boxes", xlab="Boxes")
abline(v=300, col="red", lwd=2)
sum(sim.boxes>300)/1000
[1] 0.01
The bisection method is an iterative root-finding procedures. Our goal is to find the root of \[g(x)=\frac{1+1/x-\log x}{(1+x)^2}=0\] The intermediate value theorem tells us that if \(g(x)\) is continuous on \([a_0, b_0]\) and \(g(a_0)g(b_0)\leq 0\) then there exists at least one \(x^* \in [a_0, b_0]\) for which \(g(x^*)=0\).
To find this \(x^*\), the bisection method systematically shrinks the interval from \([a_0, b_0]\) to \([a_1, b_1]\) to \([a_2, b_2]\) and so on, where \([a_0, b_0]\supset [a_1, b_1]\supset [a_2, b_2] \supset ...\) and so forth.
Let the initial value \(x^{(0)} = (a_0 + b_0)/2\). The updating equations are
\[[a_{t+1},b_{t+1}]=\left\{\begin{aligned} \left[a_t, x^{(t)}\right] \quad\text{ if } g(a_t)g(x^{(t)})\leq 0 , \\ \left[ x^{(t)},b_t\right] \quad \text{ if }g(a_t)g(x^{(t)})> 0 \end{aligned}\right.\]
and \[x^{(t+1)}=\frac{1}{2}(a_{t+1}, b_{t+1}).\]
If \(g\) has more than one root in the starting interval, it is easy to see that bisection will find one of them, but will not find the rest.
## INITIAL VALUES
= 1
a = 5
b = a+(b-a)/2
x = 40
itr
## FUNCTIONS
= function(x){log(x)/(1+x)}
g = function(x){(1+(1/x)-log(x))/((1+x)^2)}
g.prime
## MAIN
for (i in 1:itr){
if (g.prime(a)*g.prime(x) < 0) {b = x}
else {a = x}
= a+(b-a)/2
x
}
## OUTPUT
# FINAL ESTIMATE x
[1] 3.591121
g(x) # OBJECTIVE FUNCTION AT ESTIMATE
[1] 0.2784645
g.prime(x) # GRADIENT AT ESTIMATE
[1] -1.121895e-14
Newton’s method, a. k. a. Newton-Raphson iteration, is a general root-finding approach. Suppose that \(g^\prime\) is continuously differentiable, and \(g^{\prime\prime}(x^*)\neq 0\), at iteration \(t\), using Taylor series expansion \[x^*=x^{(t)}-\frac{g^\prime(x^{(t)})}{g^{\prime\prime}(x^{(t)})}=x^{(t)}+h^{(t)}\] \(h^{(t)}\) is called a refinement.
## 1. Initialization
= 3 # initial value. In the case of MLE, theta.
x = 40 # number of iterations to run
it <- rep(x,(it+1))
x.vec
## 2. Define Functions
= function(x){log(x)/(1+x)} # Objective function. In the case of MLE, g is log-likelihood function of theta.
g = function(x){(1+(1/x)-log(x))/((1+x)^2)} # first derivative of g
g.prime .2prime = function(x){(-1/((x^2)+(x^3)))-2*(1+(1/x)-log(x))/((1+x)^3)} # second derivative of g
g
## 3. Algorithm
for(i in 1:it){
= x - g.prime(x)/g.2prime(x)
x +1] <- x.vec[i] - g.prime(x.vec[i])/g.2prime(x.vec[i]) # iteration tracking
x.vec[i
}
## 4. Output
# (Optimal) Estimate x
[1] 3.591121
g(x) # Function evaluated at estimate
[1] 0.2784645
g.prime(x) # check gradient at estimate
[1] 1.053423e-17
## 5. Visualization of Iteration
x.vec
[1] 3.000000 3.417798 3.574045 3.590946 3.591121 3.591121 3.591121 3.591121
[9] 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121
[17] 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121
[25] 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121
[33] 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121
[41] 3.591121
plot(1:(it+1),g(x.vec))
plot(1:7,head(g(x.vec),7), type = "o")
When optimization of \(g\) corresponds to an MLE problem, the objective/target function \(g\) is log-likelihood \(l\), root \(x^*\) is the MLE of \(\theta\).
A fundamental problem: Minimize a function \(f(\theta_1, ... , \theta_n)\). Calculus teaches us that all the first derivatives \(\partial f/\partial \theta_i\) are zero at the minimum when \(f\) is smooth. (We have \(n\) equations \(\partial f/\partial \theta_i=0\).)
Gradient descent (GD) is a first-order method. It uses the derivatives \(\partial f/\partial \theta_i\) to find the steepest direction that reduces \(f(\boldsymbol \theta)\). The steepest direction, in which \(f(\boldsymbol \theta)\) decreases fastest, is given by \(-\nabla f\), where the symbol \(\nabla f\) represents the vector of \(n\) partial derivatives of \(f\): its gradient.
At \(t=0\), start with initial values for all parameters in vector \(\boldsymbol \theta\), that is \(\boldsymbol \theta_0\)
Iterate until the target \(f(\boldsymbol \theta)\) fails to decrease:
At \(t\)th iteration, \[\theta_{t+1}=\theta_t-\rho_t \nabla f(\theta_t),\] \(\rho_t\) is the stepsize or the learning rate.
Suppose we used a simple supervised statistical learning method, a linear regression model based on the linear hypothesis \[y_i = \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\epsilon_i\] with \(\epsilon_i \overset{iid}{\sim} \mathcal N(0, \sigma^2)\)
The target \(f(\boldsymbol \theta)\) in this case is the mean squared error (MSE) \[\text{MSE}=\frac{1}{n}\sum_{i=1}^n\left(y-\hat y\right)^2\]
In non-linear learning models, the target function is the loss function.
library(tidyverse)
The true regression line is \(E(y)= 1+0.2 x_{1}+0.5x_{2}\)
set.seed(1234)
= 1000
n = rnorm(n)
x1 = rnorm(n)
x2 = 1 + 0.2*x1 + 0.5*x2 + rnorm(n)
y = cbind(Intercept = 1, x1, x2) # model matrix X
<- function(
gd
par,
X,
y,tolerance = 1e-3,
maxit = 1000,
stepsize = 1e-3,
adapt = FALSE,
verbose = TRUE,
plotLoss = TRUE
) {
# initialize
= par; names(beta) = colnames(X)
beta = crossprod(X %*% beta - y)
loss = 1
tol = 1
iter
while(tol > tolerance && iter < maxit){
= X %*% beta
LP = t(X) %*% (LP - y)
grad = beta - stepsize * grad
betaCurrent = max(abs(betaCurrent - beta))
tol = betaCurrent
beta = append(loss, crossprod(LP - y))
loss = iter + 1
iter
if (adapt)
= ifelse(
stepsize < loss[iter - 1],
loss[iter] * 1.2,
stepsize * .8
stepsize
)
if (verbose && iter %% 10 == 0)
message(paste('Iteration:', iter))
}
if (plotLoss)
plot(loss, type = 'l', bty = 'n')
list(
par = beta,
loss = loss,
RSE = sqrt(crossprod(LP - y) / (nrow(X) - ncol(X))),
iter = iter,
fitted = LP
) }
= rep(0, 3) init
= gd(
fit_gd
init,X = X,
y = y,
tolerance = 1e-8,
stepsize = 1e-4,
adapt = TRUE
)
Iteration: 10
Iteration: 20
Iteration: 30
Iteration: 40
Iteration: 50
Iteration: 60
Iteration: 70
cbind.data.frame("gd"=fit_gd$par,"lm"=coef(lm(y ~ x1 + x2)))
gd lm
Intercept 1.0299573 1.0299573
x1 0.2177020 0.2177020
x2 0.4631026 0.4631026
Michael Clark. Model Estimation by Example: Demonstrations with R. Online Book Resources
Kevin P. Murphy. Probabilistic machine learning : an introduction. The MIT Press, 2022.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning with R. 2nd ed. Springer, 2021.
Geof H. Givens, Jennifer A. Hoeting. Computational Statistics, 2nd ed. Wiley, 2012.