Optimization with R

Author

Jiaye Xu

Published

March 25, 2025

Introduction to R

We will use the open source statistical software R, available at http://www.r-project.org, and the open source & productive integrated development environment (IDE) RStudio that can be downloaded from https://posit.co/)

BASIC R OPERATIONS: Simple operations and lists

1. Define a variable and look at it

x <- 12
x

[1] 12

2. Mathematical operations on a variable

x+2

[1] 14

x-2

[1] 10

x/2

[1] 6

x*2

[1] 24

x^2

[1] 144

3. Saving the result of mathematical operations in another variable

y <- x+2
y

[1] 14

3b. Showing the result of an assignment as you do it (sometimes handy for debugging)

(y <- x+2)

[1] 14

4. Define a (atomic) vector with the c() function

x <- c(1,2,3,4)
x

[1] 1 2 3 4

5. Mathematical operations on a vector

### element-wise operation
x+2

[1] 3 4 5 6

x-2

[1] -1  0  1  2

x/2

[1] 0.5 1.0 1.5 2.0

x*2

[1] 2 4 6 8

x^2

[1]  1  4  9 16

5b. Example of R not handling matrix operations correctly

y <- c(1,2,3,4,5,6,7,8)
y

[1] 1 2 3 4 5 6 7 8

x*y

[1]  1  4  9 16  5 12 21 32

5c. Example of matrix multiplication

x %*% t(x) # (4 by 1) times (1 by 4)

     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    2    4    6    8
[3,]    3    6    9   12
[4,]    4    8   12   16

# equivalently, crossprod(t(x))

t(x) %*% x # (1 by 4) times (4 by 1)

     [,1]
[1,]   30

# equivalently, crossprod(x)

x%*%t(y) # (4 by 1) times (1 by 8)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    1    2    3    4    5    6    7    8
[2,]    2    4    6    8   10   12   14   16
[3,]    3    6    9   12   15   18   21   24
[4,]    4    8   12   16   20   24   28   32

# equivalently, crossprod(t(x), t(y))

#   Note: the function t(x) takes the transpose of x

5d. Matrix function

z1 <- matrix(y, 2) # of row = 2, the matrix is filled by columns by default.
z1

     [,1] [,2] [,3] [,4]
[1,]    1    3    5    7
[2,]    2    4    6    8

z2 <- matrix(y, 2, byrow=TRUE)
z2

     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8

6. Structure manipulation of vectors

# Select a specific element
x[1]

[1] 1

x[2]

[1] 2

# Select a range of elements
x[2:4]

[1] 2 3 4

# Drop a specific element
x[-2]

[1] 1 3 4

# Drop a range of elements
x[-(2:3)]

[1] 1 4

# Use list to select and drop elements
y<-c(1,3)
x[y]

[1] 1 3

x[-y]

[1] 2 4

# Combine two vectors
x<-c(x,y)
x

[1] 1 2 3 4 1 3

# Reverse a vector
rev(x)

[1] 3 1 4 3 2 1

7. Logical operations

# Test equality
1==5

[1] FALSE

1==1

[1] TRUE

# Test inequality
1!=5

[1] TRUE

1!=1

[1] FALSE

8. Ranges

1:3

[1] 1 2 3

seq(from=0, to=1, by=0.12)

[1] 0.00 0.12 0.24 0.36 0.48 0.60 0.72 0.84 0.96

seq(from=0, to=1, length=7)

[1] 0.0000000 0.1666667 0.3333333 0.5000000 0.6666667 0.8333333 1.0000000

USEFUL FUNCTIONS

1. Mean, variance, standard deviation

x <- c(1,2,3,4)
mean(x)

[1] 2.5

var(x)

[1] 1.666667

sd(x)

[1] 1.290994

2. Summing a vector

sum(x)

[1] 10

prod(x)

[1] 24

3. Length of a vector

length(x)

[1] 4

4. Densities for various distributions

dnorm(0, mean=0, sd=1)      # Normal distribution

[1] 0.3989423

dexp(1, rate=1)         # Exponential distribution

[1] 0.3678794

dbinom(5, size=10, prob=0.5)    # Binomial distribution

[1] 0.2460938

dpois(10, lambda=10)        # Poisson distribution

[1] 0.12511

dgamma(1, shape=1, scale=1) # Gamma distribution

[1] 0.3678794

# NOTE: Here the density function is being evaluated at the given x with the specified parameters

5. CDFs, inverse CDFs, and random generation

pnorm(0, mean=0, sd=1)      # P(X<=0)

[1] 0.5

qnorm(0.95, mean=0, sd=1)   # quantile related to 95% lower tail probability

[1] 1.644854

rnorm(10, mean=0, sd=1)     # generates 10 realizations of a standard normal RV

 [1]  0.3126021 -0.5172903  2.0719703  0.5180661  0.7301469 -0.1118300
 [7]  0.6443572  0.6318846  0.3926022  1.4656712

6. Basic plotting

x<-c(1,2,3,4)
y<-x^2
plot(x,y)

plot(x,y, xlab="x label here", ylab="y label here", pch=2,col="blue")

# NOTE: Please make sure to always use labels! Otherwise your plots are not helpful for other people (e.g. graders) interpreting your plots.

PROGRAMMING AND FLOW CONTROL

1. FOR loops

for(x in 1:10){
    print(x+1)
}

[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11

2. WHILE loops

x<-1
while(x<10){
    print(x)
    x<-x+x
}

[1] 1
[1] 2
[1] 4
[1] 8

x<-1
while(x<10){
  x<-x+x
  print(x)
}

[1] 2
[1] 4
[1] 8
[1] 16

3. Defining functions

# ex. factorial 
Fact <- function(n) if (n == 1) 1 else n * Fact(n - 1)
Fact(5)

[1] 120

# NOTE: equivalently, use r function factorial(5), i.e., 5!

Example 1: Blind-Box Toy Collection

Children (and some adults) are frequently enticed to buy blind boxes to collect all the action figures/toys. Assume there are 15 action figures and blind-box contains exactly one with probabilities in the following table

Figure	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
Probability	0.2	0.1	0.1	0.1	0.1	0.1	0.05	0.05	0.05	0.05	0.02	0.02	0.02	0.02	0.02

Questions:

Estimate the expected number of boxes needed to collect all 15 action figures.
What is the uncertainty of your estimate?
What is the probability you bought more than 300 boxes?

prob.table <- c(.2, .1, .1, .1, .1, .1, .05, .05, .05, .05, .02, .02, .02, .02, .02)
boxes <- seq(1,15)
box.count <- function(prob=prob.table){
  check <- double(length(prob))
  i <- 0
  while(sum(check)<length(prob)){
    x <- sample(boxes, 1, prob=prob)
    check[x] <- 1
    i <- i+1
  }
  return(i)
}

trials <- 1000
sim.boxes <- double(trials)
for(i in 1:trials){
  sim.boxes[i] <- box.count()
}
est <- mean(sim.boxes)
mcse <- sd(sim.boxes) / sqrt(trials)
interval <- est + c(-1,1)*1.96*mcse
est

[1] 113.328

interval

[1] 109.7884 116.8676

hist(sim.boxes, main="Histogram of Total Boxes", xlab="Boxes")
abline(v=300, col="red", lwd=2)

sum(sim.boxes>300)/1000

[1] 0.01

Example 2: Bisection for Root Finding

The bisection method is an iterative root-finding procedures. Our goal is to find the root of \[g(x)=\frac{1+1/x-\log x}{(1+x)^2}=0\] The intermediate value theorem tells us that if \(g(x)\) is continuous on \([a_0, b_0]\) and \(g(a_0)g(b_0)\leq 0\) then there exists at least one \(x^* \in [a_0, b_0]\) for which \(g(x^*)=0\).

To find this \(x^*\), the bisection method systematically shrinks the interval from \([a_0, b_0]\) to \([a_1, b_1]\) to \([a_2, b_2]\) and so on, where \([a_0, b_0]\supset [a_1, b_1]\supset [a_2, b_2] \supset ...\) and so forth.

Let the initial value \(x^{(0)} = (a_0 + b_0)/2\). The updating equations are

\[[a_{t+1},b_{t+1}]=\left\{\begin{aligned} \left[a_t, x^{(t)}\right] \quad\text{ if } g(a_t)g(x^{(t)})\leq 0 , \\ \left[ x^{(t)},b_t\right] \quad \text{ if }g(a_t)g(x^{(t)})> 0 \end{aligned}\right.\]

and \[x^{(t+1)}=\frac{1}{2}(a_{t+1}, b_{t+1}).\]

If \(g\) has more than one root in the starting interval, it is easy to see that bisection will find one of them, but will not find the rest.

Bisection for Optimization

## INITIAL VALUES
a = 1
b = 5
x = a+(b-a)/2
itr = 40

## FUNCTIONS
g = function(x){log(x)/(1+x)}
g.prime = function(x){(1+(1/x)-log(x))/((1+x)^2)}

## MAIN
for (i in 1:itr){
    if (g.prime(a)*g.prime(x) < 0) {b = x}
    else {a = x}
    x = a+(b-a)/2
}

## OUTPUT
x       # FINAL ESTIMATE

[1] 3.591121

g(x)        # OBJECTIVE FUNCTION AT ESTIMATE

[1] 0.2784645

g.prime(x)  # GRADIENT AT ESTIMATE

[1] -1.121895e-14

Example 3: Newton-Raphson

Newton’s method, a. k. a. Newton-Raphson iteration, is a general root-finding approach. Suppose that \(g^\prime\) is continuously differentiable, and \(g^{\prime\prime}(x^*)\neq 0\), at iteration \(t\), using Taylor series expansion \[x^*=x^{(t)}-\frac{g^\prime(x^{(t)})}{g^{\prime\prime}(x^{(t)})}=x^{(t)}+h^{(t)}\] \(h^{(t)}\) is called a refinement.

## 1. Initialization
x = 3 # initial value. In the case of MLE, theta.
it = 40 # number of iterations to run
x.vec <- rep(x,(it+1))

## 2. Define Functions
g = function(x){log(x)/(1+x)} # Objective function. In the case of MLE, g is log-likelihood function of theta.
g.prime = function(x){(1+(1/x)-log(x))/((1+x)^2)} # first derivative of g
g.2prime = function(x){(-1/((x^2)+(x^3)))-2*(1+(1/x)-log(x))/((1+x)^3)} # second derivative of g

## 3. Algorithm
for(i in 1:it){
  x = x - g.prime(x)/g.2prime(x)
  x.vec[i+1] <- x.vec[i] - g.prime(x.vec[i])/g.2prime(x.vec[i]) # iteration tracking
  }

## 4. Output
x       # (Optimal) Estimate

[1] 3.591121

g(x)        # Function evaluated at estimate

[1] 0.2784645

g.prime(x)  # check gradient at estimate

[1] 1.053423e-17

## 5. Visualization of Iteration
x.vec

 [1] 3.000000 3.417798 3.574045 3.590946 3.591121 3.591121 3.591121 3.591121
 [9] 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121
[17] 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121
[25] 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121
[33] 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121
[41] 3.591121

plot(1:(it+1),g(x.vec))

plot(1:7,head(g(x.vec),7), type = "o")

When optimization of \(g\) corresponds to an MLE problem, the objective/target function \(g\) is log-likelihood \(l\), root \(x^*\) is the MLE of \(\theta\).

Example 4: Gradient Descent

A fundamental problem: Minimize a function \(f(\theta_1, ... , \theta_n)\). Calculus teaches us that all the first derivatives \(\partial f/\partial \theta_i\) are zero at the minimum when \(f\) is smooth. (We have \(n\) equations \(\partial f/\partial \theta_i=0\).)

Gradient descent (GD) is a first-order method. It uses the derivatives \(\partial f/\partial \theta_i\) to find the steepest direction that reduces \(f(\boldsymbol \theta)\). The steepest direction, in which \(f(\boldsymbol \theta)\) decreases fastest, is given by \(-\nabla f\), where the symbol \(\nabla f\) represents the vector of \(n\) partial derivatives of \(f\): its gradient.

At \(t=0\), start with initial values for all parameters in vector \(\boldsymbol \theta\), that is \(\boldsymbol \theta_0\)
Iterate until the target \(f(\boldsymbol \theta)\) fails to decrease:

At \(t\)th iteration, \[\theta_{t+1}=\theta_t-\rho_t \nabla f(\theta_t),\] \(\rho_t\) is the stepsize or the learning rate.

Demonstration for a Linear Regression

Suppose we used a simple supervised statistical learning method, a linear regression model based on the linear hypothesis \[y_i = \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\epsilon_i\] with \(\epsilon_i \overset{iid}{\sim} \mathcal N(0, \sigma^2)\)

The target \(f(\boldsymbol \theta)\) in this case is the mean squared error (MSE) \[\text{MSE}=\frac{1}{n}\sum_{i=1}^n\left(y-\hat y\right)^2\]

In non-linear learning models, the target function is the loss function.

library(tidyverse)

Data Setup

The true regression line is \(E(y)= 1+0.2 x_{1}+0.5x_{2}\)

set.seed(1234)

n  = 1000
x1 = rnorm(n)
x2 = rnorm(n)
y  = 1 + 0.2*x1 + 0.5*x2 + rnorm(n)
X  = cbind(Intercept = 1, x1, x2)  # model matrix

GD Function

gd <- function(
  par,
  X,
  y,
  tolerance = 1e-3,
  maxit     = 1000,
  stepsize  = 1e-3,
  adapt     = FALSE,
  verbose   = TRUE,
  plotLoss  = TRUE
  ) {
  
  # initialize
  beta = par; names(beta) = colnames(X)
  loss = crossprod(X %*% beta - y)
  tol  = 1
  iter = 1
  
  while(tol > tolerance && iter < maxit){
    
    LP   = X %*% beta
    grad = t(X) %*% (LP - y)
    betaCurrent = beta - stepsize * grad
    tol  = max(abs(betaCurrent - beta))
    beta = betaCurrent
    loss = append(loss, crossprod(LP - y))
    iter = iter + 1
    
    if (adapt)
      stepsize = ifelse(
        loss[iter] < loss[iter - 1],  
        stepsize * 1.2, 
        stepsize * .8
      )
    
    if (verbose && iter %% 10 == 0)
      message(paste('Iteration:', iter))
  }
  
  if (plotLoss)
    plot(loss, type = 'l', bty = 'n')
  
  list(
    par    = beta,
    loss   = loss,
    RSE    = sqrt(crossprod(LP - y) / (nrow(X) - ncol(X))), 
    iter   = iter,
    fitted = LP
  )
}

Initialization and Run the Algorithm

init = rep(0, 3)

fit_gd = gd(
  init,
  X = X,
  y = y,
  tolerance = 1e-8,
  stepsize  = 1e-4,
  adapt     = TRUE
)

Iteration: 10

Iteration: 20

Iteration: 30

Iteration: 40

Iteration: 50

Iteration: 60

Iteration: 70

Compare the Results

cbind.data.frame("gd"=fit_gd$par,"lm"=coef(lm(y ~ x1 + x2)))

                 gd        lm
Intercept 1.0299573 1.0299573
x1        0.2177020 0.2177020
x2        0.4631026 0.4631026

References

Michael Clark. Model Estimation by Example: Demonstrations with R. Online Book Resources

Kevin P. Murphy. Probabilistic machine learning : an introduction. The MIT Press, 2022.

Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning with R. 2nd ed. Springer, 2021.

Geof H. Givens, Jennifer A. Hoeting. Computational Statistics, 2nd ed. Wiley, 2012.