Optimization Methods

Author

Jiaye Xu

Published

April 11, 2024

Newton-Raphson

Newton’s method, a. k. a. Newton-Raphson iteration, is a general root-finding approach. Suppose that \(g^\prime\) is continuously differentiable, and \(g^{\prime\prime}(x^*)\neq 0\), at iteration \(t\), using Taylor series expansion \[x^*=x^{(t)}-\frac{g^\prime(x^{(t)})}{g^{\prime\prime}(x^{(t)})}=x^{(t)}+h^{(t)}\] \(h^{(t)}\) is called a refinement.

## 1. Initialization
x = 3 # initial value. In the case of MLE, theta.
it = 40 # number of iterations to run
x.vec <- rep(x,(it+1))

## 2. Define Functions
g = function(x){log(x)/(1+x)} # Objective function. In the case of MLE, g is log-likelihood function of theta.
g.prime = function(x){(1+(1/x)-log(x))/((1+x)^2)} # first derivative of g
g.2prime = function(x){(-1/((x^2)+(x^3)))-2*(1+(1/x)-log(x))/((1+x)^3)} # second derivative of g

## 3. Algorithm
for(i in 1:it){
  x = x - g.prime(x)/g.2prime(x)
  x.vec[i+1] <- x.vec[i] - g.prime(x.vec[i])/g.2prime(x.vec[i]) # iteration tracking
  }

## 4. Output
x       # (Optimal) Estimate
[1] 3.591121
g(x)        # Function evaluated at estimate
[1] 0.2784645
g.prime(x)  # check gradient at estimate
[1] 1.053423e-17
## 5. Visualization of Iteration
x.vec
 [1] 3.000000 3.417798 3.574045 3.590946 3.591121 3.591121 3.591121 3.591121
 [9] 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121
[17] 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121
[25] 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121
[33] 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121 3.591121
[41] 3.591121
plot(1:(it+1),g(x.vec))

plot(1:7,head(g(x.vec),7), type = "o")

When optimization of \(g\) corresponds to an MLE problem, the objective/target function \(g\) is log-likelihood \(l\), root \(x^*\) is the MLE of \(\theta\).

Gradient Descent

A fundamental problem: Minimize a function \(f(\theta_1, ... , \theta_n)\). Calculus teaches us that all the first derivatives \(\partial f/\partial \theta_i\) are zero at the minimum when \(f\) is smooth. (We have \(n\) equations \(\partial f/\partial \theta_i=0\).)

Gradient descent (GD) is a first-order method. It uses the derivatives \(\partial f/\partial \theta_i\) to find the steepest direction that reduces \(f(\boldsymbol \theta)\). The steepest direction, in which \(f(\boldsymbol \theta)\) decreases fastest, is given by \(-\nabla f\), where the symbol \(\nabla f\) represents the vector of \(n\) partial derivatives of \(f\): its gradient.

  1. At \(t=0\), start with initial values for all parameters in vector \(\boldsymbol \theta\), that is \(\boldsymbol \theta_0\)

  2. Iterate until the target \(f(\boldsymbol \theta)\) fails to decrease:

    At \(t\)th iteration, \[\theta_{t+1}=\theta_t-\rho_t \nabla f(\theta_t),\] \(\rho_t\) is the stepsize or the learning rate.

Demonstration for a Linear Regression

Suppose we used a simple supervised statistical learning method, a linear regression model based on the linear hypothesis \[y_i = \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\epsilon_i\] with \(\epsilon_i \overset{iid}{\sim} \mathcal N(0, \sigma^2)\)

The target \(f(\boldsymbol \theta)\) in this case is the mean squared error (MSE) \[\text{MSE}=\frac{1}{n}\sum_{i=1}^n\left(y-\hat y\right)^2\]

In non-linear learning models, the target function is the loss function.

library(tidyverse)

Data Setup

The true regression line is \(E(y)= 1+0.2 x_{1}+0.5x_{2}\)

set.seed(1234)

n  = 1000
x1 = rnorm(n)
x2 = rnorm(n)
y  = 1 + 0.2*x1 + 0.5*x2 + rnorm(n)
X  = cbind(Intercept = 1, x1, x2)  # model matrix

GD Function

gd <- function(
  par,
  X,
  y,
  tolerance = 1e-3,
  maxit     = 1000,
  stepsize  = 1e-3,
  adapt     = FALSE,
  verbose   = TRUE,
  plotLoss  = TRUE
  ) {
  
  # initialize
  beta = par; names(beta) = colnames(X)
  loss = crossprod(X %*% beta - y)
  tol  = 1
  iter = 1
  
  while(tol > tolerance && iter < maxit){
    
    LP   = X %*% beta
    grad = t(X) %*% (LP - y)
    betaCurrent = beta - stepsize * grad
    tol  = max(abs(betaCurrent - beta))
    beta = betaCurrent
    loss = append(loss, crossprod(LP - y))
    iter = iter + 1
    
    if (adapt)
      stepsize = ifelse(
        loss[iter] < loss[iter - 1],  
        stepsize * 1.2, 
        stepsize * .8
      )
    
    if (verbose && iter %% 10 == 0)
      message(paste('Iteration:', iter))
  }
  
  if (plotLoss)
    plot(loss, type = 'l', bty = 'n')
  
  list(
    par    = beta,
    loss   = loss,
    RSE    = sqrt(crossprod(LP - y) / (nrow(X) - ncol(X))), 
    iter   = iter,
    fitted = LP
  )
}

Initialization and Run the Algorithm

init = rep(0, 3)
fit_gd = gd(
  init,
  X = X,
  y = y,
  tolerance = 1e-8,
  stepsize  = 1e-4,
  adapt     = TRUE
)
Iteration: 10
Iteration: 20
Iteration: 30
Iteration: 40
Iteration: 50
Iteration: 60
Iteration: 70

Compare the Results

cbind.data.frame("gd"=fit_gd$par,"lm"=coef(lm(y ~ x1 + x2)))
                 gd        lm
Intercept 1.0299573 1.0299573
x1        0.2177020 0.2177020
x2        0.4631026 0.4631026

Stochastic Gradient Descent

In order to accelerate the computing for large data set, instead of summing over all n observations, we sample a fraction or minibatch of them each time we compute a gradient step. This process is known as stochastic gradient descent (SGD) and is the state of the art for learning deep neural networks.

For a loss function \(\mathcal L(\boldsymbol \theta)=\frac{1}{n}\sum_{i=1}^{n}\mathcal L_i(\boldsymbol\theta)\), the gradient of this objective in GD has the form \[g_t=\frac{1}{n}\sum_{i=1}^{n}\nabla_{\boldsymbol\theta}\mathcal L_i(\boldsymbol\theta_t)\] This algorithm is slow for the training set of large \(n\). So in SGD, we approximate this gradient by sampling a minibatch of \(b<<n\) \[g_t\approx\frac{1}{|\mathcal B|_t}\sum_{i\in|\mathcal B|_t}\nabla_{\boldsymbol\theta}\mathcal L_i(\boldsymbol\theta_t)\] where \(\mathcal B_t\) is a set of randomly chosen examples to use at iteration \(t\).

sgd <- function(
  par,                       # parameter estimates
  X,                         # model matrix
  y,                         # target variable
  stepsize = 1,              # the learning rate
  stepsize_tau = 0,          # if > 0, a check on the LR at early iterations
  average = FALSE            # a variation of the approach
){
  
  # initialize
  beta = par
  names(beta) = colnames(X)
  betamat = matrix(0, nrow(X), ncol = length(beta))      # Collect all estimates
  fits = NA                                      # Collect fitted values at each point
  loss = NA                                      # Collect loss at each point
  s = 0                                          # adagrad per parameter learning rate adjustment
  eps  = 1e-8                                    # a smoothing term to avoid division by zero
  
  for (i in 1:nrow(X)) {
    Xi   = X[i, , drop = FALSE]
    yi   = y[i]
    LP   = Xi %*% beta                           # matrix operations not necessary, 
    grad = t(Xi) %*% (LP - yi)                   # but makes consistent with standard gd func
    s    = s + grad^2                            # adagrad approach
    
    # update
    beta = beta - stepsize/(stepsize_tau + sqrt(s + eps)) * grad 

    if (average & i > 1) {
      beta =  beta - 1/i * (betamat[i - 1, ] - beta)          # a variation
    } 
    
    betamat[i,] = beta
    fits[i]     = LP
    loss[i]     = (LP - yi)^2
    grad_old = grad
  }
  
  LP = X %*% beta
  lastloss = crossprod(LP - y)
  
  list(
    par    = beta,                               # final estimates
    par_chain = betamat,                         # estimates at each iteration
    RMSE   = sqrt(sum(lastloss)/nrow(X)),
    fitted = LP
  )
}

Initialization and Run SGD

starting_values = rep(0, 3)
fit_sgd = sgd(
  starting_values,
  X = X,
  y = y,
  stepsize     = .1,
  stepsize_tau = .5,
  average = FALSE
)

str(fit_sgd)
List of 4
 $ par      : num [1:3, 1] 1.025 0.24 0.445
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "Intercept" "x1" "x2"
  .. ..$ : NULL
 $ par_chain: num [1:1000, 1:3] -0.06206 -0.00242 0.01715 0.08806 0.07569 ...
 $ RMSE     : num 1.01
 $ fitted   : num [1:1000, 1] 0.198 1.226 0.6 0.744 1.441 ...
fit_sgd$par
               [,1]
Intercept 1.0247599
x1        0.2402697
x2        0.4454395

Comparison

cbind.data.frame("sgd"=fit_sgd$par,"gd"=fit_gd$par,"lm"=coef(lm(y ~ x1 + x2)))
                sgd        gd        lm
Intercept 1.0247599 1.0299573 1.0299573
x1        0.2402697 0.2177020 0.2177020
x2        0.4454395 0.4631026 0.4631026

References

Michael Clark. Model Estimation by Example: Demonstrations with R. Online Book Resources

Kevin P. Murphy. Probabilistic machine learning : an introduction. The MIT Press, 2022.

Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning with R. 2nd ed. Springer, 2021.

Geof H. Givens, Jennifer A. Hoeting. Computational Statistics, 2nd ed. Wiley, 2012.