Newton’s method, a. k. a. Newton-Raphson iteration, is a general root-finding approach. Suppose that \(g^\prime\) is continuously differentiable, and \(g^{\prime\prime}(x^*)\neq 0\), at iteration \(t\), using Taylor series expansion \[x^*=x^{(t)}-\frac{g^\prime(x^{(t)})}{g^{\prime\prime}(x^{(t)})}=x^{(t)}+h^{(t)}\]\(h^{(t)}\) is called a refinement.
## 1. Initializationx =3# initial value. In the case of MLE, theta.it =40# number of iterations to runx.vec <-rep(x,(it+1))## 2. Define Functionsg =function(x){log(x)/(1+x)} # Objective function. In the case of MLE, g is log-likelihood function of theta.g.prime =function(x){(1+(1/x)-log(x))/((1+x)^2)} # first derivative of gg.2prime =function(x){(-1/((x^2)+(x^3)))-2*(1+(1/x)-log(x))/((1+x)^3)} # second derivative of g## 3. Algorithmfor(i in1:it){ x = x -g.prime(x)/g.2prime(x) x.vec[i+1] <- x.vec[i] -g.prime(x.vec[i])/g.2prime(x.vec[i]) # iteration tracking }## 4. Outputx # (Optimal) Estimate
When optimization of \(g\) corresponds to an MLE problem, the objective/target function\(g\) is log-likelihood \(l\), root \(x^*\) is the MLE of \(\theta\).
Gradient Descent
A fundamental problem: Minimize a function \(f(\theta_1, ... , \theta_n)\). Calculus teaches us that all the first derivatives \(\partial f/\partial \theta_i\) are zero at the minimum when \(f\) is smooth. (We have \(n\) equations \(\partial f/\partial \theta_i=0\).)
Gradient descent (GD) is a first-order method. It uses the derivatives \(\partial f/\partial \theta_i\) to find the steepest direction that reduces \(f(\boldsymbol \theta)\). The steepest direction, in which \(f(\boldsymbol \theta)\) decreases fastest, is given by \(-\nabla f\), where the symbol \(\nabla f\) represents the vector of \(n\) partial derivatives of \(f\): its gradient.
At \(t=0\), start with initial values for all parameters in vector \(\boldsymbol \theta\), that is \(\boldsymbol \theta_0\)
Iterate until the target \(f(\boldsymbol \theta)\) fails to decrease:
At \(t\)th iteration, \[\theta_{t+1}=\theta_t-\rho_t \nabla f(\theta_t),\]\(\rho_t\) is the stepsize or the learning rate.
Demonstration for a Linear Regression
Suppose we used a simple supervised statistical learning method, a linear regression model based on the linear hypothesis\[y_i = \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\epsilon_i\] with \(\epsilon_i \overset{iid}{\sim} \mathcal N(0, \sigma^2)\)
The target \(f(\boldsymbol \theta)\) in this case is the mean squared error (MSE) \[\text{MSE}=\frac{1}{n}\sum_{i=1}^n\left(y-\hat y\right)^2\]
In non-linear learning models, the target function is the loss function.
library(tidyverse)
Data Setup
The true regression line is \(E(y)= 1+0.2 x_{1}+0.5x_{2}\)
In order to accelerate the computing for large data set, instead of summing over all n observations, we sample a fraction or minibatch of them each time we compute a gradient step. This process is known as stochastic gradient descent (SGD) and is the state of the art for learning deep neural networks.
For a loss function \(\mathcal L(\boldsymbol \theta)=\frac{1}{n}\sum_{i=1}^{n}\mathcal L_i(\boldsymbol\theta)\), the gradient of this objective in GD has the form \[g_t=\frac{1}{n}\sum_{i=1}^{n}\nabla_{\boldsymbol\theta}\mathcal L_i(\boldsymbol\theta_t)\] This algorithm is slow for the training set of large \(n\). So in SGD, we approximate this gradient by sampling a minibatch of \(b<<n\)\[g_t\approx\frac{1}{|\mathcal B|_t}\sum_{i\in|\mathcal B|_t}\nabla_{\boldsymbol\theta}\mathcal L_i(\boldsymbol\theta_t)\] where \(\mathcal B_t\) is a set of randomly chosen examples to use at iteration \(t\).
sgd <-function( par, # parameter estimates X, # model matrix y, # target variablestepsize =1, # the learning ratestepsize_tau =0, # if > 0, a check on the LR at early iterationsaverage =FALSE# a variation of the approach){# initialize beta = parnames(beta) =colnames(X) betamat =matrix(0, nrow(X), ncol =length(beta)) # Collect all estimates fits =NA# Collect fitted values at each point loss =NA# Collect loss at each point s =0# adagrad per parameter learning rate adjustment eps =1e-8# a smoothing term to avoid division by zerofor (i in1:nrow(X)) { Xi = X[i, , drop =FALSE] yi = y[i] LP = Xi %*% beta # matrix operations not necessary, grad =t(Xi) %*% (LP - yi) # but makes consistent with standard gd func s = s + grad^2# adagrad approach# update beta = beta - stepsize/(stepsize_tau +sqrt(s + eps)) * grad if (average & i >1) { beta = beta -1/i * (betamat[i -1, ] - beta) # a variation } betamat[i,] = beta fits[i] = LP loss[i] = (LP - yi)^2 grad_old = grad } LP = X %*% beta lastloss =crossprod(LP - y)list(par = beta, # final estimatespar_chain = betamat, # estimates at each iterationRMSE =sqrt(sum(lastloss)/nrow(X)),fitted = LP )}