MCMC I

Author

Jiaye Xu

Published

March 28, 2023

Why MCMC?

  1. We want to sample from a some complicated (target) density \(\pi\). Often, this density is the result of a Bayesian computation so it can be interpreted as a posterior density. The presumption here is that we can evaluate \(\pi\) but we cannot sample from it.

  2. We know that certain stochastic processes called Markov chains will converge to a stationary distribution (if it exists and if specific conditions are satisfied). Simulating from such a Markov chain for a long enough time will eventually give us a sample from the chain’s stationary distribution.

  3. Given the functional form of the density \(\pi\), we want to construct a Markov chain that has \(\pi\) as its stationary distribution.

  4. We want to sample values from the Markov chain such that the sequence of values \({x_n}\) generated by the chain converges in distribution to the density \(\pi\).

A Take-Home Message:

For an irreducible, aperiodic and recurrent (or persistent) Markov chain \(\{\psi_t\}_{t≥1}\), having invariant distribution \(\pi\), it can be shown that for every initial value \(\psi_1\), the distribution of \(\psi_t\) tends to \(\pi\) as \(t\) increases to infinity.

For a sufficiently large \(M\), \(\{\psi_{M+1}, \ldots, \psi_{M+N}\}\) are all approximately distributed according to \(\pi\), and the law of large numbers holds for \(\{\psi_t\}_{t≥1}\) , so that one has the approximation:

\[ \mathrm{E}_\pi(g(\psi)) \approx N^{-1} \sum_{j=1}^N g\left(\psi_{M+j}\right) . \]

Gibbs Sampler

Suppose that the unknown parameter is multidimensional, so the posterior distribution is multivariate, \(\psi= (\psi^{(1)},\psi^{(2)},\ldots,\psi^{(k)})\). In this case we can write the target density \(\pi(\psi)= \pi(\psi^{(1)},\ldots,\psi^{(k)})\). The Gibbs sampler starts from an arbitrary point \(\psi_0= (\psi^{(1)}_0,\psi^{(2)}_0,\ldots,\psi^{(k)}_0)\) in the parameter space and updates one component at a time by drawing \(\psi^{(i)}\), \(i =1, \ldots, k\) from the relevant conditional distribution, according to the scheme in Algorithm: The Gibbs sampler.

\[ \begin{aligned} &\begin{aligned} & \text { 0. Initialize the starting point } \psi_0=\left(\psi_0^{(1)}, \ldots, \psi_0^{(k)}\right) \text {; } \\ & \text { 1. for } j=1, \ldots, N \text { : } \\ & \text { 1.1) generate } \psi_j^{(1)} \text { from } \pi\left(\psi^{(1)} \mid \psi^{(2)}=\psi_{j-1}^{(2)}, \ldots, \psi^{(k)}=\psi_{j-1}^{(k)}\right) ; \\ & \text { 1.2) generate } \psi_j^{(2)} \text { from } \pi\left(\psi^{(2)} \mid \psi^{(1)}=\psi_j^{(1)}, \psi^{(3)}=\psi_{j-1}^{(3)} \ldots, \psi^{(k)}=\psi_{j-1}^{(k)}\right) \text {; } \\ & \vdots \\ & \text { 1.k) generate } \psi_j^{(k)} \text { from } \pi\left(\psi^{(k)} \mid \psi^{(1)}=\psi_j^{(1)}, \ldots, \psi^{(k-1)}=\psi_j^{(k-1)}\right) \text {. } \end{aligned}\\ &\text { }\\ &\quad\quad\quad\quad \it \text { Algorithm: The Gibbs sampler } \end{aligned} \]

An important point, one that is often used in practical applications of the Gibbs sampler, is that the basic algorithm just described still works when one or more of the components \(\psi^{(i)}\) is itself multidimensional. In this case the Gibbs sampler updates in turn “blocks” of components of \(\psi\), drawing from their conditional distribution, given all the remaining components.

Metropolis-Hastings Algorithm

A very flexible method to generate a Markov chain having a prescribed invariant distribution is provided by Metropolis–Hastings algorithm. The method is very general, since it allows us to generate the next state of the chain from an essentially arbitrary distribution: the invariance of the target distribution is then enforced by an accept/reject step.

  1. Initialize the starting point \(\psi_0\);

  2. for \(j=1, \ldots, N\) :

    1.1) generate a proposal \(\tilde{\psi}_j\) from \(q\left(\psi_{j-1}, \cdot\right)\);

    1.2) compute \(\alpha=\alpha\left(\psi_{j-1}, \tilde{\psi}_j\right)\) according to \(\alpha(\psi, \tilde{\psi})=\min \left\{1, \frac{\pi(\tilde{\psi}) q(\tilde{\psi}, \psi)}{\pi(\psi) q(\psi, \tilde{\psi})}\right\}\);

    1.3) generate an independent random variable \(U_j \sim \mathcal{B} ernoulli(\alpha)\);

    1.4) if \(U_j=1\) set \(\psi_j=\tilde{\psi}_j\), otherwise set \(\psi_j=\psi_{j-1}\).

This is how the Metropolis–Hastings algorithm works —The proposal \(\tilde{\psi}_j\) is accepted as the new state of the chain with probability \(\alpha\).

NB. Here \(q\) is a density in its second argument (the “\(\cdot\)”), but it is parametrized by the first argument.

Comments:

  • The Gibbs sampler and Metropolis(–Hastings) algorithm are by no means competing approaches to Markov chain simulation: in fact, they can be combined and used together. When taking a Gibbs sampling approach, it may be unfeasible, or simply not practical, to sample from one or more conditional distributions.

  • M-H Algorithm is an extending Metropolis Algorithm:

Then Metropolis-Hastings algorithm generalizes the basic Metropolis algorithm: In M-H algorithm the proposal distributions need no longer to be symmetric.

Example: Mixture Distribution

Suppose we have observed data \(y_1, y_2, . . . ,y_{100}\) sampled independently and identically distributed from the mixture distribution

\[\delta\mathcal{N}(7, 0.5^2)+(1-\delta)\mathcal{N}(10, 0.5^2)\] Mixture densities are common in real-life applications where, for example, the data may come from more than one population. We will use M-H techniques to construct a chain whose stationary distribution equals the posterior density of \(\delta\) assuming a \(\mathcal{U}nif(0,1)\) prior distribution for \(\delta\). The data were generated with\(\delta=0.7\), so we should find that the posterior density is concentrated in this area.

## Mixture data
mixture.dat = read.table(file.choose(),header=TRUE) # from Givens & Hoeting book website, GH datasets/mixture.dat
y = mixture.dat$y

par(mfrow=c(1,1))
x=seq(5,14,by=.01)
d=.7*dnorm(x,7,.5) + .3*dnorm(x,10,.5) # mixture distribution. true delta = 0.7
hist(y,breaks=20,freq=FALSE,main="Histogram of mixture data",ylab="Density") 
points(x,d,type="l")

NOTES:

In the following, the density of the prior is uniform over the support hence not included in the computation of the Metropolis-Hastings ratio.

## INITIAL VALUES 
n = 10000 # number of iterations
x.val1 = NULL
x.val2=NULL
set.seed(0)

## FUNCTIONS
f = function(x){prod(x*dnorm(y,7,0.5) + (1-x)*dnorm(y,10,0.5))} # likelihood function to sample data. 
# NB f is also the target posterior distribution, since prior is uniform(0,1).
R = function(xt,x){f(x)*g(xt)/(f(xt)*g(x))} # M-H ratio. 

## MAIN
# BETA(1,1) PROPOSAL DENSITY. # i.e., Uniform(0,1).
g = function(x){dbeta(x,1,1)} # proposal density.
x.val1[1] = rbeta(1,1,1) # initialize x, prior beta(1,1)
for(i in 1:n){
      xt = x.val1[i] # current state
      x = rbeta(1,1,1) # propose for the next iteration.
      p = min(R(xt,x),1) 
      d = rbinom(1,1,p)# accept proposal x with probability p = min(ratio, 1). Bernoulli(p)
      x.val1[i+1] = x*d + xt*(1-d) # d is binary. If d==1, accept proposal x, o.w., stay with xt.
}
mean(x.val1[201:(n+1)]) # ergodic mean
[1] 0.6945711
par(mfrow=c(2,2))
plot(x.val1[201:(n+1)],ylim=c(0,1),type="l",ylab="delta",xlab="t") # after burn-in
title("Sample path for Beta(1,1) Proposal Dist.")

hist(x.val1[201:(n+1)],breaks=20,xlab="delta",
      main="Hist. for Beta(1,1) Proposal Dist.")

# BETA(2,10) PROPOSAL DENSITY
g = function(x){dbeta(x,2,10)}
x.val2[1] = rbeta(1,2,10)
for(i in 1:n){
      xt = x.val2[i]
      x = rbeta(1,2,10)
      p = min(R(xt,x),1)
      d = rbinom(1,1,p)
      x.val2[i+1] = x*d + xt*(1-d)
}
mean(x.val2[201:(n+1)])
[1] 0.6635089
plot(x.val2[201:(n+1)],ylim=c(0,1),type="l",ylab="delta",xlab="t")
title("Sample path for Beta(2,10) Proposal Dist.")

hist(x.val2[201:(n+1)],breaks=20,xlab="delta",
      main="Hist. for Beta(2,10) Proposal Dist.")

Example: Fur Seal Capture-Recapture Study

Our goal is to estimate the number of pups in a fur seal colony using a capture–recapture approach. In such studies, separate repeated efforts are made to count a population of unknown size. In our case, the population to be counted is the population of pups. No single census attempt is likely to provide a complete enumeration of the population, nor is it even necessary to try to capture most of the individuals. The individuals captured during each census are released with a marker indicating their capture. A capture of a marked individual during any subsequent census is termed a recapture. Population size can be estimated on the basis of the history of capture and recapture data. High recapture rates suggest that the true population size does not greatly exceed the total number of unique individuals ever captured.

Let \(N\) be the unknown population size to be estimated using \(I\) census attempts yielding total numbers of captures (including recaptures) equaling \(c = (c_1,\ldots, c_I)\). We assume that the population is closed during the period of the sampling, which means that deaths, births, and migrations are inconsequential during this period. The total number of distinct animals captured during the study is denoted by \(r\). We consider a model with separate, unknown capture probabilities for each census effort, \(\boldsymbol{\alpha}=(\alpha_1, \ldots, \alpha_I).\) This model assumes that all animals are equally catchable on any one capture occasion, but capture probabilities may change over time.

\[ \begin{aligned} \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text { Census Attempt }& i & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ \hline \text { Number captured } & c_i & 30 & 22 & 29 & 26 & 31 & 32 & 35 \\ \hline \text { Number newly caught } & m_i & 30 & 8 & 17 & 7 & 9 & 8 & 5 \\ \hline \end{array} \end{aligned} \]

The likelihood for this model is

\[ L(N, \alpha \mid \mathbf{c}, r) \propto \frac{N !}{(N-r) !} \prod_{i=1}^I \alpha_i^{c_i}\left(1-\alpha_i\right)^{N-c_i} \]

It is reasonable to assume the population of pups was closed during the study period. For estimation, one might adopt a Bayesian framework where \(N\) and \(\alpha\) are assumed to be a priori independent with the following priors:

For the unknown population size we use an improper uniform prior \(f(N)\propto 1\). For the capture probabilities, we use \[f(\alpha_i \mid \theta_1, \theta_2)=Beta(\theta_1, \theta_2)\] for \(i=1, \ldots, 7\).

A Gibbs sampler can then be constructed by simulating from the conditional posterior distributions

\[ \begin{array}{r} N^{(t+1)}-84 \mid \cdot \sim \operatorname{NegBin}\left(85,1-\prod_{i=1}^7\left(1-\alpha_i^{(t)}\right)\right), \\ \alpha_i^{(t+1)} \mid \cdot \sim \operatorname{Beta}\left(c_i+\frac{1}{2}, N^{(t+1)}-c_i+\frac{1}{2}\right) \end{array} \]

# ci        = number captured during each census
# mi        = number newly caught during each census
# r         = number of unique fur seal pups observed
# I     = number of census attempts
# N         =  estimated population size
# alpha     = computes estimated capture probabilities for each census
# num.its   = number of iterations
# burn.in   = iterations to discard for burn-in

## INITIAL VALUES
ci = c(30,22,29,26,31,32,35)  
mi = c(30,8,17,7,9,8,5)
r = sum(mi)
I=7

num.its=100000
alpha= matrix(0,num.its,I)
N=rep(0,num.its)
N[1]=sample(84:500,1) 
set.seed(1234)
burn.in = 1:1000

#MAIN
for (i in 2:num.its) {
   alpha[i,]=rbeta(I,ci+.5,N[i-1]-ci+.5)
   N[i]=rnbinom(1,r+1,1-prod(1-alpha[i,]))+r
}

# DISCARDING BURN-IN
alpha.out = alpha[-burn.in,]
N.out = N[-burn.in]


## OUTPUT
mean(N.out)      # POSTERIOR MEAN OF N
[1] 89.47519
#MANUAL COMPUTATION OF AN HPD INTERVAL

table(N.out)
N.out
   84    85    86    87    88    89    90    91    92    93    94    95    96 
  917  3697  7999 12188 14696 14619 13371 10324  7727  5233  3486  2140  1207 
   97    98    99   100   101   102   103   104   105   108 
  674   367   157   101    46    26    12    10     2     1 
sum(N.out>=85 & N.out<=95)/length(N.out)
[1] 0.9644444
sum(N.out>=85 & N.out<=94)/length(N.out)
[1] 0.9428283
sum(N.out>=84 & N.out<=93)/length(N.out)
[1] 0.9168788
sum(N.out>=84 & N.out<=94)/length(N.out)
[1] 0.9520909
## OUTPUT PLOTS
plot((max(burn.in)+1):num.its,N.out,type="l",
      xlab="t",ylab="N",main="Sample path for N")

boxplot(split(rowMeans(alpha.out),N.out),ylab="Mean Capture Probability",
      xlab="N",main="Split boxplot for seal pup example")

hist(N.out,freq=FALSE,xlab="N",
      main="Estimated marginal posterior probabilities for N")