Bayesian Inference

Author

Jiaye Xu

Published

March 21, 2023

Bayesian Fundamentals

A Heuristic Example of Bayes Theorem:

Given the description, guess this person is a librarian or a farmer,

Info 1: according to the description of personality, it is 40% likely the person is a librarian, 10% a farmer.

Info 2: in the population, the ratio of farmer to librarian is 20/1.

The posterior probability of the person is a librarian, given the personality description, is calculated by the Bayesian formula:

\[Pr(L | P) = \frac{Pr(P | L) Pr(L)}{ Pr(P)}\]

and

\[Pr(P) = Pr(P | L) Pr(L) + Pr(P | F) Pr(F)\] where \(L\) is the event that this sampled person is a librarian, \(P\) is the event that this person has a certain personality, like meek and organized. \(F\) means the sampled person is a farmer.

Prior: \(Pr(L)=1/21=0.0476\)

Likelihood: \(Pr(P | L)\)

Posterior: \(Pr(L | P)\)

Therefore, \(Pr(L | P) = 0.4*(1/21) / [0.4*(1/21) + 0.1*(20/21)]=0.1667\)

NB: New evidence does not completely determine your beliefs in a vacuum; it should update prior beliefs.

In general, the formula is

\[Pr(H | E) = \frac{Pr(E | H) Pr(H)} {Pr(E)}\]

where \(H\) means the hypothesis, \(E\) represents the evidence.

Bayesian Updating

Example: Flipping an Unfair Coin

Suppose we flip a coin up to \(200\) times. Unknown to us, the coin is far from fair – it is on average four times as likely to produce heads as it is to produce tails – that is, \(\pi=0.8\). We slowly learn about this in the process of flipping the coin repeatedly, keeping score of the number of flips \(n\) and the number of heads \(k\) after each flip. This is called Bayesian updating.

The code below implements this experiment. It simulates a series of \(n=200\) coin flips and records the number of heads \(k_i\) at every \(i\)th flip. Based on this information, it retrieves the analytical solutions for the posterior mean along with its 95% credible interval at every turn.

Simulating the experiment:

set.seed(123)                       ### set seed for replicability
a <- b <- 5                          ### hyperparameters
n <- 200                             ### num. of coin flips
pi_true <- .8                        ### true parameter
data <- rbinom(n, 1, pi_true)        ### n coin flips
posterior <- matrix(NA, 3L, n)       ### matrix container for posterior

for (i in seq_len(n)) {    
  current.sequence <- data[1:i]      ### sequence up until ith draw
  k <- sum(current.sequence)         ### number of heads in current sequence
  
  ##### Updating
  a.prime <- a + k               
  b.prime <- b + i - k
  
  ### Analytical means and credible intervals
  posterior[1, i] <- a.prime / (a.prime + b.prime)
  posterior[2, i] <- qbeta(0.025, a.prime, b.prime)
  posterior[3, i] <- qbeta(0.975, a.prime, b.prime)
}

## Plot
plot(                                ### set up empty plot with labels
  1:n, 1:n,
  type = 'n',
  xlab = "Number of Coin Flips",
  ylab = expression(paste("Posterior Means of ", pi, " (with 95% Credible Intervals)", sep = " ")),
  ylim = c(0, 1),
  xlim = c(1, n)
)
abline(                              ### reference line for the true pi
  h = c(0.5, 0.8),
  col = "gray80"
)
rect(-0.5, qbeta(0.025, 5, 5),        ### prior mean + interval at i = 0
     0.5, qbeta(0.975, 5, 5),
     col = adjustcolor('red', 0.4),
     border = adjustcolor('red', 0.2))
segments(-0.5, 0.5,
         0.5, 0.5,
         col = adjustcolor('red', 0.9),
         lwd = 1.5)
polygon(                             ### posterior intervals
  c(seq_len(n), rev(seq_len(n))),
  c(posterior[2, ], rev(posterior[3, ])),
  col = adjustcolor('blue', .4),
  border = adjustcolor('blue', .2)
)
lines( ### posterior means
  seq_len(n),
  posterior[1, ],
  col = adjustcolor('blue', .9),
  lwd = 1.5
)

The plot above shows the prior distribution with its 95% credible interval at \(i=0\) (in red) and the updated posterior distributions with their 95% credible intervals at every coin flip \(i=1,\ldots,n\). As we can see, even after just a couple of coin flips, the posterior distribution departs from the center of gravity of the prior distribution and converges toward the proportion of heads, \(k/n\), in the data. After \(n=200\) coin flips, we have \(k=161\) heads, a proportion of \(0.805\). By chance, this happens to be higher than the underlying true parameter value. After 200 flips, our posterior mean and its corresponding 95% interval are \(0.8\) and \((0.743, 0.851)\), which shows how strongly the likelihood has come to dominate the prior.

Analytical Solutions – Bayesian essentials

Contrary to the frequentist paradigm, Bayesian methods, treat data as known and fixed, whereas parameters are considered unknown random quantities. Bayesian statisticians express their knowledge about parameters distributionally.

Likelihood function

This involves stipulating a data generating process, where we specify a probability density function (pdf) or probability mass function (pmf) governed by a (set of) parameter(s) \(\theta\). This is commonly denoted \(p(y \mid\theta)\). Given \(\theta\), one can specify the relative likelihood of observing a given value \(y\).

Treating the observed values \(y\) as given yields the logical inversion “which unknown \(\theta\) most likely produces the known data \(y\)?”. This is what we call the likelihood function, often denoted \(L(\theta \mid y)\).

In practice, Bayesian practitioners often use the notations \(p(y \mid\theta)\) and \(L(\theta \mid y)\) interchangeably. Here, we will stick to the former.

In the flipping-updating example:

Given a fixed number of flips, \(k\) is generated according to a binomial distribution governed by the parameter \(\pi\):

\[k\sim Binomial(n,\pi).\]

The parameter \(\pi\), then, is our unknown quantity of interest.

Prior distribution

The prior distribution is a distributional characterization of our belief about a parameter prior to seeing the data. We denote the corresponding probability function as \(p(\pi)\). Specifying a prior distribution includes statements about the distribution’s family, density, and support.

In the flipping-updating example:

\[\pi \sim Beta(\alpha, \beta).\]

An uncertain quantity \(\theta\), known to be between \(0\) and \(1\), has a \(Beta(\alpha, \beta)\) distribution if

\[ p(\theta)=\operatorname{dbeta}(\theta, \alpha, \beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \theta^{\alpha-1}(1-\theta)^{\beta-1} \quad \text { for } 0 \leq \theta \leq 1 \]

For such a random variable,

\[ \begin{aligned} & \operatorname{mode}[\theta]=(\alpha-1) /[(\alpha-1)+(\beta-1)] \text { if } \alpha>1 \text { and } \beta>1 \\ & \mathrm{E}[\theta]=\alpha /(\alpha+\beta) \\ & \operatorname{Var}[\theta]=\alpha \beta /\left[(\alpha+\beta+1)(\alpha+\beta)^2\right]=\mathrm{E}[\theta] \times \mathrm{E}[1-\theta] /(\alpha+\beta+1) \end{aligned} \]

len.pi <- 1001L                      ### number of candidate values for pi
pi <- seq(0, 1, length.out = len.pi) ### candidate values for pi
a <- b <- 5                         ### hyperparameters
prior <- dbeta(pi, a, b)             ### prior distribution

## Plot
plot(                                ### set up empty plot, specify labels
  pi, prior,
  type = 'n',
  xlab = "Density",
  ylab = expression(paste("Prior Distribution for ", pi))
)
polygon(                             ### draw density distribution
  x= c(pi,rev(pi)),
 y=c(prior,rep(0, length(pi))),
  col = adjustcolor('red', alpha.f = .4),
  border = NA
)
abline(                              ### add vertical at pi = 0.5 
  v = .5,
  col = 'white'
)

Posterior distribution

Following the proportional version of Bayes’ Law, it thus yields the posterior distribution, i.e., our distributional belief about \(\theta\) give the data: \[p(\theta \mid \mathbf y)\propto p(\theta)\times p(\mathbf y \mid \theta )\]

In the flipping-updating example:

\[p(\pi\mid k)\sim Beta(\alpha^\prime=\alpha+k,\beta^\prime=\beta+n−k)\]

NB. The beta distribution is conjugate to the binomial likelihood.

Conjugate

A class \(\mathcal P\) of prior distributions for \(\theta\) is called conjugate for sampling model \(p(y \mid \theta)\) if \[ p(\theta) \in \mathcal{P} \Rightarrow p(\theta \mid y) \in \mathcal{P} \text {. } \]

Markov chain Monte Carlo methods

In Bayesian inference, it is very often the case that the posterior distribution of the parameters, denoted here by \(\psi\), is analytically intractable. The standard practice is to resort to simulation methods – Markov chain Monte Carlo (MCMC) methods.

For a sufficiently large \(M\), the law of large numbers hold for certain Markov chains, \((\psi_t)_{t \geq 1}\) having invariant distribution \(\pi\), one has the approximation

\[ \mathrm{E}_\pi(g(\psi)) \approx N^{-1} \sum_{j=1}^N g\left(\psi_{M+j}\right) . \] In practice it is important to determine how large \(M\) should be, i.e., how many iterations of a simulated Markov chain are to be considered burn-in and discarded in the calculation of ergodic averages in the above equation.

In the following lectures, we present the most popular MCMC algorithms for simulating from a given distribution \(\pi\).