Specify a joint distribution for the outcome(s) and all the unknowns, which typically takes the form of a marginal prior distribution for the unknowns multiplied by a likelihood for the outcome(s) conditional on the unknowns. This joint distribution is proportional to a posterior distribution of the unknowns conditional on the observed data.
Draw from posterior distribution using Markov Chain Monte Carlo (MCMC).
Evaluate how well the model fits the data and possibly revise the model.
Draw from the posterior predictive distribution of the outcome(s) given interesting values of the predictors in order to visualize how a manipulation of a predictor affects (a function of) the outcome(s).
NOTE:
Step 1 is necessarily model-specific. It is somewhat more involved than the corresponding first step of a frequentist analysis, which only requires that the likelihood of the outcome be specified. However, the default priors in the rstanarm package should work well in the majority of cases. (see https://mc-stan.org/rstanarm/articles/priors.html)
Steps 2, 3, and 4 are largely not specific to how the joint distribution in Step 1 is specified.
The key concept in Step 3 and Step 4 is the posterior predictive distribution, which is the distribution of the outcome implied by the model after having used the observed data to update our beliefs about the unknown parameters.
Frequentists, by definition, have no posterior predictive distribution and frequentist predictions are subtly different from what applied researchers seek. Maximum likelihood estimates (MLE) do not condition on the observed outcome data and so the uncertainty in the estimates pertains to the variation in the sampling distribution of the estimator, i.e. the distribution of estimates that would occur if we could repeat the process of drawing a random sample from a well-defined population and apply the estimator to each sample. It is possible to construct a distribution of predictions under the frequentist paradigm but it evokes the hypothetical of repeating the process of drawing a random sample, applying the estimator each time, and generating point predictions of the outcome.
In contrast, the posterior predictive distribution conditions on the observed outcome data in hand to update beliefs about the unknowns and the variation in the resulting distribution of predictions reflects the remaining uncertainty in our beliefs about the unknowns.
Step 1: Specify a posterior distribution
For the sake of discussion, we need some posterior distribution to draw from. We will utilize an example from the HSAUR3 package by Brian S. Everitt and Torsten Hothorn, which is used in their 2014 book A Handbook of Statistical Analyses Using R (3rd Edition) (Chapman & Hall / CRC). This book is frequentist in nature and we will show how to obtain the corresponding Bayesian results.
The model in section 6.3.2 pertains to whether a survey respondent agrees or disagrees with a conservative statement about the role of women in society, which is modeled as a function of the gender and education of the respondents. The posterior distribution — with independent priors — can be written as
\[
f(\alpha,\beta_1,\beta_2 \vert \mathbf y, \mathbf X) \propto f(\alpha)f(\beta_1)f(\beta_2)\times \prod_{i=1}^J g^{-1}(\eta_i)^{y_i}\left(1-g^{-1}(\eta_i)\right)^{n_i-y_i},
\]where \(\eta_i=\alpha+\beta_1\text{Education}_i + \beta_2\text{Female}_i\) is the linear predictor and a function of an intercept (\(\alpha\)), a coefficient on the years of education (\(\beta_1\)), and an intercept-shift (\(\beta_2\)) for the case where the respondent is female. These data are organized such that \(y_i\) is the number of respondents who agree with the statement that have the same level of education and the same gender, and \(n_i-y_i\) the number of such people who disagree with the statement. The inverse link function, \(p=g^{-1}(\eta_i)\) for a binomial likelihood can be one of several Cumulative Distribution Functions (CDFs) but in this case is the standard logistic CDF, \(g^{-1}(\eta_i)=\frac{1}{1+e^{-\eta_i}}\).
Suppose we believe — prior to seeing the data — that \(\alpha\), \(\beta_1\), and \(\beta_2\) are probably close to zero, are as likely to be positive as they are to be negative, but have a small chance of being quite far from zero. These beliefs can be represented by Student t distributionswith a few degrees of freedom in order to produce moderately heavy tails. In particular, we will specify seven degrees of freedom. Note that these purported beliefs may well be more skeptical than your actual beliefs, which are probably that women and people with more education have less conservative societal views.
Note on “prior beliefs” and default priors
We use the term “prior beliefs” to refer in generality to the information content of the prior distribution (conditional on the model). Sometimes previous research on the topic of interest motivates beliefs about model parameters, but other times such work may not exist or several studies may make contradictory claims.
Regardless, we nearly always have some knowledge that should be reflected in our choice of prior distributions. For example, no one believes a logistic regression coefficient will be greater than five in absolute value if the predictors are scaled reasonably. You may also have seen examples of so-called “non-informative” (or “vague”, “diffuse”, etc.) priors like a normal distribution with a variance of 1000. When data are reasonably scaled, these priors are almost always a bad idea for various reasons (they give non-trivial weight to extreme values, reduce computational efficiency, etc).
The default priors in rstanarm are designed to be weakly informative, by which we mean that they avoid placing unwarranted prior weight on nonsensical parameter values and provide some regularization to avoid overfitting, but also do allow for extreme values if warranted by the data. If additional information is available, the weakly informative defaults can be replaced with more informative priors.
Step 2: Draw from the posterior distribution
The likelihood for the sample is just the product over the \(J\) groups of \[g^{-1}(\eta_i)^{y_i}\left(1-g^{-1}(\eta_i)\right)^{n_i-y_i},\]
which can be maximized over \(\alpha, \beta_1, \beta_2\), to obtain frequentist estimates by calling
The p-value for the null hypothesis that \(\beta_1=0\) is very small, while the p-value for the null hypothesis that \(\beta_2=0\) is very large. However, frequentist p-values are awkward because they do not pertain to the probability that a scientific hypothesis is true but rather to the probability of observing a \(z\)-statistic that is so large (in magnitude) if the null hypothesis were true. The desire to make probabilistic statements about a scientific hypothesis is one reason why many people are drawn to the Bayesian approach.
A model with the same likelihood but Student \(t\) priors with seven degrees of freedom can be specified using the rstanarm package in a similar way by prepending stan_ to the glm call and specifying priors (and optionally the number of cores on your computer to utilize):
library(rstanarm)
Loading required package: Rcpp
This is rstanarm version 2.32.1
- See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
- Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
- For execution on a local, multicore CPU with excess RAM we recommend calling
stan_glm
family: binomial [logit]
formula: cbind(agree, disagree) ~ education + gender
observations: 42
predictors: 3
------
Median MAD_SD
(Intercept) 2.5 0.2
education -0.3 0.0
genderFemale 0.0 0.1
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
As can be seen, the “Bayesian point estimates” — which are represented by the posterior medians — are very similar to the maximum likelihood estimates. Frequentists would ask whether the point estimate is greater in magnitude than double the estimated standard deviation of the sampling distribution. But here we simply have estimates of the standard deviation of themarginal posterior distributions, which are based on a scaling of the Median Absolute Deviation (MAD) from the posterior medians to obtain a robust estimator of the posterior standard deviation. In addition, we can use the posterior_interval function to obtain a Bayesian uncertainty interval for \(\beta_1\):
ci95 <-posterior_interval(womensrole_bglm_1, prob =0.95, pars ="education")round(ci95, 2)
2.5% 97.5%
education -0.3 -0.24
Unlike frequentist confidence intervals — which are not interpretable in terms of post-data probabilities — the Bayesian uncertainty interval indicates we believe after seeing the data that there is a \(0.95\) probability that \(\beta_2\) is between ci95[1,1] and ci95[1,2]. Alternatively, we could say that there is essentially zero probability that \(\beta_2>0\), although frequentists cannot make such a claim coherently.
Many of the post-estimation methods that are available for a model that is estimated by glm are also available for a model that is estimated by stan_glm. For example,
rstanarm does provide a confint method, although it is reserved for computing confidence intervals in the case that the user elects to estimate a model by (penalized) maximum likelihood. When using full Bayesian inference (the rstanarm default) or approximate Bayesian inference the posterior_interval function should be used to obtain Bayesian uncertainty intervals.