Introduction to Bayesian Inference

Bayesian inference is fundamentally about updating what we know as we gather more information. Unlike frequentist approaches that treat parameters as fixed but unknown values, Bayesian methods represent our uncertainty about parameters as probability distributions, and we update these distributions as data comes in.

The Core Principle: Bayes’ Theorem

At the heart of Bayesian statistics lies Bayes’ theorem, which formalizes how we update our beliefs about parameters θ given observed data D:

\[P(\theta \mid D) = \frac{P(D \mid \theta) \cdot P(\theta)}{P(D)}\]

The elegance of Bayes’ theorem is in how intuitive it is: the posterior balances what the data tells us (likelihood) with what we already knew (prior).

The Fundamental Challenge

The catch is that computing the posterior requires the evidence:

\[P(D) = \int P(D \mid \theta) \cdot P(\theta) \, d\theta\]

For most real-world problems, this integral is intractable. Why? High-dimensional parameter spaces can have hundreds or thousands of dimensions. Posteriors are often multimodal with complex correlations. Many models don’t have conjugate priors, so there’s no closed-form solution. Some models require running simulations or solving differential equations just to evaluate the likelihood once.

This is where computational methods become essential—we can’t compute the posterior analytically, so we need to approximate it.

Why Sampling?

Instead of computing $P(\theta \mid D)$ directly, modern Bayesian inference relies on Monte Carlo sampling: we generate samples ${\theta_1, \theta_2, …, \theta_N}$ distributed according to the posterior.

With enough samples, we can:

The beauty of sampling is its generality—the same framework works whether you’re fitting a simple linear model or simulating galaxy formation.

The Sampling Zoo: Different Algorithms for Different Problems

There’s no universal best sampler. Each method trades off efficiency, scalability, robustness, and ease of use differently. Some excel in high dimensions but struggle with multimodality. Others naturally estimate the evidence but are slower for pure parameter inference. Understanding these trade-offs is key to choosing the right tool.

Below are several sampling algorithms I’ve worked with extensively, complete with interactive demonstrations showing when each method shines and when it struggles.

Markov Chain Monte Carlo (MCMC)

The foundation of computational Bayesian inference.

MCMC builds a Markov chain whose stationary distribution is the posterior we want to sample. The classic Metropolis-Hastings algorithm is simple: propose a new parameter value, then accept or reject based on the posterior density ratio.

This works well for low-dimensional problems (up to ~10-20 dimensions) with smooth, unimodal posteriors. It’s also incredibly valuable pedagogically—MCMC is where most people build intuition about sampling.

The key concepts:

While not competitive for high-dimensional or complex problems, MCMC is essential for understanding why more sophisticated methods exist.

➡️ Explore the MCMC sampler

Includes: trace plots, autocorrelation analysis, effective sample size computation, proposal scaling experiments


Hamiltonian Monte Carlo (HMC)

Geometry-aware sampling using gradient information.

HMC treats sampling as physics: imagine a frictionless particle sliding through the log-posterior landscape. By using gradient information, it proposes distant moves that follow the posterior’s geometry rather than wandering randomly.

Why is this so effective?

HMC handles high-dimensional smooth posteriors with ease—hundreds or thousands of dimensions where vanilla MCMC fails completely. This is why modern tools like Stan and PyMC use HMC as their default engine.

The challenges:

Despite these issues, HMC has become the workhorse of contemporary Bayesian computation.

➡️ Explore the HMC sampler

Includes: trajectory visualization, leapfrog integration, energy diagnostics, comparison with MCMC


Nested Sampling

Simultaneous sampling and evidence computation.

Unlike MCMC methods that target the posterior directly, nested sampling explores nested shells of increasing likelihood. It starts with N “live points” sampled from the prior. At each iteration, it removes the point with the lowest likelihood and replaces it with a new point from the prior that has higher likelihood. This systematically contracts the prior volume toward high-likelihood regions.

The key advantage: this process naturally computes the evidence $P(D)$ as a byproduct, making nested sampling invaluable for model comparison.

When to use it:

The trade-offs:

In astrophysics and cosmology, where model selection is often the main question, nested sampling has become a standard tool.

➡️ Explore Nested Sampling

Includes: live point evolution, evidence computation, posterior reconstruction from nested samples


Parallel Tempering

Escaping local modes via temperature ladder.

Parallel tempering (also called replica exchange MCMC) runs multiple chains at different “temperatures”—high temperatures flatten the posterior to make mode-hopping easy, while the cold chain samples the true posterior accurately.

The algorithm runs M chains targeting $[P(\theta \mid D)]^{1/T_i}$ for temperatures $T_1 = 1 < T_2 < … < T_M$. The hot chains explore freely across a flattened landscape. Periodically, we propose swapping states between adjacent temperature chains. This lets information from high-temperature exploration gradually inform the cold chain, allowing it to discover and properly weight all modes.

When to use it:

Practical considerations:

When configured properly, parallel tempering can turn impossible sampling problems into tractable ones.

➡️ Explore Parallel Tempering

Includes: temperature ladder visualization, swap statistics, mode discovery demos


Further Reading

If you want to go deeper, here are resources I’ve found particularly valuable:

Core texts:

Sampling methods:

Software:

I recommend trying multiple frameworks—each one will deepen your understanding in different ways.