Interactive MCMC Demo - Metropolis–Hastings

Introduction to MCMC

Markov Chain Monte Carlo (MCMC) is a family of algorithms for sampling from probability distributions that are difficult or impossible to sample from directly. These methods are fundamental in Bayesian inference, statistical physics, computational biology, and machine learning.

The key idea is to construct a Markov chain that has the target distribution $\pi(\mathbf{x})$ as its stationary distribution. By running the chain long enough, samples from the chain will approximate samples from $\pi(\mathbf{x})$.

Why MCMC?

Sample from complex, high-dimensional distributions where direct sampling is intractable
Only requires ability to evaluate the target density $\pi(\mathbf{x})$ (up to a normalizing constant)
Asymptotically exact: given infinite time, samples converge to the true distribution
Widely applicable: works for continuous, discrete, or mixed parameter spaces

The Metropolis–Hastings Algorithm

The Metropolis–Hastings (MH) algorithm is one of the most fundamental MCMC methods. It was introduced by Metropolis et al. (1953) and generalized by Hastings (1970). The algorithm generates a sequence of samples $\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \ldots$ where each sample depends only on the previous one (Markov property).

Algorithm Steps:

Initialize: Start with an initial state $\mathbf{x}_0$
Propose: Generate a candidate state $\mathbf{x}' \sim q(\cdot \mid \mathbf{x}_t)$ from a proposal distribution
In this demo: $q(\mathbf{x}' \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_t, \sigma^2 \mathbf{I})$ (Gaussian random walk)
Compute acceptance ratio: $$\alpha = \min\left(1, \frac{\pi(\mathbf{x}') \, q(\mathbf{x}_t \mid \mathbf{x}')}{\pi(\mathbf{x}_t) \, q(\mathbf{x}' \mid \mathbf{x}_t)}\right)$$ For symmetric proposals (like our Gaussian random walk), this simplifies to: $$\alpha = \min\left(1, \frac{\pi(\mathbf{x}')}{\pi(\mathbf{x}_t)}\right)$$
Accept or reject: Generate $u \sim \text{Uniform}(0,1)$ $$\mathbf{x}_{t+1} = \begin{cases} \mathbf{x}' & \text{if } u < \alpha \\ \mathbf{x}_t & \text{otherwise} \end{cases}$$

Key Insight: The algorithm always accepts moves to higher probability regions ($\pi(\mathbf{x}') > \pi(\mathbf{x}_t)$), but also sometimes accepts moves to lower probability regions. This ensures the chain doesn't get stuck in local modes and properly explores the entire distribution.

Target Distributions

This demonstration provides three target distributions ranging from simple to challenging, illustrating different aspects of MCMC performance:

Bivariate Gaussian (Default)

A standard correlated 2D Gaussian with correlation coefficient $\rho = 0.8$:

$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{2(1-\rho^2)}(x_1^2 - 2\rho x_1 x_2 + x_2^2)\right)$$

This distribution has elliptical contours aligned with the correlation structure. With strong correlation ($\rho = 0.8$), random-walk proposals struggle because they explore axis-by-axis when parameters should move together along the correlation direction.

Rosenbrock's Banana Distribution

A transformed Gaussian that creates a curved, banana-shaped density (Haario et al., 1999):

$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{200}(x_1^2 + 100(x_2 - x_1^2)^2)\right)$$

This distribution is strongly correlated along a curved manifold, making it difficult for random-walk samplers to explore efficiently. The "banana" shape arises from the nonlinear transformation $x_2 - x_1^2$, creating regions where standard proposals are inefficient.

Neal's Funnel Distribution

A hierarchical model that exhibits strong scale variations (Neal, 2003):

$$\begin{aligned} x_1 &\sim \mathcal{N}(0, 3^2) \\ x_2 \mid x_1 &\sim \mathcal{N}(0, \exp(x_1)^2) \end{aligned}$$

The joint density is $\pi(x_1, x_2) \propto \exp(-x_1^2/18 - x_2^2/(2e^{2x_1}))$. This creates a "funnel" where the scale of $x_2$ depends exponentially on $x_1$, challenging for fixed-width proposals. Common in hierarchical Bayesian models with variance parameters.

Why these distributions? The Gaussian provides a baseline to understand basic behavior. Rosenbrock's banana and Neal's funnel are standard benchmarks in the MCMC literature—the banana tests handling of strong nonlinear correlations, while the funnel tests adaptation to varying scales. These challenges motivate advanced methods like HMC and adaptive MCMC.

Simulation Controls

Adjust the parameters below to see how they affect the sampler's behavior. The proposal width is particularly important: too small and the chain explores slowly; too large and most proposals are rejected.

Proposal Standard Deviation ($\sigma$) 0.60

Step size for proposals. Larger $\sigma$ → larger jumps. Tune to balance exploration vs. acceptance.

Iteration Speed (ms) 60

Delay between iterations. Slower speeds help visualize individual steps.

Target Distribution

Switch between different target distributions to test sampler behavior.

MCMC Chain Visualization

The main plot shows the joint posterior $\pi(x_1, x_2)$ with darker regions indicating higher probability. Each point represents a sample in the chain: green for accepted proposals and orange for rejected proposals. The plot axes automatically adjust to each distribution's natural range.

Joint Posterior Distribution π(x₁, x₂)

The chain should explore the high-density regions (darker areas) while maintaining enough randomness to avoid getting trapped in any single location.

Marginal Distribution of x₁

Histogram of $x_1$ samples. Should converge to the true marginal $\pi(x_1) = \int \pi(x_1, x_2) dx_2$.

Marginal Distribution of x₂

Histogram of $x_2$ samples. With enough samples, this approximates the true marginal distribution.

Current Iteration Details

Click "Start Sampling" or "Single Step" to begin the MCMC algorithm.

Chain Trace Plots

Trace plots show the evolution of each parameter value over iterations. They are essential for diagnosing convergence and mixing of the MCMC chain.

What to look for in trace plots:

Good mixing: The trace should look like "fuzzy caterpillar" - random fluctuations around a stable mean with no obvious patterns or trends
Stationarity: The mean and variance should remain constant over time (after burn-in)
No trends: The trace shouldn't show long-term upward or downward trends
No getting stuck: The chain shouldn't remain at the same value for extended periods

Trace Plot: x₁ Evolution

Chain trace for parameter $x_1$. Rapid fluctuations indicate good mixing; long periods at similar values suggest poor exploration.

Trace Plot: x₂ Evolution

Chain trace for parameter $x_2$. Compare with $x_1$ trace to assess if both parameters mix at similar rates.

Chain Diagnostics

Autocorrelation Function (ACF)

The autocorrelation function measures the correlation between samples separated by $\tau$ iterations (the lag). For a well-mixing chain, the ACF should decay rapidly to zero.

Mathematical Definition:

For a chain $\{x_t\}$ with sample mean $\bar{x}$ and sample variance $s^2$, the sample ACF at lag $\tau$ is:

$$\hat{\rho}(\tau) = \frac{\sum_{t=1}^{n-\tau} (x_t - \bar{x})(x_{t+\tau} - \bar{x})}{\sum_{t=1}^{n} (x_t - \bar{x})^2}$$

Interpretation:

$\hat{\rho}(0) = 1$: Perfect correlation with itself
$\hat{\rho}(\tau) \to 0$ as $\tau \to \infty$: Samples become independent (for ergodic chains)
Slow decay: High autocorrelation indicates the chain mixes slowly
Fast decay: Low autocorrelation indicates efficient exploration

Autocorrelation Function (ACF)

Blue line: ACF for $x_1$ parameter. Red line: ACF for $x_2$ parameter. The lag range adapts automatically—showing where the ACF crosses zero plus 33% additional lags to verify convergence to independence.

Performance Metrics

Total Iterations

0

Acceptance Rate

—

Accepted Proposals

0

Effective Sample Size (est.)

—

Effective Sample Size (ESS)

Due to autocorrelation, MCMC samples are not independent. The ESS estimates how many independent samples our correlated chain is equivalent to:

$$\text{ESS} \approx \frac{n}{1 + 2\sum_{\tau=1}^{\infty} \rho(\tau)}$$

where $n$ is the total number of samples and $\rho(\tau)$ is the ACF. Higher ESS means more efficient sampling. The ratio ESS/$n$ is called the sampling efficiency.

Tuning the Proposal Width: The proposal standard deviation $\sigma$ directly affects both the acceptance rate and the autocorrelation. For random-walk Metropolis in high dimensions, the asymptotically optimal acceptance rate is approximately 0.234 (Roberts et al., 1997; Roberts & Rosenthal, 2001). For low-dimensional problems (2D), acceptance rates between 40-60% are typically more efficient, though the optimal rate depends on the target distribution's geometry.

Strengths & Limitations of Metropolis–Hastings

✓ Strengths

Generality: Works for any distribution where you can evaluate $\pi(\mathbf{x})$
Simplicity: Easy to implement and understand
Flexibility: Can be adapted with different proposal distributions
Theoretical guarantees: Provably converges to the target distribution
No gradients needed: Only requires density evaluation

✗ Limitations

Random walk behavior: Inefficient in high dimensions
Tuning required: Proposal width must be carefully chosen
Slow mixing: Can take many iterations to explore the distribution
Correlated samples: High autocorrelation reduces effective sample size
Mode jumping: Struggles with multimodal distributions

Common Challenges:

Curse of dimensionality: In high dimensions, the acceptance rate drops dramatically unless $\sigma$ is very small, leading to slow exploration
Correlated parameters: When parameters have strong posterior correlations (like the banana distribution), axis-aligned proposals are inefficient—each dimension explored independently when they should move together
Varying scales: When different regions have vastly different scales (like Neal's funnel), fixed-width proposals perform poorly—too large in narrow regions, too small in wide regions
Convergence assessment: It's difficult to know when the chain has converged to the stationary distribution

Beyond Metropolis–Hastings

While MH is foundational, modern MCMC methods have been developed to address its limitations. Future tutorials in this series will cover:

Hamiltonian Monte Carlo (HMC)

Uses gradient information to propose distant states while maintaining high acceptance rates. HMC can traverse the parameter space much more efficiently than random-walk MH, especially in high dimensions. The key insight is to treat sampling as simulating Hamiltonian dynamics in an augmented space.

Key advantage: Explores parameter space coherently rather than randomly, dramatically reducing autocorrelation.

Nested Sampling

A fundamentally different approach that simultaneously computes samples from the posterior and the model evidence (marginal likelihood). Instead of evolving a Markov chain, nested sampling progressively samples from constrained priors with increasing likelihood thresholds.

Key advantage: Natural computation of Bayesian evidence for model comparison, robust exploration of multimodal distributions.

Practical Recommendations:

For low-dimensional, well-behaved posteriors: Metropolis–Hastings is often sufficient
For high-dimensional problems with smooth posteriors: Use HMC (implemented in Stan, PyMC)
For multimodal distributions or model comparison: Consider nested sampling (implemented in dynesty, MultiNest)
For complex, structured problems: Specialized variants like Gibbs sampling, NUTS, or adaptive MH

References & Further Reading

Key Papers:

Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., & Teller, E. (1953). "Equation of State Calculations by Fast Computing Machines." Journal of Chemical Physics, 21(6), 1087-1092.
Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications." Biometrika, 57(1), 97-109.
Roberts, G.O., Gelman, A., & Gilks, W.R. (1997). "Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms." The Annals of Applied Probability, 7(1), 110-120.
Haario, H., Saksman, E., & Tamminen, J. (1999). "Adaptive Proposal Distribution for Random Walk Metropolis Algorithm." Computational Statistics, 14, 375-395.
Neal, R.M. (2003). "Slice Sampling." The Annals of Statistics, 31(3), 705-767.
Roberts, G.O., & Rosenthal, J.S. (2001). "Optimal Scaling for Various Metropolis-Hastings Algorithms." Statistical Science, 16(4), 351-367.

Books:

Robert, C.P., & Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer.
Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., & Rubin, D.B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.

Metropolis–Hastings MCMC Algorithm

Introduction to MCMC

The Metropolis–Hastings Algorithm

Target Distributions

Simulation Controls

MCMC Chain Visualization

Chain Trace Plots

Chain Diagnostics

Autocorrelation Function (ACF)

Performance Metrics

Strengths & Limitations of Metropolis–Hastings

Beyond Metropolis–Hastings

References & Further Reading