Interactive visualization of Markov Chain Monte Carlo sampling from 2D posterior distributions Part 1 of the MCMC Samplers Series | Coming soon: Hamiltonian Monte Carlo (HMC) & Nested Sampling
Introduction to MCMC
Markov Chain Monte Carlo (MCMC) is a family of algorithms for sampling from probability
distributions that are difficult or impossible to sample from directly. These methods are fundamental in
Bayesian inference, statistical physics, computational biology, and machine learning.
The key idea is to construct a Markov chain that has the target distribution \(\pi(\mathbf{x})\)
as its stationary distribution. By running the chain long enough, samples from the chain will approximate
samples from \(\pi(\mathbf{x})\).
Why MCMC?
Sample from complex, high-dimensional distributions where direct sampling is intractable
Only requires ability to evaluate the target density \(\pi(\mathbf{x})\) (up to a normalizing constant)
Asymptotically exact: given infinite time, samples converge to the true distribution
Widely applicable: works for continuous, discrete, or mixed parameter spaces
The Metropolis–Hastings Algorithm
The Metropolis–Hastings (MH) algorithm is one of the most fundamental MCMC methods.
It was introduced by Metropolis et al. (1953) and generalized by Hastings (1970). The algorithm generates
a sequence of samples \(\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \ldots\) where each sample depends only
on the previous one (Markov property).
Algorithm Steps:
Initialize: Start with an initial state \(\mathbf{x}_0\)
Propose: Generate a candidate state \(\mathbf{x}' \sim q(\cdot \mid \mathbf{x}_t)\)
from a proposal distribution
In this demo: \(q(\mathbf{x}' \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_t, \sigma^2 \mathbf{I})\)
(Gaussian random walk)
Compute acceptance ratio:
$$\alpha = \min\left(1, \frac{\pi(\mathbf{x}') \, q(\mathbf{x}_t \mid \mathbf{x}')}{\pi(\mathbf{x}_t) \, q(\mathbf{x}' \mid \mathbf{x}_t)}\right)$$
For symmetric proposals (like our Gaussian random walk), this simplifies to:
$$\alpha = \min\left(1, \frac{\pi(\mathbf{x}')}{\pi(\mathbf{x}_t)}\right)$$
Key Insight: The algorithm always accepts moves to higher probability regions
(\(\pi(\mathbf{x}') > \pi(\mathbf{x}_t)\)), but also sometimes accepts moves to lower probability
regions. This ensures the chain doesn't get stuck in local modes and properly explores the entire
distribution.
Target Distributions
This demonstration provides three target distributions ranging from simple to challenging,
illustrating different aspects of MCMC performance:
Bivariate Gaussian (Default)
A standard correlated 2D Gaussian with correlation coefficient \(\rho = 0.8\):
This distribution has elliptical contours aligned with the correlation structure.
With strong correlation (\(\rho = 0.8\)), random-walk proposals struggle because they explore
axis-by-axis when parameters should move together along the correlation direction.
Rosenbrock's Banana Distribution
A transformed Gaussian that creates a curved, banana-shaped density (Haario et al., 1999):
This distribution is strongly correlated along a curved manifold, making it difficult for
random-walk samplers to explore efficiently. The "banana" shape arises from the nonlinear
transformation \(x_2 - x_1^2\), creating regions where standard proposals are inefficient.
Neal's Funnel Distribution
A hierarchical model that exhibits strong scale variations (Neal, 2003):
The joint density is \(\pi(x_1, x_2) \propto \exp(-x_1^2/18 - x_2^2/(2e^{2x_1}))\).
This creates a "funnel" where the scale of \(x_2\) depends exponentially on \(x_1\),
challenging for fixed-width proposals. Common in hierarchical Bayesian models with variance parameters.
Why these distributions? The Gaussian provides a baseline to understand basic behavior.
Rosenbrock's banana and Neal's funnel are standard benchmarks in the MCMC literature—the banana tests
handling of strong nonlinear correlations, while the funnel tests adaptation to varying scales.
These challenges motivate advanced methods like HMC and adaptive MCMC.
Simulation Controls
Adjust the parameters below to see how they affect the sampler's behavior. The proposal width
is particularly important: too small and the chain explores slowly; too large and most proposals are rejected.
Step size for proposals. Larger \(\sigma\) → larger jumps. Tune to balance exploration vs. acceptance.
Delay between iterations. Slower speeds help visualize individual steps.
Switch between different target distributions to test sampler behavior.
MCMC Chain Visualization
The main plot shows the joint posterior \(\pi(x_1, x_2)\) with darker regions indicating higher probability.
Each point represents a sample in the chain: green
for accepted proposals and orange for rejected proposals.
The plot axes automatically adjust to each distribution's natural range.
Joint Posterior Distribution π(x₁, x₂)
The chain should explore the high-density regions (darker areas) while maintaining enough
randomness to avoid getting trapped in any single location.
Marginal Distribution of x₁
Histogram of \(x_1\) samples. Should converge to the true marginal \(\pi(x_1) = \int \pi(x_1, x_2) dx_2\).
Marginal Distribution of x₂
Histogram of \(x_2\) samples. With enough samples, this approximates the true marginal distribution.
Current Iteration Details
Click "Start Sampling" or "Single Step" to begin the MCMC algorithm.
Chain Trace Plots
Trace plots show the evolution of each parameter value over iterations. They are essential
for diagnosing convergence and mixing of the MCMC chain.
What to look for in trace plots:
Good mixing: The trace should look like "fuzzy caterpillar" - random fluctuations
around a stable mean with no obvious patterns or trends
Stationarity: The mean and variance should remain constant over time (after burn-in)
No trends: The trace shouldn't show long-term upward or downward trends
No getting stuck: The chain shouldn't remain at the same value for extended periods
Trace Plot: x₁ Evolution
Time series of \(x_1\) values. Rapid fluctuations indicate good exploration; long periods at similar
values suggest poor mixing.
Trace Plot: x₂ Evolution
Time series of \(x_2\) values. Compare with \(x_1\) trace to check if both parameters mix at similar rates.
Chain Diagnostics
Autocorrelation Function (ACF)
The autocorrelation function measures the correlation between samples separated by
\(\tau\) iterations (the lag). For a well-mixing chain, the ACF should decay rapidly to zero.
Mathematical Definition:
For a chain \(\{x_t\}\) with sample mean \(\bar{x}\) and sample variance \(s^2\), the sample ACF at lag \(\tau\) is:
\(\hat{\rho}(0) = 1\): Perfect correlation with itself
\(\hat{\rho}(\tau) \to 0\) as \(\tau \to \infty\): Samples become independent (for ergodic chains)
Slow decay: High autocorrelation indicates the chain mixes slowly
Fast decay: Low autocorrelation indicates efficient exploration
Autocorrelation Function (ACF)
Blue line: ACF for \(x_1\) parameter.
Red line: ACF for \(x_2\) parameter.
The lag range adapts automatically—showing where the ACF crosses zero plus 33% additional lags
to verify convergence to independence.
Performance Metrics
Total Iterations
0
Acceptance Rate
—
Accepted Proposals
0
Effective Sample Size (est.)
—
Effective Sample Size (ESS)
Due to autocorrelation, MCMC samples are not independent. The ESS estimates how many independent
samples our correlated chain is equivalent to:
where \(n\) is the total number of samples and \(\rho(\tau)\) is the ACF. Higher ESS means more
efficient sampling. The ratio ESS/\(n\) is called the sampling efficiency.
Tuning the Proposal Width: The proposal standard deviation \(\sigma\) directly affects
both the acceptance rate and the autocorrelation. For random-walk Metropolis in high dimensions,
the asymptotically optimal acceptance rate is approximately 0.234 (Roberts et al., 1997; Roberts & Rosenthal, 2001).
For low-dimensional problems (2D), acceptance rates between 40-60% are typically more efficient,
though the optimal rate depends on the target distribution's geometry.
Strengths & Limitations of Metropolis–Hastings
✓ Strengths
Generality: Works for any distribution where you can evaluate $\pi(\mathbf{x})$
Simplicity: Easy to implement and understand
Flexibility: Can be adapted with different proposal distributions
Theoretical guarantees: Provably converges to the target distribution
No gradients needed: Only requires density evaluation
✗ Limitations
Random walk behavior: Inefficient in high dimensions
Tuning required: Proposal width must be carefully chosen
Slow mixing: Can take many iterations to explore the distribution
Correlated samples: High autocorrelation reduces effective sample size
Mode jumping: Struggles with multimodal distributions
Common Challenges:
Curse of dimensionality: In high dimensions, the acceptance rate drops dramatically
unless \(\sigma\) is very small, leading to slow exploration
Correlated parameters: When parameters have strong posterior correlations
(like the banana distribution), axis-aligned proposals are inefficient—each dimension explored independently
when they should move together
Varying scales: When different regions have vastly different scales
(like Neal's funnel), fixed-width proposals perform poorly—too large in narrow regions, too small in wide regions
Convergence assessment: It's difficult to know when the chain has converged to
the stationary distribution
Beyond Metropolis–Hastings
While MH is foundational, modern MCMC methods have been developed to address its limitations.
Future tutorials in this series will cover:
Hamiltonian Monte Carlo (HMC)
Uses gradient information to propose distant states while maintaining high acceptance rates.
HMC can traverse the parameter space much more efficiently than random-walk MH, especially in
high dimensions. The key insight is to treat sampling as simulating Hamiltonian dynamics in an
augmented space.
Key advantage: Explores parameter space coherently rather than randomly,
dramatically reducing autocorrelation.
Nested Sampling
A fundamentally different approach that simultaneously computes samples from the posterior
and the model evidence (marginal likelihood). Instead of evolving a Markov chain,
nested sampling progressively samples from constrained priors with increasing likelihood thresholds.
Key advantage: Natural computation of Bayesian evidence for model comparison,
robust exploration of multimodal distributions.
Practical Recommendations:
For low-dimensional, well-behaved posteriors: Metropolis–Hastings is often sufficient
For high-dimensional problems with smooth posteriors: Use HMC (implemented in Stan, PyMC)
For multimodal distributions or model comparison: Consider nested sampling (implemented in dynesty, MultiNest)
For complex, structured problems: Specialized variants like Gibbs sampling, NUTS, or adaptive MH
References & Further Reading
Key Papers:
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., & Teller, E. (1953).
"Equation of State Calculations by Fast Computing Machines."
Journal of Chemical Physics, 21(6), 1087-1092.
Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications."
Biometrika, 57(1), 97-109.
Roberts, G.O., Gelman, A., & Gilks, W.R. (1997). "Weak Convergence and Optimal Scaling of
Random Walk Metropolis Algorithms." The Annals of Applied Probability, 7(1), 110-120.
Haario, H., Saksman, E., & Tamminen, J. (1999). "Adaptive Proposal Distribution for Random Walk Metropolis Algorithm."
Computational Statistics, 14, 375-395.
Neal, R.M. (2003). "Slice Sampling." The Annals of Statistics, 31(3), 705-767.
Roberts, G.O., & Rosenthal, J.S. (2001). "Optimal Scaling for Various Metropolis-Hastings Algorithms."
Statistical Science, 16(4), 351-367.
Books:
Robert, C.P., & Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer.
Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., & Rubin, D.B. (2013).
Bayesian Data Analysis (3rd ed.). CRC Press.