Interactive visualization of gradient-based MCMC sampling from 2D posterior distributions Part 2 of the MCMC Samplers Series | See also: Metropolis–Hastings
Introduction to Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC) is an advanced MCMC algorithm that uses gradient information
and concepts from Hamiltonian dynamics to generate efficient proposals. Unlike random-walk methods
(like Metropolis–Hastings), HMC proposes distant states while maintaining high acceptance rates by
following the natural geometry of the posterior distribution.
HMC addresses the two main challenges of random-walk MCMC: random walk behavior
(slow, diffusive exploration) and sensitivity to correlations (inefficient axis-aligned
proposals). By simulating Hamiltonian dynamics, HMC proposes states that move coherently along
the posterior's contours, dramatically improving efficiency especially in high dimensions.
Why HMC?
Efficient exploration: Uses gradient information to propose distant states with high acceptance probability
Reduced autocorrelation: Makes large, coherent moves through parameter space instead of small random steps
Handles correlations: Naturally adapts to the posterior geometry, efficient even with strong correlations
Scalable: Performance often improves (relatively) as dimensionality increases, unlike random-walk methods
Industry standard: Powers modern Bayesian software like Stan, PyMC, and NumPyro
The Key Insight: HMC augments the parameter space with auxiliary momentum variables and
simulates Hamiltonian dynamics. This transforms the sampling problem into a physics simulation where
a particle explores the posterior by conserving total energy—naturally spending more time in
high-probability regions while making efficient long-distance moves.
The Hamiltonian Monte Carlo Algorithm
HMC was introduced by Duane et al. (1987) for lattice field theory and popularized
for statistics by Neal (1996, 2011). The algorithm augments the parameter vector \(\mathbf{q}\)
(position) with a momentum vector \(\mathbf{p}\), defining a Hamiltonian system.
The Hamiltonian:
The total energy of the system combines potential energy (negative log posterior) and kinetic energy:
Simulate Hamiltonian dynamics: Starting from \((\mathbf{q}, \mathbf{p})\), integrate
Hamilton's equations for \(L\) steps of size \(\epsilon\):
$$\begin{aligned}
\frac{d\mathbf{q}}{dt} &= \frac{\partial H}{\partial \mathbf{p}} = M^{-1}\mathbf{p}\\
\frac{d\mathbf{p}}{dt} &= -\frac{\partial H}{\partial \mathbf{q}} = -\nabla U(\mathbf{q})
\end{aligned}$$
We use the leapfrog integrator (symplectic, reversible):
\(\mathbf{p} \leftarrow \mathbf{p} - \frac{\epsilon}{2}\nabla U(\mathbf{q})\)
for \(i = 1\) to \(L\):
\(\mathbf{q} \leftarrow \mathbf{q} + \epsilon M^{-1}\mathbf{p}\)
\(\mathbf{p} \leftarrow \mathbf{p} - \epsilon \nabla U(\mathbf{q})\) (except last step: \(\epsilon/2\))
Metropolis acceptance: Propose \((\mathbf{q}^*, \mathbf{p}^*)\) from dynamics. Accept with probability:
$$\alpha = \min\left(1, \exp(H(\mathbf{q}, \mathbf{p}) - H(\mathbf{q}^*, \mathbf{p}^*))\right)$$
Due to numerical errors, \(H\) is not perfectly conserved, but acceptance corrects for this.
Return position: If accepted, set \(\mathbf{q}_{t+1} = \mathbf{q}^*\); else \(\mathbf{q}_{t+1} = \mathbf{q}_t\)
Why it works: The leapfrog integrator approximately conserves the Hamiltonian \(H\).
Regions with low \(U(\mathbf{q})\) (high probability) correspond to low potential energy, where the
particle naturally spends more time—just like a ball rolling in a valley. The momentum carries the
particle across low-probability regions, enabling efficient exploration without getting trapped.
Target Distributions
This demonstration provides three target distributions ranging from simple to challenging,
illustrating different aspects of MCMC performance:
Bivariate Gaussian (Default)
A standard correlated 2D Gaussian with correlation coefficient \(\rho = 0.8\):
This distribution has elliptical contours aligned with the correlation structure.
With strong correlation (\(\rho = 0.8\)), random-walk proposals struggle because they explore
axis-by-axis when parameters should move together along the correlation direction.
Rosenbrock's Banana Distribution
A transformed Gaussian that creates a curved, banana-shaped density (Haario et al., 1999):
This distribution is strongly correlated along a curved manifold, making it difficult for
random-walk samplers to explore efficiently. The "banana" shape arises from the nonlinear
transformation \(x_2 - x_1^2\), creating regions where standard proposals are inefficient.
Neal's Funnel Distribution
A hierarchical model that exhibits strong scale variations (Neal, 2003):
The joint density is \(\pi(x_1, x_2) \propto \exp(-x_1^2/18 - x_2^2/(2e^{2x_1}))\).
This creates a "funnel" where the scale of \(x_2\) depends exponentially on \(x_1\),
challenging for fixed-width proposals. Common in hierarchical Bayesian models with variance parameters.
Why these distributions? The Gaussian provides a baseline to understand basic behavior.
Rosenbrock's banana and Neal's funnel are standard benchmarks in the MCMC literature—the banana tests
handling of strong nonlinear correlations, while the funnel tests adaptation to varying scales.
These challenges motivate advanced methods like HMC and adaptive MCMC.
Simulation Controls
HMC has two key tuning parameters: the step size \(\epsilon\) (how large each
leapfrog step is) and the number of steps \(L\) (how long the trajectory is).
The product \(L \times \epsilon\) determines the trajectory length.
Size of each leapfrog integration step. Smaller → more accurate, but slower.
Length of trajectory. More steps → longer proposals, but can overshoot.
Delay between iterations. Slower speeds help visualize trajectories.
Same distributions as Metropolis–Hastings for direct comparison.
Tuning advice: Optimal step size balances integration accuracy vs. efficiency.
Too large → high rejection due to numerical error. Too small → inefficient short proposals.
Aim for acceptance rates around 65-90% (much higher than MH's 23-40%!). Modern implementations
like NUTS (No-U-Turn Sampler) automatically tune these parameters.
HMC Trajectory Visualization
The main plot shows the joint posterior \(\pi(q_1, q_2)\) with darker regions indicating higher probability.
Unlike random-walk methods, HMC trajectories follow curved paths determined by the gradient of the log posterior.
Green points indicate accepted proposals and
orange points indicate rejected proposals.
The plot axes automatically adjust to each distribution's natural range.
Joint Posterior Distribution π(x₁, x₂)
The chain should explore the high-density regions (darker areas) while maintaining enough
randomness to avoid getting trapped in any single location.
Marginal Distribution of x₁
Histogram of \(x_1\) samples. Should converge to the true marginal \(\pi(x_1) = \int \pi(x_1, x_2) dx_2\).
Marginal Distribution of x₂
Histogram of \(x_2\) samples. With enough samples, this approximates the true marginal distribution.
Current Iteration Details
Click "Start Sampling" or "Single Step" to begin the MCMC algorithm.
Chain Trace Plots
Trace plots show the evolution of each parameter value over iterations. They are essential
for diagnosing convergence and mixing of the MCMC chain.
What to look for in trace plots:
Good mixing: The trace should look like "fuzzy caterpillar" - random fluctuations
around a stable mean with no obvious patterns or trends
Stationarity: The mean and variance should remain constant over time (after burn-in)
No trends: The trace shouldn't show long-term upward or downward trends
No getting stuck: The chain shouldn't remain at the same value for extended periods
Trace Plot: x₁ Evolution
Time series of \(x_1\) values. Rapid fluctuations indicate good exploration; long periods at similar
values suggest poor mixing.
Trace Plot: x₂ Evolution
Time series of \(x_2\) values. Compare with \(x_1\) trace to check if both parameters mix at similar rates.
Chain Diagnostics
Autocorrelation Function (ACF)
The autocorrelation function measures the correlation between samples separated by
\(\tau\) iterations (the lag). For a well-mixing chain, the ACF should decay rapidly to zero.
Mathematical Definition:
For a chain \(\{x_t\}\) with sample mean \(\bar{x}\) and sample variance \(s^2\), the sample ACF at lag \(\tau\) is:
\(\hat{\rho}(0) = 1\): Perfect correlation with itself
\(\hat{\rho}(\tau) \to 0\) as \(\tau \to \infty\): Samples become independent (for ergodic chains)
Slow decay: High autocorrelation indicates the chain mixes slowly
Fast decay: Low autocorrelation indicates efficient exploration
Autocorrelation Function (ACF)
Blue line: ACF for \(x_1\) parameter.
Red line: ACF for \(x_2\) parameter.
The lag range adapts automatically—showing where the ACF crosses zero plus 33% additional lags
to verify convergence to independence.
Performance Metrics
Total Iterations
0
Acceptance Rate
—
Accepted Proposals
0
Effective Sample Size (est.)
—
Effective Sample Size (ESS)
Due to autocorrelation, MCMC samples are not independent. The ESS estimates how many independent
samples our correlated chain is equivalent to:
where \(n\) is the total number of samples and \(\rho(\tau)\) is the ACF. Higher ESS means more
efficient sampling. The ratio ESS/\(n\) is called the sampling efficiency.
Tuning the Proposal Width: The proposal standard deviation \(\sigma\) directly affects
both the acceptance rate and the autocorrelation. For random-walk Metropolis in high dimensions,
the asymptotically optimal acceptance rate is approximately 0.234 (Roberts et al., 1997; Roberts & Rosenthal, 2001).
For low-dimensional problems (2D), acceptance rates between 40-60% are typically more efficient,
though the optimal rate depends on the target distribution's geometry.
Strengths & Limitations of HMC
✓ Strengths
Efficient exploration: Makes large, coherent moves instead of random walks
Reduced autocorrelation: Much lower correlation between samples than MH
Handles correlations naturally: Follows posterior geometry via gradients
High acceptance rates: Typically 65-90% (vs. 23-40% for MH)
Scales to high dimensions: Performance relatively better as \(d\) increases
✗ Limitations
Requires gradients: Need \(\nabla \log \pi(\mathbf{q})\), which isn't always available
Tuning parameters: Step size \(\epsilon\) and trajectory length \(L\) need tuning
Computational cost: Each iteration requires \(L\) gradient evaluations
Numerical errors: Discretization of continuous dynamics introduces error
Still struggles with multimodality: Difficult to jump between well-separated modes
When to use HMC:
Smooth posteriors: HMC excels when \(\pi(\mathbf{q})\) is differentiable and relatively smooth
High dimensions: Particularly valuable for \(d > 10\), where random-walk methods struggle
Strong correlations: Automatically adapts to correlation structure via gradient information
Modern variants: NUTS (No-U-Turn Sampler) in Stan/PyMC automatically tunes \(\epsilon\) and \(L\),
making HMC much more practical
Comparison: HMC vs. Metropolis–Hastings
Direct comparison on the same target distributions reveals HMC's advantages:
Key Differences:
Property
Metropolis–Hastings
HMC
Proposal mechanism
Random walk
Hamiltonian dynamics
Gradients needed?
No
Yes
Typical acceptance rate
23-40% (high-d)
65-90%
Autocorrelation
High
Low
Effective sample size
Much lower than \(n\)
Closer to \(n\)
Best for
Low-d, gradients unavailable
High-d, smooth posteriors
Key advantage: Explores parameter space coherently rather than randomly,
dramatically reducing autocorrelation.
Nested Sampling
A fundamentally different approach that simultaneously computes samples from the posterior
and the model evidence (marginal likelihood). Instead of evolving a Markov chain,
nested sampling progressively samples from constrained priors with increasing likelihood thresholds.
References & Further Reading
Key Papers:
Duane, S., Kennedy, A.D., Pendleton, B.J., & Roweth, D. (1987).
"Hybrid Monte Carlo." Physics Letters B, 195(2), 216-222.
Neal, R.M. (1996). "Bayesian Learning for Neural Networks." PhD Thesis, University of Toronto.
Neal, R.M. (2011). "MCMC Using Hamiltonian Dynamics." Handbook of Markov Chain Monte Carlo
(eds. Brooks et al.), CRC Press, 113-162.
Betancourt, M. (2017). "A Conceptual Introduction to Hamiltonian Monte Carlo."
arXiv:1701.02434.
Hoffman, M.D., & Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo."
Journal of Machine Learning Research, 15(1), 1593-1623.
Haario, H., Saksman, E., & Tamminen, J. (1999). "Adaptive Proposal Distribution for Random Walk Metropolis Algorithm."
Computational Statistics, 14, 375-395.
Books:
Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., & Rubin, D.B. (2013).
Bayesian Data Analysis (3rd ed.). CRC Press.
Betancourt, M. (2018). A Conceptual Introduction to Hamiltonian Monte Carlo.
arXiv preprint. (Excellent pedagogical resource)
Software Implementations:
Stan:mc-stan.org
- Uses NUTS (No-U-Turn Sampler), an adaptive HMC variant
PyMC:pymc.io
- Python probabilistic programming with HMC/NUTS
NumPyro: JAX-based probabilistic programming with efficient HMC