Interactive visualization of gradient-based MCMC sampling from 2D posterior distributions Part 2 of the MCMC Samplers Series | See also: Metropolis–Hastings
Introduction to Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC) is an MCMC algorithm that uses gradient information
to generate efficient proposals via Hamiltonian dynamics. Unlike random-walk methods
(like Metropolis–Hastings), HMC makes distant proposals while maintaining high acceptance rates by
exploiting the geometry of the target distribution.
HMC addresses key limitations of random-walk MCMC: the diffusive nature of exploration and
poor performance with correlated parameters. By simulating Hamiltonian dynamics, HMC generates
coherent trajectories through parameter space, dramatically improving efficiency, especially in high dimensions.
Why HMC?
Efficient exploration: Generates distant proposals with high acceptance probability using gradient information
Reduced autocorrelation: Produces nearly independent samples by making coherent moves through parameter space
Handles correlations: Exploits geometric information from gradients, efficient even with strongly correlated parameters
Scales to high dimensions: Convergence rate depends on condition number rather than dimensionality
Industry standard: Powers modern probabilistic programming frameworks like Stan, PyMC, and NumPyro
The Key Insight: HMC augments the target distribution with auxiliary momentum variables
and simulates Hamiltonian dynamics using a symplectic integrator. Energy conservation ensures the
sampler spends more time in high-probability regions, while the momentum enables efficient exploration
by carrying the trajectory across low-probability regions.
The Hamiltonian Monte Carlo Algorithm
HMC was introduced by Duane et al. (1987) for lattice field theory and popularized
for statistics by Neal (1996, 2011). The algorithm augments the parameter vector \(\mathbf{q}\)
(position) with a momentum vector \(\mathbf{p}\), defining a Hamiltonian system.
The Hamiltonian:
The total energy of the system combines potential energy (negative log posterior) and kinetic energy:
Simulate Hamiltonian dynamics: Starting from \((\mathbf{q}, \mathbf{p})\), integrate
Hamilton's equations for \(L\) steps of size \(\epsilon\):
$$\begin{aligned}
\frac{d\mathbf{q}}{dt} &= \frac{\partial H}{\partial \mathbf{p}} = M^{-1}\mathbf{p}\\
\frac{d\mathbf{p}}{dt} &= -\frac{\partial H}{\partial \mathbf{q}} = -\nabla U(\mathbf{q})
\end{aligned}$$
We use the leapfrog integrator (symplectic, reversible):
\(\mathbf{p} \leftarrow \mathbf{p} - \frac{\epsilon}{2}\nabla U(\mathbf{q})\)
for \(i = 1\) to \(L\):
\(\mathbf{q} \leftarrow \mathbf{q} + \epsilon M^{-1}\mathbf{p}\)
\(\mathbf{p} \leftarrow \mathbf{p} - \epsilon \nabla U(\mathbf{q})\) (except last step: \(\epsilon/2\))
Metropolis acceptance: Propose \((\mathbf{q}^*, \mathbf{p}^*)\) from dynamics. Accept with probability:
$$\alpha = \min\left(1, \exp(H(\mathbf{q}, \mathbf{p}) - H(\mathbf{q}^*, \mathbf{p}^*))\right)$$
Due to numerical errors, \(H\) is not perfectly conserved, but acceptance corrects for this.
Return position: If accepted, set \(\mathbf{q}_{t+1} = \mathbf{q}^*\); else \(\mathbf{q}_{t+1} = \mathbf{q}_t\)
Why it works: The leapfrog integrator is symplectic and time-reversible, approximately
preserving the Hamiltonian \(H\). This volume-preserving property leads to high acceptance rates.
Regions of high probability (low \(U(\mathbf{q})\)) correspond to low potential energy where trajectories
naturally spend more time, while momentum enables coherent exploration across low-probability barriers.
Target Distributions
This demonstration provides three target distributions ranging from simple to challenging,
illustrating different aspects of MCMC performance:
Bivariate Gaussian (Default)
A standard correlated 2D Gaussian with correlation coefficient \(\rho = 0.8\):
This distribution has elliptical contours aligned with the correlation structure.
With strong correlation (\(\rho = 0.8\)), random-walk proposals struggle because they explore
axis-by-axis when parameters should move together along the correlation direction.
Rosenbrock's Banana Distribution
A transformed Gaussian that creates a curved, banana-shaped density (Haario et al., 1999):
This distribution is strongly correlated along a curved manifold, making it difficult for
random-walk samplers to explore efficiently. The "banana" shape arises from the nonlinear
transformation \(x_2 - x_1^2\), creating regions where standard proposals are inefficient.
Neal's Funnel Distribution
A hierarchical model that exhibits strong scale variations (Neal, 2003):
The joint density is \(\pi(x_1, x_2) \propto \exp(-x_1^2/18 - x_2^2/(2e^{2x_1}))\).
This creates a "funnel" where the scale of \(x_2\) depends exponentially on \(x_1\),
challenging for fixed-width proposals. Common in hierarchical Bayesian models with variance parameters.
Why these distributions? The Gaussian provides a baseline to understand basic behavior.
Rosenbrock's banana and Neal's funnel are standard benchmarks in the MCMC literature—the banana tests
handling of strong nonlinear correlations, while the funnel tests adaptation to varying scales.
These challenges motivate advanced methods like HMC and adaptive MCMC.
Simulation Controls
HMC has two key tuning parameters: the step size \(\epsilon\) (how large each
leapfrog step is) and the number of steps \(L\) (how long the trajectory is).
The product \(L \times \epsilon\) determines the trajectory length.
Size of each leapfrog integration step. Smaller → more accurate, but slower.
Length of trajectory. More steps → longer proposals, but can overshoot.
Delay between iterations. Slower speeds help visualize trajectories.
Same distributions as Metropolis–Hastings for direct comparison.
Tuning advice: Step size controls the trade-off between integration accuracy and
discretization error. Too large leads to high rejection rates; too small wastes computational resources.
Target acceptance rates of 60-90% (compared to ~23% for optimal random-walk MH in high dimensions).
The No-U-Turn Sampler (NUTS) automatically tunes both step size and trajectory length.
HMC Trajectory Visualization
The main plot shows the joint target distribution \(\pi(q_1, q_2)\) with darker regions indicating higher probability.
HMC trajectories follow curved paths determined by Hamilton's equations, guided by the gradient of the log density.
Green points indicate accepted proposals and
orange points indicate rejected proposals.
The plot axes automatically adjust to each distribution's natural range.
Joint Posterior Distribution π(x₁, x₂)
The chain should explore the high-density regions (darker areas) while maintaining enough
randomness to avoid getting trapped in any single location.
Marginal Distribution of x₁
Histogram of \(x_1\) samples. Should converge to the true marginal \(\pi(x_1) = \int \pi(x_1, x_2) dx_2\).
Marginal Distribution of x₂
Histogram of \(x_2\) samples. With enough samples, this approximates the true marginal distribution.
Current Iteration Details
Click "Start Sampling" or "Single Step" to begin the MCMC algorithm.
Chain Trace Plots
Trace plots show the evolution of each parameter value over iterations. They are essential
for diagnosing convergence and mixing of the MCMC chain.
What to look for in trace plots:
Good mixing: The trace should look like "fuzzy caterpillar" - random fluctuations
around a stable mean with no obvious patterns or trends
Stationarity: The mean and variance should remain constant over time (after burn-in)
No trends: The trace shouldn't show long-term upward or downward trends
No getting stuck: The chain shouldn't remain at the same value for extended periods
Trace Plot: x₁ Evolution
Chain trace for parameter \(x_1\). Rapid fluctuations indicate good mixing; long periods at similar
values suggest poor exploration.
Trace Plot: x₂ Evolution
Chain trace for parameter \(x_2\). Compare with \(x_1\) trace to assess if both parameters mix at similar rates.
Chain Diagnostics
Autocorrelation Function (ACF)
The autocorrelation function measures the correlation between samples separated by
\(\tau\) iterations (the lag). For a well-mixing chain, the ACF should decay rapidly to zero.
Mathematical Definition:
For a chain \(\{x_t\}\) with sample mean \(\bar{x}\) and sample variance \(s^2\), the sample ACF at lag \(\tau\) is:
\(\hat{\rho}(0) = 1\): Perfect correlation with itself
\(\hat{\rho}(\tau) \to 0\) as \(\tau \to \infty\): Samples become independent (for ergodic chains)
Slow decay: High autocorrelation indicates the chain mixes slowly
Fast decay: Low autocorrelation indicates efficient exploration
Autocorrelation Function (ACF)
Blue line: ACF for \(x_1\) parameter.
Red line: ACF for \(x_2\) parameter.
The lag range adapts automatically—showing where the ACF crosses zero plus 33% additional lags
to verify convergence to independence.
Performance Metrics
Total Iterations
0
Acceptance Rate
—
Accepted Proposals
0
Effective Sample Size (est.)
—
Effective Sample Size (ESS)
Due to autocorrelation, MCMC samples are not independent. The ESS estimates how many independent
samples our correlated chain is equivalent to:
where \(n\) is the total number of samples and \(\rho(\tau)\) is the ACF. Higher ESS means more
efficient sampling. The ratio ESS/\(n\) is called the sampling efficiency.
Note on acceptance rates: Unlike random-walk Metropolis (optimal acceptance ~23% in high dimensions),
HMC maintains much higher acceptance rates (60-90%) due to the volume-preserving properties of symplectic integration.
This is possible because the leapfrog integrator approximately conserves the Hamiltonian, making proposals
follow surfaces of equal probability.
Strengths & Limitations of HMC
✓ Strengths
Efficient exploration: Generates coherent trajectories rather than diffusive random walks
Low autocorrelation: Produces nearly independent samples, high effective sample size
Exploits geometry: Uses gradient information to navigate complex correlation structures
High acceptance probability: Typically 60-90% due to symplectic integration
Dimension-robust: Scales with condition number rather than dimensionality
✗ Limitations
Requires differentiability: Needs \(\nabla \log \pi(\mathbf{q})\) to compute Hamiltonian dynamics
Moderate to high dimensions: Particularly effective for \(d > 10\) where random-walk methods fail
Correlated parameters: Gradient information naturally handles complex posterior geometry
Practical implementations: NUTS (No-U-Turn Sampler) in Stan/PyMC automatically adapts
step size and trajectory length, making HMC accessible without manual tuning
Comparison: HMC vs. Metropolis–Hastings
Direct comparison on the same target distributions reveals HMC's advantages:
Key Differences:
Property
Metropolis–Hastings
HMC
Proposal mechanism
Random walk
Hamiltonian dynamics
Gradients needed?
No
Yes
Typical acceptance rate
23-40% (high-d)
65-90%
Autocorrelation
High
Low
Effective sample size
Much lower than \(n\)
Closer to \(n\)
Best for
Low-d, gradients unavailable
High-d, smooth posteriors
Key advantage: Explores parameter space coherently rather than randomly,
dramatically reducing autocorrelation.
Nested Sampling
A fundamentally different approach that simultaneously computes samples from the posterior
and the model evidence (marginal likelihood). Instead of evolving a Markov chain,
nested sampling progressively samples from constrained priors with increasing likelihood thresholds.
References & Further Reading
Key Papers:
Duane, S., Kennedy, A.D., Pendleton, B.J., & Roweth, D. (1987).
"Hybrid Monte Carlo." Physics Letters B, 195(2), 216-222.
Neal, R.M. (1996). "Bayesian Learning for Neural Networks." PhD Thesis, University of Toronto.
Neal, R.M. (2011). "MCMC Using Hamiltonian Dynamics." Handbook of Markov Chain Monte Carlo
(eds. Brooks et al.), CRC Press, 113-162.
Betancourt, M. (2017). "A Conceptual Introduction to Hamiltonian Monte Carlo."
arXiv:1701.02434.
Hoffman, M.D., & Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo."
Journal of Machine Learning Research, 15(1), 1593-1623.
Haario, H., Saksman, E., & Tamminen, J. (1999). "Adaptive Proposal Distribution for Random Walk Metropolis Algorithm."
Computational Statistics, 14, 375-395.
Books:
Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., & Rubin, D.B. (2013).
Bayesian Data Analysis (3rd ed.). CRC Press.
Betancourt, M. (2018). A Conceptual Introduction to Hamiltonian Monte Carlo.
arXiv preprint. (Excellent pedagogical resource)
Software Implementations:
Stan:mc-stan.org
- Uses NUTS (No-U-Turn Sampler), an adaptive HMC variant
PyMC:pymc.io
- Python probabilistic programming with HMC/NUTS
NumPyro: JAX-based probabilistic programming with efficient HMC