Hamiltonian Monte Carlo (HMC)

Interactive visualization of gradient-based MCMC sampling from 2D posterior distributions
Part 2 of the MCMC Samplers Series | See also: Metropolis–Hastings

Introduction to Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) is an advanced MCMC algorithm that uses gradient information and concepts from Hamiltonian dynamics to generate efficient proposals. Unlike random-walk methods (like Metropolis–Hastings), HMC proposes distant states while maintaining high acceptance rates by following the natural geometry of the posterior distribution.

HMC addresses the two main challenges of random-walk MCMC: random walk behavior (slow, diffusive exploration) and sensitivity to correlations (inefficient axis-aligned proposals). By simulating Hamiltonian dynamics, HMC proposes states that move coherently along the posterior's contours, dramatically improving efficiency especially in high dimensions.

Why HMC?

Efficient exploration: Uses gradient information to propose distant states with high acceptance probability
Reduced autocorrelation: Makes large, coherent moves through parameter space instead of small random steps
Handles correlations: Naturally adapts to the posterior geometry, efficient even with strong correlations
Scalable: Performance often improves (relatively) as dimensionality increases, unlike random-walk methods
Industry standard: Powers modern Bayesian software like Stan, PyMC, and NumPyro

The Key Insight: HMC augments the parameter space with auxiliary momentum variables and simulates Hamiltonian dynamics. This transforms the sampling problem into a physics simulation where a particle explores the posterior by conserving total energy—naturally spending more time in high-probability regions while making efficient long-distance moves.

The Hamiltonian Monte Carlo Algorithm

HMC was introduced by Duane et al. (1987) for lattice field theory and popularized for statistics by Neal (1996, 2011). The algorithm augments the parameter vector $\mathbf{q}$ (position) with a momentum vector $\mathbf{p}$, defining a Hamiltonian system.

The Hamiltonian:

The total energy of the system combines potential energy (negative log posterior) and kinetic energy:

$$H(\mathbf{q}, \mathbf{p}) = U(\mathbf{q}) + K(\mathbf{p})$$

where:

$U(\mathbf{q}) = -\log \pi(\mathbf{q})$ is the potential energy (negative log target density)
$K(\mathbf{p}) = \frac{1}{2}\mathbf{p}^T M^{-1} \mathbf{p}$ is the kinetic energy (mass matrix $M$)

In this demo, we use $M = I$ (identity matrix), so $K(\mathbf{p}) = \frac{1}{2}\sum_i p_i^2$.

HMC Algorithm Steps:

Sample momentum: Draw $\mathbf{p} \sim \mathcal{N}(0, M)$
Simulate Hamiltonian dynamics: Starting from $(\mathbf{q}, \mathbf{p})$, integrate Hamilton's equations for $L$ steps of size $\epsilon$: $$\begin{aligned} \frac{d\mathbf{q}}{dt} &= \frac{\partial H}{\partial \mathbf{p}} = M^{-1}\mathbf{p}\\ \frac{d\mathbf{p}}{dt} &= -\frac{\partial H}{\partial \mathbf{q}} = -\nabla U(\mathbf{q}) \end{aligned}$$ We use the leapfrog integrator (symplectic, reversible):
$\mathbf{p} \leftarrow \mathbf{p} - \frac{\epsilon}{2}\nabla U(\mathbf{q})$
for $i = 1$ to $L$:
$\mathbf{q} \leftarrow \mathbf{q} + \epsilon M^{-1}\mathbf{p}$
$\mathbf{p} \leftarrow \mathbf{p} - \epsilon \nabla U(\mathbf{q})$ (except last step: $\epsilon/2$)
Metropolis acceptance: Propose $(\mathbf{q}^*, \mathbf{p}^*)$ from dynamics. Accept with probability: $$\alpha = \min\left(1, \exp(H(\mathbf{q}, \mathbf{p}) - H(\mathbf{q}^*, \mathbf{p}^*))\right)$$ Due to numerical errors, $H$ is not perfectly conserved, but acceptance corrects for this.
Return position: If accepted, set $\mathbf{q}_{t+1} = \mathbf{q}^*$; else $\mathbf{q}_{t+1} = \mathbf{q}_t$

Why it works: The leapfrog integrator approximately conserves the Hamiltonian $H$. Regions with low $U(\mathbf{q})$ (high probability) correspond to low potential energy, where the particle naturally spends more time—just like a ball rolling in a valley. The momentum carries the particle across low-probability regions, enabling efficient exploration without getting trapped.

Target Distributions

This demonstration provides three target distributions ranging from simple to challenging, illustrating different aspects of MCMC performance:

Bivariate Gaussian (Default)

A standard correlated 2D Gaussian with correlation coefficient $\rho = 0.8$:

$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{2(1-\rho^2)}(x_1^2 - 2\rho x_1 x_2 + x_2^2)\right)$$

This distribution has elliptical contours aligned with the correlation structure. With strong correlation ($\rho = 0.8$), random-walk proposals struggle because they explore axis-by-axis when parameters should move together along the correlation direction.

Rosenbrock's Banana Distribution

A transformed Gaussian that creates a curved, banana-shaped density (Haario et al., 1999):

$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{200}(x_1^2 + 100(x_2 - x_1^2)^2)\right)$$

This distribution is strongly correlated along a curved manifold, making it difficult for random-walk samplers to explore efficiently. The "banana" shape arises from the nonlinear transformation $x_2 - x_1^2$, creating regions where standard proposals are inefficient.

Neal's Funnel Distribution

A hierarchical model that exhibits strong scale variations (Neal, 2003):

$$\begin{aligned} x_1 &\sim \mathcal{N}(0, 3^2) \\ x_2 \mid x_1 &\sim \mathcal{N}(0, \exp(x_1)^2) \end{aligned}$$

The joint density is $\pi(x_1, x_2) \propto \exp(-x_1^2/18 - x_2^2/(2e^{2x_1}))$. This creates a "funnel" where the scale of $x_2$ depends exponentially on $x_1$, challenging for fixed-width proposals. Common in hierarchical Bayesian models with variance parameters.

Why these distributions? The Gaussian provides a baseline to understand basic behavior. Rosenbrock's banana and Neal's funnel are standard benchmarks in the MCMC literature—the banana tests handling of strong nonlinear correlations, while the funnel tests adaptation to varying scales. These challenges motivate advanced methods like HMC and adaptive MCMC.

Simulation Controls

HMC has two key tuning parameters: the step size $\epsilon$ (how large each leapfrog step is) and the number of steps $L$ (how long the trajectory is). The product $L \times \epsilon$ determines the trajectory length.

Step Size ($\epsilon$) 0.10

Size of each leapfrog integration step. Smaller → more accurate, but slower.

Number of Steps (L) 20

Length of trajectory. More steps → longer proposals, but can overshoot.

Iteration Speed (ms) 60

Delay between iterations. Slower speeds help visualize trajectories.

Target Distribution

Same distributions as Metropolis–Hastings for direct comparison.

Tuning advice: Optimal step size balances integration accuracy vs. efficiency. Too large → high rejection due to numerical error. Too small → inefficient short proposals. Aim for acceptance rates around 65-90% (much higher than MH's 23-40%!). Modern implementations like NUTS (No-U-Turn Sampler) automatically tune these parameters.

HMC Trajectory Visualization

The main plot shows the joint posterior $\pi(q_1, q_2)$ with darker regions indicating higher probability. Unlike random-walk methods, HMC trajectories follow curved paths determined by the gradient of the log posterior. Green points indicate accepted proposals and orange points indicate rejected proposals. The plot axes automatically adjust to each distribution's natural range.

Joint Posterior Distribution π(x₁, x₂)

The chain should explore the high-density regions (darker areas) while maintaining enough randomness to avoid getting trapped in any single location.

Marginal Distribution of x₁

Histogram of $x_1$ samples. Should converge to the true marginal $\pi(x_1) = \int \pi(x_1, x_2) dx_2$.

Marginal Distribution of x₂

Histogram of $x_2$ samples. With enough samples, this approximates the true marginal distribution.

Current Iteration Details

Click "Start Sampling" or "Single Step" to begin the MCMC algorithm.

Chain Trace Plots

Trace plots show the evolution of each parameter value over iterations. They are essential for diagnosing convergence and mixing of the MCMC chain.

What to look for in trace plots:

Good mixing: The trace should look like "fuzzy caterpillar" - random fluctuations around a stable mean with no obvious patterns or trends
Stationarity: The mean and variance should remain constant over time (after burn-in)
No trends: The trace shouldn't show long-term upward or downward trends
No getting stuck: The chain shouldn't remain at the same value for extended periods

Trace Plot: x₁ Evolution

Time series of $x_1$ values. Rapid fluctuations indicate good exploration; long periods at similar values suggest poor mixing.

Trace Plot: x₂ Evolution

Time series of $x_2$ values. Compare with $x_1$ trace to check if both parameters mix at similar rates.

Chain Diagnostics

Autocorrelation Function (ACF)

The autocorrelation function measures the correlation between samples separated by $\tau$ iterations (the lag). For a well-mixing chain, the ACF should decay rapidly to zero.

Mathematical Definition:

For a chain $\{x_t\}$ with sample mean $\bar{x}$ and sample variance $s^2$, the sample ACF at lag $\tau$ is:

$$\hat{\rho}(\tau) = \frac{\sum_{t=1}^{n-\tau} (x_t - \bar{x})(x_{t+\tau} - \bar{x})}{\sum_{t=1}^{n} (x_t - \bar{x})^2}$$

Interpretation:

$\hat{\rho}(0) = 1$: Perfect correlation with itself
$\hat{\rho}(\tau) \to 0$ as $\tau \to \infty$: Samples become independent (for ergodic chains)
Slow decay: High autocorrelation indicates the chain mixes slowly
Fast decay: Low autocorrelation indicates efficient exploration

Autocorrelation Function (ACF)

Blue line: ACF for $x_1$ parameter. Red line: ACF for $x_2$ parameter. The lag range adapts automatically—showing where the ACF crosses zero plus 33% additional lags to verify convergence to independence.

Performance Metrics

Total Iterations

Acceptance Rate

—

Accepted Proposals

Effective Sample Size (est.)

—

Effective Sample Size (ESS)

Due to autocorrelation, MCMC samples are not independent. The ESS estimates how many independent samples our correlated chain is equivalent to:

$$\text{ESS} \approx \frac{n}{1 + 2\sum_{\tau=1}^{\infty} \rho(\tau)}$$

where $n$ is the total number of samples and $\rho(\tau)$ is the ACF. Higher ESS means more efficient sampling. The ratio ESS/$n$ is called the sampling efficiency.

Tuning the Proposal Width: The proposal standard deviation $\sigma$ directly affects both the acceptance rate and the autocorrelation. For random-walk Metropolis in high dimensions, the asymptotically optimal acceptance rate is approximately 0.234 (Roberts et al., 1997; Roberts & Rosenthal, 2001). For low-dimensional problems (2D), acceptance rates between 40-60% are typically more efficient, though the optimal rate depends on the target distribution's geometry.

Strengths & Limitations of HMC

✓ Strengths

Efficient exploration: Makes large, coherent moves instead of random walks
Reduced autocorrelation: Much lower correlation between samples than MH
Handles correlations naturally: Follows posterior geometry via gradients
High acceptance rates: Typically 65-90% (vs. 23-40% for MH)
Scales to high dimensions: Performance relatively better as $d$ increases

✗ Limitations

Requires gradients: Need $\nabla \log \pi(\mathbf{q})$, which isn't always available
Tuning parameters: Step size $\epsilon$ and trajectory length $L$ need tuning
Computational cost: Each iteration requires $L$ gradient evaluations
Numerical errors: Discretization of continuous dynamics introduces error
Still struggles with multimodality: Difficult to jump between well-separated modes

When to use HMC:

Smooth posteriors: HMC excels when $\pi(\mathbf{q})$ is differentiable and relatively smooth
High dimensions: Particularly valuable for $d > 10$, where random-walk methods struggle
Strong correlations: Automatically adapts to correlation structure via gradient information
Modern variants: NUTS (No-U-Turn Sampler) in Stan/PyMC automatically tunes $\epsilon$ and $L$, making HMC much more practical

Comparison: HMC vs. Metropolis–Hastings

Direct comparison on the same target distributions reveals HMC's advantages:

Key Differences:

Property

Metropolis–Hastings

HMC

Proposal mechanism

Random walk

Hamiltonian dynamics

Gradients needed?

Yes

Typical acceptance rate

23-40% (high-d)

65-90%

Autocorrelation

High

Low

Effective sample size

Much lower than $n$

Closer to $n$

Best for

Low-d, gradients unavailable

High-d, smooth posteriors

Key advantage: Explores parameter space coherently rather than randomly, dramatically reducing autocorrelation.

Nested Sampling

A fundamentally different approach that simultaneously computes samples from the posterior and the model evidence (marginal likelihood). Instead of evolving a Markov chain, nested sampling progressively samples from constrained priors with increasing likelihood thresholds.

References & Further Reading

Key Papers:

Duane, S., Kennedy, A.D., Pendleton, B.J., & Roweth, D. (1987). "Hybrid Monte Carlo." Physics Letters B, 195(2), 216-222.
Neal, R.M. (1996). "Bayesian Learning for Neural Networks." PhD Thesis, University of Toronto.
Neal, R.M. (2011). "MCMC Using Hamiltonian Dynamics." Handbook of Markov Chain Monte Carlo (eds. Brooks et al.), CRC Press, 113-162.
Betancourt, M. (2017). "A Conceptual Introduction to Hamiltonian Monte Carlo." arXiv:1701.02434.
Hoffman, M.D., & Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo." Journal of Machine Learning Research, 15(1), 1593-1623.
Haario, H., Saksman, E., & Tamminen, J. (1999). "Adaptive Proposal Distribution for Random Walk Metropolis Algorithm." Computational Statistics, 14, 375-395.

Books:

Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., & Rubin, D.B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.
Betancourt, M. (2018). A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv preprint. (Excellent pedagogical resource)

Software Implementations:

Stan: mc-stan.org - Uses NUTS (No-U-Turn Sampler), an adaptive HMC variant
PyMC: pymc.io - Python probabilistic programming with HMC/NUTS
NumPyro: JAX-based probabilistic programming with efficient HMC