Hamiltonian Monte Carlo (HMC)

Interactive visualization of gradient-based MCMC sampling from 2D posterior distributions
Part 2 of the MCMC Samplers Series | See also: Metropolis–Hastings

Introduction to Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) is an advanced MCMC algorithm that uses gradient information and concepts from Hamiltonian dynamics to generate efficient proposals. Unlike random-walk methods (like Metropolis–Hastings), HMC proposes distant states while maintaining high acceptance rates by following the natural geometry of the posterior distribution.

HMC addresses the two main challenges of random-walk MCMC: random walk behavior (slow, diffusive exploration) and sensitivity to correlations (inefficient axis-aligned proposals). By simulating Hamiltonian dynamics, HMC proposes states that move coherently along the posterior's contours, dramatically improving efficiency especially in high dimensions.

Why HMC?
  • Efficient exploration: Uses gradient information to propose distant states with high acceptance probability
  • Reduced autocorrelation: Makes large, coherent moves through parameter space instead of small random steps
  • Handles correlations: Naturally adapts to the posterior geometry, efficient even with strong correlations
  • Scalable: Performance often improves (relatively) as dimensionality increases, unlike random-walk methods
  • Industry standard: Powers modern Bayesian software like Stan, PyMC, and NumPyro
The Key Insight: HMC augments the parameter space with auxiliary momentum variables and simulates Hamiltonian dynamics. This transforms the sampling problem into a physics simulation where a particle explores the posterior by conserving total energy—naturally spending more time in high-probability regions while making efficient long-distance moves.

The Hamiltonian Monte Carlo Algorithm

HMC was introduced by Duane et al. (1987) for lattice field theory and popularized for statistics by Neal (1996, 2011). The algorithm augments the parameter vector \(\mathbf{q}\) (position) with a momentum vector \(\mathbf{p}\), defining a Hamiltonian system.

The Hamiltonian:

The total energy of the system combines potential energy (negative log posterior) and kinetic energy:

$$H(\mathbf{q}, \mathbf{p}) = U(\mathbf{q}) + K(\mathbf{p})$$

where:

  • \(U(\mathbf{q}) = -\log \pi(\mathbf{q})\) is the potential energy (negative log target density)
  • \(K(\mathbf{p}) = \frac{1}{2}\mathbf{p}^T M^{-1} \mathbf{p}\) is the kinetic energy (mass matrix \(M\))

In this demo, we use \(M = I\) (identity matrix), so \(K(\mathbf{p}) = \frac{1}{2}\sum_i p_i^2\).

HMC Algorithm Steps:
  1. Sample momentum: Draw \(\mathbf{p} \sim \mathcal{N}(0, M)\)
  2. Simulate Hamiltonian dynamics: Starting from \((\mathbf{q}, \mathbf{p})\), integrate Hamilton's equations for \(L\) steps of size \(\epsilon\): $$\begin{aligned} \frac{d\mathbf{q}}{dt} &= \frac{\partial H}{\partial \mathbf{p}} = M^{-1}\mathbf{p}\\ \frac{d\mathbf{p}}{dt} &= -\frac{\partial H}{\partial \mathbf{q}} = -\nabla U(\mathbf{q}) \end{aligned}$$ We use the leapfrog integrator (symplectic, reversible):
    \(\mathbf{p} \leftarrow \mathbf{p} - \frac{\epsilon}{2}\nabla U(\mathbf{q})\)
    for \(i = 1\) to \(L\):
      \(\mathbf{q} \leftarrow \mathbf{q} + \epsilon M^{-1}\mathbf{p}\)
      \(\mathbf{p} \leftarrow \mathbf{p} - \epsilon \nabla U(\mathbf{q})\)  (except last step: \(\epsilon/2\))
  3. Metropolis acceptance: Propose \((\mathbf{q}^*, \mathbf{p}^*)\) from dynamics. Accept with probability: $$\alpha = \min\left(1, \exp(H(\mathbf{q}, \mathbf{p}) - H(\mathbf{q}^*, \mathbf{p}^*))\right)$$ Due to numerical errors, \(H\) is not perfectly conserved, but acceptance corrects for this.
  4. Return position: If accepted, set \(\mathbf{q}_{t+1} = \mathbf{q}^*\); else \(\mathbf{q}_{t+1} = \mathbf{q}_t\)
Why it works: The leapfrog integrator approximately conserves the Hamiltonian \(H\). Regions with low \(U(\mathbf{q})\) (high probability) correspond to low potential energy, where the particle naturally spends more time—just like a ball rolling in a valley. The momentum carries the particle across low-probability regions, enabling efficient exploration without getting trapped.

Target Distributions

This demonstration provides three target distributions ranging from simple to challenging, illustrating different aspects of MCMC performance:

Bivariate Gaussian (Default)

A standard correlated 2D Gaussian with correlation coefficient \(\rho = 0.8\):

$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{2(1-\rho^2)}(x_1^2 - 2\rho x_1 x_2 + x_2^2)\right)$$

This distribution has elliptical contours aligned with the correlation structure. With strong correlation (\(\rho = 0.8\)), random-walk proposals struggle because they explore axis-by-axis when parameters should move together along the correlation direction.

Rosenbrock's Banana Distribution

A transformed Gaussian that creates a curved, banana-shaped density (Haario et al., 1999):

$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{200}(x_1^2 + 100(x_2 - x_1^2)^2)\right)$$

This distribution is strongly correlated along a curved manifold, making it difficult for random-walk samplers to explore efficiently. The "banana" shape arises from the nonlinear transformation \(x_2 - x_1^2\), creating regions where standard proposals are inefficient.

Neal's Funnel Distribution

A hierarchical model that exhibits strong scale variations (Neal, 2003):

$$\begin{aligned} x_1 &\sim \mathcal{N}(0, 3^2) \\ x_2 \mid x_1 &\sim \mathcal{N}(0, \exp(x_1)^2) \end{aligned}$$

The joint density is \(\pi(x_1, x_2) \propto \exp(-x_1^2/18 - x_2^2/(2e^{2x_1}))\). This creates a "funnel" where the scale of \(x_2\) depends exponentially on \(x_1\), challenging for fixed-width proposals. Common in hierarchical Bayesian models with variance parameters.

Why these distributions? The Gaussian provides a baseline to understand basic behavior. Rosenbrock's banana and Neal's funnel are standard benchmarks in the MCMC literature—the banana tests handling of strong nonlinear correlations, while the funnel tests adaptation to varying scales. These challenges motivate advanced methods like HMC and adaptive MCMC.

Simulation Controls

HMC has two key tuning parameters: the step size \(\epsilon\) (how large each leapfrog step is) and the number of steps \(L\) (how long the trajectory is). The product \(L \times \epsilon\) determines the trajectory length.

Size of each leapfrog integration step. Smaller → more accurate, but slower.
Length of trajectory. More steps → longer proposals, but can overshoot.
Delay between iterations. Slower speeds help visualize trajectories.
Same distributions as Metropolis–Hastings for direct comparison.
Tuning advice: Optimal step size balances integration accuracy vs. efficiency. Too large → high rejection due to numerical error. Too small → inefficient short proposals. Aim for acceptance rates around 65-90% (much higher than MH's 23-40%!). Modern implementations like NUTS (No-U-Turn Sampler) automatically tune these parameters.

HMC Trajectory Visualization

The main plot shows the joint posterior \(\pi(q_1, q_2)\) with darker regions indicating higher probability. Unlike random-walk methods, HMC trajectories follow curved paths determined by the gradient of the log posterior. Green points indicate accepted proposals and orange points indicate rejected proposals. The plot axes automatically adjust to each distribution's natural range.

Joint Posterior Distribution π(x₁, x₂)

The chain should explore the high-density regions (darker areas) while maintaining enough randomness to avoid getting trapped in any single location.

Marginal Distribution of x₁

Histogram of \(x_1\) samples. Should converge to the true marginal \(\pi(x_1) = \int \pi(x_1, x_2) dx_2\).

Marginal Distribution of x₂

Histogram of \(x_2\) samples. With enough samples, this approximates the true marginal distribution.

Current Iteration Details
Click "Start Sampling" or "Single Step" to begin the MCMC algorithm.

Chain Trace Plots

Trace plots show the evolution of each parameter value over iterations. They are essential for diagnosing convergence and mixing of the MCMC chain.

What to look for in trace plots:
  • Good mixing: The trace should look like "fuzzy caterpillar" - random fluctuations around a stable mean with no obvious patterns or trends
  • Stationarity: The mean and variance should remain constant over time (after burn-in)
  • No trends: The trace shouldn't show long-term upward or downward trends
  • No getting stuck: The chain shouldn't remain at the same value for extended periods
Trace Plot: x₁ Evolution

Time series of \(x_1\) values. Rapid fluctuations indicate good exploration; long periods at similar values suggest poor mixing.

Trace Plot: x₂ Evolution

Time series of \(x_2\) values. Compare with \(x_1\) trace to check if both parameters mix at similar rates.

Chain Diagnostics

Autocorrelation Function (ACF)

The autocorrelation function measures the correlation between samples separated by \(\tau\) iterations (the lag). For a well-mixing chain, the ACF should decay rapidly to zero.

Mathematical Definition:

For a chain \(\{x_t\}\) with sample mean \(\bar{x}\) and sample variance \(s^2\), the sample ACF at lag \(\tau\) is:

$$\hat{\rho}(\tau) = \frac{\sum_{t=1}^{n-\tau} (x_t - \bar{x})(x_{t+\tau} - \bar{x})}{\sum_{t=1}^{n} (x_t - \bar{x})^2}$$

Interpretation:

  • \(\hat{\rho}(0) = 1\): Perfect correlation with itself
  • \(\hat{\rho}(\tau) \to 0\) as \(\tau \to \infty\): Samples become independent (for ergodic chains)
  • Slow decay: High autocorrelation indicates the chain mixes slowly
  • Fast decay: Low autocorrelation indicates efficient exploration
Autocorrelation Function (ACF)

Blue line: ACF for \(x_1\) parameter. Red line: ACF for \(x_2\) parameter. The lag range adapts automatically—showing where the ACF crosses zero plus 33% additional lags to verify convergence to independence.

Performance Metrics

Total Iterations
0
Acceptance Rate
Accepted Proposals
0
Effective Sample Size (est.)
Effective Sample Size (ESS)

Due to autocorrelation, MCMC samples are not independent. The ESS estimates how many independent samples our correlated chain is equivalent to:

$$\text{ESS} \approx \frac{n}{1 + 2\sum_{\tau=1}^{\infty} \rho(\tau)}$$

where \(n\) is the total number of samples and \(\rho(\tau)\) is the ACF. Higher ESS means more efficient sampling. The ratio ESS/\(n\) is called the sampling efficiency.

Tuning the Proposal Width: The proposal standard deviation \(\sigma\) directly affects both the acceptance rate and the autocorrelation. For random-walk Metropolis in high dimensions, the asymptotically optimal acceptance rate is approximately 0.234 (Roberts et al., 1997; Roberts & Rosenthal, 2001). For low-dimensional problems (2D), acceptance rates between 40-60% are typically more efficient, though the optimal rate depends on the target distribution's geometry.

Strengths & Limitations of HMC

✓ Strengths
  • Efficient exploration: Makes large, coherent moves instead of random walks
  • Reduced autocorrelation: Much lower correlation between samples than MH
  • Handles correlations naturally: Follows posterior geometry via gradients
  • High acceptance rates: Typically 65-90% (vs. 23-40% for MH)
  • Scales to high dimensions: Performance relatively better as \(d\) increases
✗ Limitations
  • Requires gradients: Need \(\nabla \log \pi(\mathbf{q})\), which isn't always available
  • Tuning parameters: Step size \(\epsilon\) and trajectory length \(L\) need tuning
  • Computational cost: Each iteration requires \(L\) gradient evaluations
  • Numerical errors: Discretization of continuous dynamics introduces error
  • Still struggles with multimodality: Difficult to jump between well-separated modes
When to use HMC:
  • Smooth posteriors: HMC excels when \(\pi(\mathbf{q})\) is differentiable and relatively smooth
  • High dimensions: Particularly valuable for \(d > 10\), where random-walk methods struggle
  • Strong correlations: Automatically adapts to correlation structure via gradient information
  • Modern variants: NUTS (No-U-Turn Sampler) in Stan/PyMC automatically tunes \(\epsilon\) and \(L\), making HMC much more practical

Comparison: HMC vs. Metropolis–Hastings

Direct comparison on the same target distributions reveals HMC's advantages:

Key Differences:
Property
Metropolis–Hastings
HMC
Proposal mechanism
Random walk
Hamiltonian dynamics
Gradients needed?
No
Yes
Typical acceptance rate
23-40% (high-d)
65-90%
Autocorrelation
High
Low
Effective sample size
Much lower than \(n\)
Closer to \(n\)
Best for
Low-d, gradients unavailable
High-d, smooth posteriors
Key advantage: Explores parameter space coherently rather than randomly, dramatically reducing autocorrelation.

Nested Sampling

A fundamentally different approach that simultaneously computes samples from the posterior and the model evidence (marginal likelihood). Instead of evolving a Markov chain, nested sampling progressively samples from constrained priors with increasing likelihood thresholds.

References & Further Reading

Key Papers:

Books:

Software Implementations: