Hamiltonian Monte Carlo (HMC)

Interactive visualization of gradient-based MCMC sampling from 2D posterior distributions
Part 2 of the MCMC Samplers Series | See also: Metropolis–Hastings

Introduction to Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) is an MCMC algorithm that uses gradient information to generate efficient proposals via Hamiltonian dynamics. Unlike random-walk methods (like Metropolis–Hastings), HMC makes distant proposals while maintaining high acceptance rates by exploiting the geometry of the target distribution.

HMC addresses key limitations of random-walk MCMC: the diffusive nature of exploration and poor performance with correlated parameters. By simulating Hamiltonian dynamics, HMC generates coherent trajectories through parameter space, dramatically improving efficiency, especially in high dimensions.

Why HMC?
  • Efficient exploration: Generates distant proposals with high acceptance probability using gradient information
  • Reduced autocorrelation: Produces nearly independent samples by making coherent moves through parameter space
  • Handles correlations: Exploits geometric information from gradients, efficient even with strongly correlated parameters
  • Scales to high dimensions: Convergence rate depends on condition number rather than dimensionality
  • Industry standard: Powers modern probabilistic programming frameworks like Stan, PyMC, and NumPyro
The Key Insight: HMC augments the target distribution with auxiliary momentum variables and simulates Hamiltonian dynamics using a symplectic integrator. Energy conservation ensures the sampler spends more time in high-probability regions, while the momentum enables efficient exploration by carrying the trajectory across low-probability regions.

The Hamiltonian Monte Carlo Algorithm

HMC was introduced by Duane et al. (1987) for lattice field theory and popularized for statistics by Neal (1996, 2011). The algorithm augments the parameter vector \(\mathbf{q}\) (position) with a momentum vector \(\mathbf{p}\), defining a Hamiltonian system.

The Hamiltonian:

The total energy of the system combines potential energy (negative log posterior) and kinetic energy:

$$H(\mathbf{q}, \mathbf{p}) = U(\mathbf{q}) + K(\mathbf{p})$$

where:

  • \(U(\mathbf{q}) = -\log \pi(\mathbf{q})\) is the potential energy (negative log target density)
  • \(K(\mathbf{p}) = \frac{1}{2}\mathbf{p}^T M^{-1} \mathbf{p}\) is the kinetic energy (mass matrix \(M\))

In this demo, we use \(M = I\) (identity matrix), so \(K(\mathbf{p}) = \frac{1}{2}\sum_i p_i^2\).

HMC Algorithm Steps:
  1. Sample momentum: Draw \(\mathbf{p} \sim \mathcal{N}(0, M)\)
  2. Simulate Hamiltonian dynamics: Starting from \((\mathbf{q}, \mathbf{p})\), integrate Hamilton's equations for \(L\) steps of size \(\epsilon\): $$\begin{aligned} \frac{d\mathbf{q}}{dt} &= \frac{\partial H}{\partial \mathbf{p}} = M^{-1}\mathbf{p}\\ \frac{d\mathbf{p}}{dt} &= -\frac{\partial H}{\partial \mathbf{q}} = -\nabla U(\mathbf{q}) \end{aligned}$$ We use the leapfrog integrator (symplectic, reversible):
    \(\mathbf{p} \leftarrow \mathbf{p} - \frac{\epsilon}{2}\nabla U(\mathbf{q})\)
    for \(i = 1\) to \(L\):
      \(\mathbf{q} \leftarrow \mathbf{q} + \epsilon M^{-1}\mathbf{p}\)
      \(\mathbf{p} \leftarrow \mathbf{p} - \epsilon \nabla U(\mathbf{q})\)  (except last step: \(\epsilon/2\))
  3. Metropolis acceptance: Propose \((\mathbf{q}^*, \mathbf{p}^*)\) from dynamics. Accept with probability: $$\alpha = \min\left(1, \exp(H(\mathbf{q}, \mathbf{p}) - H(\mathbf{q}^*, \mathbf{p}^*))\right)$$ Due to numerical errors, \(H\) is not perfectly conserved, but acceptance corrects for this.
  4. Return position: If accepted, set \(\mathbf{q}_{t+1} = \mathbf{q}^*\); else \(\mathbf{q}_{t+1} = \mathbf{q}_t\)
Why it works: The leapfrog integrator is symplectic and time-reversible, approximately preserving the Hamiltonian \(H\). This volume-preserving property leads to high acceptance rates. Regions of high probability (low \(U(\mathbf{q})\)) correspond to low potential energy where trajectories naturally spend more time, while momentum enables coherent exploration across low-probability barriers.

Target Distributions

This demonstration provides three target distributions ranging from simple to challenging, illustrating different aspects of MCMC performance:

Bivariate Gaussian (Default)

A standard correlated 2D Gaussian with correlation coefficient \(\rho = 0.8\):

$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{2(1-\rho^2)}(x_1^2 - 2\rho x_1 x_2 + x_2^2)\right)$$

This distribution has elliptical contours aligned with the correlation structure. With strong correlation (\(\rho = 0.8\)), random-walk proposals struggle because they explore axis-by-axis when parameters should move together along the correlation direction.

Rosenbrock's Banana Distribution

A transformed Gaussian that creates a curved, banana-shaped density (Haario et al., 1999):

$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{200}(x_1^2 + 100(x_2 - x_1^2)^2)\right)$$

This distribution is strongly correlated along a curved manifold, making it difficult for random-walk samplers to explore efficiently. The "banana" shape arises from the nonlinear transformation \(x_2 - x_1^2\), creating regions where standard proposals are inefficient.

Neal's Funnel Distribution

A hierarchical model that exhibits strong scale variations (Neal, 2003):

$$\begin{aligned} x_1 &\sim \mathcal{N}(0, 3^2) \\ x_2 \mid x_1 &\sim \mathcal{N}(0, \exp(x_1)^2) \end{aligned}$$

The joint density is \(\pi(x_1, x_2) \propto \exp(-x_1^2/18 - x_2^2/(2e^{2x_1}))\). This creates a "funnel" where the scale of \(x_2\) depends exponentially on \(x_1\), challenging for fixed-width proposals. Common in hierarchical Bayesian models with variance parameters.

Why these distributions? The Gaussian provides a baseline to understand basic behavior. Rosenbrock's banana and Neal's funnel are standard benchmarks in the MCMC literature—the banana tests handling of strong nonlinear correlations, while the funnel tests adaptation to varying scales. These challenges motivate advanced methods like HMC and adaptive MCMC.

Simulation Controls

HMC has two key tuning parameters: the step size \(\epsilon\) (how large each leapfrog step is) and the number of steps \(L\) (how long the trajectory is). The product \(L \times \epsilon\) determines the trajectory length.

Size of each leapfrog integration step. Smaller → more accurate, but slower.
Length of trajectory. More steps → longer proposals, but can overshoot.
Delay between iterations. Slower speeds help visualize trajectories.
Same distributions as Metropolis–Hastings for direct comparison.
Tuning advice: Step size controls the trade-off between integration accuracy and discretization error. Too large leads to high rejection rates; too small wastes computational resources. Target acceptance rates of 60-90% (compared to ~23% for optimal random-walk MH in high dimensions). The No-U-Turn Sampler (NUTS) automatically tunes both step size and trajectory length.

HMC Trajectory Visualization

The main plot shows the joint target distribution \(\pi(q_1, q_2)\) with darker regions indicating higher probability. HMC trajectories follow curved paths determined by Hamilton's equations, guided by the gradient of the log density. Green points indicate accepted proposals and orange points indicate rejected proposals. The plot axes automatically adjust to each distribution's natural range.

Joint Posterior Distribution π(x₁, x₂)

The chain should explore the high-density regions (darker areas) while maintaining enough randomness to avoid getting trapped in any single location.

Marginal Distribution of x₁

Histogram of \(x_1\) samples. Should converge to the true marginal \(\pi(x_1) = \int \pi(x_1, x_2) dx_2\).

Marginal Distribution of x₂

Histogram of \(x_2\) samples. With enough samples, this approximates the true marginal distribution.

Current Iteration Details
Click "Start Sampling" or "Single Step" to begin the MCMC algorithm.

Chain Trace Plots

Trace plots show the evolution of each parameter value over iterations. They are essential for diagnosing convergence and mixing of the MCMC chain.

What to look for in trace plots:
  • Good mixing: The trace should look like "fuzzy caterpillar" - random fluctuations around a stable mean with no obvious patterns or trends
  • Stationarity: The mean and variance should remain constant over time (after burn-in)
  • No trends: The trace shouldn't show long-term upward or downward trends
  • No getting stuck: The chain shouldn't remain at the same value for extended periods
Trace Plot: x₁ Evolution

Chain trace for parameter \(x_1\). Rapid fluctuations indicate good mixing; long periods at similar values suggest poor exploration.

Trace Plot: x₂ Evolution

Chain trace for parameter \(x_2\). Compare with \(x_1\) trace to assess if both parameters mix at similar rates.

Chain Diagnostics

Autocorrelation Function (ACF)

The autocorrelation function measures the correlation between samples separated by \(\tau\) iterations (the lag). For a well-mixing chain, the ACF should decay rapidly to zero.

Mathematical Definition:

For a chain \(\{x_t\}\) with sample mean \(\bar{x}\) and sample variance \(s^2\), the sample ACF at lag \(\tau\) is:

$$\hat{\rho}(\tau) = \frac{\sum_{t=1}^{n-\tau} (x_t - \bar{x})(x_{t+\tau} - \bar{x})}{\sum_{t=1}^{n} (x_t - \bar{x})^2}$$

Interpretation:

  • \(\hat{\rho}(0) = 1\): Perfect correlation with itself
  • \(\hat{\rho}(\tau) \to 0\) as \(\tau \to \infty\): Samples become independent (for ergodic chains)
  • Slow decay: High autocorrelation indicates the chain mixes slowly
  • Fast decay: Low autocorrelation indicates efficient exploration
Autocorrelation Function (ACF)

Blue line: ACF for \(x_1\) parameter. Red line: ACF for \(x_2\) parameter. The lag range adapts automatically—showing where the ACF crosses zero plus 33% additional lags to verify convergence to independence.

Performance Metrics

Total Iterations
0
Acceptance Rate
Accepted Proposals
0
Effective Sample Size (est.)
Effective Sample Size (ESS)

Due to autocorrelation, MCMC samples are not independent. The ESS estimates how many independent samples our correlated chain is equivalent to:

$$\text{ESS} \approx \frac{n}{1 + 2\sum_{\tau=1}^{\infty} \rho(\tau)}$$

where \(n\) is the total number of samples and \(\rho(\tau)\) is the ACF. Higher ESS means more efficient sampling. The ratio ESS/\(n\) is called the sampling efficiency.

Note on acceptance rates: Unlike random-walk Metropolis (optimal acceptance ~23% in high dimensions), HMC maintains much higher acceptance rates (60-90%) due to the volume-preserving properties of symplectic integration. This is possible because the leapfrog integrator approximately conserves the Hamiltonian, making proposals follow surfaces of equal probability.

Strengths & Limitations of HMC

✓ Strengths
  • Efficient exploration: Generates coherent trajectories rather than diffusive random walks
  • Low autocorrelation: Produces nearly independent samples, high effective sample size
  • Exploits geometry: Uses gradient information to navigate complex correlation structures
  • High acceptance probability: Typically 60-90% due to symplectic integration
  • Dimension-robust: Scales with condition number rather than dimensionality
✗ Limitations
  • Requires differentiability: Needs \(\nabla \log \pi(\mathbf{q})\) to compute Hamiltonian dynamics
  • Tuning required: Step size \(\epsilon\) and trajectory length \(L\) affect performance
  • Computational cost: Each trajectory requires \(L\) gradient evaluations
  • Discretization error: Finite step size introduces numerical integration error
  • Mode transitions: Difficult to cross low-probability regions between well-separated modes
When to use HMC:
  • Smooth target distributions: Requires differentiable log-density \(\log \pi(\mathbf{q})\)
  • Moderate to high dimensions: Particularly effective for \(d > 10\) where random-walk methods fail
  • Correlated parameters: Gradient information naturally handles complex posterior geometry
  • Practical implementations: NUTS (No-U-Turn Sampler) in Stan/PyMC automatically adapts step size and trajectory length, making HMC accessible without manual tuning

Comparison: HMC vs. Metropolis–Hastings

Direct comparison on the same target distributions reveals HMC's advantages:

Key Differences:
Property
Metropolis–Hastings
HMC
Proposal mechanism
Random walk
Hamiltonian dynamics
Gradients needed?
No
Yes
Typical acceptance rate
23-40% (high-d)
65-90%
Autocorrelation
High
Low
Effective sample size
Much lower than \(n\)
Closer to \(n\)
Best for
Low-d, gradients unavailable
High-d, smooth posteriors
Key advantage: Explores parameter space coherently rather than randomly, dramatically reducing autocorrelation.

Nested Sampling

A fundamentally different approach that simultaneously computes samples from the posterior and the model evidence (marginal likelihood). Instead of evolving a Markov chain, nested sampling progressively samples from constrained priors with increasing likelihood thresholds.

References & Further Reading

Key Papers:

Books:

Software Implementations: