Hamiltonian Monte Carlo (HMC)

Interactive visualization of gradient-based MCMC sampling from 2D posterior distributions
Part 2 of the MCMC Samplers Series | See also: Metropolis-Hastings

What is Hamiltonian Monte Carlo?

Hamiltonian Monte Carlo (HMC) is an MCMC algorithm that uses gradient information to propose new states. Rather than taking random steps, it simulates the physics of a particle moving through parameter space along trajectories shaped by the geometry of the log-density. The result is large, directed moves that are accepted at high rates.

The core idea is to augment the parameter vector \(\mathbf{q}\) with auxiliary momentum variables \(\mathbf{p}\), then simulate Hamiltonian dynamics using a symplectic integrator. Energy conservation keeps the sampler on surfaces of roughly constant probability, while the momentum carries it across low-density regions that would trap a random-walk sampler.

HMC was introduced by Duane et al. (1987) for lattice field theory and brought into statistics by Neal (1996, 2011). Today it underlies most modern probabilistic programming systems, usually in the form of the No-U-Turn Sampler (NUTS), which removes the need to hand-tune trajectory length. Compared to random-walk Metropolis-Hastings, HMC produces far lower autocorrelation and scales much better with the number of parameters.

The Algorithm

HMC defines a Hamiltonian over position \(\mathbf{q}\) (the parameters of interest) and momentum \(\mathbf{p}\):

$$H(\mathbf{q}, \mathbf{p}) = U(\mathbf{q}) + K(\mathbf{p})$$

The potential energy \(U(\mathbf{q}) = -\log \pi(\mathbf{q})\) is the negative log target density. The kinetic energy \(K(\mathbf{p}) = \frac{1}{2}\mathbf{p}^T M^{-1} \mathbf{p}\) comes from an auxiliary Gaussian over \(\mathbf{p}\), with mass matrix \(M\). This demo uses \(M = I\), so \(K(\mathbf{p}) = \frac{1}{2}\sum_i p_i^2\).

Each HMC iteration:
  1. Sample fresh momentum: \(\mathbf{p} \sim \mathcal{N}(0, M)\).
  2. Integrate Hamilton's equations for \(L\) steps of size \(\epsilon\) using the leapfrog integrator (symplectic and time-reversible): $$\begin{aligned} \frac{d\mathbf{q}}{dt} &= \frac{\partial H}{\partial \mathbf{p}} = M^{-1}\mathbf{p}\\ \frac{d\mathbf{p}}{dt} &= -\frac{\partial H}{\partial \mathbf{q}} = -\nabla U(\mathbf{q}) \end{aligned}$$
    \(\mathbf{p} \leftarrow \mathbf{p} - \frac{\epsilon}{2}\nabla U(\mathbf{q})\)
    for \(i = 1\) to \(L\):
      \(\mathbf{q} \leftarrow \mathbf{q} + \epsilon M^{-1}\mathbf{p}\)
      \(\mathbf{p} \leftarrow \mathbf{p} - \epsilon \nabla U(\mathbf{q})\)  (except last step: \(\epsilon/2\))
  3. Accept the proposed \((\mathbf{q}^*, \mathbf{p}^*)\) with probability: $$\alpha = \min\left(1, \exp(H(\mathbf{q}, \mathbf{p}) - H(\mathbf{q}^*, \mathbf{p}^*))\right)$$ The Metropolis correction handles the small numerical errors introduced by finite step sizes.
  4. Set \(\mathbf{q}_{t+1} = \mathbf{q}^*\) if accepted, otherwise keep \(\mathbf{q}_{t+1} = \mathbf{q}_t\).
The leapfrog integrator is volume-preserving (symplectic) and time-reversible. Both properties are needed for the Metropolis correction to be valid. Because the integrator approximately conserves \(H\), proposals land near the same energy level as the current state, unlike a random walk that diffuses freely into low-density regions.

Target Distributions

The demo includes three distributions that illustrate different challenges for MCMC samplers, from well-behaved to genuinely difficult.

Bivariate Gaussian (default)
$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{2(1-\rho^2)}(x_1^2 - 2\rho x_1 x_2 + x_2^2)\right)$$

Correlated (\(\rho = 0.8\)): HMC follows the tilted ellipse naturally, where random walks struggle.

Rosenbrock's Banana
$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{200}(x_1^2 + 100(x_2 - x_1^2)^2)\right)$$

Curved manifold from the nonlinear transformation \(x_2 - x_1^2\): fixed-width proposals are too wide in one direction and too narrow in the other.

Neal's Funnel
$$\begin{aligned} x_1 &\sim \mathcal{N}(0, 3^2) \\ x_2 \mid x_1 &\sim \mathcal{N}(0, \exp(x_1)^2) \end{aligned}$$

Scale of \(x_2\) varies exponentially with \(x_1\): no single step size works everywhere, motivating adaptive methods and reparameterization.

Simulation Controls

HMC has two tuning parameters: the step size \(\epsilon\) and the number of leapfrog steps \(L\). Their product \(L \times \epsilon\) is the total trajectory length, which determines how far the sampler can move in one iteration.

Smaller steps integrate more accurately but require more gradient evaluations.
More steps mean longer trajectories and more distant proposals.
Slow down to watch individual trajectories unfold.
Same distributions as Metropolis-Hastings for direct comparison.
A good starting point is \(\epsilon = 0.1\) and \(L = 20\). If acceptance drops below 60%, reduce \(\epsilon\). If the sampler moves too slowly, increase \(L\). In practice, NUTS (used in Stan and PyMC) tunes both parameters automatically during warmup.

Trajectory Visualization

The main plot shows the target density \(\pi(q_1, q_2)\), with darker regions indicating higher density. Each iteration traces a curved path through parameter space, guided by the gradient of the log density. Green points are accepted proposals; orange points are rejected. Axes scale automatically to each distribution.

Joint Posterior Distribution π(x₁, x₂)

A well-tuned chain covers the high-density region evenly without getting stuck or retracing the same path.

Marginal Distribution of x₁

Histogram of \(x_1\) samples, approximating the true marginal \(\pi(x_1) = \int \pi(x_1, x_2)\, dx_2\).

Marginal Distribution of x₂

Histogram of \(x_2\) samples. With enough iterations this approximates the true marginal.

Current Iteration Details
Click "Start Sampling" or "Single Step" to begin.

Trace Plots

Trace plots show each parameter's sampled value over iterations. A well-mixing chain looks like a stationary noisy signal: rapid fluctuations around a stable mean with no long-term trends. Long flat stretches mean the chain is stuck; slow drift means burn-in is still ongoing.

Trace Plot: x₁ Evolution

Rapid fluctuations indicate good mixing; extended runs at similar values suggest poor exploration.

Trace Plot: x₂ Evolution

Compare with the \(x_1\) trace to check whether both parameters mix at similar rates.

Chain Diagnostics

Autocorrelation Function (ACF)

The ACF measures the correlation between samples separated by \(\tau\) iterations. For a well-mixing chain it decays quickly to zero. Slow decay means successive samples carry redundant information, reducing the effective sample size per iteration.

For a chain \(\{x_t\}\) with sample mean \(\bar{x}\), the sample ACF at lag \(\tau\) is:

$$\hat{\rho}(\tau) = \frac{\sum_{t=1}^{n-\tau} (x_t - \bar{x})(x_{t+\tau} - \bar{x})}{\sum_{t=1}^{n} (x_t - \bar{x})^2}$$

By construction \(\hat{\rho}(0) = 1\). For an ergodic chain, \(\hat{\rho}(\tau) \to 0\) as \(\tau \to \infty\). Fast decay means efficient exploration.

Autocorrelation Function (ACF)

Blue line: ACF for \(x_1\). Red line: ACF for \(x_2\). The lag range extends to where the ACF first crosses zero, plus 33% more lags to confirm it stays near zero.

Performance Metrics

Total Iterations
0
Acceptance Rate
Accepted Proposals
0
Effective Sample Size (est.)
Effective Sample Size (ESS)

Because MCMC samples are correlated, \(n\) iterations do not give \(n\) independent samples. The ESS estimates the equivalent number of independent draws:

$$\text{ESS} \approx \frac{n}{1 + 2\sum_{\tau=1}^{\infty} \rho(\tau)}$$

Higher ESS means more information per iteration. The ratio ESS/\(n\) is the sampling efficiency. HMC typically achieves much higher efficiency than random-walk methods on the same target.

HMC acceptance rates typically fall between 60% and 90%, far above the ~23% that is optimal for random-walk Metropolis in high dimensions. This is because the leapfrog integrator keeps proposals close to a surface of constant energy, making large moves easy to accept.

Strengths and Limitations

HMC's main advantage is sampling efficiency. By following the geometry of the posterior, it produces nearly independent draws in far fewer iterations than random-walk methods, and the gap widens as dimension increases. Acceptance rates stay high because the symplectic integrator approximately conserves energy.

The main constraint is differentiability: HMC needs \(\nabla \log \pi(\mathbf{q})\) at every leapfrog step. Each trajectory costs \(L\) gradient evaluations, which is acceptable when gradients are cheap but non-trivial otherwise. Finite step sizes introduce small integration errors, corrected by the Metropolis step at the cost of occasional rejections. Step size and trajectory length both need tuning, though NUTS removes this in practice. HMC also struggles to cross wide low-density regions between well-separated modes, where specialized techniques or parallel tempering are needed.

Use HMC when the log-density is smooth and differentiable, particularly for posteriors where parameters are correlated and dimension is not negligible.

HMC vs. Metropolis-Hastings

Both algorithms target the same distributions and use Metropolis acceptance, but they differ fundamentally in how proposals are generated.

Property
Metropolis-Hastings
HMC
Proposal mechanism
Random walk
Hamiltonian dynamics
Gradients needed?
No
Yes
Typical acceptance rate
23-40% (high-d)
65-90%
Autocorrelation
High
Low
Effective sample size
Much lower than \(n\)
Closer to \(n\)
Best for
Low-d, gradients unavailable
Higher-d, smooth posteriors

References

Papers:

  • Duane, S., Kennedy, A.D., Pendleton, B.J., & Roweth, D. (1987). "Hybrid Monte Carlo." Physics Letters B, 195(2), 216-222.
  • Neal, R.M. (1996). "Bayesian Learning for Neural Networks." PhD Thesis, University of Toronto.
  • Neal, R.M. (2011). "MCMC Using Hamiltonian Dynamics." Handbook of Markov Chain Monte Carlo (eds. Brooks et al.), CRC Press, 113-162.
  • Betancourt, M. (2017). "A Conceptual Introduction to Hamiltonian Monte Carlo." arXiv:1701.02434.
  • Hoffman, M.D., & Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo." Journal of Machine Learning Research, 15(1), 1593-1623.
  • Haario, H., Saksman, E., & Tamminen, J. (1999). "Adaptive Proposal Distribution for Random Walk Metropolis Algorithm." Computational Statistics, 14, 375-395.

Books:

  • Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press.
  • Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., & Rubin, D.B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.

Software:

  • Stan: mc-stan.org - uses NUTS, an adaptive HMC variant
  • PyMC: pymc.io - Python probabilistic programming with HMC/NUTS
  • NumPyro: JAX-based probabilistic programming with efficient HMC