Hamiltonian Monte Carlo (HMC)
Interactive visualization of gradient-based MCMC sampling from 2D posterior distributions
Part 2 of the MCMC Samplers Series | See also: Metropolis-Hastings
Interactive visualization of gradient-based MCMC sampling from 2D posterior distributions
Part 2 of the MCMC Samplers Series | See also: Metropolis-Hastings
Hamiltonian Monte Carlo (HMC) is an MCMC algorithm that uses gradient information to propose new states. Rather than taking random steps, it simulates the physics of a particle moving through parameter space along trajectories shaped by the geometry of the log-density. The result is large, directed moves that are accepted at high rates.
The core idea is to augment the parameter vector \(\mathbf{q}\) with auxiliary momentum variables \(\mathbf{p}\), then simulate Hamiltonian dynamics using a symplectic integrator. Energy conservation keeps the sampler on surfaces of roughly constant probability, while the momentum carries it across low-density regions that would trap a random-walk sampler.
HMC was introduced by Duane et al. (1987) for lattice field theory and brought into statistics by Neal (1996, 2011). Today it underlies most modern probabilistic programming systems, usually in the form of the No-U-Turn Sampler (NUTS), which removes the need to hand-tune trajectory length. Compared to random-walk Metropolis-Hastings, HMC produces far lower autocorrelation and scales much better with the number of parameters.
HMC defines a Hamiltonian over position \(\mathbf{q}\) (the parameters of interest) and momentum \(\mathbf{p}\):
The potential energy \(U(\mathbf{q}) = -\log \pi(\mathbf{q})\) is the negative log target density. The kinetic energy \(K(\mathbf{p}) = \frac{1}{2}\mathbf{p}^T M^{-1} \mathbf{p}\) comes from an auxiliary Gaussian over \(\mathbf{p}\), with mass matrix \(M\). This demo uses \(M = I\), so \(K(\mathbf{p}) = \frac{1}{2}\sum_i p_i^2\).
The demo includes three distributions that illustrate different challenges for MCMC samplers, from well-behaved to genuinely difficult.
Correlated (\(\rho = 0.8\)): HMC follows the tilted ellipse naturally, where random walks struggle.
Curved manifold from the nonlinear transformation \(x_2 - x_1^2\): fixed-width proposals are too wide in one direction and too narrow in the other.
Scale of \(x_2\) varies exponentially with \(x_1\): no single step size works everywhere, motivating adaptive methods and reparameterization.
HMC has two tuning parameters: the step size \(\epsilon\) and the number of leapfrog steps \(L\). Their product \(L \times \epsilon\) is the total trajectory length, which determines how far the sampler can move in one iteration.
The main plot shows the target density \(\pi(q_1, q_2)\), with darker regions indicating higher density. Each iteration traces a curved path through parameter space, guided by the gradient of the log density. Green points are accepted proposals; orange points are rejected. Axes scale automatically to each distribution.
A well-tuned chain covers the high-density region evenly without getting stuck or retracing the same path.
Histogram of \(x_1\) samples, approximating the true marginal \(\pi(x_1) = \int \pi(x_1, x_2)\, dx_2\).
Histogram of \(x_2\) samples. With enough iterations this approximates the true marginal.
Trace plots show each parameter's sampled value over iterations. A well-mixing chain looks like a stationary noisy signal: rapid fluctuations around a stable mean with no long-term trends. Long flat stretches mean the chain is stuck; slow drift means burn-in is still ongoing.
Rapid fluctuations indicate good mixing; extended runs at similar values suggest poor exploration.
Compare with the \(x_1\) trace to check whether both parameters mix at similar rates.
The ACF measures the correlation between samples separated by \(\tau\) iterations. For a well-mixing chain it decays quickly to zero. Slow decay means successive samples carry redundant information, reducing the effective sample size per iteration.
For a chain \(\{x_t\}\) with sample mean \(\bar{x}\), the sample ACF at lag \(\tau\) is:
$$\hat{\rho}(\tau) = \frac{\sum_{t=1}^{n-\tau} (x_t - \bar{x})(x_{t+\tau} - \bar{x})}{\sum_{t=1}^{n} (x_t - \bar{x})^2}$$By construction \(\hat{\rho}(0) = 1\). For an ergodic chain, \(\hat{\rho}(\tau) \to 0\) as \(\tau \to \infty\). Fast decay means efficient exploration.
Blue line: ACF for \(x_1\). Red line: ACF for \(x_2\). The lag range extends to where the ACF first crosses zero, plus 33% more lags to confirm it stays near zero.
Because MCMC samples are correlated, \(n\) iterations do not give \(n\) independent samples. The ESS estimates the equivalent number of independent draws:
$$\text{ESS} \approx \frac{n}{1 + 2\sum_{\tau=1}^{\infty} \rho(\tau)}$$Higher ESS means more information per iteration. The ratio ESS/\(n\) is the sampling efficiency. HMC typically achieves much higher efficiency than random-walk methods on the same target.
HMC's main advantage is sampling efficiency. By following the geometry of the posterior, it produces nearly independent draws in far fewer iterations than random-walk methods, and the gap widens as dimension increases. Acceptance rates stay high because the symplectic integrator approximately conserves energy.
The main constraint is differentiability: HMC needs \(\nabla \log \pi(\mathbf{q})\) at every leapfrog step. Each trajectory costs \(L\) gradient evaluations, which is acceptable when gradients are cheap but non-trivial otherwise. Finite step sizes introduce small integration errors, corrected by the Metropolis step at the cost of occasional rejections. Step size and trajectory length both need tuning, though NUTS removes this in practice. HMC also struggles to cross wide low-density regions between well-separated modes, where specialized techniques or parallel tempering are needed.
Use HMC when the log-density is smooth and differentiable, particularly for posteriors where parameters are correlated and dimension is not negligible.
Both algorithms target the same distributions and use Metropolis acceptance, but they differ fundamentally in how proposals are generated.
Papers:
Books:
Software: