Metropolis–Hastings MCMC Algorithm
Interactive visualization of Markov Chain Monte Carlo sampling from 2D posterior distributions
Part 1 of the MCMC Samplers Series | Coming soon: Hamiltonian Monte Carlo (HMC) & Nested Sampling
Interactive visualization of Markov Chain Monte Carlo sampling from 2D posterior distributions
Part 1 of the MCMC Samplers Series | Coming soon: Hamiltonian Monte Carlo (HMC) & Nested Sampling
Markov Chain Monte Carlo (MCMC) is a family of algorithms for sampling from probability distributions that are too complex to sample from directly. They are central to Bayesian inference, statistical physics, and computational biology.
The idea is to build a Markov chain whose stationary distribution is the target \(\pi(\mathbf{x})\). Run the chain long enough and its samples will approximate draws from \(\pi(\mathbf{x})\).
MCMC only requires you to evaluate \(\pi(\mathbf{x})\) up to a normalizing constant, which makes it applicable to a vast range of problems where direct sampling or analytical integration is out of reach. Given enough iterations, it is asymptotically exact.
Metropolis–Hastings (MH) is the simplest and most widely used MCMC method. Introduced by Metropolis et al. (1953) and generalized by Hastings (1970), it builds a sequence of samples \(\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \ldots\) where each state depends only on the previous one.
The three distributions below range from easy to genuinely hard for a random-walk sampler. They are standard benchmarks in the MCMC literature.
Correlated Gaussian (\(\rho = 0.8\)). Exposes the axis-by-axis weakness of random-walk proposals.
Mass on a curved manifold. A fixed proposal has no way to follow the geometry.
Scale of \(x_2\) varies exponentially with \(x_1\). No single proposal width works everywhere.
The proposal width \(\sigma\) is the most important parameter to tune: too small and the chain barely moves; too large and almost every proposal is rejected.
Darker regions correspond to higher density. Each point is one sample: green for accepted proposals and orange for rejected ones. Axes adjust automatically to each distribution's natural range.
Watch whether the chain covers the high-density regions or gets trapped in one area.
Histogram of \(x_1\) samples. With enough iterations it approximates the true marginal \(\pi(x_1) = \int \pi(x_1, x_2)\, dx_2\).
Histogram of \(x_2\) samples. With enough iterations this approximates the true marginal.
Trace plots show each parameter's sampled value at every iteration. A well-mixing chain looks like a "fuzzy caterpillar": rapid fluctuations around a stable mean, no trends, no long flat stretches where the chain sat still. Discard the initial burn-in period before drawing conclusions.
Rapid fluctuations indicate good mixing. Long stretches at similar values suggest the chain is stuck.
Compare with the \(x_1\) trace: if one parameter mixes well and the other does not, the proposal geometry or scale is likely mismatched.
The autocorrelation function measures how correlated two samples are when separated by \(\tau\) iterations. For a well-mixing chain, it should decay quickly to zero. Slow decay means successive samples carry redundant information and the effective number of independent samples is small.
For a chain \(\{x_t\}\) with sample mean \(\bar{x}\), the ACF at lag \(\tau\) is:
$$\hat{\rho}(\tau) = \frac{\sum_{t=1}^{n-\tau} (x_t - \bar{x})(x_{t+\tau} - \bar{x})}{\sum_{t=1}^{n} (x_t - \bar{x})^2}$$By definition \(\hat{\rho}(0) = 1\), and for an ergodic chain \(\hat{\rho}(\tau) \to 0\) as \(\tau \to \infty\). How fast it gets there is what matters in practice.
Blue line: ACF for \(x_1\). Red line: ACF for \(x_2\). The lag axis extends to where the ACF first crosses zero, plus 33% extra to confirm it stays near zero.
Because MCMC samples are correlated, \(n\) iterations do not give you \(n\) independent samples. The ESS estimates how many independent samples they are worth:
$$\text{ESS} \approx \frac{n}{1 + 2\sum_{\tau=1}^{\infty} \rho(\tau)}$$The ratio ESS/\(n\) is the sampling efficiency. High autocorrelation drives it toward zero. Maximizing ESS for a fixed computational budget is the central goal of MCMC tuning.
MH works for any target where you can evaluate \(\pi(\mathbf{x})\) up to a constant, requires no gradient information, and is easy to implement. Under mild conditions, the acceptance-rejection step satisfies detailed balance, which ensures the correct stationary distribution.
The weaknesses become apparent in harder problems. The random-walk proposal is blind to the geometry of the posterior, so mixing degrades in high dimensions, along curved manifolds, and wherever the posterior has varying scales (as in Neal's funnel). The proposal width \(\sigma\) must be hand-tuned: too small and the chain barely moves, too large and the acceptance rate collapses. Multimodal distributions are a persistent problem, since the chain can spend arbitrarily long confined to one mode. These limitations motivated Hamiltonian Monte Carlo and other gradient-based methods.
The distributions above reveal where MH struggles: curved geometries, varying scales, and high dimensionality. Hamiltonian Monte Carlo addresses all three by using gradient information to propose distant states coherently rather than diffusing randomly. If you want to see how that plays out on the same distributions, the HMC demo is the natural next step.