Metropolis–Hastings MCMC Algorithm

Interactive visualization of Markov Chain Monte Carlo sampling from 2D posterior distributions
Part 1 of the MCMC Samplers Series | Coming soon: Hamiltonian Monte Carlo (HMC) & Nested Sampling

What is MCMC?

Markov Chain Monte Carlo (MCMC) is a family of algorithms for sampling from probability distributions that are too complex to sample from directly. They are central to Bayesian inference, statistical physics, and computational biology.

The idea is to build a Markov chain whose stationary distribution is the target \(\pi(\mathbf{x})\). Run the chain long enough and its samples will approximate draws from \(\pi(\mathbf{x})\).

Why MCMC?

MCMC only requires you to evaluate \(\pi(\mathbf{x})\) up to a normalizing constant, which makes it applicable to a vast range of problems where direct sampling or analytical integration is out of reach. Given enough iterations, it is asymptotically exact.

The Metropolis–Hastings Algorithm

Metropolis–Hastings (MH) is the simplest and most widely used MCMC method. Introduced by Metropolis et al. (1953) and generalized by Hastings (1970), it builds a sequence of samples \(\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \ldots\) where each state depends only on the previous one.

Algorithm Steps:
  1. Initialize: Start with an initial state \(\mathbf{x}_0\)
  2. Propose: Generate a candidate state \(\mathbf{x}' \sim q(\cdot \mid \mathbf{x}_t)\) from a proposal distribution
    In this demo: \(q(\mathbf{x}' \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_t, \sigma^2 \mathbf{I})\) (Gaussian random walk)
  3. Compute acceptance ratio: $$\alpha = \min\left(1, \frac{\pi(\mathbf{x}') \, q(\mathbf{x}_t \mid \mathbf{x}')}{\pi(\mathbf{x}_t) \, q(\mathbf{x}' \mid \mathbf{x}_t)}\right)$$ For symmetric proposals (like our Gaussian random walk), this simplifies to: $$\alpha = \min\left(1, \frac{\pi(\mathbf{x}')}{\pi(\mathbf{x}_t)}\right)$$
  4. Accept or reject: Generate \(u \sim \text{Uniform}(0,1)\) $$\mathbf{x}_{t+1} = \begin{cases} \mathbf{x}' & \text{if } u < \alpha \\ \mathbf{x}_t & \text{otherwise} \end{cases}$$
The chain always accepts moves to higher log-density regions, but also sometimes accepts moves to lower-density ones. Without these occasional downhill moves, the chain would collapse onto a local mode and never explore the full distribution.

Target Distributions

The three distributions below range from easy to genuinely hard for a random-walk sampler. They are standard benchmarks in the MCMC literature.

Bivariate Gaussian
$$\pi(x_1, x_2) \propto \exp\!\left(-\frac{x_1^2 - 2\rho x_1 x_2 + x_2^2}{2(1-\rho^2)}\right)$$

Correlated Gaussian (\(\rho = 0.8\)). Exposes the axis-by-axis weakness of random-walk proposals.

Rosenbrock's Banana
$$\pi(x_1, x_2) \propto \exp\!\left(-\frac{x_1^2 + 100(x_2 - x_1^2)^2}{200}\right)$$

Mass on a curved manifold. A fixed proposal has no way to follow the geometry.

Neal's Funnel
$$x_1 \sim \mathcal{N}(0,9),\quad x_2 \mid x_1 \sim \mathcal{N}(0, e^{2x_1})$$

Scale of \(x_2\) varies exponentially with \(x_1\). No single proposal width works everywhere.

Simulation Controls

The proposal width \(\sigma\) is the most important parameter to tune: too small and the chain barely moves; too large and almost every proposal is rejected.

Step size for proposals. Larger \(\sigma\) means larger jumps.
Delay between iterations. Slow it down to watch individual steps.
Switch distributions to see how sampler behavior changes.

MCMC Chain Visualization

Darker regions correspond to higher density. Each point is one sample: green for accepted proposals and orange for rejected ones. Axes adjust automatically to each distribution's natural range.

Joint Posterior Distribution π(x₁, x₂)

Watch whether the chain covers the high-density regions or gets trapped in one area.

Marginal Distribution of x₁

Histogram of \(x_1\) samples. With enough iterations it approximates the true marginal \(\pi(x_1) = \int \pi(x_1, x_2)\, dx_2\).

Marginal Distribution of x₂

Histogram of \(x_2\) samples. With enough iterations this approximates the true marginal.

Current Iteration Details
Click "Start Sampling" or "Single Step" to begin.

Chain Trace Plots

Trace plots show each parameter's sampled value at every iteration. A well-mixing chain looks like a "fuzzy caterpillar": rapid fluctuations around a stable mean, no trends, no long flat stretches where the chain sat still. Discard the initial burn-in period before drawing conclusions.

Trace Plot: x₁ Evolution

Rapid fluctuations indicate good mixing. Long stretches at similar values suggest the chain is stuck.

Trace Plot: x₂ Evolution

Compare with the \(x_1\) trace: if one parameter mixes well and the other does not, the proposal geometry or scale is likely mismatched.

Chain Diagnostics

Autocorrelation Function (ACF)

The autocorrelation function measures how correlated two samples are when separated by \(\tau\) iterations. For a well-mixing chain, it should decay quickly to zero. Slow decay means successive samples carry redundant information and the effective number of independent samples is small.

For a chain \(\{x_t\}\) with sample mean \(\bar{x}\), the ACF at lag \(\tau\) is:

$$\hat{\rho}(\tau) = \frac{\sum_{t=1}^{n-\tau} (x_t - \bar{x})(x_{t+\tau} - \bar{x})}{\sum_{t=1}^{n} (x_t - \bar{x})^2}$$

By definition \(\hat{\rho}(0) = 1\), and for an ergodic chain \(\hat{\rho}(\tau) \to 0\) as \(\tau \to \infty\). How fast it gets there is what matters in practice.

Autocorrelation Function (ACF)

Blue line: ACF for \(x_1\). Red line: ACF for \(x_2\). The lag axis extends to where the ACF first crosses zero, plus 33% extra to confirm it stays near zero.

Performance Metrics

Total Iterations
0
Acceptance Rate
Accepted Proposals
0
Effective Sample Size (est.)
Effective Sample Size (ESS)

Because MCMC samples are correlated, \(n\) iterations do not give you \(n\) independent samples. The ESS estimates how many independent samples they are worth:

$$\text{ESS} \approx \frac{n}{1 + 2\sum_{\tau=1}^{\infty} \rho(\tau)}$$

The ratio ESS/\(n\) is the sampling efficiency. High autocorrelation drives it toward zero. Maximizing ESS for a fixed computational budget is the central goal of MCMC tuning.

For random-walk Metropolis in high dimensions, the asymptotically optimal acceptance rate is approximately 0.234 (Roberts et al., 1997; Roberts & Rosenthal, 2001). In 2D, rates between 40% and 60% tend to work better in practice, though the right target depends on the geometry of the posterior.

Strengths and Limitations

MH works for any target where you can evaluate \(\pi(\mathbf{x})\) up to a constant, requires no gradient information, and is easy to implement. Under mild conditions, the acceptance-rejection step satisfies detailed balance, which ensures the correct stationary distribution.

The weaknesses become apparent in harder problems. The random-walk proposal is blind to the geometry of the posterior, so mixing degrades in high dimensions, along curved manifolds, and wherever the posterior has varying scales (as in Neal's funnel). The proposal width \(\sigma\) must be hand-tuned: too small and the chain barely moves, too large and the acceptance rate collapses. Multimodal distributions are a persistent problem, since the chain can spend arbitrarily long confined to one mode. These limitations motivated Hamiltonian Monte Carlo and other gradient-based methods.

Beyond Metropolis–Hastings

The distributions above reveal where MH struggles: curved geometries, varying scales, and high dimensionality. Hamiltonian Monte Carlo addresses all three by using gradient information to propose distant states coherently rather than diffusing randomly. If you want to see how that plays out on the same distributions, the HMC demo is the natural next step.

References

  • Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., & Teller, E. (1953). "Equation of State Calculations by Fast Computing Machines." Journal of Chemical Physics, 21(6), 1087–1092.
  • Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications." Biometrika, 57(1), 97–109.
  • Roberts, G.O., Gelman, A., & Gilks, W.R. (1997). "Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms." The Annals of Applied Probability, 7(1), 110–120.
  • Haario, H., Saksman, E., & Tamminen, J. (1999). "Adaptive Proposal Distribution for Random Walk Metropolis Algorithm." Computational Statistics, 14, 375–395.
  • Neal, R.M. (2003). "Slice Sampling." The Annals of Statistics, 31(3), 705–767.
  • Roberts, G.O., & Rosenthal, J.S. (2001). "Optimal Scaling for Various Metropolis-Hastings Algorithms." Statistical Science, 16(4), 351–367.
  • Robert, C.P., & Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer.
  • Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press.
  • Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., & Rubin, D.B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.