Parallel Tempering (Temperature Ladder MCMC)

Part 4 of the MCMC Samplers Series • Interactive visualization of temperature ladder sampling for multimodal distributions

What is Parallel Tempering?

Parallel Tempering is an MCMC method designed to overcome the mode-hopping problem in multimodal distributions. The approach is to run multiple MCMC chains simultaneously at different "temperatures," then periodically swap states between chains.

The Core Idea:

Instead of sampling from the target posterior \(\pi(\theta)\), run \(K\) parallel chains sampling from "tempered" distributions:

$$\pi_\beta(\theta) \propto \pi(\theta)^\beta = [\mathcal{L}(\theta) \pi_0(\theta)]^\beta$$

where \(0 < \beta_K < \beta_{K-1} < \ldots < \beta_2 < \beta_1 = 1\) is the temperature ladder. The cold chain at \(\beta_1 = 1\) samples the true posterior, while hotter chains at smaller \(\beta\) sample increasingly flattened versions of the distribution, up to near-uniform sampling as \(\beta_k \to 0\).

Why does this work? At low \(\beta\), the energy landscape is flattened and barriers between modes are suppressed, so hot chains cross freely between regions that trap a cold chain. Replica exchange swaps carry information about newly discovered modes from hot chains down to the cold chain, which samples the true posterior.

The Algorithm

Parallel tempering alternates between two steps: local MCMC moves within each chain and replica exchange swaps between adjacent temperatures.

Algorithm: Parallel Tempering
  1. Initialize: Set up \(K\) chains, each at temperature \(\beta_k\), with initial states \(\theta^{(k)}\)
  2. MCMC step: For each chain \(k = 1, \ldots, K\):
    • Propose \(\theta' \sim q(\cdot \mid \theta^{(k)})\) using a symmetric Gaussian random walk
    • Accept with probability \(\alpha = \min\!\left(1,\, \left[\frac{\pi(\theta')}{\pi(\theta^{(k)})}\right]^{\!\beta_k}\right)\) (symmetric proposal cancels)
  3. Exchange step: For each adjacent pair \((k, k+1)\):
    • Propose to swap states: \(\theta^{(k)} \leftrightarrow \theta^{(k+1)}\)
    • Accept swap with probability: $$\alpha_{\text{swap}} = \min\left(1, \frac{\pi(\theta^{(k+1)})^{\beta_k} \pi(\theta^{(k)})^{\beta_{k+1}}}{\pi(\theta^{(k)})^{\beta_k} \pi(\theta^{(k+1)})^{\beta_{k+1}}}\right)$$ which simplifies to: $$\alpha_{\text{swap}} = \min\left(1, \exp\left[(\beta_k - \beta_{k+1})(\log \pi(\theta^{(k+1)}) - \log \pi(\theta^{(k)}))\right]\right)$$
  4. Repeat: Iterate steps 2-3. Collect samples from the cold chain (\(\beta_1 = 1\)) only
Replica exchange: A swap is accepted when the energy difference between the two states is small relative to the temperature gap. Adjacent distributions must overlap sufficiently for swaps to fire regularly, which is why temperature ladder spacing matters. Empirically, swap acceptance rates of 20–40% per adjacent pair give a good balance between exploration and information flow.

Target Distributions

Four distributions illustrate parallel tempering's strengths and allow comparison with standard MCMC:

Bivariate Gaussian (Baseline)
$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{2(1-\rho^2)}(x_1^2 - 2\rho x_1 x_2 + x_2^2)\right)$$

Single mode (\(\rho = 0.8\)): PT is not needed here, but useful for understanding temperature effects on a well-behaved target.

Bimodal Gaussian Mixture (Best case for PT)
$$\pi(\theta) = 0.4 \cdot \mathcal{N}((-2,-2), 0.8^2I) + 0.6 \cdot \mathcal{N}((+2,+2), 0.8^2I)$$

Hot chains cross the barrier between modes and pass that information to the cold chain via swaps, where standard MH gets stuck.

Rosenbrock's Banana
$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{200}(x_1^2 + 100(x_2 - x_1^2)^2)\right)$$

Curved manifold, single mode: PT helps explore the geometry but HMC is better suited for this kind of challenge.

Neal's Funnel
$$\begin{aligned} x_1 &\sim \mathcal{N}(0, 3^2) \\ x_2 \mid x_1 &\sim \mathcal{N}(0, \exp(x_1)^2) \end{aligned}$$

Hot chains explore the narrow neck more easily, though the deeper fix is reparameterization or HMC.

Simulation Controls

Controls proposal distribution width. Hot chains automatically scale larger. Target: 20-40% acceptance for cold chain.
More temperatures = better exploration but higher computational cost
Linear / Geometric (0.1^k) / Exponential / Adaptive
Number of MCMC steps between swap attempts
Delay between iterations. Slower speeds help visualize swaps.

Temperature Chain Visualizations

Each subplot shows one temperature level in the ladder. Chain 1 (top-left, β=1.0) samples the true posterior and is the only chain used for inference. Hotter chains (lower β) sample progressively flattened distributions and explore more freely.

Swaps exchange the states between chains while each chain retains its temperature. When a point jumps, a parameter value has moved from one temperature level to another. Mode information discovered by hot chains propagates down the ladder to the cold chain through this mechanism.

Parallel Tempering Diagnostics

The key diagnostics for PT are: swap acceptance rates (are chains communicating well?), chain exploration (is each temperature exploring its target properly?), and cold chain coverage (does the final posterior sample all modes correctly?).

Swap Acceptance Rate Between Adjacent Chains

Target: 20-40%. Green = optimal, Orange = acceptable, Red = problematic.

Cold Chain Trace Plot (θ₁)

Chain 1 (β=1.0) position over time. Good mixing shows random walk behavior with mode switches.

Mixing Diagnostics

Unlike single-chain MCMC, mixing quality in parallel tempering depends on both within-chain exploration (each chain covering its tempered target) and between-chain communication (swaps transferring information up and down the ladder). The initial burn-in transient ends once each chain has settled into its stationary distribution.

Cold Chain (β=1.0) - Both Coordinates Over Time

θ₁ (blue) and θ₂ (orange) vs iteration for the cold chain. Look for burn-in during the initial transient as the chain moves from its random start to the posterior region, then stable wandering once stationarity is reached. For the bimodal distribution, jumps between the (-2,-2) and (+2,+2) regions indicate successful mode switching.

Discard the first 50–200 iterations as burn-in before analyzing the posterior. Trace plots should show rapid fluctuations without trends or extended flat segments. For multimodal targets, verify that the cold chain visits all modes: swap acceptance rates of 20–40% per adjacent pair indicate that information is flowing through the ladder. If a mode is absent from the cold chain, adding more temperature levels or tightening the ladder spacing usually helps.

Cold Chain Final Posterior

Samples from Chain 1 (β=1.0) represent the true posterior after discarding burn-in.

Cold Chain (β=1.0) - True Posterior Samples

Only samples from the cold chain (Chain 1, β=1.0) represent the true posterior. For the bimodal distribution, both modes should be explored proportionally to their weights (40% and 60%). If only one mode appears, try adding more chains or adjusting the temperature spacing.

Performance Metrics

Total Iterations
0
Cold Chain Samples
0
Swap Attempts
0
Successful Swaps
0
Total Iterations counts MCMC steps multiplied by the number of chains, since each chain advances every iteration. Cold Chain Samples is the number of posterior samples drawn from Chain 1 alone. The swap success rate should fall between 20% and 40% per adjacent pair. Note that PT requires K times more likelihood evaluations than standard single-chain MCMC, where K is the number of temperature levels.
Click "Start Sampling" to begin parallel tempering.

Strengths & Limitations

Parallel tempering is well suited to multimodal posteriors with well-separated modes. Hot chains discover new modes and swaps propagate that information to the cold chain, which maintains detailed balance and produces correct posterior samples. The chains evolve independently between swap steps, making the algorithm straightforward to parallelize. It requires only density evaluations, with no gradients needed. The main cost is computational: running K chains means K times more likelihood evaluations, and temperature ladder design requires care. Swap acceptance rates decrease as dimension grows, so performance degrades in high-dimensional settings. For unimodal posteriors, simpler methods such as HMC are preferable.

Parallel tempering is most appropriate for low-to-moderate dimensional problems (roughly \(d \lesssim 20\)) with multimodal posteriors where gradients are unavailable, such as physical phase transitions, mixture models, or rugged energy landscapes. When gradients are available and the posterior is unimodal, HMC or NUTS will typically be more efficient.

Comparison with Other Methods

Property Standard MH HMC Parallel Tempering Nested Sampling
Multimodality Poor (stuck) Poor (stuck) Excellent Excellent
High dimensions Poor Excellent Poor (swap rate decreases) Poor (curse of d)
Gradient-free? Yes No Yes Yes
Computes evidence? No No No Yes
Computational cost Low (1×) Medium (gradient) High (\(K\)×) Medium-High

Use PT for low-to-moderate dimensional multimodal problems without gradients. Use HMC/NUTS for high-dimensional unimodal posteriors with gradients. Use nested sampling when you need evidence (model comparison) or for multimodal problems in very low dimensions.

References & Further Reading

Key Papers:

  • Geyer, C.J. (1991). "Markov Chain Monte Carlo Maximum Likelihood." In Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, 156-163.
  • Hukushima, K., & Nemoto, K. (1996). "Exchange Monte Carlo Method and Application to Spin Glass Simulations." Journal of the Physical Society of Japan, 65(6), 1604-1608.
  • Earl, D.J., & Deem, M.W. (2005). "Parallel Tempering: Theory, Applications, and New Perspectives." Physical Chemistry Chemical Physics, 7(23), 3910-3916.
  • Swendsen, R.H., & Wang, J.S. (1986). "Replica Monte Carlo Simulation of Spin-Glasses." Physical Review Letters, 57(21), 2607-2609. (Original replica exchange idea)
  • Katzgraber, H.G., Trebst, S., Huse, D.A., & Troyer, M. (2006). "Feedback-Optimized Parallel Tempering Monte Carlo." Journal of Statistical Mechanics: Theory and Experiment, 2006(03), P03018.

Temperature Ladder Design:

  • Kone, A., & Kofke, D.A. (2005). "Selection of Temperature Intervals for Parallel-Tempering Simulations." Journal of Chemical Physics, 122(20), 206101.
  • Rathore, N., Chopra, M., & de Pablo, J.J. (2005). "Optimal Allocation of Replicas in Parallel Tempering Simulations." Journal of Chemical Physics, 122(2), 024111.

Books:

  • Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press. (Chapter 7: Parallel and Interacting Chains)
  • Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing. Springer. (Chapter 6: Simulated Tempering and Related Methods)

Software:

  • emcee: Python - The MCMC Hammer (includes parallel tempering as PTSampler)
  • PyMC: Can implement PT with custom samplers
  • PTEMCEE: Parallel-tempered ensemble MCMC in Python
  • MrBayes: Phylogenetics software with Metropolis-Coupled MCMC (MC³)