Parallel Tempering (Temperature Ladder MCMC)

Part 4 of the MCMC Samplers Series • Interactive visualization of temperature ladder sampling for multimodal distributions

What is Parallel Tempering?

Parallel Tempering (also called Metropolis-Coupled MCMC or MC³) is a powerful MCMC method designed to overcome the mode-hopping problem in multimodal distributions. The key insight: run multiple MCMC chains simultaneously at different "temperatures," then periodically swap states between chains.

The Core Idea:

Instead of sampling from the target posterior \(\pi(\theta)\), run \(K\) parallel chains sampling from "tempered" distributions:

$$\pi_\beta(\theta) \propto \pi(\theta)^\beta = [\mathcal{L}(\theta) \pi_0(\theta)]^\beta$$

where \(0 < \beta_K < \beta_{K-1} < \ldots < \beta_2 < \beta_1 = 1\) is the temperature ladder:

  • \(\beta_1 = 1\): The "cold" chain samples the true posterior \(\pi(\theta)\)
  • \(\beta_k < 1\): "Hot" chains sample flattened versions (easier to move between modes)
  • \(\beta_k \to 0\): The hottest chain samples almost uniformly (pure exploration)
Why does this work? Hot chains can easily hop between modes because the energy barriers are reduced (think of heating a molecule to make it more mobile). When a hot chain discovers a new mode, it can swap states with cooler chains, eventually transferring this information to the cold chain that targets the true posterior. This creates an efficient exploration-exploitation trade-off.

The Algorithm

Parallel tempering alternates between two steps: local MCMC moves within each chain and replica exchange swaps between adjacent temperatures.

Algorithm: Parallel Tempering
  1. Initialize: Set up \(K\) chains, each at temperature \(\beta_k\), with initial states \(\theta^{(k)}\)
  2. MCMC step: For each chain \(k = 1, \ldots, K\):
    • Propose \(\theta' \sim q(\cdot \mid \theta^{(k)})\) using Metropolis-Hastings
    • Accept with probability \(\alpha = \min(1, r^{\beta_k})\) where \(r = \frac{\pi(\theta')}{\pi(\theta^{(k)})}\)
  3. Exchange step: For each adjacent pair \((k, k+1)\):
    • Propose to swap states: \(\theta^{(k)} \leftrightarrow \theta^{(k+1)}\)
    • Accept swap with probability: $$\alpha_{\text{swap}} = \min\left(1, \frac{\pi(\theta^{(k)})^{\beta_{k+1}} \pi(\theta^{(k+1)})^{\beta_k}}{\pi(\theta^{(k)})^{\beta_k} \pi(\theta^{(k+1)})^{\beta_{k+1}}}\right)$$ which simplifies to: $$\alpha_{\text{swap}} = \min\left(1, \exp\left[(\beta_k - \beta_{k+1})(\log \pi(\theta^{(k)}) - \log \pi(\theta^{(k+1)}))\right]\right)$$
  4. Repeat: Iterate steps 2-3. Collect samples from the cold chain (\(\beta_1 = 1\)) only
Key insight on the swap acceptance: The swap is more likely when states have similar posteriors, or when temperatures are close together. This motivates careful design of the temperature ladder to ensure good "communication" between chains. A common target is 20-40% swap acceptance between adjacent temperatures.

Target Distributions

We provide four distributions to illustrate parallel tempering's strengths and compare with standard MCMC:

Bivariate Gaussian (Baseline)

A standard correlated 2D Gaussian with \(\rho = 0.8\). Single mode—parallel tempering is overkill here, but useful to understand temperature effects.

🌟 Bimodal Gaussian Mixture (Perfect for PT!)

Two well-separated modes at (-2,-2) and (+2,+2):

$$\pi(\theta) = 0.4 \cdot \mathcal{N}((-2,-2), 0.8^2I) + 0.6 \cdot \mathcal{N}((+2,+2), 0.8^2I)$$

This is where PT shines! Standard MH gets stuck in one mode. Hot chains easily hop between modes, and swaps propagate this information to the cold chain. Watch how different temperature chains explore differently!

Rosenbrock's Banana

Strongly correlated along a curved manifold. Single mode but challenging geometry. PT helps but HMC is better here.

Neal's Funnel

Hierarchical model with exponentially varying scales. Hot chains can explore the narrow neck more easily. PT helps, but the real solution is reparameterization or HMC.

Simulation Controls

More temperatures = better exploration but higher computational cost
Controls how quickly temperatures decrease
Number of MCMC steps between swap attempts
Delay between iterations. Slower speeds help visualize swaps.

Temperature Chain Visualizations

Each subplot shows one temperature level in the ladder. Chain 1 (top-left, β=1.0) always samples the true posterior—this is your target chain. Hotter chains (lower β) sample flattened distributions and explore more freely.

How swaps work: States (the x,y positions) move between chains, but each chain position keeps its temperature. When you see a point jump, that's a state moving from one temperature to another. The cold chain (position 1) receives information about newly discovered modes from the hot chains through these swaps.

Parallel Tempering Diagnostics

The key diagnostics for PT are: swap acceptance rates (are chains communicating well?), chain exploration (is each temperature exploring its target properly?), and cold chain coverage (does the final posterior sample all modes correctly?).

Swap Acceptance Rate Between Adjacent Chains

Target: 20-40%. Green = optimal, Orange = acceptable, Red = problematic.

Cold Chain Trace Plot (θ₁)

Chain 1 (β=1.0) position over time. Good mixing shows random walk behavior with mode switches.

Convergence Diagnostics

Unlike single-chain MCMC, parallel tempering convergence depends on both within-chain mixing (each chain exploring its target) and between-chain communication (swaps transferring information). Watch for the burn-in phase where chains move from initial positions to the target distribution.

Cold Chain (β=1.0) - Both Coordinates Over Time

θ₁ (blue) and θ₂ (orange) vs iteration for the cold chain. Look for: (1) Burn-in - initial transient behavior as chain moves from random start to posterior, (2) Stationarity - stable wandering within posterior region, (3) Mode switching - for bimodal, jumps between (-2,-2) and (+2,+2) regions.

Assessing Convergence:
  • Burn-in phase: First ~50-200 iterations as chains move from random initialization. Discard these when analyzing posterior.
  • Within-chain mixing: Trace plots should show random fluctuations without trends or getting stuck.
  • Between-chain communication: Swap acceptance rates of 20-40% ensure information flows between temperatures.
  • Mode exploration (multimodal): Cold chain should visit all modes. Count transitions between modes in trace plot.
  • Visual check: Does the cold chain posterior coverage match the true distribution? Run longer if modes are missing.

Cold Chain Final Posterior

This is the payoff: samples from Chain 1 (β=1.0) represent the true posterior after discarding burn-in.

Cold Chain (β=1.0) - True Posterior Samples

This is what matters! Only samples from the cold chain (Chain 1, β=1.0) represent the true posterior. For the bimodal distribution: both modes should be explored proportionally to their weights (40% and 60%). If only one mode appears, parallel tempering isn't helping (try more chains or better temperature spacing).

Performance Metrics

Total Iterations
0
Cold Chain Samples
0
Swap Attempts
0
Successful Swaps
0
Understanding the metrics:
  • Total Iterations: MCMC steps × number of chains (each chain moves every iteration)
  • Cold Chain Samples: Number of posterior samples (one per iteration from Chain 1)
  • Swap Success Rate: Proportion of accepted swaps (should be 20-40% per adjacent pair)
  • Computational Cost: PT requires K× more likelihood evaluations than standard MCMC
Click "Start Sampling" to begin parallel tempering.

Strengths & Limitations

✓ Strengths
  • Multimodality: Excellent for exploring multiple well-separated modes
  • Mode discovery: Hot chains find modes; swaps propagate to cold chain
  • Correct weights: Unlike simulated annealing, maintains detailed balance
  • Embarrassingly parallel: Each chain runs independently between swaps
  • Gradient-free: Works when gradients unavailable (unlike HMC)
✗ Limitations
  • Computational cost: \(K\) chains means \(K\times\) function evaluations
  • Temperature ladder design: Requires careful tuning for efficiency
  • High dimensions: Swap acceptance decreases exponentially with \(d\)
  • Not for all problems: Overkill for unimodal posteriors (use HMC instead)
  • Memory overhead: Must store \(K\) states simultaneously
When to use Parallel Tempering:
  • Multimodal posteriors with well-separated modes (physical phase transitions, mixture models)
  • Rugged energy landscapes (statistical mechanics, protein folding)
  • When you suspect multiple modes but don't know how many
  • When gradients are unavailable (unlike HMC) but you have computational budget
  • Dimensions \(d \lesssim 20\) (swap acceptance degrades in high-d)

Comparison with Other Methods

Property Standard MH HMC Parallel Tempering Nested Sampling
Multimodality Poor (stuck) Poor (stuck) Excellent Excellent
High dimensions Poor Excellent Poor (swap rate ↓) Poor (curse of d)
Gradient-free? Yes No Yes Yes
Computes evidence? No No No Yes
Computational cost Low (1×) Medium (gradient) High (\(K\)×) Medium-High

Practical recommendation: Use PT for low-to-moderate dimensional multimodal problems without gradients. Use HMC/NUTS for high-dimensional unimodal posteriors with gradients. Use nested sampling when you need evidence (model comparison) or for multimodal problems in very low dimensions.

References & Further Reading

Key Papers:

  • Geyer, C.J. (1991). "Markov Chain Monte Carlo Maximum Likelihood." In Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, 156-163.
  • Hukushima, K., & Nemoto, K. (1996). "Exchange Monte Carlo Method and Application to Spin Glass Simulations." Journal of the Physical Society of Japan, 65(6), 1604-1608.
  • Earl, D.J., & Deem, M.W. (2005). "Parallel Tempering: Theory, Applications, and New Perspectives." Physical Chemistry Chemical Physics, 7(23), 3910-3916.
  • Swendsen, R.H., & Wang, J.S. (1986). "Replica Monte Carlo Simulation of Spin-Glasses." Physical Review Letters, 57(21), 2607-2609. (Original replica exchange idea)
  • Katzgraber, H.G., Trebst, S., Huse, D.A., & Troyer, M. (2006). "Feedback-Optimized Parallel Tempering Monte Carlo." Journal of Statistical Mechanics: Theory and Experiment, 2006(03), P03018.

Temperature Ladder Design:

  • Kone, A., & Kofke, D.A. (2005). "Selection of Temperature Intervals for Parallel-Tempering Simulations." Journal of Chemical Physics, 122(20), 206101.
  • Rathore, N., Chopra, M., & de Pablo, J.J. (2005). "Optimal Allocation of Replicas in Parallel Tempering Simulations." Journal of Chemical Physics, 122(2), 024111.

Books:

  • Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press. (Chapter 7: Parallel and Interacting Chains)
  • Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing. Springer. (Chapter 6: Simulated Tempering and Related Methods)

Software:

  • emcee: Python - The MCMC Hammer (includes parallel tempering as PTSampler)
  • PyMC: Can implement PT with custom samplers
  • PTEMCEE: Parallel-tempered ensemble MCMC in Python
  • MrBayes: Phylogenetics software with Metropolis-Coupled MCMC (MC³)