Part 4 of the MCMC Samplers Series • Interactive visualization of temperature ladder sampling for multimodal distributions
What is Parallel Tempering?
Parallel Tempering (also called Metropolis-Coupled MCMC or MC³)
is a powerful MCMC method designed to overcome the mode-hopping problem in multimodal distributions. The key insight:
run multiple MCMC chains simultaneously at different "temperatures," then periodically swap states between chains.
The Core Idea:
Instead of sampling from the target posterior \(\pi(\theta)\), run \(K\) parallel chains sampling from
"tempered" distributions:
where \(0 < \beta_K < \beta_{K-1} < \ldots < \beta_2 < \beta_1 = 1\) is the temperature ladder:
\(\beta_1 = 1\): The "cold" chain samples the true posterior \(\pi(\theta)\)
\(\beta_k < 1\): "Hot" chains sample flattened versions (easier to move between modes)
\(\beta_k \to 0\): The hottest chain samples almost uniformly (pure exploration)
Why does this work? Hot chains can easily hop between modes because the energy barriers are reduced
(think of heating a molecule to make it more mobile). When a hot chain discovers a new mode, it can swap states
with cooler chains, eventually transferring this information to the cold chain that targets the true posterior.
This creates an efficient exploration-exploitation trade-off.
The Algorithm
Parallel tempering alternates between two steps: local MCMC moves within each chain and
replica exchange swaps between adjacent temperatures.
Algorithm: Parallel Tempering
Initialize: Set up \(K\) chains, each at temperature \(\beta_k\), with initial states \(\theta^{(k)}\)
MCMC step: For each chain \(k = 1, \ldots, K\):
Propose \(\theta' \sim q(\cdot \mid \theta^{(k)})\) using Metropolis-Hastings
Accept with probability \(\alpha = \min(1, r^{\beta_k})\) where \(r = \frac{\pi(\theta')}{\pi(\theta^{(k)})}\)
Exchange step: For each adjacent pair \((k, k+1)\):
Propose to swap states: \(\theta^{(k)} \leftrightarrow \theta^{(k+1)}\)
Accept swap with probability:
$$\alpha_{\text{swap}} = \min\left(1, \frac{\pi(\theta^{(k)})^{\beta_{k+1}} \pi(\theta^{(k+1)})^{\beta_k}}{\pi(\theta^{(k)})^{\beta_k} \pi(\theta^{(k+1)})^{\beta_{k+1}}}\right)$$
which simplifies to:
$$\alpha_{\text{swap}} = \min\left(1, \exp\left[(\beta_k - \beta_{k+1})(\log \pi(\theta^{(k)}) - \log \pi(\theta^{(k+1)}))\right]\right)$$
Repeat: Iterate steps 2-3. Collect samples from the cold chain (\(\beta_1 = 1\)) only
Key insight on the swap acceptance: The swap is more likely when states have similar posteriors,
or when temperatures are close together. This motivates careful design of the temperature ladder to ensure
good "communication" between chains. A common target is 20-40% swap acceptance between adjacent temperatures.
Target Distributions
We provide four distributions to illustrate parallel tempering's strengths and compare with standard MCMC:
Bivariate Gaussian (Baseline)
A standard correlated 2D Gaussian with \(\rho = 0.8\). Single mode—parallel tempering is overkill here,
but useful to understand temperature effects.
This is where PT shines! Standard MH gets stuck in one mode. Hot chains easily hop between modes,
and swaps propagate this information to the cold chain. Watch how different temperature chains explore differently!
Rosenbrock's Banana
Strongly correlated along a curved manifold. Single mode but challenging geometry. PT helps but HMC is better here.
Neal's Funnel
Hierarchical model with exponentially varying scales. Hot chains can explore the narrow neck more easily.
PT helps, but the real solution is reparameterization or HMC.
Simulation Controls
More temperatures = better exploration but higher computational cost
Controls how quickly temperatures decrease
Number of MCMC steps between swap attempts
Delay between iterations. Slower speeds help visualize swaps.
Temperature Chain Visualizations
Each subplot shows one temperature level in the ladder. Chain 1 (top-left, β=1.0) always samples
the true posterior—this is your target chain. Hotter chains (lower β) sample flattened distributions
and explore more freely.
How swaps work: States (the x,y positions) move between chains, but each chain position keeps its
temperature. When you see a point jump, that's a state moving from one temperature to another. The cold chain
(position 1) receives information about newly discovered modes from the hot chains through these swaps.
Parallel Tempering Diagnostics
The key diagnostics for PT are: swap acceptance rates (are chains communicating well?),
chain exploration (is each temperature exploring its target properly?), and
cold chain coverage (does the final posterior sample all modes correctly?).
Swap Acceptance Rate Between Adjacent Chains
Target: 20-40%. Green = optimal, Orange = acceptable, Red = problematic.
Cold Chain Trace Plot (θ₁)
Chain 1 (β=1.0) position over time. Good mixing shows random walk behavior with mode switches.
Convergence Diagnostics
Unlike single-chain MCMC, parallel tempering convergence depends on both within-chain mixing
(each chain exploring its target) and between-chain communication (swaps transferring information).
Watch for the burn-in phase where chains move from initial positions to the target distribution.
Cold Chain (β=1.0) - Both Coordinates Over Time
θ₁ (blue) and θ₂ (orange) vs iteration for the cold chain.
Look for: (1) Burn-in - initial transient behavior as chain moves from random start to posterior,
(2) Stationarity - stable wandering within posterior region,
(3) Mode switching - for bimodal, jumps between (-2,-2) and (+2,+2) regions.
Assessing Convergence:
Burn-in phase: First ~50-200 iterations as chains move from random initialization.
Discard these when analyzing posterior.
Within-chain mixing: Trace plots should show random fluctuations without trends or getting stuck.
Between-chain communication: Swap acceptance rates of 20-40% ensure information flows between temperatures.
Mode exploration (multimodal): Cold chain should visit all modes. Count transitions between modes in trace plot.
Visual check: Does the cold chain posterior coverage match the true distribution? Run longer if modes are missing.
Cold Chain Final Posterior
This is the payoff: samples from Chain 1 (β=1.0) represent the true posterior after discarding burn-in.
Cold Chain (β=1.0) - True Posterior Samples
This is what matters! Only samples from the cold chain (Chain 1, β=1.0) represent the true posterior.
For the bimodal distribution: both modes should be explored proportionally to their weights (40% and 60%).
If only one mode appears, parallel tempering isn't helping (try more chains or better temperature spacing).
Performance Metrics
Total Iterations
0
Cold Chain Samples
0
Swap Attempts
0
Successful Swaps
0
Understanding the metrics:
Total Iterations: MCMC steps × number of chains (each chain moves every iteration)
Cold Chain Samples: Number of posterior samples (one per iteration from Chain 1)
Swap Success Rate: Proportion of accepted swaps (should be 20-40% per adjacent pair)
Computational Cost: PT requires K× more likelihood evaluations than standard MCMC
Click "Start Sampling" to begin parallel tempering.
Strengths & Limitations
✓ Strengths
Multimodality: Excellent for exploring multiple well-separated modes
Mode discovery: Hot chains find modes; swaps propagate to cold chain
Embarrassingly parallel: Each chain runs independently between swaps
Gradient-free: Works when gradients unavailable (unlike HMC)
✗ Limitations
Computational cost: \(K\) chains means \(K\times\) function evaluations
Temperature ladder design: Requires careful tuning for efficiency
High dimensions: Swap acceptance decreases exponentially with \(d\)
Not for all problems: Overkill for unimodal posteriors (use HMC instead)
Memory overhead: Must store \(K\) states simultaneously
When to use Parallel Tempering:
Multimodal posteriors with well-separated modes (physical phase transitions, mixture models)
Rugged energy landscapes (statistical mechanics, protein folding)
When you suspect multiple modes but don't know how many
When gradients are unavailable (unlike HMC) but you have computational budget
Dimensions \(d \lesssim 20\) (swap acceptance degrades in high-d)
Comparison with Other Methods
Property
Standard MH
HMC
Parallel Tempering
Nested Sampling
Multimodality
Poor (stuck)
Poor (stuck)
Excellent
Excellent
High dimensions
Poor
Excellent
Poor (swap rate ↓)
Poor (curse of d)
Gradient-free?
Yes
No
Yes
Yes
Computes evidence?
No
No
No
Yes
Computational cost
Low (1×)
Medium (gradient)
High (\(K\)×)
Medium-High
Practical recommendation: Use PT for low-to-moderate dimensional multimodal problems without gradients.
Use HMC/NUTS for high-dimensional unimodal posteriors with gradients. Use nested sampling when you need evidence
(model comparison) or for multimodal problems in very low dimensions.
References & Further Reading
Key Papers:
Geyer, C.J. (1991). "Markov Chain Monte Carlo Maximum Likelihood."
In Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, 156-163.
Hukushima, K., & Nemoto, K. (1996). "Exchange Monte Carlo Method and Application to Spin Glass Simulations."
Journal of the Physical Society of Japan, 65(6), 1604-1608.
Earl, D.J., & Deem, M.W. (2005). "Parallel Tempering: Theory, Applications, and New Perspectives."
Physical Chemistry Chemical Physics, 7(23), 3910-3916.
Swendsen, R.H., & Wang, J.S. (1986). "Replica Monte Carlo Simulation of Spin-Glasses."
Physical Review Letters, 57(21), 2607-2609. (Original replica exchange idea)
Katzgraber, H.G., Trebst, S., Huse, D.A., & Troyer, M. (2006). "Feedback-Optimized Parallel Tempering Monte Carlo."
Journal of Statistical Mechanics: Theory and Experiment, 2006(03), P03018.
Temperature Ladder Design:
Kone, A., & Kofke, D.A. (2005). "Selection of Temperature Intervals for Parallel-Tempering Simulations."
Journal of Chemical Physics, 122(20), 206101.
Rathore, N., Chopra, M., & de Pablo, J.J. (2005). "Optimal Allocation of Replicas in Parallel Tempering Simulations."
Journal of Chemical Physics, 122(2), 024111.
Books:
Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo.
CRC Press. (Chapter 7: Parallel and Interacting Chains)
Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing. Springer.
(Chapter 6: Simulated Tempering and Related Methods)
Software:
emcee: Python - The MCMC Hammer (includes parallel tempering as PTSampler)
PyMC: Can implement PT with custom samplers
PTEMCEE: Parallel-tempered ensemble MCMC in Python
MrBayes: Phylogenetics software with Metropolis-Coupled MCMC (MC³)