Part 4 of the MCMC Samplers Series • Interactive visualization of temperature ladder sampling for multimodal distributions
What is Parallel Tempering?
Parallel Tempering is an MCMC method designed to overcome the mode-hopping problem in multimodal distributions.
The approach is to run multiple MCMC chains simultaneously at different "temperatures," then periodically swap states between chains.
The Core Idea:
Instead of sampling from the target posterior \(\pi(\theta)\), run \(K\) parallel chains sampling from
"tempered" distributions:
where \(0 < \beta_K < \beta_{K-1} < \ldots < \beta_2 < \beta_1 = 1\) is the temperature ladder.
The cold chain at \(\beta_1 = 1\) samples the true posterior, while hotter chains at smaller \(\beta\) sample
increasingly flattened versions of the distribution, up to near-uniform sampling as \(\beta_k \to 0\).
Why does this work? At low \(\beta\), the energy landscape is flattened and barriers between
modes are suppressed, so hot chains cross freely between regions that trap a cold chain.
Replica exchange swaps carry information about newly discovered modes from hot chains down to the cold chain,
which samples the true posterior.
The Algorithm
Parallel tempering alternates between two steps: local MCMC moves within each chain and
replica exchange swaps between adjacent temperatures.
Algorithm: Parallel Tempering
Initialize: Set up \(K\) chains, each at temperature \(\beta_k\), with initial states \(\theta^{(k)}\)
MCMC step: For each chain \(k = 1, \ldots, K\):
Propose \(\theta' \sim q(\cdot \mid \theta^{(k)})\) using a symmetric Gaussian random walk
Accept with probability \(\alpha = \min\!\left(1,\, \left[\frac{\pi(\theta')}{\pi(\theta^{(k)})}\right]^{\!\beta_k}\right)\) (symmetric proposal cancels)
Exchange step: For each adjacent pair \((k, k+1)\):
Propose to swap states: \(\theta^{(k)} \leftrightarrow \theta^{(k+1)}\)
Accept swap with probability:
$$\alpha_{\text{swap}} = \min\left(1, \frac{\pi(\theta^{(k+1)})^{\beta_k} \pi(\theta^{(k)})^{\beta_{k+1}}}{\pi(\theta^{(k)})^{\beta_k} \pi(\theta^{(k+1)})^{\beta_{k+1}}}\right)$$
which simplifies to:
$$\alpha_{\text{swap}} = \min\left(1, \exp\left[(\beta_k - \beta_{k+1})(\log \pi(\theta^{(k+1)}) - \log \pi(\theta^{(k)}))\right]\right)$$
Repeat: Iterate steps 2-3. Collect samples from the cold chain (\(\beta_1 = 1\)) only
Replica exchange: A swap is accepted when the energy difference between the two states
is small relative to the temperature gap. Adjacent distributions must overlap sufficiently for swaps
to fire regularly, which is why temperature ladder spacing matters. Empirically, swap acceptance rates
of 20–40% per adjacent pair give a good balance between exploration and information flow.
Target Distributions
Four distributions illustrate parallel tempering's strengths and allow comparison with standard MCMC:
Hot chains explore the narrow neck more easily, though the deeper fix is reparameterization or HMC.
Simulation Controls
Controls proposal distribution width. Hot chains automatically scale larger.
Target: 20-40% acceptance for cold chain.
More temperatures = better exploration but higher computational cost
Linear / Geometric (0.1^k) / Exponential / Adaptive
Number of MCMC steps between swap attempts
Delay between iterations. Slower speeds help visualize swaps.
Temperature Chain Visualizations
Each subplot shows one temperature level in the ladder. Chain 1 (top-left, β=1.0)
samples the true posterior and is the only chain used for inference. Hotter chains (lower β)
sample progressively flattened distributions and explore more freely.
Swaps exchange the states between chains while each chain retains its temperature. When
a point jumps, a parameter value has moved from one temperature level to another. Mode information
discovered by hot chains propagates down the ladder to the cold chain through this mechanism.
Parallel Tempering Diagnostics
The key diagnostics for PT are: swap acceptance rates (are chains communicating well?),
chain exploration (is each temperature exploring its target properly?), and
cold chain coverage (does the final posterior sample all modes correctly?).
Swap Acceptance Rate Between Adjacent Chains
Target: 20-40%. Green = optimal, Orange = acceptable, Red = problematic.
Cold Chain Trace Plot (θ₁)
Chain 1 (β=1.0) position over time. Good mixing shows random walk behavior with mode switches.
Mixing Diagnostics
Unlike single-chain MCMC, mixing quality in parallel tempering depends on both within-chain exploration
(each chain covering its tempered target) and between-chain communication (swaps transferring
information up and down the ladder). The initial burn-in transient ends once each chain has settled
into its stationary distribution.
Cold Chain (β=1.0) - Both Coordinates Over Time
θ₁ (blue) and θ₂ (orange) vs iteration for the cold chain.
Look for burn-in during the initial transient as the chain moves from its random start to the posterior region,
then stable wandering once stationarity is reached. For the bimodal distribution, jumps between the (-2,-2)
and (+2,+2) regions indicate successful mode switching.
Discard the first 50–200 iterations as burn-in before analyzing the posterior. Trace plots should show
rapid fluctuations without trends or extended flat segments. For multimodal targets, verify that the cold chain
visits all modes: swap acceptance rates of 20–40% per adjacent pair indicate that information is flowing
through the ladder. If a mode is absent from the cold chain, adding more temperature levels or tightening
the ladder spacing usually helps.
Cold Chain Final Posterior
Samples from Chain 1 (β=1.0) represent the true posterior after discarding burn-in.
Cold Chain (β=1.0) - True Posterior Samples
Only samples from the cold chain (Chain 1, β=1.0) represent the true posterior.
For the bimodal distribution, both modes should be explored proportionally to their weights (40% and 60%).
If only one mode appears, try adding more chains or adjusting the temperature spacing.
Performance Metrics
Total Iterations
0
Cold Chain Samples
0
Swap Attempts
0
Successful Swaps
0
Total Iterations counts MCMC steps multiplied by the number of chains, since each chain advances every iteration.
Cold Chain Samples is the number of posterior samples drawn from Chain 1 alone. The swap success rate should fall
between 20% and 40% per adjacent pair. Note that PT requires K times more likelihood evaluations than standard
single-chain MCMC, where K is the number of temperature levels.
Click "Start Sampling" to begin parallel tempering.
Strengths & Limitations
Parallel tempering is well suited to multimodal posteriors with well-separated modes. Hot chains discover
new modes and swaps propagate that information to the cold chain, which maintains detailed balance and
produces correct posterior samples. The chains evolve independently between swap steps, making the
algorithm straightforward to parallelize. It requires only density evaluations, with no gradients needed.
The main cost is computational: running K chains means K times more likelihood evaluations, and temperature
ladder design requires care. Swap acceptance rates decrease as dimension grows, so performance degrades in
high-dimensional settings. For unimodal posteriors, simpler methods such as HMC are preferable.
Parallel tempering is most appropriate for low-to-moderate dimensional problems (roughly \(d \lesssim 20\))
with multimodal posteriors where gradients are unavailable, such as physical phase transitions, mixture models,
or rugged energy landscapes. When gradients are available and the posterior is unimodal, HMC or NUTS will
typically be more efficient.
Comparison with Other Methods
Property
Standard MH
HMC
Parallel Tempering
Nested Sampling
Multimodality
Poor (stuck)
Poor (stuck)
Excellent
Excellent
High dimensions
Poor
Excellent
Poor (swap rate decreases)
Poor (curse of d)
Gradient-free?
Yes
No
Yes
Yes
Computes evidence?
No
No
No
Yes
Computational cost
Low (1×)
Medium (gradient)
High (\(K\)×)
Medium-High
Use PT for low-to-moderate dimensional multimodal problems without gradients.
Use HMC/NUTS for high-dimensional unimodal posteriors with gradients. Use nested sampling when you need evidence
(model comparison) or for multimodal problems in very low dimensions.
References & Further Reading
Key Papers:
Geyer, C.J. (1991). "Markov Chain Monte Carlo Maximum Likelihood."
In Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, 156-163.
Hukushima, K., & Nemoto, K. (1996). "Exchange Monte Carlo Method and Application to Spin Glass Simulations."
Journal of the Physical Society of Japan, 65(6), 1604-1608.
Earl, D.J., & Deem, M.W. (2005). "Parallel Tempering: Theory, Applications, and New Perspectives."
Physical Chemistry Chemical Physics, 7(23), 3910-3916.
Swendsen, R.H., & Wang, J.S. (1986). "Replica Monte Carlo Simulation of Spin-Glasses."
Physical Review Letters, 57(21), 2607-2609. (Original replica exchange idea)
Katzgraber, H.G., Trebst, S., Huse, D.A., & Troyer, M. (2006). "Feedback-Optimized Parallel Tempering Monte Carlo."
Journal of Statistical Mechanics: Theory and Experiment, 2006(03), P03018.
Temperature Ladder Design:
Kone, A., & Kofke, D.A. (2005). "Selection of Temperature Intervals for Parallel-Tempering Simulations."
Journal of Chemical Physics, 122(20), 206101.
Rathore, N., Chopra, M., & de Pablo, J.J. (2005). "Optimal Allocation of Replicas in Parallel Tempering Simulations."
Journal of Chemical Physics, 122(2), 024111.
Books:
Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo.
CRC Press. (Chapter 7: Parallel and Interacting Chains)
Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing. Springer.
(Chapter 6: Simulated Tempering and Related Methods)
Software:
emcee: Python - The MCMC Hammer (includes parallel tempering as PTSampler)
PyMC: Can implement PT with custom samplers
PTEMCEE: Parallel-tempered ensemble MCMC in Python
MrBayes: Phylogenetics software with Metropolis-Coupled MCMC (MC³)