Interactive Parallel Tempering Demo

What is Parallel Tempering?

Parallel Tempering is a powerful MCMC method designed to overcome the mode-hopping problem in multimodal distributions. The key insight: run multiple MCMC chains simultaneously at different "temperatures," then periodically swap states between chains.

The Core Idea:

Instead of sampling from the target posterior $\pi(\theta)$, run $K$ parallel chains sampling from "tempered" distributions:

$$\pi_\beta(\theta) \propto \pi(\theta)^\beta = [\mathcal{L}(\theta) \pi_0(\theta)]^\beta$$

where $0 < \beta_K < \beta_{K-1} < \ldots < \beta_2 < \beta_1 = 1$ is the temperature ladder:

$\beta_1 = 1$: The "cold" chain samples the true posterior $\pi(\theta)$
$\beta_k < 1$: "Hot" chains sample flattened versions (easier to move between modes)
$\beta_k \to 0$: The hottest chain samples almost uniformly (pure exploration)

Why does this work? Hot chains explore a flattened energy landscape where energy barriers between modes are suppressed, enabling efficient mode-hopping. Replica exchange swaps allow information about newly discovered modes to propagate from hot chains to the cold chain sampling the target distribution. This creates an efficient exploration-exploitation trade-off.

The Algorithm

Parallel tempering alternates between two steps: local MCMC moves within each chain and replica exchange swaps between adjacent temperatures.

Algorithm: Parallel Tempering

Initialize: Set up $K$ chains, each at temperature $\beta_k$, with initial states $\theta^{(k)}$
MCMC step: For each chain $k = 1, \ldots, K$:
- Propose $\theta' \sim q(\cdot \mid \theta^{(k)})$ using Metropolis-Hastings
- Accept with probability $\alpha = \min(1, r^{\beta_k})$ where $r = \frac{\pi(\theta')}{\pi(\theta^{(k)})}$
Exchange step: For each adjacent pair $(k, k+1)$:
- Propose to swap states: $\theta^{(k)} \leftrightarrow \theta^{(k+1)}$
- Accept swap with probability: $$\alpha_{\text{swap}} = \min\left(1, \frac{\pi(\theta^{(k)})^{\beta_{k+1}} \pi(\theta^{(k+1)})^{\beta_k}}{\pi(\theta^{(k)})^{\beta_k} \pi(\theta^{(k+1)})^{\beta_{k+1}}}\right)$$ which simplifies to: $$\alpha_{\text{swap}} = \min\left(1, \exp\left[(\beta_k - \beta_{k+1})(\log \pi(\theta^{(k)}) - \log \pi(\theta^{(k+1)}))\right]\right)$$
Repeat: Iterate steps 2-3. Collect samples from the cold chain ($\beta_1 = 1$) only

Key insight on replica exchange: Swaps are accepted when the energy difference between states is comparable to the temperature gap. This motivates careful temperature ladder design to maintain sufficient overlap between adjacent distributions. Target swap acceptance rates of 20-40% ensure efficient information exchange while maintaining detailed balance.

Target Distributions

We provide four distributions to illustrate parallel tempering's strengths and compare with standard MCMC:

Bivariate Gaussian (Baseline)

A standard correlated 2D Gaussian with $\rho = 0.8$. Single mode—parallel tempering is overkill here, but useful to understand temperature effects.

🌟 Bimodal Gaussian Mixture (Perfect for PT!)

Two well-separated modes at (-2,-2) and (+2,+2):

$$\pi(\theta) = 0.4 \cdot \mathcal{N}((-2,-2), 0.8^2I) + 0.6 \cdot \mathcal{N}((+2,+2), 0.8^2I)$$

This is where PT shines! Standard MH gets stuck in one mode. Hot chains easily hop between modes, and swaps propagate this information to the cold chain. Watch how different temperature chains explore differently!

Rosenbrock's Banana

Strongly correlated along a curved manifold. Single mode but challenging geometry. PT helps but HMC is better here.

Neal's Funnel

Hierarchical model with exponentially varying scales. Hot chains can explore the narrow neck more easily. PT helps, but the real solution is reparameterization or HMC.

Simulation Controls

Target Distribution

Base Proposal Step Size (σ₀): 0.4

Controls proposal distribution width. Hot chains automatically scale larger. Target: 20-40% acceptance for cold chain.

Number of Temperatures: 4

More temperatures = better exploration but higher computational cost

Temperature Spacing: Geometric

Linear / Geometric (0.1^k) / Exponential / Adaptive

Iterations per Swap: 10

Number of MCMC steps between swap attempts

Iteration Speed (ms) 50

Delay between iterations. Slower speeds help visualize swaps.

Temperature Chain Visualizations

Each subplot shows one temperature level in the ladder. Chain 1 (top-left, β=1.0) always samples the true posterior—this is your target chain. Hotter chains (lower β) sample flattened distributions and explore more freely.

How swaps work: States (the x,y positions) move between chains, but each chain position keeps its temperature. When you see a point jump, that's a state moving from one temperature to another. The cold chain (position 1) receives information about newly discovered modes from the hot chains through these swaps.

Parallel Tempering Diagnostics

The key diagnostics for PT are: swap acceptance rates (are chains communicating well?), chain exploration (is each temperature exploring its target properly?), and cold chain coverage (does the final posterior sample all modes correctly?).

Swap Acceptance Rate Between Adjacent Chains

Target: 20-40%. Green = optimal, Orange = acceptable, Red = problematic.

Cold Chain Trace Plot (θ₁)

Chain 1 (β=1.0) position over time. Good mixing shows random walk behavior with mode switches.

Convergence Diagnostics

Unlike single-chain MCMC, parallel tempering convergence depends on both within-chain mixing (each chain exploring its target) and between-chain communication (swaps transferring information). Watch for the burn-in phase where chains move from initial positions to the target distribution.

Cold Chain (β=1.0) - Both Coordinates Over Time

θ₁ (blue) and θ₂ (orange) vs iteration for the cold chain. Look for: (1) Burn-in - initial transient behavior as chain moves from random start to posterior, (2) Stationarity - stable wandering within posterior region, (3) Mode switching - for bimodal, jumps between (-2,-2) and (+2,+2) regions.

Assessing Convergence:

Burn-in phase: First ~50-200 iterations as chains move from random initialization. Discard these when analyzing posterior.
Within-chain mixing: Trace plots should show random fluctuations without trends or getting stuck.
Between-chain communication: Swap acceptance rates of 20-40% ensure information flows between temperatures.
Mode exploration (multimodal): Cold chain should visit all modes. Count transitions between modes in trace plot.
Visual check: Does the cold chain posterior coverage match the true distribution? Run longer if modes are missing.

Cold Chain Final Posterior

This is the payoff: samples from Chain 1 (β=1.0) represent the true posterior after discarding burn-in.

Cold Chain (β=1.0) - True Posterior Samples

This is what matters! Only samples from the cold chain (Chain 1, β=1.0) represent the true posterior. For the bimodal distribution: both modes should be explored proportionally to their weights (40% and 60%). If only one mode appears, parallel tempering isn't helping (try more chains or better temperature spacing).

Performance Metrics

Total Iterations

0

Cold Chain Samples

0

Swap Attempts

0

Successful Swaps

0

Understanding the metrics:

Total Iterations: MCMC steps × number of chains (each chain moves every iteration)
Cold Chain Samples: Number of posterior samples (one per iteration from Chain 1)
Swap Success Rate: Proportion of accepted swaps (should be 20-40% per adjacent pair)
Computational Cost: PT requires K× more likelihood evaluations than standard MCMC

Click "Start Sampling" to begin parallel tempering.

Strengths & Limitations

✓ Strengths

Multimodality: Excellent for exploring multiple well-separated modes
Mode discovery: Hot chains find modes; swaps propagate to cold chain
Proper sampling: Maintains detailed balance, produces correct posterior samples
Parallelizable: Chains evolve independently between replica exchange steps
Gradient-free: Only requires density evaluation (unlike HMC)

✗ Limitations

Computational cost: $K$ chains means $K\times$ function evaluations
Temperature ladder design: Requires careful tuning for efficiency
High dimensions: Swap acceptance decreases exponentially with $d$
Not for all problems: Overkill for unimodal posteriors (use HMC instead)
Memory overhead: Must store $K$ states simultaneously

When to use Parallel Tempering:

Multimodal posteriors with well-separated modes (physical phase transitions, mixture models)
Rugged energy landscapes (statistical mechanics, protein folding)
When you suspect multiple modes but don't know how many
When gradients are unavailable (unlike HMC) but you have computational budget
Dimensions $d \lesssim 20$ (swap acceptance degrades in high-d)

Comparison with Other Methods

Property	Standard MH	HMC	Parallel Tempering	Nested Sampling
Multimodality	Poor (stuck)	Poor (stuck)	Excellent	Excellent
High dimensions	Poor	Excellent	Poor (swap rate ↓)	Poor (curse of d)
Gradient-free?	Yes	No	Yes	Yes
Computes evidence?	No	No	No	Yes
Computational cost	Low (1×)	Medium (gradient)	High ($K$×)	Medium-High

Practical recommendation: Use PT for low-to-moderate dimensional multimodal problems without gradients. Use HMC/NUTS for high-dimensional unimodal posteriors with gradients. Use nested sampling when you need evidence (model comparison) or for multimodal problems in very low dimensions.

References & Further Reading

Key Papers:

Geyer, C.J. (1991). "Markov Chain Monte Carlo Maximum Likelihood." In Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, 156-163.
Hukushima, K., & Nemoto, K. (1996). "Exchange Monte Carlo Method and Application to Spin Glass Simulations." Journal of the Physical Society of Japan, 65(6), 1604-1608.
Earl, D.J., & Deem, M.W. (2005). "Parallel Tempering: Theory, Applications, and New Perspectives." Physical Chemistry Chemical Physics, 7(23), 3910-3916.
Swendsen, R.H., & Wang, J.S. (1986). "Replica Monte Carlo Simulation of Spin-Glasses." Physical Review Letters, 57(21), 2607-2609. (Original replica exchange idea)
Katzgraber, H.G., Trebst, S., Huse, D.A., & Troyer, M. (2006). "Feedback-Optimized Parallel Tempering Monte Carlo." Journal of Statistical Mechanics: Theory and Experiment, 2006(03), P03018.

Temperature Ladder Design:

Kone, A., & Kofke, D.A. (2005). "Selection of Temperature Intervals for Parallel-Tempering Simulations." Journal of Chemical Physics, 122(20), 206101.
Rathore, N., Chopra, M., & de Pablo, J.J. (2005). "Optimal Allocation of Replicas in Parallel Tempering Simulations." Journal of Chemical Physics, 122(2), 024111.

Books:

Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press. (Chapter 7: Parallel and Interacting Chains)
Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing. Springer. (Chapter 6: Simulated Tempering and Related Methods)

Software:

emcee: Python - The MCMC Hammer (includes parallel tempering as PTSampler)
PyMC: Can implement PT with custom samplers
PTEMCEE: Parallel-tempered ensemble MCMC in Python
MrBayes: Phylogenetics software with Metropolis-Coupled MCMC (MC³)

Parallel Tempering (Temperature Ladder MCMC)