Interactive visualization of evidence computation and posterior sampling via nested sampling Part 3 of the MCMC Samplers Series | See also:
Metropolis–Hastings ·
HMC
Introduction to Nested Sampling
Nested Sampling is a computational approach to Bayesian inference introduced by John Skilling (2004, 2006).
Unlike traditional MCMC methods that sample from the posterior distribution, nested sampling is primarily designed
to compute the Bayesian evidence (marginal likelihood) \(\mathcal{Z}\), with posterior samples
obtained as a byproduct.
The key innovation is transforming the multi-dimensional integration problem into a one-dimensional problem by
evolving samples from regions of low likelihood to high likelihood. This makes nested sampling particularly
effective for model comparison and exploring multimodal posteriors.
Why Nested Sampling?
Evidence computation: Directly calculates \(\mathcal{Z} = \int \mathcal{L}(\mathbf{\theta}) \pi(\mathbf{\theta}) d\mathbf{\theta}\) for model comparison
Multimodal posteriors: Naturally explores multiple modes without mode-hopping difficulties
No burn-in: Every point contributes to the evidence calculation; no samples are discarded
Automatic termination: Algorithm stops when the remaining evidence contribution is negligible
Posterior samples: Importance-weighted samples from the posterior are obtained automatically
The Key Insight: Instead of sampling directly from the posterior \(\pi(\mathbf{\theta}|\mathbf{d})\),
nested sampling samples from constrained priors with increasing likelihood thresholds. By tracking the shrinking
prior volume at each likelihood level, we can numerically integrate to compute the evidence and generate posterior samples.
The Nested Sampling Algorithm
Nested Sampling was introduced by Skilling (2004, 2006) for calculating the Bayesian evidence.
The algorithm maintains a set of "live points" sampled from the prior, progressively replacing the point
with the lowest likelihood while shrinking the prior volume.
Replace the discarded point by sampling from the prior with constraint \(\mathcal{L}(\mathbf{\theta}) > \mathcal{L}_i^*\)
Terminate: Stop when the estimated remaining evidence contribution is negligible
(typically when \(\mathcal{L}_{\text{max}} \times X < \epsilon \times \mathcal{Z}_{\text{current}}\))
Key Challenge: Step 2 requires sampling from a constrained prior (the hardest part).
This demo uses a simple rejection sampling approach: draw from the prior and reject if \(\mathcal{L} \leq \mathcal{L}_{\text{min}}\).
Real implementations use MCMC within nested sampling (e.g., slice sampling, Metropolis) or advanced techniques
like MultiNest's ellipsoidal decomposition or dynesty's dynamic nested sampling.
Target Distributions
This demonstration provides three target distributions ranging from simple to challenging,
illustrating different aspects of MCMC performance:
Bivariate Gaussian (Default)
A standard correlated 2D Gaussian with correlation coefficient \(\rho = 0.8\):
This distribution has elliptical contours aligned with the correlation structure.
With strong correlation (\(\rho = 0.8\)), random-walk proposals struggle because they explore
axis-by-axis when parameters should move together along the correlation direction.
Rosenbrock's Banana Distribution
A transformed Gaussian that creates a curved, banana-shaped density (Haario et al., 1999):
This distribution is strongly correlated along a curved manifold, making it difficult for
random-walk samplers to explore efficiently. The "banana" shape arises from the nonlinear
transformation \(x_2 - x_1^2\), creating regions where standard proposals are inefficient.
Neal's Funnel Distribution
A hierarchical model that exhibits strong scale variations (Neal, 2003):
The joint density is \(\pi(x_1, x_2) \propto \exp(-x_1^2/18 - x_2^2/(2e^{2x_1}))\).
This creates a "funnel" where the scale of \(x_2\) depends exponentially on \(x_1\),
challenging for fixed-width proposals. Common in hierarchical Bayesian models with variance parameters.
Why this matters for nested sampling: This is where nested sampling truly shines!
MCMC methods (MH, HMC) struggle with mode-hopping—they often get stuck in one mode and never find the other.
Nested sampling naturally explores both modes because it samples from the entire prior and progressively
contracts. Watch the posterior weights plot—you should see two peaks, one for each mode!
Why these distributions? The Gaussian provides a baseline. The banana and funnel test
challenging geometry (from MCMC/HMC demos for comparison). The bimodal distribution is the star here—it
demonstrates nested sampling's key advantage: naturally finding and weighing multiple modes without getting trapped.
Try running MH or HMC on the bimodal distribution and compare—they'll likely only find one mode!
Simulation Controls
Nested sampling has one key parameter: the number of live points \(N\). More live points
improve accuracy but slow down computation. The algorithm automatically terminates when the remaining
evidence contribution becomes negligible.
More points → better evidence estimate, but slower. Typically 50-500.
Delay between iterations. Slower helps visualize the progression.
The bimodal distribution showcases nested sampling's key advantage: naturally finding multiple modes.
Note on implementation: This demo uses simple rejection sampling to generate new live points
(draw from prior, reject if likelihood too low). Real implementations use MCMC methods (slice sampling,
Metropolis) to efficiently sample from the constrained prior, especially important in high dimensions or
for complex likelihood contours.
Nested Sampling Visualization
The main plot shows the joint posterior \(\pi(\theta_1, \theta_2)\) with darker regions indicating higher probability.
Blue points show the current live points (actively sampling), and
red points show dead points (discarded) that contribute
to the evidence calculation. Watch how live points progressively contract toward higher-likelihood regions.
Joint Posterior Distribution π(θ₁, θ₂)
Live points sample the constrained prior uniformly, naturally exploring all high-density regions
(darker areas). For multimodal distributions, you'll see live points in multiple modes simultaneously.
Marginal Posterior Distribution of θ₁
Importance-weighted histogram of θ₁ samples. Each dead point contributes
proportionally to its posterior weight \(w_i = \mathcal{L}_i \Delta X_i / \mathcal{Z}\).
Approximates the true marginal \(\pi(\theta_1 | \text{data})\).
Marginal Posterior Distribution of θ₂
Importance-weighted histogram of θ₂ samples. Properly accounts for the fact that
different samples contribute differently to the posterior. This is crucial for nested sampling
as raw dead points are NOT posterior samples without weighting.
Current Iteration Details
Click "Start Sampling" or "Single Iteration" to begin nested sampling.
Nested Sampling Diagnostics
Unlike MCMC diagnostics (which focus on chain mixing and autocorrelation), nested sampling requires monitoring the prior volume shrinkage
and evidence evolution. These plots show how the algorithm progressively constrains the prior
and accumulates evidence.
What to monitor in nested sampling:
Prior volume X(i): Should shrink exponentially as \(X_i \approx \exp(-i/N)\)
Log-likelihood evolution: Should increase as we move to higher-likelihood regions
Evidence accumulation: log(Z) should stabilize when most evidence is captured
Posterior weights: Show which samples contribute most to the posterior
Prior Volume Shrinkage: log(X) vs Iteration
Prior volume X shrinks exponentially. The slope should be approximately -1/N.
Linear on log scale confirms proper shrinkage rate.
Log-Likelihood Evolution
Minimum log-likelihood at each iteration. Should increase monotonically as we sample
from progressively more constrained priors.
Evidence Evolution: log(Z) vs Iteration
Cumulative log-evidence. Should plateau when remaining prior volume × max likelihood becomes negligible.
The stabilization indicates convergence.
Posterior Analysis
Understanding Posterior Weights
This is the most important diagnostic for nested sampling. It shows you which dead points
actually matter for your posterior inference.
This weight is the product of two competing factors:
\(\mathcal{L}_i\): Likelihood (how well the parameters fit the data) — increases with iteration
\(\Delta X_i\): Prior volume width — decreases exponentially with iteration
What the plot shows you:
📊 Early iterations (left side):
• Large prior volume (\(\Delta X_i\) is big) ✓
• BUT low likelihood (\(\mathcal{L}_i\) is small) ✗
• → Low weights (these points don't contribute much to posterior)
🎯 Middle iterations (peak of the curve):
• Moderate prior volume (\(\Delta X_i\) still reasonable) ✓
• High likelihood (\(\mathcal{L}_i\) is large) ✓
• → HIGH WEIGHTS — These are your most important samples!
• These points are near the posterior mode and dominate your inference
📉 Late iterations (right side):
• Very high likelihood (\(\mathcal{L}_i\) is huge) ✓
• BUT tiny prior volume (\(\Delta X_i \approx 0\)) ✗
• → Low weights (not much prior volume left to contribute)
Practical interpretation:
Sharp, narrow peak: Posterior is concentrated; only a few iterations matter.
May need more live points for accuracy.
Broad, smooth peak: Many iterations contribute; good sampling of posterior.
ESS will be closer to number of iterations.
Multiple peaks: Multimodal posterior! Each peak corresponds to exploring a different mode.
Peak location: Shows when you've transitioned from prior-dominated to likelihood-dominated region.
Posterior Weights vs Iteration
What to look for: The peak shows the "sweet spot" where nested sampling finds the bulk of the posterior mass.
This is where \(\mathcal{L}_i\) is high enough to matter, but \(\Delta X_i\) hasn't shrunk to zero yet.
Before the peak, you're exploring low-likelihood regions (prior). After the peak, you're refining details
but with negligible volume contribution.
💡 Key takeaway:
Unlike MCMC where every sample contributes equally, nested sampling naturally identifies which samples
matter most. The posterior weights plot reveals this directly—you can literally see which ~10-30% of
iterations carry most of the posterior information!
hasn't shrunk to negligibility yet. Early iterations (low likelihood) and late iterations
(tiny prior volume) both have low weights.
Performance Metrics
Iterations
0
Log Evidence
—
Dead Points
0
Effective Sample Size
—
Effective Sample Size for Nested Sampling
In nested sampling, ESS is computed from the variance of the normalized importance weights,
following standard importance sampling theory (Kish, 1965; Kong, 1992):
where \(w_i = \mathcal{L}_i \Delta X_i / \mathcal{Z}\) are the normalized posterior weights satisfying \(\sum w_i = 1\),
so \(\tilde{w}_i = w_i\). This measures the variance of the weights—when all weights are equal (\(w_i = 1/n\)),
we get ESS = \(n\). When weights are highly unequal (few samples dominate), ESS ≪ \(n\).
Interpretation: ESS quantifies how many equally-weighted samples would provide the same
Monte Carlo variance as the weighted sample. This is the standard diagnostic for importance sampling
(Liu, 2001; Chopin & Ridgway, 2017) and is reported by nested sampling implementations like dynesty
(Speagle, 2020) and NestedFit.
Tuning the Number of Live Points: The number of live points \(N\) controls the resolution
of the prior volume shrinkage. More live points give more accurate evidence estimates and better posterior
resolution, but increase computational cost. A practical rule: \(N \gtrsim 25d\) for \(d\) dimensions
(Skilling, 2006). For multimodal problems, increase \(N\) to ensure adequate sampling of each mode.
Advanced implementations like MultiNest use ellipsoidal decomposition to adaptively allocate live points
across modes.
Strengths & Limitations of Nested Sampling
✓ Strengths
Evidence computation: Directly calculates \(\mathcal{Z}\) for model comparison (Bayes factors)
Multimodal posteriors: Naturally explores all modes without mode-hopping difficulties
No burn-in: Every point contributes; no samples wasted
Automatic termination: Stops when evidence is accurately determined
Parallel-friendly: Multiple live points can be updated independently
✗ Limitations
Constrained sampling challenge: Sampling from prior with \(\mathcal{L} > \mathcal{L}_{\text{min}}\) is hard
Not ideal for posteriors alone: If you only want posterior samples, MCMC is often more efficient
Computational cost: Can be expensive, especially with rejection sampling
Curse of dimensionality: Rejection sampling becomes impractical in high dimensions (\(d > 20\))
Implementation complexity: Advanced variants (MultiNest, PolyChord) are complex
When to use Nested Sampling:
Model comparison: When you need Bayes factors or posterior odds ratios
Multimodal posteriors: When the posterior has multiple well-separated modes
Phase transitions: When the posterior has sharp features or phase transition behavior
Low-to-moderate dimensions: Most effective for \(d \lesssim 20\) (with basic rejection) or \(d \lesssim 100\) (with advanced methods)
Comparison: Nested Sampling vs. MCMC
Nested sampling and MCMC methods serve different primary purposes:
Key Differences:
Property
MCMC (MH/HMC)
Nested Sampling
Primary goal
Posterior sampling
Evidence computation
Computes \(\mathcal{Z}\)?
No (requires thermodynamic integration)
Yes (directly)
Multimodality
Struggles (mode-hopping hard)
Handles naturally
Burn-in needed?
Yes
No
Parallelization
Sequential chain
Embarrassingly parallel
Best for
Posterior inference
Model comparison
Practical recommendation: Use nested sampling when you need to compare models
(compute Bayes factors) or when the posterior is multimodal. Use HMC/NUTS for high-dimensional
unimodal posterior inference. Use Metropolis–Hastings when gradients aren't available and
dimensionality is low.
References & Further Reading
Nested Sampling Papers:
Skilling, J. (2004). "Nested Sampling." In American Institute of Physics Conference Series,
Vol. 735 (eds. R. Fischer, R. Preuss, & U. V. Toussaint), 395-405.
Skilling, J. (2006). "Nested Sampling for General Bayesian Computation."
Bayesian Analysis, 1(4), 833-860.
Feroz, F., Hobson, M.P., & Bridges, M. (2009). "MultiNest: An Efficient and Robust Bayesian Inference Tool
for Cosmology and Particle Physics." Monthly Notices of the Royal Astronomical Society, 398(4), 1601-1614.
Speagle, J.S. (2020). "dynesty: A Dynamic Nested Sampling Package for Estimating Bayesian Posteriors and Evidences."
Monthly Notices of the Royal Astronomical Society, 493(3), 3132-3158.
Ashton, G., et al. (2022). "Nested Sampling for Physical Scientists."
Nature Reviews Methods Primers, 2, 39.
Importance Sampling & ESS Theory:
Kish, L. (1965). Survey Sampling. New York: Wiley. (Original ESS formula for weighted samples)
Kong, A. (1992). "A Note on Importance Sampling Using Standardized Weights."
Technical Report, University of Chicago.
Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing. New York: Springer.
(Chapter 2: Importance Sampling)
Chopin, N., & Ridgway, J. (2017). "Leave Pima Indians Alone: Binary Regression as a Benchmark for Bayesian Computation."
Statistical Science, 32(1), 64-87. (Discussion of ESS for importance sampling)
Books:
Sivia, D.S., & Skilling, J. (2006). Data Analysis: A Bayesian Tutorial (2nd ed.). Oxford University Press.
Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press.
Trotta, R. (2008). "Bayes in the Sky: Bayesian Inference and Model Selection in Cosmology."
Contemporary Physics, 49(2), 71-104. (Review with nested sampling applications)
Software Implementations:
dynesty:dynesty.readthedocs.io
- Pure Python, dynamic nested sampling with excellent documentation (reports ESS)