Nested Sampling

Interactive visualization of evidence computation and posterior sampling via nested sampling
Part 3 of the MCMC Samplers Series | See also: Metropolis–Hastings · HMC

Introduction to Nested Sampling

Nested Sampling is a computational approach to Bayesian inference introduced by John Skilling (2004, 2006). Unlike traditional MCMC methods that sample from the posterior distribution, nested sampling is primarily designed to compute the Bayesian evidence (marginal likelihood) \(\mathcal{Z}\), with posterior samples obtained as a byproduct.

The key innovation is transforming the multi-dimensional integration problem into a one-dimensional problem by evolving samples from regions of low likelihood to high likelihood. This makes nested sampling particularly effective for model comparison and exploring multimodal posteriors.

Why Nested Sampling?
  • Evidence computation: Directly calculates \(\mathcal{Z} = \int \mathcal{L}(\mathbf{\theta}) \pi(\mathbf{\theta}) d\mathbf{\theta}\) for model comparison
  • Multimodal posteriors: Naturally explores multiple modes without mode-hopping difficulties
  • No burn-in: Every point contributes to the evidence calculation; no samples are discarded
  • Automatic termination: Algorithm stops when the remaining evidence contribution is negligible
  • Posterior samples: Importance-weighted samples from the posterior are obtained automatically
The Key Insight: Instead of sampling directly from the posterior \(\pi(\mathbf{\theta}|\mathbf{d})\), nested sampling samples from constrained priors with increasing likelihood thresholds. By tracking the shrinking prior volume at each likelihood level, we can numerically integrate to compute the evidence and generate posterior samples.

The Nested Sampling Algorithm

Nested Sampling was introduced by Skilling (2004, 2006) for calculating the Bayesian evidence. The algorithm maintains a set of "live points" sampled from the prior, progressively replacing the point with the lowest likelihood while shrinking the prior volume.

The Evidence Integral:

Bayes' theorem in terms of evidence:

$$\pi(\mathbf{\theta}|\mathbf{d}) = \frac{\mathcal{L}(\mathbf{\theta})\pi(\mathbf{\theta})}{\mathcal{Z}}$$

where the evidence is:

$$\mathcal{Z} = \int \mathcal{L}(\mathbf{\theta}) \pi(\mathbf{\theta}) d\mathbf{\theta}$$

Nested sampling rewrites this integral in terms of prior volume \(X\) enclosed by likelihood contour \(\mathcal{L}\):

$$\mathcal{Z} = \int_0^1 \mathcal{L}(X) dX$$
Nested Sampling Algorithm:
  1. Initialize: Draw \(N\) live points \(\{\mathbf{\theta}_i\}\) from the prior \(\pi(\mathbf{\theta})\). Set \(X_0 = 1\) (entire prior volume).
  2. Iterate: At iteration \(i\):
    • Find the live point with lowest likelihood: \(\mathcal{L}_i^* = \min_j \mathcal{L}(\mathbf{\theta}_j)\)
    • Save this point as a sample: \((\mathbf{\theta}_i^*, \mathcal{L}_i^*, X_i)\)
    • Estimate prior volume shrinkage: \(X_i \approx t_i X_{i-1}\) where \(t_i \sim \text{Beta}(N, 1)\)
    • Replace the discarded point by sampling from the prior with constraint \(\mathcal{L}(\mathbf{\theta}) > \mathcal{L}_i^*\)
  3. Terminate: Stop when the estimated remaining evidence contribution is negligible (typically when \(\mathcal{L}_{\text{max}} \times X < \epsilon \times \mathcal{Z}_{\text{current}}\))
  4. Compute evidence: Numerical integration via trapezoidal rule: $$\mathcal{Z} \approx \sum_{i=1}^{M} \mathcal{L}_i^* \Delta X_i$$ where \(\Delta X_i = X_{i-1} - X_{i+1}\) (or variants)
  5. Extract posterior: Samples \(\{\mathbf{\theta}_i^*\}\) with importance weights \(w_i = \mathcal{L}_i^* \Delta X_i / \mathcal{Z}\)
Key Challenge: Step 2 requires sampling from a constrained prior (the hardest part). This demo uses a simple rejection sampling approach: draw from the prior and reject if \(\mathcal{L} \leq \mathcal{L}_{\text{min}}\). Real implementations use MCMC within nested sampling (e.g., slice sampling, Metropolis) or advanced techniques like MultiNest's ellipsoidal decomposition or dynesty's dynamic nested sampling.

Target Distributions

This demonstration provides three target distributions ranging from simple to challenging, illustrating different aspects of MCMC performance:

Bivariate Gaussian (Default)

A standard correlated 2D Gaussian with correlation coefficient \(\rho = 0.8\):

$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{2(1-\rho^2)}(x_1^2 - 2\rho x_1 x_2 + x_2^2)\right)$$

This distribution has elliptical contours aligned with the correlation structure. With strong correlation (\(\rho = 0.8\)), random-walk proposals struggle because they explore axis-by-axis when parameters should move together along the correlation direction.

Rosenbrock's Banana Distribution

A transformed Gaussian that creates a curved, banana-shaped density (Haario et al., 1999):

$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{200}(x_1^2 + 100(x_2 - x_1^2)^2)\right)$$

This distribution is strongly correlated along a curved manifold, making it difficult for random-walk samplers to explore efficiently. The "banana" shape arises from the nonlinear transformation \(x_2 - x_1^2\), creating regions where standard proposals are inefficient.

Neal's Funnel Distribution

A hierarchical model that exhibits strong scale variations (Neal, 2003):

$$\begin{aligned} x_1 &\sim \mathcal{N}(0, 3^2) \\ x_2 \mid x_1 &\sim \mathcal{N}(0, \exp(x_1)^2) \end{aligned}$$

The joint density is \(\pi(x_1, x_2) \propto \exp(-x_1^2/18 - x_2^2/(2e^{2x_1}))\). This creates a "funnel" where the scale of \(x_2\) depends exponentially on \(x_1\), challenging for fixed-width proposals. Common in hierarchical Bayesian models with variance parameters.

🌟 Bimodal Gaussian Mixture (Showcases Nested Sampling!)

Two well-separated Gaussian modes demonstrating multimodal inference:

$$\pi(\theta_1, \theta_2) = 0.4 \cdot \mathcal{N}((-2,-2), 0.8^2I) + 0.6 \cdot \mathcal{N}((+2,+2), 0.8^2I)$$

Why this matters for nested sampling: This is where nested sampling truly shines! MCMC methods (MH, HMC) struggle with mode-hopping—they often get stuck in one mode and never find the other. Nested sampling naturally explores both modes because it samples from the entire prior and progressively contracts. Watch the posterior weights plot—you should see two peaks, one for each mode!

Why these distributions? The Gaussian provides a baseline. The banana and funnel test challenging geometry (from MCMC/HMC demos for comparison). The bimodal distribution is the star here—it demonstrates nested sampling's key advantage: naturally finding and weighing multiple modes without getting trapped. Try running MH or HMC on the bimodal distribution and compare—they'll likely only find one mode!

Simulation Controls

Nested sampling has one key parameter: the number of live points \(N\). More live points improve accuracy but slow down computation. The algorithm automatically terminates when the remaining evidence contribution becomes negligible.

More points → better evidence estimate, but slower. Typically 50-500.
Delay between iterations. Slower helps visualize the progression.
The bimodal distribution showcases nested sampling's key advantage: naturally finding multiple modes.
Note on implementation: This demo uses simple rejection sampling to generate new live points (draw from prior, reject if likelihood too low). Real implementations use MCMC methods (slice sampling, Metropolis) to efficiently sample from the constrained prior, especially important in high dimensions or for complex likelihood contours.

Nested Sampling Visualization

The main plot shows the joint posterior \(\pi(\theta_1, \theta_2)\) with darker regions indicating higher probability. Blue points show the current live points (actively sampling), and red points show dead points (discarded) that contribute to the evidence calculation. Watch how live points progressively contract toward higher-likelihood regions.

Joint Posterior Distribution π(θ₁, θ₂)

Live points sample the constrained prior uniformly, naturally exploring all high-density regions (darker areas). For multimodal distributions, you'll see live points in multiple modes simultaneously.

Marginal Posterior Distribution of θ₁

Importance-weighted histogram of θ₁ samples. Each dead point contributes proportionally to its posterior weight \(w_i = \mathcal{L}_i \Delta X_i / \mathcal{Z}\). Approximates the true marginal \(\pi(\theta_1 | \text{data})\).

Marginal Posterior Distribution of θ₂

Importance-weighted histogram of θ₂ samples. Properly accounts for the fact that different samples contribute differently to the posterior. This is crucial for nested sampling as raw dead points are NOT posterior samples without weighting.

Current Iteration Details
Click "Start Sampling" or "Single Iteration" to begin nested sampling.

Nested Sampling Diagnostics

Unlike MCMC diagnostics (which focus on chain mixing and autocorrelation), nested sampling requires monitoring the prior volume shrinkage and evidence evolution. These plots show how the algorithm progressively constrains the prior and accumulates evidence.

What to monitor in nested sampling:
  • Prior volume X(i): Should shrink exponentially as \(X_i \approx \exp(-i/N)\)
  • Log-likelihood evolution: Should increase as we move to higher-likelihood regions
  • Evidence accumulation: log(Z) should stabilize when most evidence is captured
  • Posterior weights: Show which samples contribute most to the posterior
Prior Volume Shrinkage: log(X) vs Iteration

Prior volume X shrinks exponentially. The slope should be approximately -1/N. Linear on log scale confirms proper shrinkage rate.

Log-Likelihood Evolution

Minimum log-likelihood at each iteration. Should increase monotonically as we sample from progressively more constrained priors.

Evidence Evolution: log(Z) vs Iteration

Cumulative log-evidence. Should plateau when remaining prior volume × max likelihood becomes negligible. The stabilization indicates convergence.

Posterior Analysis

Understanding Posterior Weights

This is the most important diagnostic for nested sampling. It shows you which dead points actually matter for your posterior inference.

The Key Insight:

Each dead point has a posterior weight:

$$w_i = \frac{\mathcal{L}_i \times \Delta X_i}{\mathcal{Z}}$$

This weight is the product of two competing factors:

  • \(\mathcal{L}_i\): Likelihood (how well the parameters fit the data) — increases with iteration
  • \(\Delta X_i\): Prior volume width — decreases exponentially with iteration
What the plot shows you:

📊 Early iterations (left side):
• Large prior volume (\(\Delta X_i\) is big) ✓
• BUT low likelihood (\(\mathcal{L}_i\) is small) ✗
→ Low weights (these points don't contribute much to posterior)

🎯 Middle iterations (peak of the curve):
• Moderate prior volume (\(\Delta X_i\) still reasonable) ✓
• High likelihood (\(\mathcal{L}_i\) is large) ✓
→ HIGH WEIGHTS — These are your most important samples!
• These points are near the posterior mode and dominate your inference

📉 Late iterations (right side):
• Very high likelihood (\(\mathcal{L}_i\) is huge) ✓
• BUT tiny prior volume (\(\Delta X_i \approx 0\)) ✗
→ Low weights (not much prior volume left to contribute)

Practical interpretation:
  • Sharp, narrow peak: Posterior is concentrated; only a few iterations matter. May need more live points for accuracy.
  • Broad, smooth peak: Many iterations contribute; good sampling of posterior. ESS will be closer to number of iterations.
  • Multiple peaks: Multimodal posterior! Each peak corresponds to exploring a different mode.
  • Peak location: Shows when you've transitioned from prior-dominated to likelihood-dominated region.
Posterior Weights vs Iteration

What to look for: The peak shows the "sweet spot" where nested sampling finds the bulk of the posterior mass. This is where \(\mathcal{L}_i\) is high enough to matter, but \(\Delta X_i\) hasn't shrunk to zero yet. Before the peak, you're exploring low-likelihood regions (prior). After the peak, you're refining details but with negligible volume contribution.

💡 Key takeaway:

Unlike MCMC where every sample contributes equally, nested sampling naturally identifies which samples matter most. The posterior weights plot reveals this directly—you can literally see which ~10-30% of iterations carry most of the posterior information!

hasn't shrunk to negligibility yet. Early iterations (low likelihood) and late iterations (tiny prior volume) both have low weights.

Performance Metrics

Iterations
0
Log Evidence
Dead Points
0
Effective Sample Size
Effective Sample Size for Nested Sampling

In nested sampling, ESS is computed from the variance of the normalized importance weights, following standard importance sampling theory (Kish, 1965; Kong, 1992):

$$\text{ESS} = \frac{\left(\sum_{i=1}^{n} w_i\right)^2}{\sum_{i=1}^{n} w_i^2} = \frac{1}{\sum_{i=1}^{n} \tilde{w}_i^2}$$

where \(w_i = \mathcal{L}_i \Delta X_i / \mathcal{Z}\) are the normalized posterior weights satisfying \(\sum w_i = 1\), so \(\tilde{w}_i = w_i\). This measures the variance of the weights—when all weights are equal (\(w_i = 1/n\)), we get ESS = \(n\). When weights are highly unequal (few samples dominate), ESS ≪ \(n\).

Interpretation: ESS quantifies how many equally-weighted samples would provide the same Monte Carlo variance as the weighted sample. This is the standard diagnostic for importance sampling (Liu, 2001; Chopin & Ridgway, 2017) and is reported by nested sampling implementations like dynesty (Speagle, 2020) and NestedFit.

Tuning the Number of Live Points: The number of live points \(N\) controls the resolution of the prior volume shrinkage. More live points give more accurate evidence estimates and better posterior resolution, but increase computational cost. A practical rule: \(N \gtrsim 25d\) for \(d\) dimensions (Skilling, 2006). For multimodal problems, increase \(N\) to ensure adequate sampling of each mode. Advanced implementations like MultiNest use ellipsoidal decomposition to adaptively allocate live points across modes.

Strengths & Limitations of Nested Sampling

✓ Strengths
  • Evidence computation: Directly calculates \(\mathcal{Z}\) for model comparison (Bayes factors)
  • Multimodal posteriors: Naturally explores all modes without mode-hopping difficulties
  • No burn-in: Every point contributes; no samples wasted
  • Automatic termination: Stops when evidence is accurately determined
  • Parallel-friendly: Multiple live points can be updated independently
✗ Limitations
  • Constrained sampling challenge: Sampling from prior with \(\mathcal{L} > \mathcal{L}_{\text{min}}\) is hard
  • Not ideal for posteriors alone: If you only want posterior samples, MCMC is often more efficient
  • Computational cost: Can be expensive, especially with rejection sampling
  • Curse of dimensionality: Rejection sampling becomes impractical in high dimensions (\(d > 20\))
  • Implementation complexity: Advanced variants (MultiNest, PolyChord) are complex
When to use Nested Sampling:
  • Model comparison: When you need Bayes factors or posterior odds ratios
  • Multimodal posteriors: When the posterior has multiple well-separated modes
  • Phase transitions: When the posterior has sharp features or phase transition behavior
  • Low-to-moderate dimensions: Most effective for \(d \lesssim 20\) (with basic rejection) or \(d \lesssim 100\) (with advanced methods)

Comparison: Nested Sampling vs. MCMC

Nested sampling and MCMC methods serve different primary purposes:

Key Differences:
Property
MCMC (MH/HMC)
Nested Sampling
Primary goal
Posterior sampling
Evidence computation
Computes \(\mathcal{Z}\)?
No (requires thermodynamic integration)
Yes (directly)
Multimodality
Struggles (mode-hopping hard)
Handles naturally
Burn-in needed?
Yes
No
Parallelization
Sequential chain
Embarrassingly parallel
Best for
Posterior inference
Model comparison

Practical recommendation: Use nested sampling when you need to compare models (compute Bayes factors) or when the posterior is multimodal. Use HMC/NUTS for high-dimensional unimodal posterior inference. Use Metropolis–Hastings when gradients aren't available and dimensionality is low.

References & Further Reading

Nested Sampling Papers:

  • Skilling, J. (2004). "Nested Sampling." In American Institute of Physics Conference Series, Vol. 735 (eds. R. Fischer, R. Preuss, & U. V. Toussaint), 395-405.
  • Skilling, J. (2006). "Nested Sampling for General Bayesian Computation." Bayesian Analysis, 1(4), 833-860.
  • Feroz, F., Hobson, M.P., & Bridges, M. (2009). "MultiNest: An Efficient and Robust Bayesian Inference Tool for Cosmology and Particle Physics." Monthly Notices of the Royal Astronomical Society, 398(4), 1601-1614.
  • Speagle, J.S. (2020). "dynesty: A Dynamic Nested Sampling Package for Estimating Bayesian Posteriors and Evidences." Monthly Notices of the Royal Astronomical Society, 493(3), 3132-3158.
  • Ashton, G., et al. (2022). "Nested Sampling for Physical Scientists." Nature Reviews Methods Primers, 2, 39.

Importance Sampling & ESS Theory:

  • Kish, L. (1965). Survey Sampling. New York: Wiley. (Original ESS formula for weighted samples)
  • Kong, A. (1992). "A Note on Importance Sampling Using Standardized Weights." Technical Report, University of Chicago.
  • Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing. New York: Springer. (Chapter 2: Importance Sampling)
  • Chopin, N., & Ridgway, J. (2017). "Leave Pima Indians Alone: Binary Regression as a Benchmark for Bayesian Computation." Statistical Science, 32(1), 64-87. (Discussion of ESS for importance sampling)

Books:

  • Sivia, D.S., & Skilling, J. (2006). Data Analysis: A Bayesian Tutorial (2nd ed.). Oxford University Press.
  • Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press.
  • Trotta, R. (2008). "Bayes in the Sky: Bayesian Inference and Model Selection in Cosmology." Contemporary Physics, 49(2), 71-104. (Review with nested sampling applications)

Software Implementations:

  • dynesty: dynesty.readthedocs.io - Pure Python, dynamic nested sampling with excellent documentation (reports ESS)
  • MultiNest: github.com/farhanferoz/MultiNest - Fortran implementation with Python interface (PyMultiNest)
  • PolyChord: PolyChord - Efficient for high-dimensional problems
  • UltraNest: Modern Python implementation with advanced diagnostics