Nested Sampling

Interactive visualization of evidence computation and posterior sampling via nested sampling
Part 3 of the MCMC Samplers Series | See also: Metropolis-Hastings · HMC

Introduction to Nested Sampling

Nested Sampling is a computational approach to Bayesian inference introduced by John Skilling (2004, 2006). Unlike standard MCMC methods that target the posterior directly, nested sampling is primarily designed to compute the Bayesian evidence (marginal likelihood) \(\mathcal{Z}\), with posterior samples as a byproduct.

The key innovation is transforming the multi-dimensional evidence integral into a one-dimensional integral over prior volume, by sorting parameter space according to likelihood. The algorithm naturally handles multimodal posteriors, since live points are drawn from the entire prior and progressively constrained.

Why Nested Sampling?

Nested sampling directly calculates the evidence \(\mathcal{Z} = \int \mathcal{L}(\mathbf{\theta}) \pi(\mathbf{\theta}) d\mathbf{\theta}\), which is the quantity needed for principled model comparison via Bayes factors. It does this while naturally exploring multimodal posteriors, since live points are drawn from the entire prior and progressively constrained rather than following a single chain. Every collected point contributes to the estimate, and the algorithm terminates automatically once the remaining evidence contribution is negligible.

Rather than targeting the posterior \(\pi(\mathbf{\theta}|\mathbf{d})\) directly, nested sampling draws from the prior subject to progressively tighter likelihood constraints. By tracking the shrinking prior volume at each likelihood level, it numerically integrates the evidence and accumulates posterior samples along the way.

The Nested Sampling Algorithm

The algorithm maintains a set of \(N\) "live points" sampled from the prior, progressively replacing the point with the lowest likelihood while shrinking the prior volume. Each discarded point is saved as a posterior sample, weighted by the prior volume it represents.

The Evidence Integral:

Bayes' theorem in terms of evidence:

$$\pi(\mathbf{\theta}|\mathbf{d}) = \frac{\mathcal{L}(\mathbf{\theta})\pi(\mathbf{\theta})}{\mathcal{Z}}$$

where the evidence is:

$$\mathcal{Z} = \int \mathcal{L}(\mathbf{\theta}) \pi(\mathbf{\theta}) d\mathbf{\theta}$$

Nested sampling rewrites this integral in terms of prior volume \(X\) enclosed by likelihood contour \(\mathcal{L}\):

$$\mathcal{Z} = \int_0^1 \mathcal{L}(X) dX$$
Nested Sampling Algorithm:
  1. Initialize: Draw \(N\) live points \(\{\mathbf{\theta}_i\}\) from the prior \(\pi(\mathbf{\theta})\). Set \(X_0 = 1\) (entire prior volume).
  2. Iterate: At iteration \(i\):
    • Find the live point with lowest likelihood: \(\mathcal{L}_i^* = \min_j \mathcal{L}(\mathbf{\theta}_j)\)
    • Save this point as a sample: \((\mathbf{\theta}_i^*, \mathcal{L}_i^*, X_i)\)
    • Estimate prior volume shrinkage: \(X_i \approx t_i X_{i-1}\) where \(t_i \sim \text{Beta}(N, 1)\)
    • Replace the discarded point by sampling from the prior with constraint \(\mathcal{L}(\mathbf{\theta}) > \mathcal{L}_i^*\)
  3. Terminate: Stop when the estimated remaining evidence contribution is negligible (typically when \(\mathcal{L}_{\text{max}} \times X < \epsilon \times \mathcal{Z}_{\text{current}}\))
  4. Compute evidence: Numerical integration (trapezoid rule): $$\mathcal{Z} \approx \sum_{i=1}^{M} \mathcal{L}_i^* \Delta X_i$$ where \(\Delta X_i = \tfrac{1}{2}(X_{i-1} - X_{i+1})\) is the trapezoid weight (or \(X_{i-1} - X_i\) for the simpler left-rectangle approximation)
  5. Extract posterior: Samples \(\{\mathbf{\theta}_i^*\}\) with importance weights \(w_i = \mathcal{L}_i^* \Delta X_i / \mathcal{Z}\)
Step 2 requires sampling from a constrained prior, which is the hardest part of the algorithm. This demo uses simple rejection sampling: draw from the prior and reject if \(\mathcal{L} \leq \mathcal{L}_{\text{min}}\). Real implementations use MCMC within nested sampling (e.g., slice sampling, Metropolis) or advanced techniques like MultiNest's ellipsoidal decomposition or dynesty's dynamic nested sampling.

Target Distributions

Three target distributions illustrate different aspects of sampler behavior, ranging from well-behaved to genuinely difficult:

Bivariate Gaussian (Default)
$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{2(1-\rho^2)}(x_1^2 - 2\rho x_1 x_2 + x_2^2)\right)$$

Correlated (\(\rho = 0.8\)): a clean baseline where nested sampling behaves predictably.

Rosenbrock's Banana
$$\pi(x_1, x_2) \propto \exp\left(-\frac{1}{200}(x_1^2 + 100(x_2 - x_1^2)^2)\right)$$

Curved manifold from the nonlinear transformation \(x_2 - x_1^2\): tests the sampler's ability to follow non-ellipsoidal iso-likelihood contours.

Neal's Funnel
$$\begin{aligned} x_1 &\sim \mathcal{N}(0, 3^2) \\ x_2 \mid x_1 &\sim \mathcal{N}(0, \exp(x_1)^2) \end{aligned}$$

Scale of \(x_2\) varies exponentially with \(x_1\): a common challenge in hierarchical models, where the neck of the funnel is easy to miss.

Bimodal Gaussian Mixture
$$\pi(\theta_1, \theta_2) = 0.4 \cdot \mathcal{N}((-2,-2), 0.8^2I) + 0.6 \cdot \mathcal{N}((+2,+2), 0.8^2I)$$

This is where nested sampling shines: it naturally finds and correctly weights both separated modes, where MCMC methods typically get stuck in one.

Simulation Controls

Nested sampling has one key parameter: the number of live points \(N\). More live points improve accuracy but slow down computation. The algorithm automatically terminates when the remaining evidence contribution becomes negligible.

More points → better evidence estimate, but slower. Typically 50-500.
Delay between iterations. Slower helps visualize the progression.
The bimodal distribution showcases nested sampling's key advantage: naturally finding multiple modes.
This demo uses simple rejection sampling to generate new live points: draw from the prior and reject if the likelihood falls below the current threshold. Real implementations use MCMC methods (slice sampling, Metropolis) to efficiently sample from the constrained prior, which becomes important in high dimensions or for complex likelihood contours.

Nested Sampling Visualization

The main plot shows the joint posterior density \(\pi(\theta_1, \theta_2)\) with darker regions indicating higher density. Blue points show the current live points (actively sampling), and red points show dead points (discarded) that contribute to the evidence calculation. Watch how live points progressively contract toward higher-likelihood regions.

Joint Posterior Distribution π(θ₁, θ₂)

Live points sample the constrained prior uniformly, naturally exploring all high-density regions (darker areas). For multimodal distributions, you'll see live points in multiple modes simultaneously.

Marginal Posterior Distribution of θ₁

Importance-weighted histogram of θ₁ samples. Each dead point contributes proportionally to its posterior weight \(w_i = \mathcal{L}_i \Delta X_i / \mathcal{Z}\). Approximates the true marginal \(\pi(\theta_1 | \text{data})\).

Marginal Posterior Distribution of θ₂

Importance-weighted histogram of θ₂ samples. Properly accounts for the fact that different samples contribute differently to the posterior. This is crucial for nested sampling as raw dead points are NOT posterior samples without weighting.

Current Iteration Details
Click "Start Sampling" or "Single Iteration" to begin nested sampling.

Nested Sampling Diagnostics

Unlike MCMC diagnostics (which focus on chain mixing and autocorrelation), nested sampling requires monitoring the prior volume shrinkage and evidence evolution. These plots show how the algorithm progressively constrains the prior and accumulates evidence.

What to monitor in nested sampling:
  • Prior volume X(i): Should shrink exponentially as \(X_i \approx \exp(-i/N)\)
  • Log-likelihood evolution: Should increase as we move to higher-likelihood regions
  • Evidence accumulation: log(Z) should stabilize when most evidence is captured
  • Posterior weights: Show which samples contribute most to the posterior
Prior Volume Shrinkage: log(X) vs Iteration

Prior volume X shrinks exponentially. The slope should be approximately -1/N. Linear on log scale confirms proper shrinkage rate.

Log-Likelihood Evolution

Minimum log-likelihood at each iteration. Should increase monotonically as we sample from progressively more constrained priors.

Evidence Evolution: log(Z) vs Iteration

Cumulative log-evidence. Should plateau when remaining prior volume × max likelihood becomes negligible. Stabilization indicates that remaining evidence contribution is negligible.

Posterior Analysis

Understanding Posterior Weights

Each dead point carries a posterior weight \(w_i = \mathcal{L}_i \Delta X_i / \mathcal{Z}\), the product of its likelihood and the prior volume width at that iteration. Because likelihood increases while prior volume shrinks exponentially, the weights peak in a middle range of iterations and fall off on both sides. The plot below shows this distribution directly: a broad, smooth peak indicates many iterations contribute to the posterior, while a narrow peak suggests most weight is concentrated in a small number of samples. For multimodal targets, two distinct peaks appear, one per mode.

$$w_i = \frac{\mathcal{L}_i \times \Delta X_i}{\mathcal{Z}}$$
Posterior Weights vs Iteration

Performance Metrics

Iterations
0
Log Evidence
Dead Points
0
Effective Sample Size
Effective Sample Size for Nested Sampling

In nested sampling, ESS is computed from the variance of the normalized importance weights, following standard importance sampling theory (Kish, 1965; Kong, 1992):

$$\text{ESS} = \frac{\left(\sum_{i=1}^{n} w_i\right)^2}{\sum_{i=1}^{n} w_i^2} = \frac{1}{\sum_{i=1}^{n} \tilde{w}_i^2}$$

where \(w_i = \mathcal{L}_i \Delta X_i / \mathcal{Z}\) are the normalized posterior weights satisfying \(\sum w_i = 1\), so \(\tilde{w}_i = w_i\). This measures the variance of the weights: when all weights are equal (\(w_i = 1/n\)), we get ESS = \(n\). When weights are highly unequal (few samples dominate), ESS ≪ \(n\).

ESS quantifies how many equally-weighted samples would provide the same Monte Carlo variance as the weighted sample. This is the standard diagnostic for importance sampling (Liu, 2001; Chopin & Ridgway, 2017) and is reported by nested sampling implementations like dynesty (Speagle, 2020) and NestedFit.

Tuning the Number of Live Points: The number of live points \(N\) controls the resolution of the prior volume shrinkage. More live points give more accurate evidence estimates and better posterior resolution, but increase computational cost. A practical rule: \(N \gtrsim 25d\) for \(d\) dimensions (Skilling, 2006). For multimodal problems, increase \(N\) to ensure adequate sampling of each mode. Advanced implementations like MultiNest use ellipsoidal decomposition to adaptively allocate live points across modes.

Strengths & Limitations of Nested Sampling

Nested sampling is well suited to problems where the Bayesian evidence is the primary quantity of interest, such as model comparison via Bayes factors, or where the posterior has multiple well-separated modes that would trap a standard MCMC chain. Every collected sample contributes to the estimate, there is no burn-in, and the algorithm terminates automatically. The main difficulty is the constrained sampling step: drawing a new point from the prior subject to a likelihood constraint is easy in low dimensions with rejection sampling, but becomes expensive beyond roughly 20 dimensions without specialized techniques such as ellipsoidal decomposition (MultiNest) or slice sampling (PolyChord). If the goal is posterior inference alone in a well-behaved unimodal distribution, HMC or NUTS will typically be faster and scale better to higher dimensions.

Use nested sampling when you need Bayes factors, when the posterior is multimodal, or when the problem dimension is moderate (roughly below 20 with basic rejection sampling, or below 100 with advanced methods).

Comparison: Nested Sampling vs. MCMC

Nested sampling and MCMC methods serve different primary purposes, summarized below.

Key Differences:
Property
MCMC (MH/HMC)
Nested Sampling
Primary goal
Posterior sampling
Evidence computation
Computes \(\mathcal{Z}\)?
No (requires thermodynamic integration)
Yes (directly)
Multimodality
Struggles (mode-hopping hard)
Handles naturally
Burn-in needed?
Yes
No
Parallelization
Sequential chain
Embarrassingly parallel
Best for
Posterior inference
Model comparison

Use nested sampling when you need to compare models (compute Bayes factors) or when the posterior is multimodal. Use HMC/NUTS for high-dimensional unimodal posterior inference. Use Metropolis-Hastings when gradients are unavailable and dimensionality is low.

References & Further Reading

Nested Sampling Papers:

  • Skilling, J. (2004). "Nested Sampling." In American Institute of Physics Conference Series, Vol. 735 (eds. R. Fischer, R. Preuss, & U. V. Toussaint), 395-405.
  • Skilling, J. (2006). "Nested Sampling for General Bayesian Computation." Bayesian Analysis, 1(4), 833-860.
  • Feroz, F., Hobson, M.P., & Bridges, M. (2009). "MultiNest: An Efficient and Robust Bayesian Inference Tool for Cosmology and Particle Physics." Monthly Notices of the Royal Astronomical Society, 398(4), 1601-1614.
  • Speagle, J.S. (2020). "dynesty: A Dynamic Nested Sampling Package for Estimating Bayesian Posteriors and Evidences." Monthly Notices of the Royal Astronomical Society, 493(3), 3132-3158.
  • Ashton, G., et al. (2022). "Nested Sampling for Physical Scientists." Nature Reviews Methods Primers, 2, 39.

Importance Sampling & ESS Theory:

  • Kish, L. (1965). Survey Sampling. New York: Wiley. (Original ESS formula for weighted samples)
  • Kong, A. (1992). "A Note on Importance Sampling Using Standardized Weights." Technical Report, University of Chicago.
  • Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing. New York: Springer. (Chapter 2: Importance Sampling)
  • Chopin, N., & Ridgway, J. (2017). "Leave Pima Indians Alone: Binary Regression as a Benchmark for Bayesian Computation." Statistical Science, 32(1), 64-87. (Discussion of ESS for importance sampling)

Books:

  • Sivia, D.S., & Skilling, J. (2006). Data Analysis: A Bayesian Tutorial (2nd ed.). Oxford University Press.
  • Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press.
  • Trotta, R. (2008). "Bayes in the Sky: Bayesian Inference and Model Selection in Cosmology." Contemporary Physics, 49(2), 71-104. (Review with nested sampling applications)

Software Implementations:

  • dynesty: dynesty.readthedocs.io - Pure Python, dynamic nested sampling with excellent documentation (reports ESS)
  • MultiNest: github.com/farhanferoz/MultiNest - Fortran implementation with Python interface (PyMultiNest)
  • PolyChord: PolyChord - Efficient for high-dimensional problems
  • UltraNest: Modern Python implementation with advanced diagnostics