Nested Sampling
Interactive visualization of evidence computation and posterior sampling via nested sampling
Part 3 of the MCMC Samplers Series | See also:
Metropolis-Hastings ·
HMC
Interactive visualization of evidence computation and posterior sampling via nested sampling
Part 3 of the MCMC Samplers Series | See also:
Metropolis-Hastings ·
HMC
Nested Sampling is a computational approach to Bayesian inference introduced by John Skilling (2004, 2006). Unlike standard MCMC methods that target the posterior directly, nested sampling is primarily designed to compute the Bayesian evidence (marginal likelihood) \(\mathcal{Z}\), with posterior samples as a byproduct.
The key innovation is transforming the multi-dimensional evidence integral into a one-dimensional integral over prior volume, by sorting parameter space according to likelihood. The algorithm naturally handles multimodal posteriors, since live points are drawn from the entire prior and progressively constrained.
Nested sampling directly calculates the evidence \(\mathcal{Z} = \int \mathcal{L}(\mathbf{\theta}) \pi(\mathbf{\theta}) d\mathbf{\theta}\), which is the quantity needed for principled model comparison via Bayes factors. It does this while naturally exploring multimodal posteriors, since live points are drawn from the entire prior and progressively constrained rather than following a single chain. Every collected point contributes to the estimate, and the algorithm terminates automatically once the remaining evidence contribution is negligible.
The algorithm maintains a set of \(N\) "live points" sampled from the prior, progressively replacing the point with the lowest likelihood while shrinking the prior volume. Each discarded point is saved as a posterior sample, weighted by the prior volume it represents.
Bayes' theorem in terms of evidence:
$$\pi(\mathbf{\theta}|\mathbf{d}) = \frac{\mathcal{L}(\mathbf{\theta})\pi(\mathbf{\theta})}{\mathcal{Z}}$$where the evidence is:
$$\mathcal{Z} = \int \mathcal{L}(\mathbf{\theta}) \pi(\mathbf{\theta}) d\mathbf{\theta}$$Nested sampling rewrites this integral in terms of prior volume \(X\) enclosed by likelihood contour \(\mathcal{L}\):
$$\mathcal{Z} = \int_0^1 \mathcal{L}(X) dX$$Three target distributions illustrate different aspects of sampler behavior, ranging from well-behaved to genuinely difficult:
Correlated (\(\rho = 0.8\)): a clean baseline where nested sampling behaves predictably.
Curved manifold from the nonlinear transformation \(x_2 - x_1^2\): tests the sampler's ability to follow non-ellipsoidal iso-likelihood contours.
Scale of \(x_2\) varies exponentially with \(x_1\): a common challenge in hierarchical models, where the neck of the funnel is easy to miss.
This is where nested sampling shines: it naturally finds and correctly weights both separated modes, where MCMC methods typically get stuck in one.
Nested sampling has one key parameter: the number of live points \(N\). More live points improve accuracy but slow down computation. The algorithm automatically terminates when the remaining evidence contribution becomes negligible.
The main plot shows the joint posterior density \(\pi(\theta_1, \theta_2)\) with darker regions indicating higher density. Blue points show the current live points (actively sampling), and red points show dead points (discarded) that contribute to the evidence calculation. Watch how live points progressively contract toward higher-likelihood regions.
Live points sample the constrained prior uniformly, naturally exploring all high-density regions (darker areas). For multimodal distributions, you'll see live points in multiple modes simultaneously.
Importance-weighted histogram of θ₁ samples. Each dead point contributes proportionally to its posterior weight \(w_i = \mathcal{L}_i \Delta X_i / \mathcal{Z}\). Approximates the true marginal \(\pi(\theta_1 | \text{data})\).
Importance-weighted histogram of θ₂ samples. Properly accounts for the fact that different samples contribute differently to the posterior. This is crucial for nested sampling as raw dead points are NOT posterior samples without weighting.
Unlike MCMC diagnostics (which focus on chain mixing and autocorrelation), nested sampling requires monitoring the prior volume shrinkage and evidence evolution. These plots show how the algorithm progressively constrains the prior and accumulates evidence.
Prior volume X shrinks exponentially. The slope should be approximately -1/N. Linear on log scale confirms proper shrinkage rate.
Minimum log-likelihood at each iteration. Should increase monotonically as we sample from progressively more constrained priors.
Cumulative log-evidence. Should plateau when remaining prior volume × max likelihood becomes negligible. Stabilization indicates that remaining evidence contribution is negligible.
Each dead point carries a posterior weight \(w_i = \mathcal{L}_i \Delta X_i / \mathcal{Z}\), the product of its likelihood and the prior volume width at that iteration. Because likelihood increases while prior volume shrinks exponentially, the weights peak in a middle range of iterations and fall off on both sides. The plot below shows this distribution directly: a broad, smooth peak indicates many iterations contribute to the posterior, while a narrow peak suggests most weight is concentrated in a small number of samples. For multimodal targets, two distinct peaks appear, one per mode.
$$w_i = \frac{\mathcal{L}_i \times \Delta X_i}{\mathcal{Z}}$$In nested sampling, ESS is computed from the variance of the normalized importance weights, following standard importance sampling theory (Kish, 1965; Kong, 1992):
$$\text{ESS} = \frac{\left(\sum_{i=1}^{n} w_i\right)^2}{\sum_{i=1}^{n} w_i^2} = \frac{1}{\sum_{i=1}^{n} \tilde{w}_i^2}$$where \(w_i = \mathcal{L}_i \Delta X_i / \mathcal{Z}\) are the normalized posterior weights satisfying \(\sum w_i = 1\), so \(\tilde{w}_i = w_i\). This measures the variance of the weights: when all weights are equal (\(w_i = 1/n\)), we get ESS = \(n\). When weights are highly unequal (few samples dominate), ESS ≪ \(n\).
ESS quantifies how many equally-weighted samples would provide the same Monte Carlo variance as the weighted sample. This is the standard diagnostic for importance sampling (Liu, 2001; Chopin & Ridgway, 2017) and is reported by nested sampling implementations like dynesty (Speagle, 2020) and NestedFit.
Nested sampling is well suited to problems where the Bayesian evidence is the primary quantity of interest, such as model comparison via Bayes factors, or where the posterior has multiple well-separated modes that would trap a standard MCMC chain. Every collected sample contributes to the estimate, there is no burn-in, and the algorithm terminates automatically. The main difficulty is the constrained sampling step: drawing a new point from the prior subject to a likelihood constraint is easy in low dimensions with rejection sampling, but becomes expensive beyond roughly 20 dimensions without specialized techniques such as ellipsoidal decomposition (MultiNest) or slice sampling (PolyChord). If the goal is posterior inference alone in a well-behaved unimodal distribution, HMC or NUTS will typically be faster and scale better to higher dimensions.
Nested sampling and MCMC methods serve different primary purposes, summarized below.
Use nested sampling when you need to compare models (compute Bayes factors) or when the posterior is multimodal. Use HMC/NUTS for high-dimensional unimodal posterior inference. Use Metropolis-Hastings when gradients are unavailable and dimensionality is low.
Nested Sampling Papers:
Importance Sampling & ESS Theory:
Books:
Software Implementations: