UQ4NN Solvers
This section provides the mathematical foundations underlying each UQ solver in QUiNN. All solvers share a common goal: given a neural network model \(M(x; w)\) with weights \(w \in \mathbb{R}^K\), and training data \(\{(x_i, y_i)\}_{i=1}^{N}\), approximate the posterior distribution
where \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{N}\) is the dataset, and use samples from this posterior to propagate uncertainty through the network predictions.
MCMC (NN_MCMC)
Markov chain Monte Carlo directly samples from the posterior \(p(w \mid \mathcal{D})\) by constructing a Markov chain whose stationary distribution is the target posterior. Given \(M_{\text{MCMC}}\) chain samples \(\{w^{(j)}\}_{j=1}^{M_{\text{MCMC}}}\) (after discarding burn-in), predictions are obtained as
The posterior mean and variance of the prediction at a test point \(x^*\) are estimated as
QUiNN supports three MCMC samplers:
Adaptive Metropolis (AMCMC)
The Adaptive Metropolis algorithm [3] uses a random-walk Metropolis-Hastings sampler with an adaptively tuned proposal covariance. At step \(t\), the proposal is
where the proposal covariance is updated online using the sample covariance of the chain history:
with \(\hat{C}_t\) the running sample covariance of \(\{w^{(0)}, \ldots, w^{(t)}\}\), \(K\) the parameter dimensionality, and \(\gamma\) a user-tunable scaling factor. The adaptation is triggered after an initial burn-in period \(t_0\), and the covariance is refreshed every \(t_{\text{adapt}}\) steps. The standard Metropolis-Hastings acceptance criterion applies:
Hamiltonian Monte Carlo (HMC)
Hamiltonian Monte Carlo [4] augments the parameter space with an auxiliary momentum variable \(p \in \mathbb{R}^K\) and defines a Hamiltonian
The leapfrog integrator evolves the state \((w, p)\) for \(L\) steps with step size \(\varepsilon\):
The proposal \((w', p')\) is accepted with probability
Metropolis-Adjusted Langevin Algorithm (MALA)
MALA [5] is a gradient-informed random-walk that uses the Langevin diffusion to construct proposals. The proposal is
which corresponds to a single Euler-Maruyama discretization step of the Langevin stochastic differential equation. The Metropolis-Hastings correction ensures exact sampling.
Deep Ensemble (NN_Ens)
Deep Ensembles train \(J\) independent networks from random initializations, optionally on random subsets of the data (controlled by the data fraction parameter \(\delta \in (0, 1]\)). Each ensemble member \(j\) minimizes the standard MSE loss
where \(\mathcal{D}_j \subseteq \mathcal{D}\), \(|\mathcal{D}_j| = \lfloor \delta \cdot N \rfloor\). Predictions from all members are aggregated:
Randomized MAP Sampling (NN_RMS)
Randomized MAP Sampling (RMS) [2] extends the deep ensemble approach by training each member with a randomized prior anchor. Each ensemble member \(j\) minimizes the negative log-posterior
where \(w_0^{(j)} \sim \mathcal{N}(0, \sigma_{\text{prior}}^2 I_K)\) is a random anchor independently drawn for each member. This provides an implicit sampling scheme: the set of MAP solutions \(\{w_j^*\}_{j=1}^J\) are approximate posterior samples.
Variational Inference (NN_VI)
Variational inference approximates the posterior \(p(w \mid \mathcal{D})\) with a tractable distribution \(q_\phi(w)\) by minimizing the Kullback-Leibler (KL) divergence, which is equivalent to maximizing the Evidence Lower Bound (ELBO). QUiNN implements the Bayes by Backprop method [1].
Variational Family
Each weight \(w_k\) is parameterized with an independent Gaussian:
where \(\phi = \{\mu_k, \rho_k\}_{k=1}^K\) are the variational parameters. The softplus transformation ensures \(\sigma_k > 0\).
Scale Mixture Prior
The prior over each weight is a scale mixture of two Gaussians:
where \(\pi \in [0,1]\) and \(\sigma_1, \sigma_2 > 0\) are hyperparameters.
ELBO Loss
The variational loss (per mini-batch) is
where \(w \sim q_\phi\), \(B\) is the number of mini-batches, and \(\text{MSE}(w) = \frac{1}{|b|}\sum_{i \in b}\|y_i - M(x_i; w)\|^2\) over the current mini-batch \(b\). At each training step, \(S\) weight samples are drawn for a Monte Carlo estimate of the ELBO. At prediction time, weight samples from \(q_\phi(w)\) are drawn to produce an ensemble of outputs.
Laplace Approximation (NN_Laplace)
The Laplace approximation [6] constructs a Gaussian approximation to the posterior centered at the MAP estimate \(w^*\):
where \(\mathcal{L}(w) = -\log p(w \mid \mathcal{D})\) is the negative log-posterior and \(\nabla^2_w \mathcal{L}(w^*)\) is its Hessian evaluated at the MAP.
Step 1: MAP Training. The network is trained by minimizing the negative log-posterior \(\mathcal{L}(w)\), yielding the MAP estimate \(w^*\).
Step 2: Hessian Computation. QUiNN supports two Hessian approximations:
Full Hessian: The exact \(K \times K\) Hessian is computed via second-order automatic differentiation:
\[H_{ij} = \frac{\partial^2 \mathcal{L}}{\partial w_i \partial w_j}\Bigg|_{w=w^*}.\]Diagonal (Fisher) approximation: The diagonal of the empirical Fisher information matrix is used as a Hessian proxy:
\[\tilde{H}_{kk} = \frac{1}{N}\sum_{i=1}^{N} \left(\frac{\partial \mathcal{L}_i}{\partial w_k}\Bigg|_{w=w^*}\right)^2,\]where \(\mathcal{L}_i\) denotes the per-sample loss. The resulting Hessian is diagonal: \(\tilde{H} = \text{diag}(\tilde{H}_{11}, \ldots, \tilde{H}_{KK})\).
Step 3: Posterior Covariance. The posterior covariance is
where \(s\) is a user-tunable covariance scaling factor.
Step 4: Prediction. A predictive sample is drawn as
SWAG (NN_SWAG)
Stochastic Weight Averaging-Gaussian (SWAG) [7] approximates the posterior by fitting a Gaussian distribution to the SGD trajectory after initial training.
Step 1: Pre-training. The network is trained with the negative log-posterior loss to obtain a good initialization.
Step 2: SGD Trajectory Collection. Starting from the pre-trained weights, \(T\) additional SGD steps are performed. At every \(c\)-th step, the current weight vector \(w_t\) is recorded and the running moments are updated:
where \(n = \lfloor t/c \rfloor\) is the snapshot counter and \(\odot\) is element-wise product.
Step 3: Covariance Approximation. The diagonal variance is
For the low-rank variant, the last \(k\) deviation vectors \(d_i = w_{t_i} - \bar{w}\) are stored as columns of a matrix \(D \in \mathbb{R}^{K \times k}\).
Step 4: Prediction. A posterior sample is drawn as
where \(z_1 \sim \mathcal{N}(0, I_K)\) and \(z_2 \sim \mathcal{N}(0, I_k)\). If the covariance type is not low-rank, the second term is omitted. Predictions are obtained as \(y(x^*) = M(x^*; w)\).
Summary of Solvers
Solver |
Posterior approximation |
Training cost |
Memory cost |
Key hyperparameters |
|---|---|---|---|---|
|
Exact (asymptotically) |
High (\(O(M_{\text{MCMC}})\) forward/backward passes) |
\(O(M_{\text{MCMC}} \cdot K)\) |
\(M_{\text{MCMC}}\), sampler type, \(\sigma\), \(\varepsilon\) (HMC) |
|
Implicit (point estimates) |
\(J \times\) single training |
\(O(J \cdot K)\) |
\(J\), \(\delta\) |
|
Implicit (randomized MAP) |
\(J \times\) single training |
\(O(J \cdot K)\) |
\(J\), \(\sigma\), \(\sigma_{\text{prior}}\) |
|
Factored Gaussian \(q_\phi(w)\) |
\(\sim 2\times\) single training |
\(O(2K)\) (for \(\mu, \rho\)) |
\(\pi\), \(\sigma_1\), \(\sigma_2\), \(S\) |
|
Gaussian at MAP |
Single training + Hessian |
\(O(K^2)\) full / \(O(K)\) diag |
|
|
Low-rank Gaussian |
Single training + \(T\) SGD steps |
\(O(K \cdot k)\) low-rank |
\(k\), \(T\), \(c\), |
References
See the References page for the full reference list. Key references for the solvers: