Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

arXiv cs.LG 05/25/26, 04:00 AM Papers
Summary
This paper demonstrates that pointwise metrics like RMSE and MAE structurally mislead for inverse problems with multimodal posteriors, because optimal point estimators collapse the posterior and distort spectral features. It proposes a three-part evaluation protocol using per-event distributional accuracy, spectrum-fidelity diagnostics, and coverage-based calibration to address these failures.
arXiv:2605.22891v1 Announce Type: new Abstract: Evaluation in scientific reconstruction is dominated by pointwise metrics - RMSE, MAE, per-event resolution - under the implicit assumption that lower error means better reconstruction. We show that this assumption fails structurally for inverse problems with multimodal posteriors. By the law of total variance, point estimators trained to minimize MSE or MAE produce a marginal spectrum strictly narrower than the truth whenever the posterior has nonzero width. The resulting bias is independent of architecture, training, and dataset size, and it compresses precisely the spectral features - tails, modes, shapes - that downstream scientific measurements rely on. We propose a three-part evaluation protocol where each step targets a failure mode the others miss: per-event distributional accuracy via CRPS, population-level marginal accuracy via a spectrum-fidelity diagnostic, and uncertainty trustworthiness via coverage-based calibration. On a synthetic benchmark with an analytic posterior and on a realistic many-to-one inverse problem from particle physics, model rankings reverse between pointwise and distributional metrics, and calibration further separates architectures indistinguishable under CRPS. The evaluation protocol, not the model, determines the scientific conclusion.
Original Article
View Cached Full Text
Cached at: 05/25/26, 08:56 AM
# Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems
Source: [https://arxiv.org/html/2605.22891](https://arxiv.org/html/2605.22891)
Mads H\. Baattrup Deutsches Elektronen\-Synchrotron Hamburg, Germany mads\.baattrup@desy\.de &Jörn Bach Deutsches Elektronen\-Synchrotron Hamburg, Germany joern\.bach@desy\.de &Laurids Jeppe CERN Geneva, Switzerland laurids\.jeppe@cern\.ch &Finn Labe Deutsches Elektronen\-Synchrotron Hamburg, Germany finn\.labe@desy\.de &Alexander Grohsjean University of Hamburg Hamburg, Germany alexander\.grohsjean@uni\-hamburg\.de &Christian Schwanenberger Deutsches Elektronen\-Synchrotron Hamburg, Germany christian\.schwanenberger@desy\.de &Peer Stelldinger HAW Hamburg Hamburg, Germany peer\.stelldinger@haw\-hamburg\.de

###### Abstract

Evaluation in scientific reconstruction is dominated by pointwise metrics – RMSE, MAE, per\-event resolution – under the implicit assumption that lower error means better reconstruction\. We show that this assumption fails structurally for inverse problems with multimodal posteriors\. By the law of total variance, point estimators trained to minimize MSE or MAE produce a marginal spectrum strictly narrower than the truth whenever the posterior has nonzero width\. The resulting bias is independent of architecture, training, and dataset size, and it compresses precisely the spectral features – tails, modes, shapes – that downstream scientific measurements rely on\. We propose a three\-part evaluation protocol where each step targets a failure mode the others miss: per\-event distributional accuracy via CRPS, population\-level marginal accuracy via a spectrum\-fidelity diagnostic, and uncertainty trustworthiness via coverage\-based calibration\. On a synthetic benchmark with an analytic posterior and on a realistic many\-to\-one inverse problem from particle physics, model rankings reverse between pointwise and distributional metrics, and calibration further separates architectures indistinguishable under CRPS\. The evaluation protocol, not the model, determines the scientific conclusion\.

## 1Introduction and Scope

In scientific reconstruction, evaluation is not a neutral postprocessing step: the metrics we report determine which models advance and which physical conclusions hold\. Across particle physics, medical imaging, and geophysics, evaluation is dominated by pointwise resolution metrics \(root\-mean\-squared\-error \(RMSE\), mean\-absolute\-error \(MAE\), per\-event bias\) that ask how close each prediction lies to the truth\. We show that this convention fails structurally for under\-constrained inverse problems with multimodal posteriors\. The optimal predictor under mean\-squared\-error \(MSE\) is the conditional expectation\[[6](https://arxiv.org/html/2605.22891#bib.bib1)\], which for multimodal posteriors can fall between modes in regions of vanishing probability density\. Models achieving lower MSE do so by collapsing the posterior more aggressively, producing predictions that are individually “unphysical” and collectively distort reconstructed spectra\. This is an*evaluation oversight*rather than a model failure\.

We present the following contributions:

1. 1\.A theoretical argument \([section˜3\.2](https://arxiv.org/html/2605.22891#S3.SS2)\) showing that any point estimator minimizing MSE or MAE produces a marginal spectrum strictly narrower than the truth whenever the posterior has nonzero variance – a bias independent of architecture, training objective, and dataset size\.
2. 2\.A three\-part evaluation protocol – per\-event distributional accuracy, population\-level spectrum fidelity, and coverage\-based calibration – that diagnoses the failure modes pointwise metrics miss and applies across regression, mixture, and generative model families on a common scale\.
3. 3\.An empirical demonstration on a controlled synthetic benchmark with an analytic posterior and on a realistic many\-to\-one inverse problem from particle physics, showing that model rankings reverse between pointwise and distributional metrics, and that calibration further separates architectures indistinguishable under the per\-event distributional accuracy metric\.

##### Scope

We consider fully\-supervised inverse problems where paired latentszzand observationsxxare available via simulation, and the downstream quantity of interest is the marginalp\(z\)p\(z\)over a dataset rather than a posterior over global parameters\. Ground\-truthzzis assumed accessible at evaluation time\. We do not address joint posterior structure across high\-dimensionalzz, domain shift between training and deployment, or unsupervised settings\. Our analysis is otherwise domain\-agnostic: any inverse problem with non\-negligible posterior variance falls within scope\.

## 2Background and Related Work

##### Fully\-Supervised Inverse Problems

We consider the task of recovering a latent statez∈ℝnz\\in\\mathbb\{R\}^\{n\}from noisy observationsx∈ℝmx\\in\\mathbb\{R\}^\{m\}related by a forward modelx=G\(z,ξ\)\{x\}=G\(\{z\},\\xi\), whereξ\\xirepresents measurement noise or physical stochasticity – such a pair,\(z,x\)\(z,x\), is commonly called an*event*or a*sample*\. In scientific domains – particle reconstruction, seismic imaging, medical tomography –GGis a high\-fidelity simulator that is analytically intractable to invert\. When a large corpus of paired\(z,x\)\(z,x\)samples is available, the task becomes a*fully\-supervised inverse problem*: learn a surrogateG†G^\{\\dagger\}approximating the posteriorp\(z∣x\)p\(z\\mid x\)\. Such problems are frequently ill\-posed, admitting multiple or no exact solutions, either through the stochasticity ofξ\\xior through information loss inx\{x\}\.

##### Evaluation in Scientific Reconstruction

Because scientific reconstruction is typically performed event\-by\-event, evaluation has traditionally followed the same structure, dominated by RMSE, MAE, or resolution and bias of the per\-event error distribution\. This convention appears universal: RMSE and MAE are the headline metrics in particle reconstruction\[[18](https://arxiv.org/html/2605.22891#bib.bib24),[46](https://arxiv.org/html/2605.22891#bib.bib25)\], sparse\-view computed tomography\[[47](https://arxiv.org/html/2605.22891#bib.bib7)\], image super\-resolution benchmarks\[[54](https://arxiv.org/html/2605.22891#bib.bib6)\], climate downscaling\[[52](https://arxiv.org/html/2605.22891#bib.bib8)\], and geophysical inversion\[[32](https://arxiv.org/html/2605.22891#bib.bib5)\]and are routinely adopted as the training objective itself\. Lower error is implicitly assumed to mean better reconstruction \(also when aggregated\) – true for well\-constrained unimodal problems, but not when the conditional posterior has nonzero width \([section˜3](https://arxiv.org/html/2605.22891#S3)\)\. Point predictions are also commonly reported, even when the underlying method produces a full posterior: generative models for gravitational\-wave parameter estimation are routinely summarized by medians and 90% intervals\[[45](https://arxiv.org/html/2605.22891#bib.bib10),[15](https://arxiv.org/html/2605.22891#bib.bib11)\], and normalizing flows for neutrino kinematics are ultimately evaluated via point estimates extracted from the posterior\[[43](https://arxiv.org/html/2605.22891#bib.bib9)\]\. This “generative\-to\-regressive” bottleneck persists because the community lacks a structured evaluation protocol that rewards preservation of posterior structure\.

##### Connection to Simulation\-Based Inference \(SBI\)

Learning the surrogate,G†G^\{\\dagger\}, is an instance of amortized neural posterior estimation \(NPE\)\[[13](https://arxiv.org/html/2605.22891#bib.bib15),[38](https://arxiv.org/html/2605.22891#bib.bib16)\]\. Our setting differs from canonical SBI in two respects: the latentzzis per\-event and amortized over𝒪\(104\-106\)\\mathcal\{O\}\(10^\{4\}\\text\{\-\}10^\{6\}\)events, and the scientific quantity of interest is the aggregated marginalp\(z\)p\(z\)rather than a posterior over global parameters\. This motivates an evaluation protocol targeting both per\-event accuracy and population\-level fidelity\.

##### Proper Scoring Rules, Calibration, and Evaluation Principles

Evaluation of probabilistic predictions requires metrics that reward*faithful*distributions rather than sharp point predictions\. A scoring ruleS\(F,z\)S\(F,z\)is*proper*if its expectation under the true distribution is minimized byFFequal to that distribution, and*strictly proper*if uniquely so\[[23](https://arxiv.org/html/2605.22891#bib.bib28)\]\. Propriety distinguishes honest probabilistic evaluation from metrics that can be “gamed” by posterior collapse\.

The continuous ranked probability score \(CRPS\) is the standard proper scoring rule for univariate distributional prediction, developed in the probabilistic weather forecasting literature\[[23](https://arxiv.org/html/2605.22891#bib.bib28),[28](https://arxiv.org/html/2605.22891#bib.bib22)\]\. For a predictive cumulative distributionFFand an observed valuezz, it is defined as:

CRPS\(F,z\)=∫−∞∞\[F\(t\)−𝟏\(t≥z\)\]2dt,\\mathrm\{CRPS\}\(F,z\)=\\int\_\{\-\\infty\}^\{\\infty\}\\left\[F\(t\)\-\\mathbf\{1\}\(t\\geq z\)\\right\]^\{2\}\\mathrm\{d\}t,\(1\)where𝟏\(⋅\)\\mathbf\{1\}\(\\cdot\)denotes the indicator function\. When the predictive distribution is represented byNNposterior samples\{z^k\}k=1N\\\{\\hat\{z\}^\{k\}\\\}\_\{k=1\}^\{N\}, the CRPS can be estimated in𝒪\(Nlog⁡N\)\\mathcal\{O\}\(N\\log N\)by sorting\[[31](https://arxiv.org/html/2605.22891#bib.bib27)\]and the explicit form is given in[section˜A\.1](https://arxiv.org/html/2605.22891#A1.SS1)\.

Two properties are particularly relevant for this setting\. First, CRPS reduces to the MAE whenF=δ\(z^\)F=\\delta\(\\hat\{z\}\)is a Dirac distribution at a point predictionz^\\hat\{z\}, making it directly applicable to both generative and regression models on a common scale\. Second, the CRPS is reported per\-event and then averaged providing the same granularity as RMSE or MAE so dataset\-averaged CRPS can be reported as a drop\-in replacement for standard pointwise resolution metrics\.

Beyond proper scoring, calibration assesses whether nominal credible regions contain the true value at their stated frequencies\[[27](https://arxiv.org/html/2605.22891#bib.bib19),[51](https://arxiv.org/html/2605.22891#bib.bib20),[26](https://arxiv.org/html/2605.22891#bib.bib13),[33](https://arxiv.org/html/2605.22891#bib.bib14)\]; we use a conformal variant\[[21](https://arxiv.org/html/2605.22891#bib.bib31),[53](https://arxiv.org/html/2605.22891#bib.bib32),[1](https://arxiv.org/html/2605.22891#bib.bib30)\]for finite\-sample guarantees and cross\-family comparability\. SBI benchmarks\[[37](https://arxiv.org/html/2605.22891#bib.bib18)\]and coverage diagnostics\[[27](https://arxiv.org/html/2605.22891#bib.bib19),[51](https://arxiv.org/html/2605.22891#bib.bib20),[34](https://arxiv.org/html/2605.22891#bib.bib21)\]develop these tools for Bayesian posteriors over global parameters\. The aggregated\-latent setting we consider – where the scientific quantity of interest is the marginalp\(z\)p\(z\)over a dataset – is not their target, and a population\-level spectrum diagnostic is not part of their protocols\. InverseBench\[[55](https://arxiv.org/html/2605.22891#bib.bib23)\]establishes strong reconstruction baselines using primarily pointwise quality metrics\. Multimodal posterior structure is not assessed, and the authors note that systematic evaluation of this remains open\. Conformal calibration has been applied to HEP\[[1](https://arxiv.org/html/2605.22891#bib.bib30)\], but combining it with CRPS and spectrum fidelity for cross\-family comparison of supervised reconstruction is, to our knowledge, novel\.

##### Current Methods

Current approaches to scientific inverse problems fall into three categories, each established across multiple domains\.*Analytic solvers*exploit the forward model together with domain\-specific physical constraints to recoverzzfromxxby solving algebraic or variational constraints\. Examples span radio\-astronomy image reconstruction\[[29](https://arxiv.org/html/2605.22891#bib.bib3)\], top quark reconstruction in physics\[[48](https://arxiv.org/html/2605.22891#bib.bib4)\], and classical geophysical inversion\[[32](https://arxiv.org/html/2605.22891#bib.bib5)\]\. These methods are interpretable but tend to degrade under noise and under\-determination\.*Deep regression*has become the dominant learning\-based approach across the domains above, typically trained to optimize pointwise metrics\. Population\-level regularizers \(e\.g\., maximum\-mean\-discrepancy \(MMD\) penalties\[[11](https://arxiv.org/html/2605.22891#bib.bib26)\]\) augment this with marginal\-distribution alignment, but do not address per\-event structure\.*Generative models*offer a path toward full posterior estimation\[[2](https://arxiv.org/html/2605.22891#bib.bib12)\], with applications including neutrino reconstruction\[[43](https://arxiv.org/html/2605.22891#bib.bib9)\]and gravitational\-wave parameter estimation\[[45](https://arxiv.org/html/2605.22891#bib.bib10)\]\.

## 3Limitations of Pointwise Evaluation Metrics

Pointwise metrics are misaligned with the goals of multimodal reconstruction\. The arguments below are model\-independent, applying to any method that extracts point predictions from a multimodal posterior\. We focus on the multimodality\-induced failure of point estimation\. A complementary Jensen\-gap failure mode, arising when downstream observables are nonlinear functions of the latent – also highlighting that the MSE\-optimal predictor in latent space is generally not MSE\-optimal once mapped to observables – is treated in[section˜D\.8\.1](https://arxiv.org/html/2605.22891#A4.SS8.SSS1)\.

### 3\.1The Conditional Mean Pathology

Consider a generic inverse problem: Given an observationxx, recover a latent quantityzzthat is distributed according to the conditional posteriorz∼p\(z∣x\)z\\sim p\(z\\mid x\)\. The MSE for a point estimator,fθ\(x\)f\_\{\\theta\}\(x\):

ℒ\(θ\)=𝔼z∼p\(z∣x\)\[‖fθ\(x\)−z‖2\]\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\_\{z\\sim p\(z\\mid x\)\}\\left\[\\\|f\_\{\\theta\}\(x\)\-z\\\|^\{2\}\\right\]\(2\)is minimized by the conditional expectation,𝔼\[z∣x\]\\mathbb\{E\}\[z\\mid x\]\[[6](https://arxiv.org/html/2605.22891#bib.bib1)\], wherefθ\(x\)f\_\{\\theta\}\(x\)is theG†G^\{\\dagger\}surrogate and\|\|⋅\|\|\|\|\\cdot\|\|is the L2\-norm\. Whenp\(z∣x\)p\(z\\mid x\)is multimodal, the conditional mean can occupy a region of*zero*posterior density\. An instructive illustration is a bimodal symmetric posterior with two Gaussian components with means at±a\\pm aand variance,σ2\\sigma^\{2\}:

p\(z∣x\)=12𝒩\(z;\+a,σ2\)\+12𝒩\(z;−a,σ2\),a≫σ\.p\(z\\mid x\)=\\tfrac\{1\}\{2\}\\,\\mathcal\{N\}\(z;\\,\+a,\\,\\sigma^\{2\}\)\+\\tfrac\{1\}\{2\}\\,\\mathcal\{N\}\(z;\\,\-a,\\,\\sigma^\{2\}\),\\qquad a\\gg\\sigma\.\(3\)The MSE\-optimal prediction isfθ\(x\)=0f\_\{\\theta\}\(x\)=0, which has posterior densityp\(z=0∣x\)≈0p\(z=0\\mid x\)\\approx 0whenever the modes are well separated \(a≫σa\\gg\\sigma\)\. In other words, the “best” prediction under the squared error is a value that the posterior assigns essentially zero probability\.

### 3\.2Spectral Mismodeling under Aggregation

This per\-event pathology has direct population\-level consequences\. In many scientific applications, the final quantity of interest is not a per\-event estimatefθ\(x\)f\_\{\\theta\}\(x\)but rather a*marginal distribution*ofzzacross the dataset, which can be referred to as a*spectrum*or an*unfolded distribution*\.

By the law of total variance\[[25](https://arxiv.org/html/2605.22891#bib.bib2)\]:

Var\[z\]=𝔼\[Var\[z∣x\]\]⏟within\-posterior variance\+Var\[𝔼\[z∣x\]\]⏟variance of point estimates,\\text\{Var\}\[z\]\\;=\\;\\underbrace\{\\mathbb\{E\}\\left\[\\text\{Var\}\\left\[z\\mid x\\right\]\\right\]\}\_\{\\text\{within\-posterior variance\}\}\\;\+\\;\\underbrace\{\\text\{Var\}\\left\[\\mathbb\{E\}\\left\[z\\mid x\\right\]\\right\]\}\_\{\\text\{variance of point estimates\}\},\(4\)from which it follows immediately that

Var\[𝔼\[z∣x\]\]≤Var\[z\],\\text\{Var\}\\left\[\\mathbb\{E\}\[z\\mid x\]\\right\]\\;\\leq\\;\\text\{Var\}\[z\],\(5\)with equality only whenVar\[z∣x\]=0\\text\{Var\}\[z\\mid x\]=0almost surely, i\.e\. when the true spectrum has no within\-posterior variance\. It states that the distribution of conditional means is*always narrower*than the true marginal distribution ofzzwheneverVar\[z∣x\]\>0\\text\{Var\}\[z\\mid x\]\>0\.

The compression in[eq\.˜5](https://arxiv.org/html/2605.22891#S3.E5)is a*bias, not a variance term*: it does not decrease with dataset size and is inherited by any estimator converging to𝔼\[z∣x\]\\mathbb\{E\}\[z\\mid x\]\. It has*direct physical consequences*: tails and shapes of marginal distributions encode physical parameters, so their compression directly biases downstream measurements\.

The same decomposition explains why population\-level regularizers – e\.g\., MMD penalties added to an MSE objective \(e\.g\.\[[11](https://arxiv.org/html/2605.22891#bib.bib26)\]\) – reduce but fail to eliminate the bias\. A marginal penalty can reshape the aggregate distribution of point predictions, but it cannot correct their per\-event placement; matched marginal variance is then achieved by the wrong assignment of predictions to events, leaving the within\-posterior variance term of[eq\.˜4](https://arxiv.org/html/2605.22891#S3.E4)unmodelled\.

Finally, the within\-posterior variance in[eq\.˜4](https://arxiv.org/html/2605.22891#S3.E4)is largest precisely when multimodality is most pronounced \(it includes the spread between modes\); an analogous argument applies to MAE\-trained models, which target the conditional median\.

The practical implication is direct: evaluating models only by pointwise metrics such as RMSE actively rewards suppression of posterior structure and penalizes models that preserve it, biasing reconstructed spectra in ways that do not diminish with more data\.

## 4Proposed Three\-Step Evaluation Protocol

We propose a three\-metric protocol – per\-event distributional accuracy \(CRPS\), population\-level spectral accuracy \(spectrum fidelity\), and uncertainty calibration \(coverage\) – where each metric targets a failure mode the others miss\. See[appendix˜A](https://arxiv.org/html/2605.22891#A1)for more details\.

### 4\.1Metric I: CRPS \(Per\-Event Distributional Accuracy\)

Our per\-event metric is the CRPS \([section˜2](https://arxiv.org/html/2605.22891#S2)\)\. Three properties make it the right choice for evaluating multimodal inverse problems:*\(i\)*strict propriety ensures that it is minimized only when the predictive distribution equals the true posterior, so it penalizes posterior collapse rather than rewarding it,*\(ii\)*its reduction to MAE for point\-estimators places generative and regression models on a common footing, enabling fair comparison, and*\(iii\)*it is per\-event and averaged, matching RMSE’s granularity while measuring distributional rather than pointwise proximity to the truth\. Univariate CRPS suffices when downstream measurements are marginal; the energy score\[[24](https://arxiv.org/html/2605.22891#bib.bib46)\]extends the protocol to joint posteriors\.

### 4\.2Metric II: Spectrum Fidelity \(Population\-Level Accuracy\)

CRPS does not measure the marginal,p\(z\)p\(z\), across the dataset – the population\-level quantity downstream analyses depend on\. To evaluate this, we introduce a population\-level diagnostic not targeted by existing SBI benchmarks\[[37](https://arxiv.org/html/2605.22891#bib.bib18)\], based on the binnedχ2\\chi^\{2\}statistic\. Let\{nbtrue\}\\\{n\_\{b\}^\{\\text\{true\}\}\\\}and\{nbpred\}\\\{n\_\{b\}^\{\\text\{pred\}\}\\\}denote the histogram bin counts of the true and predicted values ofzzover the test dataset\. The spectrum fidelity is then defined as the Pearsonχ2\\chi^\{2\}\[[40](https://arxiv.org/html/2605.22891#bib.bib29)\]:

χspec2=∑b=1B\(nbpred−nbtrue\)2nbtrue,\\chi^\{2\}\_\{\\text\{spec\}\}=\\sum\_\{b=1\}^\{B\}\\frac\{\\left\(n\_\{b\}^\{\\text\{pred\}\}\-n\_\{b\}^\{\\text\{true\}\}\\right\)^\{2\}\}\{n\_\{b\}^\{\\text\{true\}\}\},\(6\)where the sum runs overBBbins, and assuming Poisson uncertainties with the variance in each bin approximated asσb2≈nbtrue\\sigma\_\{b\}^\{2\}\\approx n\_\{b\}^\{\\text\{true\}\}which is valid whennbtrue≫1n\_\{b\}^\{\\text\{true\}\}\\gg 1\. For generative models, we construct the predicted histogram by using*one random posterior sample per event*\. If the model is well calibrated, the resulting set of samples\{z^i∼p\(z∣xi\)\}i=1N\\\{\\hat\{z\}\_\{i\}\\sim p\(z\\mid x\_\{i\}\)\\\}\_\{i=1\}^\{N\}is marginally distributed asp\(z\)p\(z\)by construction,111This follows fromp\(z\)=∫p\(z∣x\)p\(x\)dxp\(z\)=\\int p\(z\\mid x\)\\,p\(x\)\\,\\mathrm\{d\}xand requires that the learned posterior equals the true posterior; in practice,χspec2\\chi^\{2\}\_\{\\text\{spec\}\}simultaneously tests this assumption\.and the expectedχspec2\\chi^\{2\}\_\{\\text\{spec\}\}equalsB−1B\-1\(the number of degrees of freedom\)\. For point\-estimators, the single prediction per event is used directly\.

We report the binnedχ2\\chi^\{2\}\-metric as the primary spectrum fidelity diagnostic due to its interpretability and direct connection to goodness\-of\-fit tests standard in scientific analyses\. The Earth Mover’s distance\[[44](https://arxiv.org/html/2605.22891#bib.bib34)\]is a natural unbinned alternative\.

### 4\.3Metric III: Calibration \(Uncertainty Trustworthiness\)

Neither CRPS nor spectrum fidelity directly assesses whether the*width*of the predicted posterior is trustworthy\. A model can obtain a reasonable CRPS by placing probability mass near the truth without being calibrated in its uncertainty estimates\. The third axis of our evaluation protocol is therefore a calibration diagnostic: a*coverage curve*that plots empirical coverage,C\(α\)C\(\\alpha\), against nominal confidence level,α\\alpha\. A perfectly calibrated model produces a coverage curve tracking the diagonal; undercoverage \(C\(α\)<αC\(\\alpha\)<\\alpha\) indicates posteriors that are too narrow, overcoverage indicates posteriors that are too wide\. The area between the curve and the diagonal,∫01\|C\(α\)−α\|dα\\int\_\{0\}^\{1\}\|C\(\\alpha\)\-\\alpha\|\\,\\mathrm\{d\}\\alpha, quantifies the overall deviation\.

The essential requirement is that calibration*is assessed*, not which specific procedure is used\. Some approaches are listed in[appendix˜B](https://arxiv.org/html/2605.22891#A2)\. Expected\-coverage curves are standard in the SBI literature\[[27](https://arxiv.org/html/2605.22891#bib.bib19),[51](https://arxiv.org/html/2605.22891#bib.bib20),[34](https://arxiv.org/html/2605.22891#bib.bib21)\]\. We use a conformal variant because it provides finite\-sample guarantees and admits a cross\-family comparison that is the point of this protocol\[[1](https://arxiv.org/html/2605.22891#bib.bib30)\]\. Following the forecasting principle of “maximize sharpness subject to calibration”\[[22](https://arxiv.org/html/2605.22891#bib.bib33)\], well\-calibrated models should be ranked by sharpness \(e\.g\., mean prediction\-set width\)\.

Calibration should also be assessed*conditionally*on phase\-space bins \(e\.g\. defined by event topology or signal\-to\-noise ratio\) to surface regional failures hidden by global coverage\. The choice of nonconformity score \(or density estimator\) interacts with the training objective: negative log\-likelihood \(NLL\) is natural for likelihood\-trained models but is not guaranteed to be calibrated for likelihood\-free objectives, even when density evaluation is correct\.

Point estimators without uncertainty cannot be assessed on this axis; this inability is itself a limitation worth reporting given the variance compression of[section˜3\.2](https://arxiv.org/html/2605.22891#S3.SS2)\.

## 5Benchmark I: A Synthetic Inverse Problem with known Multimodal Posterior

We first demonstrate the failure of pointwise evaluation in a controlled setting where the true posterior is known analytically\.

Table 1:Synthetic benchmark metrics \(10,000 test events\)\. Values are mean±\\pmstandard deviation across 5 random seeds\. The regression wins on RMSE; the normalizing flow and MDN dominate all distributional metrics\. Averaging the normalizing flow posterior recovers the regression’s RMSE, CRPS, and spectral mismodeling\. Calibration can only be assessed for uncertainty\-aware architectures\.χ2\\chi^\{2\}has 49 degrees of freedom; Best per column inbold\.∗Lowest by construction \([section˜3\.1](https://arxiv.org/html/2605.22891#S3.SS1)\); converged to constant prediction with negligible seed variance\.†One random posterior sample forχspec2/ndf\\chi^\{2\}\_\{\\text\{spec\}\}/\\text\{ndf\}\.‡Mixture density network needs manual initialization; see[section˜C\.3](https://arxiv.org/html/2605.22891#A3.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.22891v1/x1.png)Figure 1:Synthetic benchmark results\.\(a\)Per\-event posteriors for an observation \(x∗=1x^\{\*\}=1\) in bimodal regime\.\(b\)Marginal distribution of reconstructedzzover 10,000 test events\.\(c\)Global conformal calibration coverage curves for the flow and MDN models\.##### Problem design

We consider the forward model,f:ℝ→ℝf:\\mathbb\{R\}\\rightarrow\\mathbb\{R\}:

x=f\(z\)=z2\+ε,ε∼𝒩\(0,σε2\),x=f\(z\)=z^\{2\}\+\\varepsilon,\\qquad\\varepsilon\\sim\\mathcal\{N\}\\left\(0,\\sigma\_\{\\varepsilon\}^\{2\}\\right\),\(7\)with uniform priorz∼𝒰\(−a,a\)z\\sim\\mathcal\{U\}\\left\(\-a,a\\right\), wherea=5a=5andσε=0\.5\\sigma\_\{\\varepsilon\}=0\.5\. The mappingz↦z2z\\mapsto z^\{2\}is symmetric: for any observationx\>0x\>0, the values±x\\pm\\sqrt\{x\}are equally likely*a priori*, producing an analytically tractable bimodal posterior \(see[appendix˜C](https://arxiv.org/html/2605.22891#A3)\)\. In this model,xxcontrols a smooth transition from unimodal \(x≈0x\\approx 0\) to strongly bimodal regimes, covering the generic many\-to\-one pathology\.

##### Compared methods

We compare four models on identical training data, differing only in their treatment of per\-event uncertainty\. The*regression MLP*is a 4\-layer MLP trained with MSE; by posterior symmetry, its MSE\-optimal prediction isz^\(x\)=𝔼\[z∣x\]=0\\hat\{z\}\(x\)=\\mathbb\{E\}\[z\\mid x\]=0for allxx, providing no per\-event uncertainty and cannot be calibrated using conformal prediction\. The*heteroscedastic regression*shares the backbone but outputs\(μ,σ\)\(\\mu,\\sigma\)trained by Gaussian NLL, modeling per\-event uncertainty under a unimodal Gaussian assumption\. The*mixture density network*\(MDN\) shares the regression backbone but outputs the parameters of aK=2K=2Gaussian mixture, trained by NLL; this is our lightweight posterior\-preserving baseline\. The*conditional normalizing flow*is a 4\-layer autoregressive rational\-quadratic spline \(ARQS\) flow\[[17](https://arxiv.org/html/2605.22891#bib.bib35)\]trained by NLL\[[39](https://arxiv.org/html/2605.22891#bib.bib36)\], included as a more expressive reference\. More details can be found in[appendix˜C](https://arxiv.org/html/2605.22891#A3)\.

##### Evaluation setup

Following[section˜4](https://arxiv.org/html/2605.22891#S4), we evaluate the models on 10,000 held\-out samples\. For the uncertainty\-aware architectures, we draw500500samples per event to approximate the conditional posterior \(see[section˜C\.5](https://arxiv.org/html/2605.22891#A3.SS5)for CRPS convergence study\)\. We compute CRPS,χspec2\\chi\_\{\\text\{spec\}\}^\{2\}withB=50B=50bins over\[−a,a\]\[\-a,a\], and a conformal calibration where we first split the test set into a calibration set \(ncal=1,000n\_\{\\text\{cal\}\}=1\{,\}000\) and an evaluation set \(neval=9,000n\_\{\\text\{eval\}\}=9\{,\}000\)\. We use the NLL as nonconformity score, and for this reason, we only assess calibration for the uncertainty\-aware models\.

##### Results under pointwise metrics

[Table˜1](https://arxiv.org/html/2605.22891#S5.T1)\(first column\) confirms that the regression model achieves the lowest RMSE\. The normalizing flow posterior mean, which is obtained by averaging all 500 samples per event, recovers approximately the same RMSE, confirming that the regression is simply the conditional expectation of the normalizing flow’s posterior\. Under RMSE alone, the regression model is the best available method\.

##### Results under the proposed protocol

The remaining columns of[table˜1](https://arxiv.org/html/2605.22891#S5.T1)invert the ranking\. The flow and MDN both dominate every distributional metric: CRPS scores are roughly half of the regression’s, andχspec2/ndf≈1\\chi^\{2\}\_\{\\text\{spec\}\}/\\text\{ndf\}\\approx 1indicates reconstructed marginal spectra statistically consistent with the truth, compared toχspec2/ndf∼104\\chi^\{2\}\_\{\\text\{spec\}\}/\\text\{ndf\}\\sim 10^\{4\}for the regression\.[Figure˜1](https://arxiv.org/html/2605.22891#S5.F1)makes the mechanism visible\. Atx∗=1x^\{\*\}=1, the true posterior is bimodal at±1\\pm 1; the regression predictsz^≈0\\hat\{z\}\\approx 0, a region of low posterior density, while the flow recovers both modes \(panel a\)\. Aggregated over the test set, the regression’s marginal collapses to a spike at zero and the flow’s matches the true uniform on\[−a,a\]\[\-a,\\,a\]\(panel b\) – the extreme case of[eq\.˜5](https://arxiv.org/html/2605.22891#S3.E5)\. The heteroscedastic regression’sχspec2\\chi^\{2\}\_\{\\text\{spec\}\}remains far above the MDN’s despite per\-event uncertainty – the unimodal Gaussian family cannot represent the true posterior\.

All uncertainty\-aware methods achieve nominal coverage \([fig\.˜1](https://arxiv.org/html/2605.22891#S5.F1)\(c\), deviance<0\.02<0\.02\), but with very different sharpness: the heteroscedastic regression’s 90% prediction sets span most of the prior, while the flow’s and MDN’s are disjoint, concentrated on the modes, and substantially narrower\. Coverage alone is therefore insufficient – see[table˜1](https://arxiv.org/html/2605.22891#S5.T1)for marginal deviances and[section˜C\.5](https://arxiv.org/html/2605.22891#A3.SS5)for prediction\-set widths and conditional calibration\.

##### Takeaways

The same models on the same test data yield opposite rankings depending on the metric, but the deeper finding is that the relevant axis is posterior\-family expressivity, not generative\-vs\-regression\. The MDN, sharing the regression’s backbone with only a 2\-component mixture head, matches the flow on every distributional metric; the heteroscedastic regression, with per\-event uncertainty but a unimodal Gaussian family, improves over the regression by two orders of magnitude onχspec2\\chi^\{2\}\_\{\\text\{spec\}\}yet remains far from the flow and MDN because its predictive family cannot represent the true posterior\. The flow remains the right default when posterior geometry is unknown a priori, which motivates using flows in Benchmark II\.

## 6Benchmark II: A Real\-World Multimodal Inverse Problem

Table 2:Physics benchmark for theΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)observable\. Five methods evaluated under pointwise \(RMSE\) and distributional \(CRPS,χspec2\\chi^\{2\}\_\{\\text\{spec\}\}\) metrics\. The transformer wins on RMSE; the discrete flow dominates distributional metrics\. Best per column inbold\.†One random posterior sample forχspec2/ndf\\chi^\{2\}\_\{\\text\{spec\}\}\\,/\\,\\text\{ndf\}\. Extended table with more observables in[section˜D\.8](https://arxiv.org/html/2605.22891#A4.SS8)\.

We benchmark our approach on a particular particle physics inverse problem: reconstructing top\-quark pairs in dileptonic decays \(see[appendix˜D](https://arxiv.org/html/2605.22891#A4)\)\. The problem exhibits all the pathologies the protocol targets: underdetermination \(two neutrinos escape detection\), combinatorial assignment, non\-Gaussian posterior geometry, and measurement noise\. Full physics background, observable definitions, and the forward model are in[appendix˜D](https://arxiv.org/html/2605.22891#A4)\. The latentzzencodes the kinematics of two top\-quarks; observationsxxare detector\-level \(jets, leptons, missing transverse energy\)\. Unlike Benchmark I, no analytic posterior is available; evaluation uses the per\-event true latentz\{z\}as reference\. Reconstructed spectra feed directly into top\-mass measurements and new\-particle searches\[[19](https://arxiv.org/html/2605.22891#bib.bib41),[12](https://arxiv.org/html/2605.22891#bib.bib42)\], with parameters of interest encoded in shapes and tails – exactly the regime[section˜3\.2](https://arxiv.org/html/2605.22891#S3.SS2)warns about\.

##### Problem design

We use a public Delphes\[[16](https://arxiv.org/html/2605.22891#bib.bib39)\]Monte Carlo dataset\[[42](https://arxiv.org/html/2605.22891#bib.bib40)\]\(~950k simulated events with realistic detector effects\) as our forward model\. Inputsx\{x\}are detector\-level observables \(jets, leptons, missing transverse energy\); targetsz\{z\}are the top\-quark four\-momenta\. Evaluation focuses on a multimodal derived observableΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\); its definition and additional observables are given in[section˜D\.3](https://arxiv.org/html/2605.22891#A4.SS3)\.

##### Compared methods

We evaluate four methods under identical conditions\. The*analytic solver*\[[48](https://arxiv.org/html/2605.22891#bib.bib4)\]algebraically imposes the kinematic mass constraints of the decay topology to return a single neutrino solution per event\. The*transformer regression*\(inspired byCMS Collaboration \[[11](https://arxiv.org/html/2605.22891#bib.bib26)\]\) is a transformer encoder trained with a hybrid MSE \+ MMD loss; we also report a pure\-MSE variant\. The*discrete normalizing flow*\(inspired byν2\\nu^\{2\}\-flows\[[43](https://arxiv.org/html/2605.22891#bib.bib9)\]\) is a conditional invertible network with exact likelihood\. The*continuous normalizing flow*uses the conditional flow\-matching objective\[[9](https://arxiv.org/html/2605.22891#bib.bib43),[35](https://arxiv.org/html/2605.22891#bib.bib44)\], with likelihood and sampling via ODE integration\. We use a conformal calibration procedure\[[1](https://arxiv.org/html/2605.22891#bib.bib30)\]\. Full details in[appendix˜D](https://arxiv.org/html/2605.22891#A4)\.

##### Evaluation setup

We follow the protocol of[section˜4](https://arxiv.org/html/2605.22891#S4), reporting CRPS, spectrum fidelity, and calibration coverage alongside RMSE forΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)\. We draw500500posterior samples per event from each flow; CRPS uses the full ensemble,χspec2\\chi^\{2\}\_\{\\text\{spec\}\}a single random draw per event\. Calibration coverage is assessed for the flows via their log\-likelihoods; it is unavailable for the point estimators, a limitation of those methods rather than a protocol gap\. Results are single\-seed per architecture;[section˜D\.7](https://arxiv.org/html/2605.22891#A4.SS7)shows that across four discrete\-flow configurations, metric spread is orders of magnitude below the inter\-method gaps reported in[tables˜2](https://arxiv.org/html/2605.22891#S6.T2)and[8](https://arxiv.org/html/2605.22891#A4.T8)\.

![Refer to caption](https://arxiv.org/html/2605.22891v1/x2.png)Figure 2:Top reconstruction benchmark results\.\(a\)Reconstructed per\-event posterior overΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)\. The flow\-based posteriors are nearly identical\.\(b\)Marginal distribution ofΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)over the test set\. The flows sample a random point per event; the point estimators provide a point estimate without uncertainties\.\(c\)Conformal coverage curve for the flows\.
##### Results under pointwise metrics

[Table˜2](https://arxiv.org/html/2605.22891#S6.T2)shows that the pure\-MSE transformer achieves the best RMSE onΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\), with the MMD\-regularized variant trailing slightly\. The analytic solver’s RMSE is comparable to the generative models rather than the transformers, because it commits to a single mode per event rather than computing a posterior mean: when the selected mode is correct, the error is small, when it is not, the error is of order the inter\-mode separation\. Both the solver and a single generative sample are “mode\-picking” estimators; the transformers achieve lower RMSE precisely by averaging across modes, which is the source of the spectral distortion identified in[section˜3\.2](https://arxiv.org/html/2605.22891#S3.SS2)\. The MMD transformer retains competitive RMSE and closes most of the marginal gap on unimodal observables \(see[appendix˜D](https://arxiv.org/html/2605.22891#A4)\), making it the natural choice under current evaluation practice\[[11](https://arxiv.org/html/2605.22891#bib.bib26)\]\. The generative models offer no obvious advantage on either axis\.

##### Results under the proposed protocol

The distributional metrics flip this conclusion\. OnΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\), the discrete flow achieves the best CRPS and lowestχspec2\\chi^\{2\}\_\{\\text\{spec\}\}, while the performance of both transformers remains orders of magnitude below that of the flow despite the MMD penalty\. This pattern is exactly the prediction of[section˜3\.2](https://arxiv.org/html/2605.22891#S3.SS2): the marginal penalty fixes what marginal information can fix – on unimodal observables, MMD reducesχspec2\\chi^\{2\}\_\{\\text\{spec\}\}by an order of magnitude and approaches that of the flows \([appendix˜D](https://arxiv.org/html/2605.22891#A4)\) – but on the multimodalΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)it cannot correct the per\-event conditional, and the spectral distortion persists \([fig\.˜2](https://arxiv.org/html/2605.22891#S6.F2)\(b\)\)\. Per\-event posteriors \([fig\.˜2](https://arxiv.org/html/2605.22891#S6.F2)\(a\)\) make the mechanism visible: both flows capture the bimodal structure ofΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)while both transformers collapse to the mean between modes, exactly like the synthetic pathology of[fig\.˜1](https://arxiv.org/html/2605.22891#S5.F1)\. The variance compression is visually clearest on unimodal observables \(see[section˜D\.8](https://arxiv.org/html/2605.22891#A4.SS8)\)\.

The coverage curves \([fig\.˜2](https://arxiv.org/html/2605.22891#S6.F2)\(c\)\) expose a qualitative difference invisible to RMSE or CRPS: the discrete flow is well\-calibrated with only slight undercoverage, while the continuous flow systematically undercovers despite comparable CRPS\. The discrepancy possibly reflects both approximate ODE\-based density evaluation and the mismatch between the flow\-matching objective and the likelihood used for calibration – the discrete flow optimizes this likelihood directly\. Two architectures indistinguishable on per\-event accuracy thus produce qualitatively different uncertainty estimates, surfaced only by the full three\-metric protocol\.

##### Takeaways

The physics benchmark confirms and extends the synthetic results\. OnΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\), model rankings flip between pointwise and distributional evaluation: the pure\-MSE transformer wins on RMSE while the discrete flow dominates CRPS and spectrum fidelity\. The MMD\-regularized transformer partially mitigates spectral distortion on unimodal observables \([appendix˜D](https://arxiv.org/html/2605.22891#A4)\) but cannot correct the per\-event multimodality ofΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\), validating the claim of[section˜3\.2](https://arxiv.org/html/2605.22891#S3.SS2)\. Among the generative models, the discrete and continuous flows appear approximately interchangeable under CRPS andχspec2\\chi^\{2\}\_\{\\text\{spec\}\}, but calibration exposes a consequential architecture difference: exact\-likelihood flows produce trustworthy posteriors while approximate\-likelihood flows do not\. The three\-step protocol is necessary to surface these distinctions\.

## 7Discussion

##### Limitations

Univariate CRPS loses joint posterior structure for high\-dimensionalzz; multivariate proper scores \(energy score\[[24](https://arxiv.org/html/2605.22891#bib.bib46)\]\) and joint coverage diagnostics \(SBC\[[51](https://arxiv.org/html/2605.22891#bib.bib20)\], TARP\[[34](https://arxiv.org/html/2605.22891#bib.bib21)\]\) extend the protocol\. The physics benchmark uses a single Monte Carlo generator with Delphes detector simulation, and we have not tested how the learned posteriors or their calibration behave under domain shift to full simulation or real collision data – validating this robustness is a prerequisite for experimental deployment\. Finally, the Benchmark II models are trained on a modest dataset without exhaustive tuning, so the absolute values in[table˜2](https://arxiv.org/html/2605.22891#S6.T2)should not be read as each architecture’s best achievable performance\. The ranking flip itself is robust:[section˜3\.2](https://arxiv.org/html/2605.22891#S3.SS2)shows that no MMD strength can correct per\-event multimodality, so while gaps may shrink with tuning, they cannot close\.

##### Applicability beyond the studied benchmarks

The protocol is domain\-agnostic\. Any many\-to\-one inverse problem produces the same posterior multimodality and the same failure of pointwise metrics \(e\.g\. phase\-retrieval, cosmological parameter inference, or gravitational wave parameter estimation\)\. The protocol requires no domain\-specific infrastructure: CRPS reduces to MAE for point\-estimators, spectrum fidelity requires only a binned histogram, and calibration requires either a tractable likelihood or conformal nonconformity scores\. The ranking flip is also not limited to bimodal posteriors: by[eq\.˜4](https://arxiv.org/html/2605.22891#S3.E4), any non\-negligible within\-posterior variance produces systematic tail compression that does not average away with more data, though the distortion is milder for heavy\-tailed unimodal cases and negligible for sharply peaked ones\. We recommend practitioners inspect posterior samples for a subset of events before defaulting to RMSE\.222We also note that marginally well\-calibrated generative reconstruction is essentially generative unfolding\[[30](https://arxiv.org/html/2605.22891#bib.bib47),[4](https://arxiv.org/html/2605.22891#bib.bib48),[41](https://arxiv.org/html/2605.22891#bib.bib49)\]\.

##### Open Problems

Our findings suggest three directions for future work\. First, lightweight per\-event uncertainty heads \(e\.g\., MDNs\) may correct spectral mismodeling more efficiently than population\-level regularizers by directly modeling the \(conditional\) within\-posterior variance of[eq\.˜4](https://arxiv.org/html/2605.22891#S3.E4)\[[14](https://arxiv.org/html/2605.22891#bib.bib50)\]\. Whether they match flow\-based fidelity on realistic multimodal problems is open, and their calibration can be assessed within the same protocol\. Second, distributional evaluation gains do not automatically translate into improved measurements: how posterior samples should be carried through downstream inference is application\-specific and largely unstudied\. Third, closing the calibration gap of likelihood\-free generative models likely requires higher\-order ODE solvers or likelihood\-aware finetuning\.

## 8Conclusion

We have shown that pointwise metrics are structurally misaligned with the scientific goals of supervised reconstruction: the law of total variance forces any point\-estimator without per\-event uncertainty to produce a marginal spectrum narrower than the truth, a bias that does not diminish with more data\. Our three\-part protocol – CRPS, spectrum fidelity, and coverage\-based calibration – addresses each component\. We show across a synthetic benchmark with analytic posterior and a realistic many\-to\-one inverse problem from particle physics that model rankings reverse between pointwise and distributional evaluation, and calibration separates architectures indistinguishable under CRPS\. Which model is judged best is a function of how its outputs are scored, not only of what those outputs are\. Careful evaluation makes this step visible and exposes where it has been going wrong\. By rewarding posterior collapse, pointwise\-only evaluation has been favoring models whose reconstructed spectra cannot support the downstream measurements built on them\. The protocol we propose makes this visible and gives practitioners a basis for comparison that scientific conclusions can rest on\. Code and evaluation pipelines are available from the corresponding author on request; see[appendix˜E](https://arxiv.org/html/2605.22891#A5)\.

## Acknowledgments and Disclosure of Funding

We thank D\. Stafford for providing a Python implementation of the analytic reconstruction of dileptonic top\-quark pairs and for fruitful discussions\. The authors acknowledge support from Deutsches Elektronen\-Synchrotron \(DESY\), a member of the Helmholtz Association, and from DASHH, Data Science in Hamburg – Helmholtz Graduate School for the Structure of Matter, as well as funding by the DFG under Germany’s Excellence Strategy – EXC 2121 “Quantum Universe” – 390833306\. This research was supported in part through the Maxwell computational resources operated at DESY, Hamburg, Germany\.

## References

- \[1\]J\. Y\. Araz and M\. Spannowsky\(2025\)Another fit bites the dust: conformal prediction as a calibration standard for machine learning in high\-energy physics\.External Links:[Link](https://arxiv.org/abs/2512.17048),2512\.17048Cited by:[Appendix B](https://arxiv.org/html/2605.22891#A2.SS0.SSS0.Px1.p1.4),[§C\.4](https://arxiv.org/html/2605.22891#A3.SS4.p4.4),[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p4.1),[§4\.3](https://arxiv.org/html/2605.22891#S4.SS3.p2.1),[§6](https://arxiv.org/html/2605.22891#S6.SS0.SSS0.Px2.p1.1)\.
- \[2\]L\. Ardizzone, J\. Kruse, C\. Rother, and U\. Köthe\(2019\)Analyzing inverse problems with invertible neural networks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rJed6j0cKX)Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px5.p1.2)\.
- \[3\]S\. Baker and R\. D\. Cousins\(1984\)Clarification of the use of chi\-square and likelihood functions in fits to histograms\.Nuclear Instruments and Methods in Physics Research221\(2\),pp\. 437–442\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1016/0167-5087%2884%2990016-4),ISSN 0167\-5087,[Link](https://www.sciencedirect.com/science/article/pii/0167508784900164)Cited by:[§A\.2](https://arxiv.org/html/2605.22891#A1.SS2.p1.2),[§D\.8](https://arxiv.org/html/2605.22891#A4.SS8.SSS0.Px4.p1.5),[Table 8](https://arxiv.org/html/2605.22891#A4.T8)\.
- \[4\]M\. Bellagente, A\. Butter, G\. Kasieczka, T\. Plehn, A\. Rousselot, R\. Winterhalder, L\. Ardizzone, and U\. Köthe\(2020\)Invertible networks or partons to detector and back again\.SciPost Phys\.9,pp\. 074\.External Links:[Document](https://dx.doi.org/10.21468/SciPostPhys.9.5.074),[Link](https://scipost.org/10.21468/SciPostPhys.9.5.074)Cited by:[footnote 2](https://arxiv.org/html/2605.22891#footnote2)\.
- \[5\]C\. M\. Bishop\(1994\)Mixture density networks\.WorkingPaperTechnical Report4288,NCRG,Aston University,Aston University\(English\)\.Cited by:[§C\.3](https://arxiv.org/html/2605.22891#A3.SS3.SSS0.Px2.p1.3)\.
- \[6\]C\. M\. Bishop\(2006\-01\)Pattern recognition and machine learning\.Springer\.External Links:[Link](https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/)Cited by:[§1](https://arxiv.org/html/2605.22891#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.22891#S3.SS1.p1.11)\.
- \[7\]J\. Brehmer, V\. Bresó, P\. de Haan, T\. Plehn, H\. Qu, J\. Spinner, and J\. Thaler\(2025\)A Lorentz\-equivariant transformer for all of the LHC\.SciPost Phys\.19,pp\. 108\.External Links:[Document](https://dx.doi.org/10.21468/SciPostPhys.19.4.108),[Link](https://scipost.org/10.21468/SciPostPhys.19.4.108)Cited by:[§D\.5](https://arxiv.org/html/2605.22891#A4.SS5.SSS0.Px4.p1.1)\.
- \[8\]J\. Brehmer, P\. de Haan, S\. Behrends, and T\. Cohen\(2023\)Geometric algebra transformer\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://arxiv.org/abs/2305.18415),2305\.18415Cited by:[§D\.5](https://arxiv.org/html/2605.22891#A4.SS5.SSS0.Px4.p1.1)\.
- \[9\]R\. T\. Q\. Chen, Y\. Rubanova, J\. Bettencourt, and D\. Duvenaud\(2018\)Neural ordinary differential equations\.Advances in Neural Information Processing Systems\.Cited by:[§6](https://arxiv.org/html/2605.22891#S6.SS0.SSS0.Px2.p1.1)\.
- \[10\]R\. T\. Q\. Chen\(2018\)Torchdiffeq\.External Links:[Link](https://github.com/rtqichen/torchdiffeq)Cited by:[§E\.3](https://arxiv.org/html/2605.22891#A5.SS3.p1.1)\.
- \[11\]T\. CMS Collaboration\(2025\)Enhanced reconstruction of dileptonic top quark\-antiquark events using supervised machine learning methods\.Technical reportCERN,Geneva\.External Links:[Link](https://cds.cern.ch/record/2944724)Cited by:[§D\.5](https://arxiv.org/html/2605.22891#A4.SS5.SSS0.Px2.p1.4),[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px5.p1.2),[§3\.2](https://arxiv.org/html/2605.22891#S3.SS2.p4.1),[§6](https://arxiv.org/html/2605.22891#S6.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2605.22891#S6.SS0.SSS0.Px4.p1.1)\.
- \[12\]T\. CMS Collaboration\(2025\-08\)Observation of a pseudoscalar excess at the top quark pair production threshold\.Reports on Progress in Physics88\(8\),pp\. 087801\.External Links:[Document](https://dx.doi.org/10.1088/1361-6633/adf7d3),[Link](https://doi.org/10.1088/1361-6633/adf7d3)Cited by:[§6](https://arxiv.org/html/2605.22891#S6.p1.3)\.
- \[13\]K\. Cranmer, J\. Brehmer, and G\. Louppe\(2020\)The frontier of simulation\-based inference\.Proceedings of the National Academy of Sciences117\(48\),pp\. 30055–30062\.External Links:[Document](https://dx.doi.org/10.1073/pnas.1912789117),[Link](https://www.pnas.org/doi/abs/10.1073/pnas.1912789117),https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.1912789117Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px3.p1.4)\.
- \[14\]P\. Cui, W\. Hu, and J\. Zhu\(2020\)Calibrated reliable regression using maximum mean discrepancy\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 17164–17175\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/c74c4bf0dad9cbae3d80faa054b7d8ca-Paper.pdf)Cited by:[§7](https://arxiv.org/html/2605.22891#S7.SS0.SSS0.Px3.p1.1)\.
- \[15\]M\. Dax, S\. R\. Green, J\. Gair, J\. H\. Macke, A\. Buonanno, and B\. Schölkopf\(2021\-12\)Real\-time gravitational wave science with neural posterior estimation\.Phys\. Rev\. Lett\.127,pp\. 241103\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevLett.127.241103),[Link](https://link.aps.org/doi/10.1103/PhysRevLett.127.241103)Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]J\. de Favereau, C\. Delaere, P\. Demin, A\. Giammanco, V\. Lemaître, A\. Mertens, M\. Selvaggi, and T\. D\. 3\. collaboration\(2014\)DELPHES 3: a modular framework for fast simulation of a generic collider experiment\.Journal of High Energy Physics2014\(2\),pp\. 57\.External Links:[Document](https://dx.doi.org/10.1007/JHEP02%282014%29057)Cited by:[§D\.2](https://arxiv.org/html/2605.22891#A4.SS2.p1.2),[§6](https://arxiv.org/html/2605.22891#S6.SS0.SSS0.Px1.p1.3)\.
- \[17\]C\. Durkan, A\. Bekasov, I\. Murray, and G\. Papamakarios\(2019\)Neural spline flows\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d'Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/7ac71d433f282034e088473244df8c02-Paper.pdf)Cited by:[§C\.3](https://arxiv.org/html/2605.22891#A3.SS3.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.22891#S5.SS0.SSS0.Px2.p1.4)\.
- \[18\]L\. Ehrke, J\. A\. Raine, K\. Zoch, M\. Guth, and T\. Golling\(2023\)Topological reconstruction of particle physics processes using graph neural networks\.Phys\. Rev\. D107\(11\),pp\. 116019\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevD.107.116019),2303\.13937Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px2.p1.1)\.
- \[19\]L\. Favaro, R\. Kogler, A\. Paasch, S\. P\. Schweitzer, T\. Plehn, and D\. Schwarz\(2025\)How to unfold top decays\.SciPost Phys\. Core8,pp\. 053\.External Links:[Document](https://dx.doi.org/10.21468/SciPostPhysCore.8.3.053),[Link](https://scipost.org/10.21468/SciPostPhysCore.8.3.053)Cited by:[§6](https://arxiv.org/html/2605.22891#S6.p1.3)\.
- \[20\]C\. A\. T\. Ferro, D\. S\. Richardson, and A\. P\. Weigel\(2008\)On the effect of ensemble size on the discrete and continuous ranked probability scores\.Meteorological Applications15\(1\),pp\. 19–24\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1002/met.45),[Link](https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/met.45),https://rmets\.onlinelibrary\.wiley\.com/doi/pdf/10\.1002/met\.45Cited by:[§A\.1](https://arxiv.org/html/2605.22891#A1.SS1.p1.9),[§C\.4](https://arxiv.org/html/2605.22891#A3.SS4.p1.3),[§C\.5](https://arxiv.org/html/2605.22891#A3.SS5.SSS0.Px3.p1.6),[§D\.7](https://arxiv.org/html/2605.22891#A4.SS7.SSS0.Px1.p1.6)\.
- \[21\]A\. Gammerman, V\. Vovk, and V\. Vapnik\(1998\)Learning by transduction\.InProceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence,UAI’98,San Francisco, CA, USA,pp\. 148–155\.External Links:ISBN 155860555XCited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p4.1)\.
- \[22\]T\. Gneiting, F\. Balabdaoui, and A\. E\. Raftery\(2007\)Probabilistic forecasts, calibration and sharpness\.Journal of the Royal Statistical Society Series B69\(2\),pp\. 243–268\.External Links:[Link](https://econpapers.repec.org/RePEc:bla:jorssb:v:69:y:2007:i:2:p:243-268)Cited by:[Appendix B](https://arxiv.org/html/2605.22891#A2.SS0.SSS0.Px3.p1.1),[§4\.3](https://arxiv.org/html/2605.22891#S4.SS3.p2.1)\.
- \[23\]T\. Gneiting and A\. E\. Raftery\(2007\)Strictly proper scoring rules, prediction, and estimation\.Journal of the American Statistical Association102\(477\),pp\. 359–378\.External Links:[Document](https://dx.doi.org/10.1198/016214506000001437),[Link](https://doi.org/10.1198/016214506000001437),https://doi\.org/10\.1198/016214506000001437Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p1.2),[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p2.2)\.
- \[24\]T\. Gneiting, L\. I\. Stanberry, E\. P\. Grimit, L\. Held, and N\. A\. Johnson\(2008\-08\-01\)Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds\.TEST17\(2\),pp\. 211–235\.External Links:[Document](https://dx.doi.org/10.1007/s11749-008-0114-x),ISSN 1863\-8260,[Link](https://doi.org/10.1007/s11749-008-0114-x)Cited by:[§4\.1](https://arxiv.org/html/2605.22891#S4.SS1.p1.1),[§7](https://arxiv.org/html/2605.22891#S7.SS0.SSS0.Px1.p1.1)\.
- \[25\]G\. Grimmett and D\. Stirzaker\(2020\)Probability and random processes\.OUP Oxford\.External Links:ISBN 9780192586865Cited by:[§3\.2](https://arxiv.org/html/2605.22891#S3.SS2.p2.4)\.
- \[26\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.CoRRabs/1706\.04599\.External Links:[Link](http://arxiv.org/abs/1706.04599),1706\.04599Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p4.1)\.
- \[27\]J\. Hermans, A\. Delaunoy, F\. Rozet, A\. Wehenkel, V\. Begy, and G\. Louppe\(2022\)A trust crisis in simulation\-based inference? your posterior approximations can be unfaithful\.External Links:[Link](https://arxiv.org/abs/2110.06581),2110\.06581Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p4.1),[§4\.3](https://arxiv.org/html/2605.22891#S4.SS3.p2.1)\.
- \[28\]H\. Hersbach\(2000\)Decomposition of the continuous ranked probability score for ensemble prediction systems\.Weather and Forecasting15\(5\),pp\. 559–570\.External Links:[Document](https://dx.doi.org/10.1175/1520-0434%282000%29015%3C0559%3ADOTCRP%3E2.0.CO%3B2),[Link](https://journals.ametsoc.org/view/journals/wefo/15/5/1520-0434_2000_015_0559_dotcrp_2_0_co_2.xml)Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p2.2)\.
- \[29\]J\. A\. Högbom\(1974\-06\)Aperture Synthesis with a Non\-Regular Distribution of Interferometer Baselines\.Astron\. Astrophys\. Suppl\. Ser\.15,pp\. 417\.Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px5.p1.2)\.
- \[30\]N\. Huetsch, J\. M\. Villadamigo, A\. Shmakov, S\. Diefenbacher, V\. Mikuni, T\. Heimel, M\. Fenton, K\. Greif, B\. Nachman, D\. Whiteson, A\. Butter, and T\. Plehn\(2025\)The landscape of unfolding with machine learning\.SciPost Phys\.18,pp\. 070\.External Links:[Document](https://dx.doi.org/10.21468/SciPostPhys.18.2.070),[Link](https://scipost.org/10.21468/SciPostPhys.18.2.070)Cited by:[footnote 2](https://arxiv.org/html/2605.22891#footnote2)\.
- \[31\]A\. Jordan, F\. Krüger, and S\. Lerch\(2019\)Evaluating probabilistic forecasts with scoringrules\.Journal of Statistical Software90\(12\),pp\. 1–37\.External Links:[Document](https://dx.doi.org/10.18637/jss.v090.i12),[Link](https://www.jstatsoft.org/index.php/jss/article/view/v090i12)Cited by:[§A\.1](https://arxiv.org/html/2605.22891#A1.SS1.p1.9),[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p2.6)\.
- \[32\]Y\. Kim and N\. Nakata\(2018\-12\)Geophysical inversion versus machine learning in inverse problems\.Leading Edge37\(12\),pp\. 894–901\.External Links:[Document](https://dx.doi.org/10.1190/tle37120894.1)Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px5.p1.2)\.
- \[33\]V\. Kuleshov, N\. Fenner, and S\. Ermon\(2018\)Accurate uncertainties for deep learning using calibrated regression\.CoRRabs/1807\.00263\.External Links:[Link](http://arxiv.org/abs/1807.00263),1807\.00263Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p4.1)\.
- \[34\]P\. Lemos, A\. Coogan, Y\. Hezaveh, and L\. Perreault\-Levasseur\(2023\-07\)Sampling\-based accuracy testing of posterior estimators for general inference\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 19256–19273\.External Links:[Link](https://proceedings.mlr.press/v202/lemos23a.html)Cited by:[Appendix B](https://arxiv.org/html/2605.22891#A2.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p4.1),[§4\.3](https://arxiv.org/html/2605.22891#S4.SS3.p2.1),[§7](https://arxiv.org/html/2605.22891#S7.SS0.SSS0.Px1.p1.1)\.
- \[35\]Y\. Lipman, R\. T\. Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le\(2023\)Flow matching for generative modeling\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PqvMRDCJT9t)Cited by:[§D\.5](https://arxiv.org/html/2605.22891#A4.SS5.SSS0.Px4.p1.1),[Table 5](https://arxiv.org/html/2605.22891#A4.T5.9.2),[§6](https://arxiv.org/html/2605.22891#S6.SS0.SSS0.Px2.p1.1)\.
- \[36\]I\. Loshchilov and F\. Hutter\(2017\)Fixing weight decay regularization in adam\.CoRRabs/1711\.05101\.External Links:[Link](http://arxiv.org/abs/1711.05101),1711\.05101Cited by:[§C\.3](https://arxiv.org/html/2605.22891#A3.SS3.p1.3),[§D\.6](https://arxiv.org/html/2605.22891#A4.SS6.p1.1)\.
- \[37\]J\. Lueckmann, J\. Boelts, D\. Greenberg, P\. Goncalves, and J\. Macke\(2021\-05\)Benchmarking simulation\-based inference\.InProceedings of The 24th International Conference on Artificial Intelligence and Statistics,A\. Banerjee and K\. Fukumizu \(Eds\.\),Proceedings of Machine Learning Research, Vol\.130,pp\. 343–351\.External Links:[Link](https://proceedings.mlr.press/v130/lueckmann21a.html)Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p4.1),[§4\.2](https://arxiv.org/html/2605.22891#S4.SS2.p1.6)\.
- \[38\]G\. Papamakarios and I\. Murray\(2016\)Fastϵ\\epsilon\-free inference of simulation models with bayesian conditional density estimation\.InAdvances in Neural Information Processing Systems,D\. Lee, M\. Sugiyama, U\. Luxburg, I\. Guyon, and R\. Garnett \(Eds\.\),Vol\.29,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2016/file/6aca97005c68f1206823815f66102863-Paper.pdf)Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px3.p1.4)\.
- \[39\]G\. Papamakarios, E\. Nalisnick, D\. J\. Rezende, S\. Mohamed, and B\. Lakshminarayanan\(2021\)Normalizing flows for probabilistic modeling and inference\.Journal of Machine Learning Research22\(57\),pp\. 1–64\.External Links:[Link](http://jmlr.org/papers/v22/19-1028.html)Cited by:[§5](https://arxiv.org/html/2605.22891#S5.SS0.SSS0.Px2.p1.4)\.
- \[40\]K\. Pearson\(1992\)On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling\.InBreakthroughs in Statistics: Methodology and Distribution,pp\. 11–28\.External Links:[Document](https://dx.doi.org/10.1007/978-1-4612-4380-9%5F2),ISBN 978\-1\-4612\-4380\-9,[Link](https://doi.org/10.1007/978-1-4612-4380-9_2)Cited by:[§4\.2](https://arxiv.org/html/2605.22891#S4.SS2.p1.6)\.
- \[41\]A\. Petitjean, A\. Butter, K\. Greif, S\. P\. Schweitzer, T\. Plehn, J\. Spinner, and D\. Whiteson\(2025\)Generative unfolding of jets and their substructure\.External Links:[Link](https://arxiv.org/abs/2510.19906),2510\.19906Cited by:[footnote 2](https://arxiv.org/html/2605.22891#footnote2)\.
- \[42\]J\. A\. Raine, M\. Leigh, K\. Zoch, L\. Ehrke, D\. Sengupta, and T\. Golling\(2023\-07\)Dileptonic ttbar neutrino regression dataset\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.8113516),[Link](https://doi.org/10.5281/zenodo.8113516)Cited by:[§D\.2](https://arxiv.org/html/2605.22891#A4.SS2.p1.2),[§D\.4](https://arxiv.org/html/2605.22891#A4.SS4.p1.2),[§E\.4](https://arxiv.org/html/2605.22891#A5.SS4.p1.1),[§6](https://arxiv.org/html/2605.22891#S6.SS0.SSS0.Px1.p1.3)\.
- \[43\]J\. A\. Raine, M\. Leigh, K\. Zoch, and T\. Golling\(2024\-01\)Fast and improved neutrino reconstruction in multineutrino final states with conditional normalizing flows\.Phys\. Rev\. D109,pp\. 012005\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevD.109.012005),[Link](https://link.aps.org/doi/10.1103/PhysRevD.109.012005)Cited by:[§D\.2](https://arxiv.org/html/2605.22891#A4.SS2.p1.2),[§D\.5](https://arxiv.org/html/2605.22891#A4.SS5.SSS0.Px3.p1.1),[Table 6](https://arxiv.org/html/2605.22891#A4.T6),[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px5.p1.2),[§6](https://arxiv.org/html/2605.22891#S6.SS0.SSS0.Px2.p1.1)\.
- \[44\]Y\. Rubner, C\. Tomasi, and L\. J\. Guibas\(1998\-01\)A metric for distributions with applications to image databases\.InSixth International Conference on Computer Vision \(IEEE Cat\. No\.98CH36271\),Vol\.,pp\. 59–66\.External Links:[Document](https://dx.doi.org/10.1109/ICCV.1998.710701),ISSNCited by:[§4\.2](https://arxiv.org/html/2605.22891#S4.SS2.p2.1)\.
- \[45\]H\. Shen, E\. A\. Huerta, E\. O’Shea, P\. Kumar, and Z\. Zhao\(2021\-11\)Statistically\-informed deep learning for gravitational wave parameter estimation\.Machine Learning: Science and Technology3\(1\),pp\. 015007\.External Links:[Document](https://dx.doi.org/10.1088/2632-2153/ac3843),[Link](https://doi.org/10.1088/2632-2153/ac3843)Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px5.p1.2)\.
- \[46\]A\. Shmakov, M\. J\. Fenton, T\. Ho, S\. Hsu, D\. Whiteson, and P\. Baldi\(2022\)SPANet: Generalized permutationless set assignment for particle physics using symmetry preserving attention\.SciPost Phys\.12\(5\),pp\. 178\.External Links:[Document](https://dx.doi.org/10.21468/SciPostPhys.12.5.178),2106\.03898Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px2.p1.1)\.
- \[47\]E\. Y\. Sidky and X\. Pan\(2022\)Report on the aapm deep\-learning sparse\-view ct grand challenge\.Medical Physics49\(8\),pp\. 4935–4943\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1002/mp.15489),[Link](https://aapm.onlinelibrary.wiley.com/doi/abs/10.1002/mp.15489),https://aapm\.onlinelibrary\.wiley\.com/doi/pdf/10\.1002/mp\.15489Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px2.p1.1)\.
- \[48\]L\. Sonnenschein\(2005\-11\)Algebraic approach to solvett¯t\\overline\{t\}dilepton equations\.Phys\. Rev\. D72,pp\. 095020\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevD.72.095020),[Link](https://link.aps.org/doi/10.1103/PhysRevD.72.095020)Cited by:[§D\.5](https://arxiv.org/html/2605.22891#A4.SS5.SSS0.Px1.p1.3),[§E\.1](https://arxiv.org/html/2605.22891#A5.SS1.SSS0.Px1.p2.1),[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px5.p1.2),[§6](https://arxiv.org/html/2605.22891#S6.SS0.SSS0.Px2.p1.1)\.
- \[49\]J\. Spinner, V\. Bresó, P\. De Haan, T\. Plehn, J\. Thaler, and J\. Brehmer\(2024\)Lorentz\-equivariant geometric algebra transformers for high\-energy physics\.InAdvances in Neural Information Processing Systems,Vol\.37\.External Links:[Link](https://arxiv.org/abs/2405.14806),2405\.14806Cited by:[§D\.5](https://arxiv.org/html/2605.22891#A4.SS5.SSS0.Px4.p1.1)\.
- \[50\]V\. Stimper, D\. Liu, A\. Campbell, V\. Berenz, L\. Ryll, B\. Schölkopf, and J\. M\. Hernández\-Lobato\(2023\)Normflows: a pytorch package for normalizing flows\.Journal of Open Source Software8\(86\),pp\. 5361\.External Links:[Document](https://dx.doi.org/10.21105/joss.05361),[Link](https://doi.org/10.21105/joss.05361)Cited by:[§C\.3](https://arxiv.org/html/2605.22891#A3.SS3.SSS0.Px3.p1.1),[§E\.3](https://arxiv.org/html/2605.22891#A5.SS3.p1.1)\.
- \[51\]S\. Talts, M\. Betancourt, D\. Simpson, A\. Vehtari, and A\. Gelman\(2020\)Validating bayesian inference algorithms with simulation\-based calibration\.External Links:[Link](https://arxiv.org/abs/1804.06788),1804\.06788Cited by:[Appendix B](https://arxiv.org/html/2605.22891#A2.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p4.1),[§4\.3](https://arxiv.org/html/2605.22891#S4.SS3.p2.1),[§7](https://arxiv.org/html/2605.22891#S7.SS0.SSS0.Px1.p1.1)\.
- \[52\]T\. Vandal, E\. Kodra, S\. Ganguly, A\. Michaelis, R\. Nemani, and A\. R\. Ganguly\(2018\)Generating high resolution climate change projections through single image super\-resolution: an abridged version\.InProceedings of the 27th International Joint Conference on Artificial Intelligence,IJCAI’18,pp\. 5389–5393\.External Links:ISBN 9780999241127Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px2.p1.1)\.
- \[53\]Z\. Wang, R\. Gao, M\. Yin, M\. Zhou, and D\. Blei\(2023\-04\)Probabilistic conformal prediction using conditional random samples\.InProceedings of The 26th International Conference on Artificial Intelligence and Statistics,F\. Ruiz, J\. Dy, and J\. van de Meent \(Eds\.\),Proceedings of Machine Learning Research, Vol\.206,pp\. 8814–8836\.External Links:[Link](https://proceedings.mlr.press/v206/wang23n.html)Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p4.1)\.
- \[54\]Z\. Wang, J\. Chen, and S\. C\. H\. Hoi\(2021\)Deep learning for image super\-resolution: a survey\.IEEE Transactions on Pattern Analysis and Machine Intelligence43\(10\),pp\. 3365–3387\.External Links:[Document](https://dx.doi.org/10.1109/TPAMI.2020.2982166)Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px2.p1.1)\.
- \[55\]H\. Zheng, W\. Chu, B\. Zhang, Z\. Wu, A\. Wang, B\. Feng, C\. Zou, Y\. Sun, N\. B\. Kovachki, Z\. E\. Ross, K\. Bouman, and Y\. Yue\(2025\)InverseBench: benchmarking plug\-and\-play diffusion priors for inverse problems in physical sciences\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=U3PBITXNG6)Cited by:[§2](https://arxiv.org/html/2605.22891#S2.SS0.SSS0.Px4.p4.1)\.

## Appendix AEvaluation Metrics

### A\.1Empirical CRPS estimator

For a predictive distribution represented byNNposterior samples\{z^\(k\)\}k=1N\\\{\\hat\{z\}^\{\(k\)\}\\\}\_\{k=1\}^\{N\}, the CRPS of[eq\.˜1](https://arxiv.org/html/2605.22891#S2.E1)is estimated as

CRPS^=1N∑k=1N\|z^\(k\)−z\|−12N2∑k=1N∑j=1N\|z^\(k\)−z^\(j\)\|,\\widehat\{\\mathrm\{CRPS\}\}=\\frac\{1\}\{N\}\\sum\_\{k=1\}^\{N\}\\bigl\|\\hat\{z\}^\{\(k\)\}\-z\\bigr\|\-\\frac\{1\}\{2N^\{2\}\}\\sum\_\{k=1\}^\{N\}\\sum\_\{j=1\}^\{N\}\\bigl\|\\hat\{z\}^\{\(k\)\}\-\\hat\{z\}^\{\(j\)\}\\bigr\|,\(8\)computable in𝒪\(Nlog⁡N\)\\mathcal\{O\}\(N\\log N\)via sorting\[[31](https://arxiv.org/html/2605.22891#bib.bib27)\]\. The estimator has finite\-sample bias of𝒪\(1/N\)\\mathcal\{O\}\(1/N\)from the second term\[[20](https://arxiv.org/html/2605.22891#bib.bib53)\]\. For the sample sizes used in this work \(N=500N=500for both benchmarks\) the bias≈0\.05%\\approx 0\.05\\%of the reported CRPS values is well below the gaps between methods\. ReducingFFto a Dirac distribution at a point predictionz^\\hat\{z\}recoversCRPS^=\|z^−z\|\\widehat\{\\mathrm\{CRPS\}\}=\|\\hat\{z\}\-z\|, the MAE\.

### A\.2Spectrum fidelity

The binnedχ2\\chi^\{2\}statistic of[eq\.˜6](https://arxiv.org/html/2605.22891#S4.E6)requires only that bin counts are well\-approximated as Poisson, i\.e\.nbtrue≳10n^\{\\text\{true\}\}\_\{b\}\\gtrsim 10\. Bin counts and per\-benchmark binning choices are stated in the respective evaluation subsections \([appendices˜C](https://arxiv.org/html/2605.22891#A3)and[D](https://arxiv.org/html/2605.22891#A4)\)\. For tail\-heavy spectra where some bins violate the Poisson approximation, a likelihood\-ratio statistic such as Baker\-Cousins\[[3](https://arxiv.org/html/2605.22891#bib.bib54)\]is a more accurate goodness\-of\-fit measure\. We discuss this where relevant in[section˜D\.8](https://arxiv.org/html/2605.22891#A4.SS8)\.

### A\.3Calibration

The conformal procedure used throughout this work, including the choice of nonconformity score per model and the finite\-sample marginal coverage guarantee, is described next\.

## Appendix BAlternative Calibration Procedures

The protocol of[section˜4\.3](https://arxiv.org/html/2605.22891#S4.SS3)uses conformal calibration as a concrete instantiation of the calibration axis, but the protocol’s requirement is that calibration is assessed – not which procedure is used\. We briefly review three methods and indicate when each is appropriate\.

##### Conformal calibration

Given a nonconformity scores\(z,x\)s\(z,x\)\(e\.g\. the NLL\), the thresholdq^1−α\\hat\{q\}\_\{1\-\\alpha\}is the⌈\(1−α\)\(ncal\+1\)⌉\\lceil\(1\-\\alpha\)\(n\_\{\\text\{cal\}\}\+1\)\\rceil\-th smallest score on a held\-out calibration set\[[1](https://arxiv.org/html/2605.22891#bib.bib30)\]\. The resulting prediction set satisfiesℙ\(z∈𝒞1−α\(x\)\)≥1−α\\mathbb\{P\}\(z\\in\\mathcal\{C\}\_\{1\-\\alpha\}\(x\)\)\\geq 1\-\\alphaunder exchangeability, giving a finite\-sample marginal coverage guarantee that holds across model families\. We use it throughout this work for cross\-family comparability\.

##### HPD credible intervals

For models with tractable densities, the highest posterior density regionℛα=\{z:pθ\(z∣x\)≥τα\}\\mathcal\{R\}\_\{\\alpha\}=\\\{z:p\_\{\\theta\}\(z\\mid x\)\\geq\\tau\_\{\\alpha\}\\\}tests the learned density directly without requiring a calibration split\. It offers no finite\-sample guarantee and is most appropriate when the density is the object of interest itself, e\.g\. for likelihood\-trained models being assessed on their own training objective\.

##### PIT / CDF\-based diagnostics

The probability integral transform of the true value under the predicted CDF should be uniform under correct calibration\[[22](https://arxiv.org/html/2605.22891#bib.bib33)\]and deviations reveal systematic biases\. PIT is most informative for univariate predictions and provides a richer diagnostic than scalar coverage, but no formal guarantee\.

##### SBC and TARP

For joint posteriors over high\-dimensional latents, simulation\-based calibration\[[51](https://arxiv.org/html/2605.22891#bib.bib20)\]and TARP\[[34](https://arxiv.org/html/2605.22891#bib.bib21)\]extend coverage diagnostics to the joint setting\. They are the natural choice when the protocol’s univariate CRPS is replaced by the energy score \([section˜4\.1](https://arxiv.org/html/2605.22891#S4.SS1)\) for joint\-posterior applications\.

##### Choosing among them

We recommend to use conformal prediction when comparing across model families \(its finite\-sample guarantee is family\-agnostic\)\. Use HPD or PIT when assessing a single model’s density quality on its own terms\. Use SBC or TARP when the latent of interest is high\-dimensional and the joint posterior structure matters\.

## Appendix CBenchmark I: Synthetic Inverse Problem

This appendix provides architecture, training, and evaluation details for the synthetic benchmark of[section˜5](https://arxiv.org/html/2605.22891#S5), along with extended results supporting the main\-text claims\.

### C\.1Forward Model

We wantp\(z∣x\)p\(z\\mid x\)for the modelx=z2\+εx=z^\{2\}\+\\varepsilon, withε∼𝒩\(0,σε2\)\\varepsilon\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\varepsilon\}^\{2\}\)andz∼𝒰\(−a,a\)z\\sim\\mathcal\{U\}\(\-a,a\)\. Here,𝒩\(μ,σ2\)\\mathcal\{N\}\(\\mu,\\sigma^\{2\}\)denotes the standard normal distribution with meanμ\\muand varianceσ2\\sigma^\{2\}, and𝒰\(−a,a\)\\mathcal\{U\}\(\-a,a\)denotes the uniform distribution on the range\[−a,a\]\[\-a,a\]\. From Bayes’ theorem:

p\(z∣x\)=p\(x∣z\)p\(z\)p\(x\)\.p\(z\\mid x\)=\\frac\{p\(x\\mid z\)\\,p\(z\)\}\{p\(x\)\}\.\(9\)For a givenzzwe havex∣z∼𝒩\(z2,σε2\)x\\mid z\\sim\\mathcal\{N\}\(z^\{2\},\\sigma\_\{\\varepsilon\}^\{2\}\)which is expanded to

p\(x∣z\)=12πσεexp⁡\(−\(x−z2\)22σε2\)\.p\(x\\mid z\)=\\frac\{1\}\{\\sqrt\{2\\pi\}\\,\\sigma\_\{\\varepsilon\}\}\\exp\\\!\\left\(\-\\frac\{\(x\-z^\{2\}\)^\{2\}\}\{2\\sigma\_\{\\varepsilon\}^\{2\}\}\\right\)\.The prior is uniform, so it contributes nozzdependence on the likelihood\. The unnormalized posterior is:

p\(z∣x\)∝p\(x∣z\)p\(z\)∝exp⁡\(−\(x−z2\)22σε2\)forz∈\[−a,a\]p\(z\\mid x\)\\propto p\(x\\mid z\)\\,p\(z\)\\propto\\exp\\\!\\left\(\-\\frac\{\(x\-z^\{2\}\)^\{2\}\}\{2\\sigma\_\{\\varepsilon\}^\{2\}\}\\right\)\\quad\\text\{for \}z\\in\[\-a,a\]\(10\)The factors12πσε\\frac\{1\}\{\\sqrt\{2\\pi\}\\sigma\_\{\\varepsilon\}\}and12a\\frac\{1\}\{2a\}are constants w\.r\.t\.zzand are absorbed into the normalization\. This leads to the unnormalized log\-posterior:

log⁡p~\(z∣x\)=−12\(x−z2σε\)2,forz∈\[−a,a\]\\log\\tilde\{p\}\(z\\mid x\)=\-\\frac\{1\}\{2\}\\left\(\\frac\{x\-z^\{2\}\}\{\\sigma\_\{\\varepsilon\}\}\\right\)^\{2\},\\quad\\text\{for \}z\\in\[\-a,a\]\(11\)
p\(x\)p\(x\)is intractable analytically:

p\(x\)=∫−aap\(x∣z\)p\(z\)dz=12a∫−aa12πσεexp⁡\(−\(x−z2\)22σε2\)dz,p\(x\)=\\int\_\{\-a\}^\{a\}p\(x\\mid z\)\\,p\(z\)\\,\\mathrm\{d\}z=\\frac\{1\}\{2a\}\\int\_\{\-a\}^\{a\}\\frac\{1\}\{\\sqrt\{2\\pi\}\\,\\sigma\_\{\\varepsilon\}\}\\exp\\\!\\left\(\-\\frac\{\(x\-z^\{2\}\)^\{2\}\}\{2\\sigma\_\{\\varepsilon\}^\{2\}\}\\right\)\\mathrm\{d\}z,\(12\)but constant with respect tozz\. We can evaluate the unnormalized posterior on a grid and normalize it numerically:

p\(z∣x\)=exp⁡\(log⁡p\)∫−aaexp⁡\(log⁡p\)dz\.p\(z\\mid x\)=\\frac\{\\exp\(\\log\{p\}\)\}\{\\int\_\{\-a\}^\{a\}\\exp\(\\log\{p\}\)\\,\\mathrm\{d\}z\}\.\(13\)We use[eq\.˜13](https://arxiv.org/html/2605.22891#A3.E13)as the truth posterior to compare against in Benchmark I, providing an exact ground\-truth posterior for every observation,xx\.

### C\.2Data generation and splits

We samplez∼𝒰\(−a,a\)z\\sim\\mathcal\{U\}\(\-a,a\)witha=5a=5, computey=z2\+εy=z^\{2\}\+\\varepsilonwithε∼𝒩\(0,σε2\)\\varepsilon\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\varepsilon\}^\{2\}\)andσε=0\.5\\sigma\_\{\\varepsilon\}=0\.5, and split independently\-drawn samples into training \(50,000\), validation \(10,000\), and test \(10,000\) sets\. Splits are generated with fixed random seeds; because the prior and noise are sampled fresh for each set, train\-test contamination is impossible by construction\.

### C\.3Model architectures and training details

The model architectures used in Benchmark I are shown in[table˜3](https://arxiv.org/html/2605.22891#A3.T3)\. Reference implementations of all four models, training scripts, and the exact configuration files used to produce[table˜1](https://arxiv.org/html/2605.22891#S5.T1)and[fig\.˜1](https://arxiv.org/html/2605.22891#S5.F1)are available on request \([appendix˜E](https://arxiv.org/html/2605.22891#A5)\)\. All models are trained with AdamW\[[36](https://arxiv.org/html/2605.22891#bib.bib38)\]\(learning rate10−310^\{\-3\}, batch size512512\) on 50,000 samples for8080epochs to ensure all models are fully converged\. We do not employ early stopping, and we use cosine annealing learning rate scheduling\. We have trained on aNVIDIA A100 80GB PCIeGPU\. Training the MLP\-backbone models takes approximately one minute, while training the normalizing flow takes approximately 7 minutes\.

Table 3:Benchmark I model comparison\. All non\-flow models share a 4\-layer MLP backbone and differ only in output head and training objective\.##### Shared backbone

The regression, heteroscedastic regression, and MDN share a 4\-layer fully\-connected backbone with128128hidden units per layer, ReLU activations, and a single scalar inputxx\. Models differ only in the output head and training objective; the normalizing flow uses a different architecture entirely \(described below\)\.

##### MDN initialization

On the symmetric posterior ofx=z2\+εx=z^\{2\}\+\\varepsilon, MDNs with default random initialization frequently collapse to an effectively unimodal posterior – a known property of MDN training on symmetric multimodal targets\[[5](https://arxiv.org/html/2605.22891#bib.bib51)\]\. We break the symmetry by initializing the two component means atμ1=\+1\\mu\_\{1\}=\+1andμ2=−1\\mu\_\{2\}=\-1, with no other changes to architecture, optimizer, or training schedule\. This stabilizes training across all 5 seeds reported in[table˜1](https://arxiv.org/html/2605.22891#S5.T1)\. This instability further motivates using more expressive and stabler flows in Benchmark II\.

##### Conditional normalizing flow

We use a conditional normalizing flow built with thenormflowslibrary\[[50](https://arxiv.org/html/2605.22891#bib.bib52)\]\. The base distribution is a standard 1D Gaussian \(fixed mean and unit variance\)\. The flow stacks 4 transform blocks, each consisting of an autoregressive rational\-quadratic spline \(ARQS\) coupling\[[17](https://arxiv.org/html/2605.22891#bib.bib35)\]followed by an LU\-decomposed linear permutation\. Each spline transform has 2 hidden blocks with 64 hidden units conditioning on the scalar observationxx\(1 context channel\)\. Defaultnormflowshyperparameters are used for spline tail bounds and number of knots\.

### C\.4Evaluation details

The regression MLP model reports a point estimate per event, and CRPS reduces to MAE for this model\. For the remaining models, CRPS is calculated fromN=500N=500posterior samples per event using the sorting\-based estimator of[eq\.˜8](https://arxiv.org/html/2605.22891#A1.E8)\. The estimator bias from finiteNNis𝒪\(1/N\)≈0\.05%\\mathcal\{O\}\(1/N\)\\approx 0\.05\\%\[[20](https://arxiv.org/html/2605.22891#bib.bib53)\], well below the gaps reported in[table˜1](https://arxiv.org/html/2605.22891#S5.T1)\.

WithB=50B=50uniform bins over\[−5,5\]\[\-5,5\], expected counts under the true uniform prior are∼200\\sim 200per bin, validating the Gaussian approximationσb2≈nbtrue\\sigma\_\{b\}^\{2\}\\approx n^\{\\text\{true\}\}\_\{b\}\. The regression’s collapse toz≈0z\\approx 0concentrates predictions in a single bin, producing the dominant contribution toχspec2\\chi^\{2\}\_\{\\text\{spec\}\}\.

We apply conformal prediction as a model\-agnostic calibration diagnostic\. For the uncertainty\-aware models, the nonconformity score is the NLLs\(z,x\)=−log⁡pθ\(z∣x\)s\(z,x\)=\-\\log p\_\{\\theta\}\(z\\mid x\)\. The regression has no notion of nonconformity score that scales with input – a uniform residual acrossxxproduces calibrated coverage trivially via fixed\-width intervals\.

Given a calibration set of sizencaln\_\{\\text\{cal\}\}, the conformal threshold is then given as the empirical threshold on the calibration set corresponding to some mis\-coverage level,α\\alpha\(e\.g\.,0\.050\.05for95%95\\%confidence\)\[[1](https://arxiv.org/html/2605.22891#bib.bib30)\]:

q^1−α=the⌈\(1−α\)\(ncal\+1\)⌉\-th smallest value of\{si\}\.\\hat\{q\}\_\{1\-\\alpha\}=\\text\{the\}\\;\\lceil\(1\-\\alpha\)\(n\_\{\\text\{cal\}\}\+1\)\\rceil\\text\{\-th smallest value of\}\\;\\\{s\_\{i\}\\\}\.\(14\)A well\-calibrated model should produce a coverage curve that tracks the diagonal\.

Calibration alone is not enough – a model can be perfectly calibrated yet produce sets so wide that they are practically useless\. We therefore also report the average Lebesgue size of the conformal prediction set:

C^1−α\(y\)=\{z:s\(z,y\)≤q^1−α\}\\hat\{C\}\_\{1\-\\alpha\}\(y\)\\;=\\;\\\{\\,z:s\(z,y\)\\leq\\hat\{q\}\_\{1\-\\alpha\}\\,\\\}on a uniform grid of 1,000 points over\[−a,a\]\[\-a,a\]\. The reported width is the total Lebesgue measure \(sum of interval lengths for disjoint regions\)\.

### C\.5Additional Results

Table 4:Expanded version of[table˜1](https://arxiv.org/html/2605.22891#S5.T1)\. Synthetic benchmark metrics \(10,000 test events\)\. Values are mean±\\pmstandard deviation across 5 random seeds\. The regression wins on RMSE; the normalizing flow and MDN dominate all distributional metrics\. Averaging the normalizing flow posterior recovers the regression’s RMSE, CRPS, and spectral mismodeling\. Calibration only possible for uncertainty\-aware architectures\.χ2\\chi^\{2\}has 49 degrees of freedom; Best \(mean\) per column inbold\.∗Lowest by construction \([section˜3\.1](https://arxiv.org/html/2605.22891#S3.SS1)\); converged to constant prediction with negligible seed variance\.†One random posterior sample forχspec2/ndf\\chi^\{2\}\_\{\\text\{spec\}\}/\\text\{ndf\}\.‡Mixture density network needs manual initialization\.

![Refer to caption](https://arxiv.org/html/2605.22891v1/x3.png)Figure 3:Additional conditional posteriors for the toy model presented in[section˜5](https://arxiv.org/html/2605.22891#S5)\. We present the reconstructed posteriors for six differentx∗x^\{\*\}values similar to[fig\.˜1](https://arxiv.org/html/2605.22891#S5.F1)\(a\)\. Here, we also include the heteroscedastic regression model\.![Refer to caption](https://arxiv.org/html/2605.22891v1/x4.png)Figure 4:\(a\)Marginalzzover 10,000 test events\. Similar to[fig\.˜1](https://arxiv.org/html/2605.22891#S5.F1)\(b\), but we have included the marginal recovered by the heteroscedastic regression\.\(b\)The sensitivity of the CRPS score is measured by calculating it as a function of ensemble size,MM, for the distributional models\. The flow’s curve coincides with MDN’s\.![Refer to caption](https://arxiv.org/html/2605.22891v1/x5.png)Figure 5:Calibration diagnostics beyond marginal coverage\.\(a\)Mean conformal prediction\-set size at each nominal coverage level: all methods are calibrated \([fig\.˜1](https://arxiv.org/html/2605.22891#S5.F1)\(c\)\) but produce sets of very different widths, demonstrating that coverage and sharpness are independent diagnostics\.\(b\)Empirical coverage at the90%90\\%nominal level as a function of the observationxx, showing whether conditional calibration is uniform across phase space or hides regional deviations\.This subsection presents extended visualizations and a sensitivity analysis supporting the main\-text claims of[section˜5](https://arxiv.org/html/2605.22891#S5)\.

##### Per\-event posteriors across the unimodal\-to\-bimodal transition

[Figure˜3](https://arxiv.org/html/2605.22891#A3.F3)extends[fig\.˜1](https://arxiv.org/html/2605.22891#S5.F1)\(a\) by showing the recovered per\-event posteriors at six different observationsx∗x^\{\*\}spanning the unimodal regime \(x∗≈0x^\{\*\}\\approx 0, where the two modes merge\) through the strongly bimodal regime \(x∗≫1x^\{\*\}\\gg 1where modes are well separated at±x∗\\pm\\sqrt\{x^\{\*\}\}\)\. The flow and MDN track the analytic posterior across all regimes, including the smooth transition between regimes\. The heteroscedastic regression – shown here but not in the main text – collapses to a single Gaussian centered near zero with inflated variance whenever the true posterior is bimodal, illustrating the failure mode quantified by itsχspec2\\chi^\{2\}\_\{\\text\{spec\}\}in[table˜1](https://arxiv.org/html/2605.22891#S5.T1)\. Per\-event uncertainty is insufficient if the predicted family cannot represent the true posterior geometry\.

##### Marginal recovery including the heteroscedastic regression

[Figure˜4](https://arxiv.org/html/2605.22891#A3.F4)\(a\) extends[fig\.˜1](https://arxiv.org/html/2605.22891#S5.F1)\(b\) with the heteroscedastic regression’s marginal\. The flow and MDN marginals match the true uniform on\[−a,a\]\[\-a,a\]and the regression collapses to a spike at zero\. The heteroscedastic regression fall somewhere in between, and it overestimates the center and underestimates the tails\. This visually confirms the variance compression argument \([section˜3\.2](https://arxiv.org/html/2605.22891#S3.SS2)\) that adding per\-event uncertainty improves the marginal over a pure point estimator but cannot recover bimodal structure under a unimodal posterior assumption\.

##### CRPS sensitivity to posterior ensemble size

[Figure˜4](https://arxiv.org/html/2605.22891#A3.F4)\(b\) shows dataset\-averaged CRPS as a function of the number of posterior samplesMMused for its estimation, for each distributional model\. The estimator stabilizes well below the per\-method gaps reported in[table˜1](https://arxiv.org/html/2605.22891#S5.T1): byM≈100M\\approx 100all curves are within Monte Carlo noise of their asymptotic values, justifying the choice ofM=500M=500in the main text and confirming that method rankings are robust to ensemble size\. The estimator bias from finiteNNis𝒪\(1/N\)≈0\.05%\\mathcal\{O\}\(1/N\)\\approx 0\.05\\%\[[20](https://arxiv.org/html/2605.22891#bib.bib53)\]whenM=500M=500\.

##### Prediction\-set efficiency

[Figure˜5](https://arxiv.org/html/2605.22891#A3.F5)\(a\) shows the mean prediction\-set size as a function of nominal coverage1−α1\-\\alphafor each model calibrated with conformal prediction\. All three uncertainty\-aware methods achieve nominal coverage \(see[fig\.˜1](https://arxiv.org/html/2605.22891#S5.F1)\(c\)\), but their prediction sets differ substantially in size\. At90%90\\%nominal coverage, the heteroscedastic regression’s sets span more than half of the full prior support\[−a,a\]\[\-a,a\]and convey almost no information beyond the prior\. In contrast, the MDN and flow produce sets a few times narrower, concentrated as two disjoint intervals around the true modes±x\\pm\\sqrt\{x\}\([fig\.˜3](https://arxiv.org/html/2605.22891#A3.F3)\)\. This quantifies the claim of[section˜4](https://arxiv.org/html/2605.22891#S4)that coverage alone is an insufficient calibration diagnostic: a model can be perfectly calibrated and yet uninformative, and sharpness must be reported alongside coverage to distinguish and rank the two\.

##### Conditional calibration across the posterior\-geometry transition

[Figure˜5](https://arxiv.org/html/2605.22891#A3.F5)\(b\) shows empirical coverage at the90%90\\%nominal level as a function of the observationxx, probing whether marginal calibration hides regions of systematic over\- or undercoverage\. This addresses the recommendation of[section˜4\.3](https://arxiv.org/html/2605.22891#S4.SS3)that calibration should be assessed conditionally in bins of phase space rather than only marginally\. In addition to marginal calibration, we show the conditional calibration in[fig\.˜5](https://arxiv.org/html/2605.22891#A3.F5)\(b\) at90%90\\%\. The heteroscedastic model achieves marginal calibration by overcovering for smallxxand undercovering for largexx\. Only conditional calibration makes this visible\. The MDN and flow slightly undercover in the unimodal regime and slightly overcover in the strongly bimodal regime\.

## Appendix DBenchmark II: Dileptonic Top\-Quark Reconstruction

This appendix provides architecture, training, and evaluation details for the physics benchmark of[section˜6](https://arxiv.org/html/2605.22891#S6), along with extended results supporting the main\-text claims\.

### D\.1Physics background

In this work, we evaluate our method on the reconstruction of top quark pairs\(tt¯\)\\left\(t\\bar\{t\}\\right\)produced in proton\-proton collisions\. The top quark is the heaviest known elementary particle, and its precise reconstruction is a cornerstone of the Large Hadron Collider \(LHC\) physics program\.

#### D\.1\.1Dileptonic Decay Topology

Top quarks decay almost exclusively into aWWboson and a bottom\(b\)\\left\(b\\right\)quark\. When bothWWbosons from att¯\{t\\bar\{t\}\}pair decay into a charged lepton\(ℓ=e,μ\)\\left\(\\ell=e,\\mu\\right\)and a neutrino\(ν\)\\left\(\\nu\\right\), the process is called thedileptonic decay channel\. The full decay chain is represented as:

pp→tt¯→\(W\+b\)\(W−b¯\)→\(ℓ\+νℓb\)\(ℓ−ν¯ℓb¯\)\.pp\\to t\\bar\{t\}\\to\\left\(W^\{\+\}b\\right\)\\big\(W^\{\-\}\\bar\{b\}\\big\)\\to\\big\(\\ell^\{\+\}\\nu\_\{\\ell\}b\\big\)\\big\(\\ell^\{\-\}\\bar\{\\nu\}\_\{\\ell\}\\bar\{b\}\\big\)\.\(15\)The experimental signature in the detector consists of two high\-momentum leptons, two jets originating from thebb\-quarks \(bb\-jets\), and significant Missing Transverse Energy\(E→Tmissor MET\)\\left\(\\vec\{E\}\_\{T\}^\{miss\}\\;\\text\{or MET\}\\right\)due to the undetected neutrinos\.

From an algorithmic perspective, the reconstruction of the parent top quark four\-momenta presents three primary challenges:

1. 1\.Underdetermined Kinematics:While the detector measures the transverse components of the sum of the neutrino momenta\(E→Tmiss\)\\left\(\\vec\{E\}\_\{T\}^\{miss\}\\right\), the individual longitudinal momenta\(pz\)\\left\(p\_\{z\}\\right\)and the specific distribution of transverse momentum between the two neutrinos are unknown\. This results in a system with six unknown degrees of freedom \(the three\-momentum components for each neutrino\) but only two direct experimental constraints\. Even when assuming the known masses of theWWboson and the top quark as constraints, the system often remains underdetermined or yields multiple algebraic solutions\.
2. 2\.Combinatorial Ambiguity:In a standard event, the detector identifies twobb\-jets, but it is not a priori known which jet originated from the top quark and which from the anti\-top quark\. Fornnadditional light\-flavor jets in the event, the number of possible permutations for the final\-state assignment grows factorially, creating a complex assignment problem\.
3. 3\.Detector Resolution and Noise:The measured momenta of jets and theE→Tmiss\\vec\{E\}\_\{T\}^\{miss\}are subject to experimental uncertainties and resolution effects\. Traditional analytical “kinematic fitting” methods often fail when the measured values fluctuate such that no physical solution exists for the mass constraints\.

### D\.2Dataset details

We use the public Delphes\[[16](https://arxiv.org/html/2605.22891#bib.bib39)\]Monte Carlo dataset ofRaineet al\.\[[42](https://arxiv.org/html/2605.22891#bib.bib40)\]\(Zenodo DOI[10\.5281/zenodo\.8113516](https://doi.org/10.5281/zenodo.8113516)\), released under CC\-BY 4\.0\. The dataset comprises∼\\sim950k simulated dileptonictt¯\{t\\bar\{t\}\}events with realistic detector effects via Delphes\. Event generation, selection, and detector simulation details are in the dataset release andRaineet al\.\[[43](https://arxiv.org/html/2605.22891#bib.bib9)\]\. We use the dataset’s provided train/test splits without modification and we only consider the events generated with MadGraph\.

The dataset was originally constructed for neutrino reconstruction, with bundled targets \(calledneutrinos\) corresponding to the two neutrino four\-momenta\. For top\-quark reconstruction we instead construct the target vectors ourselves from the dataset keytruth\_particles\. The latent target is two four\-vectors per event\. The observables per event comprise up to six four\-vector jets, two lepton four\-vectors, a missing\-transverse\-energy vector \(pTp\_\{T\}andϕ\\phi\), and a boolean array indicating whether jets are b\-tagged \(whether a reconstructed jet is*likely*originating from a bottom quark\)\.

### D\.3Observable definitions

Once the four\-momenta of the top quark\(pt\)\\left\(p\_\{t\}\\right\)and the anti\-top quark\(pt¯\)\\left\(p\_\{\\bar\{t\}\}\\right\)are reconstructed and the inverse problem is solved, several high\-level observables can be computed for the actual physical study of the process\. These variables are critical for testing Standard Model predictions and searching for new physics\. The distributions over the observables we use here, vary from low\-variance and unimodal\(pTtt¯\)\\left\(p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}\\right\)to higher\-variance and multimodal\(Δϕ\(ℓhel,t¯tt¯\)\)\\left\(\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)\\right\)\.

#### D\.3\.1Global Kinematic Variables

These observables describe the overall scale and motion of thett¯\{t\\bar\{t\}\}system\.

- •Invariant Mass\(mtt¯\)\\left\(m\_\{\{t\\bar\{t\}\}\}\\right\):Defined as the Lorentzian norm of the combined four\-momentum,mtt¯=\(pt\+pt¯\)2m\_\{\{t\\bar\{t\}\}\}=\\sqrt\{\(p\_\{t\}\+p\_\{\\bar\{t\}\}\)^\{2\}\}\. It represents the total energy scale of thett¯\{t\\bar\{t\}\}production\. In the context of machine learning, this is often the most important target for reconstruction, as local deviations in themtt¯m\_\{\{t\\bar\{t\}\}\}spectrum could indicate the existence of heavy resonant particles \(e\.g\., a heavy Higgs boson\) decaying into top pairs\.
- •Transverse Momentum\(pTtt¯\)\\left\(p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}\\right\):The magnitude of the vector sum of the transverse momenta,pTtt¯=\|p→T,t\+p→T,t¯\|p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}=\|\\vec\{p\}\_\{T,t\}\+\\vec\{p\}\_\{T,\\bar\{t\}\}\|\. In a simple leading\-order collision, the top quarks are produced back\-to\-back in the transverse plane, leading topTtt¯≈0p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}\\approx 0\. A non\-zero value typically indicates the presence of "Initial State Radiation" \(ISR\), where the incoming partons radiate gluons before colliding\.

#### D\.3\.2Angular and Spin\-Sensitive Observables

These variables describe the relative geometry of the decay products and are sensitive to the quantum state \(spin\) of the quarks\.

- •Azimuthal Angle Difference\(Δϕ\(ℓhel,t¯tt¯\)\)\\left\(\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)\\right\):The angle between the lepton and the anti\-top quark in the plane transverse to the beam pipe in the helicity frame of thett¯\{t\\bar\{t\}\}system,Δϕ\(ℓhel,t¯tt¯\)=\|ϕℓhel−ϕt¯tt¯\|\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)=\|\\phi\_\{\\ell\_\{\\text\{hel\}\}\}\-\\phi\_\{\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\}\|\. This metric is a close proxy for the spin correlation between the top and the anti\-top quarks in a pair system, see the next paragraph forchelc\_\{\\mathrm\{hel\}\}\.
- •Spin Correlation Variable\(chel\)\\left\(c\_\{\\mathrm\{hel\}\}\\right\): The spin correlation observablechelc\_\{\\mathrm\{hel\}\}is computed from the scalar product of the lepton momentum vectors in the rest frames of their respective parent top quarks\. It is closely related toΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)and the equivalent angle between the lepton and its parent top quark\. It is a suitable variable to distinguish the spin state of the top quark pair and to search for top quark pairs which were produced by heavier \(pseudo\-\)scalar particles\. To computechelc\_\{\\mathrm\{hel\}\}, one has to boost the lepton momenta into thett¯\{t\\bar\{t\}\}rest frame and then into the parent top quark frames\. In this frame,chelc\_\{\\mathrm\{hel\}\}is chel=ℓ1^⋅ℓ2^c\_\{\\mathrm\{hel\}\}=\\hat\{\\ell^\{1\}\}\\cdot\\hat\{\\ell^\{2\}\}\(16\)

### D\.4Forward model

The forward model maps generator\-level top\-quark four\-momenta𝒛\\bm\{z\}to detector\-level observations𝒙\\bm\{x\}through three stochastic stages:*\(i\)*hard\-scattering matrix\-element generation,*\(ii\)*parton showering and hadronization, and*\(iii\)*detector response\. TheRaineet al\.\[[42](https://arxiv.org/html/2605.22891#bib.bib40)\]dataset uses MadGraph \(v3\.10\) for the hard process, Pythia8 \(v8\.243\) for showering, and Delphes \(v3\.4\.2\) for detector simulation\. The full configurations are also found in the dataset release\[[42](https://arxiv.org/html/2605.22891#bib.bib40)\]\.

Two sources of information loss make the inverse problem under\-constrained\. First, the two neutrinos in dileptonictt¯t\\bar\{t\}decays escape detection entirely and only their summed transverse component is observable as missing transverse momentum\(E→Tmiss\)\\left\(\\vec\{E\}\_\{T\}^\{miss\}\\right\), which provides two constraints in place of the eight neutrino degrees of freedom\. Second, parton\-jet matching is combinatorial: the two reconstructed bottom\-jets can each be associated with either the top or antitop, with no detector\-level information distinguishing the two assignments\.

Detector effects \(energy smearing, angular resolution, jet reconstruction efficiency, bottom\-tagging etc\.\) introduce additional measurement noise on top of the structural under\-determination\.

### D\.5Model architectures

Architectural and training hyperparameters for all four methods are summarized in[table˜5](https://arxiv.org/html/2605.22891#A4.T5)and[table˜6](https://arxiv.org/html/2605.22891#A4.T6)\. This section provides supplementary detail and design rationale where relevant\.

##### Analytic solver

We use the algebraic on\-shell\-mass solver ofSonnenschein \[[48](https://arxiv.org/html/2605.22891#bib.bib4)\]\. Our Python implementation is available from the corresponding author on request\. The solver returns up to 4 discrete neutrino\-momentum solutions per event satisfying the on\-shell mass constraints\. We select the solution per event withmtt¯m\_\{\{t\\bar\{t\}\}\}closest to2mt2m\_\{t\}\. The solver returns no solution for∼10%\\sim 10\\%of events and these events are excluded from all metrics and spectra for fair comparison \([section˜D\.7](https://arxiv.org/html/2605.22891#A4.SS7)\)\.

##### Transformer regression

The transformer regression is inspired byCMS Collaboration \[[11](https://arxiv.org/html/2605.22891#bib.bib26)\], predicting the top and antitop four\-momenta directly from the input tokens\. The pure\-MSE variant uses standard squared\-error loss; the MSE \+ MMD variant adds two MMD penalties on themtt¯m\_\{t\\bar\{t\}\}andpTtt¯p\_\{T\}^\{t\\bar\{t\}\}marginals \([table˜6](https://arxiv.org/html/2605.22891#A4.T6), “MMD regularization” block\)\. We attempted MMD penalties onchelc\_\{\\mathrm\{hel\}\}andΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)across multiple bandwidth and weight configurations and in all cases the MMD term stalled during training without meaningful decrease while MSE continued to improve normally\. Marginal regularization requires moving conditional\-mean predictions into vanishing per\-event posterior density, fighting the MSE objective rather than complementing it when the problem is multimodal\. The two losses are structurally incompatible on multimodal observables\.

##### Discrete normalizing flow

The discrete flow is inspired by theν2\\nu^\{2\}\-flow architecture\[[43](https://arxiv.org/html/2605.22891#bib.bib9)\]: a transformer condition encoder feeds into a stack of conditional ARQS coupling layers with LU\-linear permutations\. Targets are normalized via an iterative mean\-variance layer before flow training, with the corresponding Jacobian included in the likelihood evaluation\. The base distribution is standard Gaussian\.

##### Continuous normalizing flow

The continuous flow uses the conditional flow\-matching objective\[[35](https://arxiv.org/html/2605.22891#bib.bib44)\]\. The vector fieldvθ\(𝒛t,t,𝒙\)v\_\{\\theta\}\(\\bm\{z\}\_\{t\},t,\\bm\{x\}\)is parameterized as an encoder\-decoder transformer\. An encoder processes the conditioning tokens and a decoder cross\-attends to the encoded condition while conditioning on the integration time via a Gaussian Fourier embedding \([table˜6](https://arxiv.org/html/2605.22891#A4.T6)\)\. Sampling and likelihood evaluation use ODE integration and details are listed in the same table\. Our transformer adapts baseline\-transformer building blocks from the L\-GATr codebase\[[7](https://arxiv.org/html/2605.22891#bib.bib55),[49](https://arxiv.org/html/2605.22891#bib.bib56),[8](https://arxiv.org/html/2605.22891#bib.bib57)\]\.

### D\.6Training details

Table 5:Benchmark II model architectures\. All learned models share the same input representation and training data; they differ in encoder backbone, output head, and training objective\.∗Trained with conditional flow matching \(CFM\)\[[35](https://arxiv.org/html/2605.22891#bib.bib44)\]\.

Table 6:Benchmark II training hyperparameters\. Three architectures: a transformer encoder for the MSE/MMD regressions, an encoder\-decoder vector field for the continuous flow, and aν2\\nu^\{2\}\-flow\-style\[[43](https://arxiv.org/html/2605.22891#bib.bib9)\]architecture combining a transformer condition encoder with a stack of RQS coupling layers for the discrete flow\.∗MMD penalties onchelc\_\{\\mathrm\{hel\}\}andΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)were attempted but did not converge \(see[section˜D\.5](https://arxiv.org/html/2605.22891#A4.SS5)\);†PID is a one\-hot\-style 6D particle ID encoding of charge and particle type\.

The models used in Benchmark II are shown in table[table˜5](https://arxiv.org/html/2605.22891#A4.T5)\. Reference implementations of all four models, training scripts, and the exact configuration files used to produce[table˜2](https://arxiv.org/html/2605.22891#S6.T2)and[fig\.˜2](https://arxiv.org/html/2605.22891#S6.F2)are available on request \([appendix˜E](https://arxiv.org/html/2605.22891#A5)\)\. All models are trained with AdamW\[[36](https://arxiv.org/html/2605.22891#bib.bib38)\]with hyperparameters shown in[table˜6](https://arxiv.org/html/2605.22891#A4.T6)\.

### D\.7Evaluation details

The analytic solver returns a solution for∼90%\\sim 90\\%of events; we restrict marginal\-spectrum and metric computation to its solvable subset for fair comparison\. The regression and flow\-based models reconstruct all events \(100%100\\%efficiency\), and their metrics on the full test set differ by only a small amount\. For the flows we drawN=500N=500posterior samples per event: CRPS uses the full ensemble;χspec2\\chi^\{2\}\_\{\\text\{spec\}\}uses one random draw per event\. Histograms useB=100B=100bins \(only bins with nonzero expected content enter in computation\) for calculatingχspec2\\chi^\{2\}\_\{\\text\{spec\}\}\. Conformal calibration usesncal=1,000n\_\{\\text\{cal\}\}=1\{,\}000events from the test set, with the remaining events for evaluation\. We use the per\-event NLL as nonconformity score for both flows\.

##### CRPS sensitivity to posterior ensemble size

![Refer to caption](https://arxiv.org/html/2605.22891v1/x6.png)Figure 6:The sensitivity of the CRPS score overΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)is measured by calculating it as a function of ensemble size,MM, for the flow\-based models\.Based on the discussion in[section˜C\.5](https://arxiv.org/html/2605.22891#A3.SS5)we also include a plot showing dataset\-averaged CRPS as a function of the number of posterior samplesMMused for its estimation in[fig\.˜6](https://arxiv.org/html/2605.22891#A4.F6)\. The estimator stabilizes well below the per\-method gaps reported in[table˜2](https://arxiv.org/html/2605.22891#S6.T2): byM≈100M\\approx 100all curves are within Monte Carlo noise of their asymptotic values, justifying the choice ofM=500M=500in the main text and confirming that method rankings are robust to ensemble size\. The estimator bias from finiteNNis𝒪\(1/N\)≈0\.05%\\mathcal\{O\}\(1/N\)\\approx 0\.05\\%\[[20](https://arxiv.org/html/2605.22891#bib.bib53)\]whenM=500M=500\.

##### Robustness across flow configurations

Table 7:Robustness across four discrete\-flow implementations sharing architecture and hyperparameters but differing in likelihood implementation\. Values are mean±\\pmstandard deviation on the test set\. Calibration is omitted because it depends on the absolute log\-likelihood, which is not comparable across implementations\.Compute cost precluded multi\-seed training per architecture\. As an alternative, we trained four discrete\-flow variants sharing the same architecture and hyperparameters but differing in how the change\-of\-variables Jacobian is computed\. Because absolute log\-likelihoods are not comparable across these implementations, we restrict the robustness check to implementation\-invariant metrics: RMSE, CRPS, andχspec2\\chi^\{2\}\_\{\\text\{spec\}\}\([table˜7](https://arxiv.org/html/2605.22891#A4.T7)\)\. On CRPS andχspec2\\chi^\{2\}\_\{\\text\{spec\}\}, the uncertainties are two to three orders of magnitude smaller than the inter\-method gaps to the transformer baselines in[table˜8](https://arxiv.org/html/2605.22891#A4.T8): forΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)the CRPS spread is roughly1%1\\%and theχspec2\\chi^\{2\}\_\{\\text\{spec\}\}spread0\.2%0\.2\\%of the corresponding gaps\. RMSE uncertainties are larger in relative terms and reach the same order as some inter\-method gaps\. We interpret this as bounding implementation noise on the metrics that drive the ranking flip of[section˜6](https://arxiv.org/html/2605.22891#S6)well below the effect sizes themselves\. This is a different constraint from seed variance for a fixed implementation: a multi\-seed study remains future work\.

### D\.8Results across all four observables

Table 8:Benchmark II: methods evaluated under pointwise \(RMSE\) and distributional \(CRPS,χspec2\\chi^\{2\}\_\{\\text\{spec\}\}\) metrics across four observables \(ordered by increasing per\-event posterior multimodality\)\. Averaging the flow posterior*in observable space*per event recovers RMSE substantially better than any latent\-space regression, demonstrating the Jensen\-gap argument of[section˜D\.8\.1](https://arxiv.org/html/2605.22891#A4.SS8.SSS1): the MSE\-optimal observable estimator is the conditional mean of the observable, not the observable of the conditional mean\. Theχ2\\chi^\{2\}forpTtt¯p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}andmtt¯m\_\{\{t\\bar\{t\}\}\}is biased since the Gaussian approximation breaks down in the heavy tails \(Baker\-Cousins\[[3](https://arxiv.org/html/2605.22891#bib.bib54)\]would be more appropriate, but within\-method comparisons remain valid\)\. Best per column inbold![Refer to caption](https://arxiv.org/html/2605.22891v1/x7.png)Figure 7:Marginal reconstructed spectra forpTtt¯p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\},mtt¯m\_\{\{t\\bar\{t\}\}\},chelc\_\{\\mathrm\{hel\}\}, andΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)\(in order of increasing multimodality\)\. The dashed line shows the truth marginal and the colored curves show each method’s reconstructed marginal\. Quantitative comparison viaχspec2\\chi^\{2\}\_\{\\text\{spec\}\}is in[table˜8](https://arxiv.org/html/2605.22891#A4.T8)\.The main text focuses onΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)as a strongly multimodal observable where our evaluation protocol has the highest impact\. This section presents the full evaluation across all four observables \(pTtt¯p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\},mtt¯m\_\{\{t\\bar\{t\}\}\},chelc\_\{\\mathrm\{hel\}\},Δϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\), ordered by increasing multimodality of the per\-event posterior\)\.[Table˜8](https://arxiv.org/html/2605.22891#A4.T8)shows that the ranking flip described in[section˜6](https://arxiv.org/html/2605.22891#S6)is consistent across all observables: the pure\-MSE transformer wins or near\-wins RMSE everywhere disregarding posterior means \(see[section˜D\.8\.1](https://arxiv.org/html/2605.22891#A4.SS8.SSS1)\), while the flows dominate CRPS andχspec2\\chi^\{2\}\_\{\\text\{spec\}\}\.

##### The MMD penalty pattern

The MMD\-regularized transformer reveals the mechanism predicted in[section˜3\.2](https://arxiv.org/html/2605.22891#S3.SS2)\. OnpTtt¯p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}, where the per\-event posterior is predominantly unimodal, the MMD penalty achievesχspec2/ndf\\chi^\{2\}\_\{\\text\{spec\}\}/\\text\{ndf\}, comparable to the flows\. On the multimodalΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\), the same penalty producesχspec2/ndf\\chi^\{2\}\_\{\\text\{spec\}\}/\\text\{ndf\}in the thousands\. This asymmetry is the empirical signature of the law of total variance: marginal regularization can compensate global distortion when within\-posterior variance is small, but cannot correct per\-event multimodality when within\-posterior variance is large\. The intermediate observables \(mtt¯m\_\{\{t\\bar\{t\}\}\},chelc\_\{\\mathrm\{hel\}\}\) show intermediate behavior, with MMD reducing but not closing the gap to the flows\.

##### Marginal spectra

[Figure˜7](https://arxiv.org/html/2605.22891#A4.F7)visualizes the same conclusion\. The MSE transformer compresses the marginal of every observable, with the distortion most visually severe forΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)\. The MMD transformer recovers the unimodalpTtt¯p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}andmtt¯m\_\{\{t\\bar\{t\}\}\}marginals, where it was specifically tuned \([section˜D\.5](https://arxiv.org/html/2605.22891#A4.SS5)\), deviates noticeably forchelc\_\{\\mathrm\{hel\}\}, and fails onΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)\. Both flows track all four marginals closely, with no observable\-specific tuning\.

##### Variance compression of spectra

The variance compression of[section˜3\.2](https://arxiv.org/html/2605.22891#S3.SS2)is visually clearest onpTtt¯p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}in[fig\.˜7](https://arxiv.org/html/2605.22891#A4.F7)\(a\): the MSE transformer’s marginal is sharper at the peak and lower in the tail than the truth, with the peak itself shifted upward because the right\-tailed underlying distribution pulls conditional means toward higherpTtt¯p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}values\. The same pattern is also more pronounced inmtt¯m\_\{\{t\\bar\{t\}\}\}in[fig\.˜7](https://arxiv.org/html/2605.22891#A4.F7)\(b\)\.

##### Caveat onχspec2\\chi^\{2\}\_\{\\text\{spec\}\}for heavy\-tailed observables

Theχspec2\\chi^\{2\}\_\{\\text\{spec\}\}values formtt¯m\_\{\{t\\bar\{t\}\}\}andpTtt¯p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}reported in[table˜8](https://arxiv.org/html/2605.22891#A4.T8)are systematically biased\. Both observables have heavy tails with bins where the Poisson approximationσb2≈nbtrue\\sigma\_\{b\}^\{2\}\\approx n\_\{b\}^\{\\text\{true\}\}underestimates the true uncertainty, inflatingχ2\\chi^\{2\}\. A proper goodness\-of\-fit assessment for tail\-heavy spectra uses the Baker\-Cousins likelihood\-ratio statistic\[[3](https://arxiv.org/html/2605.22891#bib.bib54)\], which we leave for future work\. Within\-method comparisons remain valid because the bias affects all methods equally on a given observable\.

#### D\.8\.1Jensen gap between latent and observable estimators

The conditional\-mean pathology of[section˜3\.1](https://arxiv.org/html/2605.22891#S3.SS1)has a basis\-dependent refinement when downstream quantities are nonlinear functions of the latent\. Let𝒪:ℝDz→ℝ\\mathcal\{O\}:\\mathbb\{R\}^\{D\_\{z\}\}\\to\\mathbb\{R\}be a physics observable computed from the latent four\-vector𝒛\\bm\{z\}\. The MSE\-optimal estimator for𝒪\(𝒛\)\\mathcal\{O\}\(\\bm\{z\}\)given observations𝒙\\bm\{x\}is

𝒪^∗\(𝒙\)=𝔼\[𝒪\(𝒛\)∣𝒙\]=∫𝒪\(𝒛\)p\(𝒛∣𝒙\)𝑑𝒛\.\\hat\{\\mathcal\{O\}\}^\{\*\}\(\\bm\{x\}\)=\\mathbb\{E\}\[\\mathcal\{O\}\(\\bm\{z\}\)\\mid\\bm\{x\}\]=\\int\\mathcal\{O\}\(\\bm\{z\}\)\\,p\(\\bm\{z\}\\mid\\bm\{x\}\)\\,d\\bm\{z\}\.\(17\)A regression trained on the latent under squared\-error loss instead converges to𝒛^\(𝒙\)=𝔼\[𝒛∣𝒙\]\\bm\{\\hat\{z\}\}\(\\bm\{x\}\)=\\mathbb\{E\}\[\\bm\{z\}\\mid\\bm\{x\}\]and yields the prediction𝒪\(𝔼\[𝒛∣𝒙\]\)\\mathcal\{O\}\(\\mathbb\{E\}\[\\bm\{z\}\\mid\\bm\{x\}\]\)for the observable\. By Jensen’s inequality,

𝒪\(𝔼\[𝒛∣𝒙\]\)≠𝔼\[𝒪\(𝒛\)∣𝒙\]\\mathcal\{O\}\(\\mathbb\{E\}\[\\bm\{z\}\\mid\\bm\{x\}\]\)\\neq\\mathbb\{E\}\[\\mathcal\{O\}\(\\bm\{z\}\)\\mid\\bm\{x\}\]\(18\)in general, with the gap vanishing only when𝒪\\mathcal\{O\}is linear or the posteriorp\(𝒛∣𝒙\)p\(\\bm\{z\}\\mid\\bm\{x\}\)is degenerate\. The regression is therefore not MSE\-optimal for the observable even setting aside the multimodality issues of[section˜3\.2](https://arxiv.org/html/2605.22891#S3.SS2)\.

Generative methods, by contrast, recover the MSE\-optimal observable estimator at evaluation time: given posterior samples𝒛\(k\)∼p\(𝒛∣𝒙\)\\bm\{z\}^\{\(k\)\}\\sim p\(\\bm\{z\}\\mid\\bm\{x\}\), the Monte Carlo average1N∑k𝒪\(𝒛\(k\)\)\\frac\{1\}\{N\}\\sum\_\{k\}\\mathcal\{O\}\(\\bm\{z\}^\{\(k\)\}\)converges to𝔼\[𝒪\(𝒛\)∣𝒙\]\\mathbb\{E\}\[\\mathcal\{O\}\(\\bm\{z\}\)\\mid\\bm\{x\}\]\. This explains the systematic RMSE advantage of the “posterior mean” rows in[table˜8](https://arxiv.org/html/2605.22891#A4.T8)\. Across all four observables, both flow posterior means achieve RMSE substantially below all latent\-space regression baselines, despite the regressions being trained directly to minimize four\-vector MSE\. The Jensen gap for these observables is large because top\-quark posteriors are wide and the observable maps are strongly nonlinear \(squared invariant masses, transverse\-plane angles\)\.

This argument is independent of the multimodality argument of[section˜3\.2](https://arxiv.org/html/2605.22891#S3.SS2): even for unimodal posteriors with substantial variance, the Jensen gap persists\. It provides a second, independent reason to prefer methods that preserve the per\-event posterior over those that collapse it to a four\-vector point estimate\. The argument also highlights why the basis choice for any latent state regression directly influences the reconstruction RMSE on observables\.

### D\.9Per\-event posterior structure across observables

![Refer to caption](https://arxiv.org/html/2605.22891v1/x8.png)Figure 8:Reconstructed posteriors \(filled curves\) and point estimates \(vertical lines\) for three randomly selected test events across the four observablespTtt¯p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\},mtt¯m\_\{\{t\\bar\{t\}\}\},chelc\_\{\\mathrm\{hel\}\}, andΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)\. True targets are shown as dashed black lines\. Event 135 \(bottom row ofΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)column\) is the event displayed in[fig\.˜2](https://arxiv.org/html/2605.22891#S6.F2)\(a\)\.[Figure˜8](https://arxiv.org/html/2605.22891#A4.F8)extends[fig\.˜2](https://arxiv.org/html/2605.22891#S6.F2)\(a\) by showing per\-event reconstructed posteriors for three randomly selected test events across all four observables\. The same per\-event collapse pattern observed forΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)in the main text persists across all observables: both transformer regressions concentrate near the per\-event posterior mean, while the flows recover the underlying multimodal or asymmetric structure where present\. The contrast is most visually striking onΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)\(strongly multimodal\)\. OnpTtt¯p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}andmtt¯m\_\{\{t\\bar\{t\}\}\}the posteriors are predominantly unimodal but skewed, and the regression’s bias is correspondingly milder\. Event 135 \(bottom\-right in theΔϕ\(ℓhel,t¯tt¯\)\\Delta\\phi\(\\ell\_\{\\mathrm\{hel\}\},\\,\\bar\{t\}\_\{\{t\\bar\{t\}\}\}\)row\) is the same event shown in[fig\.˜2](https://arxiv.org/html/2605.22891#S6.F2)\(a\)\. Across all events, the analytic solver returns at most a single point estimate per event and fails to return any solution for Event 0\.

### D\.10Conditional calibration reveals architecture and data limitations

![Refer to caption](https://arxiv.org/html/2605.22891v1/x9.png)Figure 9:Empirical coverage at the90%90\\%nominal level as a function of the true value of each observable\. Shaded bands show binomial uncertainty per bin\. The rightmost bins forpTtt¯p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}andmtt¯m\_\{\{t\\bar\{t\}\}\}have low statistics as indicated by the uncertainty bands\.[Section˜4\.3](https://arxiv.org/html/2605.22891#S4.SS3)recommends assessing calibration conditionally – in bins of phase space – to identify regions where uncertainty estimates fail despite acceptable global coverage\.[Figure˜9](https://arxiv.org/html/2605.22891#A4.F9)applies this diagnostic to the two flows, plotting empirical coverage at the90%90\\%nominal level as a function of each observable\. Both flows undercover throughout, but the discrete flow tracks the nominal level substantially more closely than the continuous flow across all four observables, consistent with the marginal coverage curves of[fig\.˜2](https://arxiv.org/html/2605.22891#S6.F2)\(c\)\. The continuous flow’s undercoverage is approximately uniform in observable value, suggesting a global density\-evaluation issue \(likely from approximate ODE\-based likelihood; see[section˜6](https://arxiv.org/html/2605.22891#S6)\) rather than a phase\-space\-specific failure\. The exception is the high\-pTtt¯p\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}region \(pTtt¯≳450GeVp\_\{\\mathrm\{T\}\}^\{\{t\\bar\{t\}\}\}\\gtrsim 450\\,\\text\{GeV\}\), where both flows undercover more severely\. This regime is sparsely populated in the training data, and the degraded calibration likely reflects extrapolation rather than a model failure\. Conditional calibration thus surfaces both a global architectural difference \(continuous vs discrete\) and a phase\-space\-specific data limitation, neither of which is visible in the global coverage curve alone\.

## Appendix EReproducibility

This appendix documents the artifacts released with the paper\. Beyond reproducing the reported experiments, the released evaluation pipeline and protocol documentation are intended to let practitioners apply the three\-axis protocol to their own supervised reconstruction models, on benchmarks beyond the two studied here\.

### E\.1Code repository and structure

The code accompanying this paper is available from the corresponding author on request\. The repository contains the implementations of all learned models in Benchmark II, the training and evaluation pipelines, the reference implementations of the three protocol metrics, and the analysis notebooks used to produce every table and figure in[sections˜5](https://arxiv.org/html/2605.22891#S5)and[6](https://arxiv.org/html/2605.22891#S6)\. Compute requirements, training times, and hyperparameters are summarised in[sections˜C\.3](https://arxiv.org/html/2605.22891#A3.SS3)and[D\.6](https://arxiv.org/html/2605.22891#A4.SS6)\. Executed notebooks and the pre\-computed evaluation arrays underlying every Benchmark II table and figure are available from the corresponding author on request\.

##### Repository layout

- •README\.mdA more practical oriented description on how the repository is set up, how to install the virtual environment, and how to run different parts of the code base\.
- •src/top\_reco/– the core Python package for the studies presented in Benchmark II\. Themodels/subpackage contains the model implementations;losses/,tasks/anddata/provide the training objectives, the dataset wiring and the Delphes\-format data loader\. Themetrics/subpackage provides reference implementations of CRPS \([section˜4\.1](https://arxiv.org/html/2605.22891#S4.SS1)\), spectrum fidelity \([section˜4\.2](https://arxiv.org/html/2605.22891#S4.SS2)\), and the conformal calibration diagnostic \([section˜4\.3](https://arxiv.org/html/2605.22891#S4.SS3)\), each independently testable and importable asfrom top\_reco\.metrics import …\.
- •configs/– The model configuration tree with hyperparameters for each model\.
- •scripts/–train\.py\(entry point for training a model\),evaluate\.py\(produces the per\-event metric arrays\),posterior\_sampling\.py\(produces the dense posterior samples used for the per\-event posterior figures\)\.
- •notebooks/–benchmark1\_synthetic\.ipynbis end\-to\-end and self\-contained \(no external data\)\. It produces all tables and figures related to Benchmark I\.benchmark2\_plotting\.ipynbreads pre\-computed evaluation arrays fromdata/benchmark2/and produces all tables and figures related to Benchmark II\.
- •tests/– unit tests for thecoordinates/andmetrics/modules\.
- •docs/– Sphinx\-built API reference for the metrics module, generated from the same docstrings that document the formulae of[section˜4](https://arxiv.org/html/2605.22891#S4)\.
- •NOTICE\.mdandLICENSES/– third\-party code attribution; portions ofsrc/top\_reco/coordinates/and the transformer building blocks ofsrc/top\_reco/models/are adapted from other sources\.

The*analytic solver*baseline used in Benchmark II is the algebraic on\-shell\-mass solver of\[[48](https://arxiv.org/html/2605.22891#bib.bib4)\]; the implementation is external to this repository and is available from the corresponding author on request\.

### E\.2Compute resources

All experiments were run on a singleNVIDIA A100 80GB PCIeGPU\. Per\-model training times are: Benchmark I – MLP\-backbone models 1 minute each, normalizing flow 7 minutes \(5 seeds each\); Benchmark II – transformer regressions 4 hours, continuous flow 16 hours, discrete flow 8 hours \(single seed each, table 6\)\. In addition, the flow\-based models needed GPU\-time for sampling which we estimate to a combined sampling time of 30 GPU\-hours\. Total reported\-experiment compute is approximately 65 GPU\-hours\. Total project compute including hyperparameter exploration, MMD\-bandwidth scans on additional observables \([section˜D\.5](https://arxiv.org/html/2605.22891#A4.SS5)\), and discarded architectural variants is approximately 400 GPU\-hours\.

### E\.3Dependencies and environment

Models are implemented in PyTorch \(≥\\geq2\.2\)\. The synthetic\-benchmark normalizing flow uses normflows\[[50](https://arxiv.org/html/2605.22891#bib.bib52)\]\. The continuous flow uses standard PyTorch ODE integration utilities\[[10](https://arxiv.org/html/2605.22891#bib.bib45)\]\. Full dependency lists with pinned versions are provided in the repository inuv\.lock\.

### E\.4Dataset licensing and access

The Benchmark II dataset is the public Delphes Monte Carlo release ofRaineet al\.\[[42](https://arxiv.org/html/2605.22891#bib.bib40)\], distributed under CC\-BY 4\.0\. We use the train/test splits provided with the release without modification, and we restrict ourselves to events generated with MadGraph\. The Benchmark I synthetic dataset is generated by the released code from the forward model of[eq\.˜7](https://arxiv.org/html/2605.22891#S5.E7); the data\-generation script is included in the repository and produces deterministic outputs given the documented seeds\.
Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

Similar Articles

Improving Multimodal Reasoning via Worst Dimension Optimization

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection

Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Submit Feedback

Similar Articles

Improving Multimodal Reasoning via Worst Dimension Optimization
Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring
Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection
Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking
The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm