The Degeneracy Distillery

arXiv cs.LG Papers

Summary

This paper introduces the degeneracy distillery, a method that automatically detects and resolves degenerate parameter combinations in physical models by estimating and flattening the Fisher information matrix, reducing the simulation budget required for neural posterior estimation while providing physical insight.

arXiv:2606.23838v1 Announce Type: new Abstract: When two or more parameters or labels produce similar data, they are degenerate, or hard to distinguish. Degeneracies render both label prediction and inverse problems difficult, since both machine learning algorithms and probabilistic samplers rely on the distinguishability of data and its gradients with respect to parameters. However, identifying degeneracies in physical models or real-world datasets can be elucidating about the choice of model or the underlying process that produces the data. We present the degeneracy distillery, a method that (1) detects and (2) resolves degenerate parameter combinations (a) automatically and (b) symbolically, from parameter-data (or parameter-simulation) pairs alone, through estimation and flattening of the Fisher information matrix. By exploring the information geometry of the likelihood, we characterize degeneracies as an intrinsic property of the physical model, requiring no realised data observation. We demonstrate our approach on a range of synthetic and real-world problems, discovering symbolic coordinate transformations that identify the combinations of parameters of a model which yield independent effects on the data. The resulting coordinates flatten the Fisher information in expectation globally, in contrast to posterior-based methods that flatten only at a single point, and substantially reduce the simulation budget required for downstream neural posterior estimation. In test cases we require up to $10\times$ fewer simulations for posterior estimation at matched validation calibration whilst simultaneously gaining physical insight on the system.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:49 AM

# The Degeneracy Distillery
Source: [https://arxiv.org/html/2606.23838](https://arxiv.org/html/2606.23838)
T\. Lucas Makinen1,2Deaglan J\. Bartlett3,4Niall Jeffrey5,6Benjamin D\. Wandelt7,8 1Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilberforce Rd, Cambridge CB3 0WA, United Kingdom2Imperial Centre for Inference and Cosmology \(ICIC\), Imperial College London, Prince Consort Road, London SW7 2AZ, United Kingdom3Astrophysics, University of Oxford, Oxford OX1 3RH, United Kingdom4CNRS & Sorbonne Université, Institut d’Astrophysique de Paris \(IAP\), UMR 7095, 98 bis bd Arago, F\-75014 Paris, France5Department of Physics and Astronomy, University College London, London WC1E 6BT, United Kingdom6Department of Physics & King’s Institute for Artificial Intelligence, King’s College London, London WC2R 2LS, United Kingdom7Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD 21218, USA8Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA

###### Abstract

When two or more parameters or labels produce similar data, they aredegenerate, or hard to distinguish\. Degeneracies render both label prediction and inverse problems difficult, since both machine learning algorithms and probabilistic samplers rely on the distinguishability of data and its gradients with respect to parameters\. However, identifying degeneracies in physical models or real\-world datasets can be elucidating about the choice of model or the underlying process that produces the data\. We present thedegeneracy distillery, a method that \(1\) detects and \(2\) resolves degenerate parameter combinations \(a\) automatically and \(b\) symbolically, from parameter\-data \(or parameter\-simulation\) pairs alone, through estimation and flattening of the Fisher information matrix\. By exploring the information geometry of the likelihood, we characterize degeneracies as an intrinsic property of the physical model, requiring no realised data observation\. We demonstrate our approach on a range of synthetic and real\-world problems, discovering symbolic coordinate transformations that identify the combinations of parameters of a model which yield independent effects on the data\. The resulting coordinates flatten the Fisher information in expectation*globally*, in contrast to posterior\-based methods that flatten only at a single point, and substantially reduce the simulation budget required for downstream neural posterior estimation\. In test cases we require up to10×10\\timesfewer simulations for posterior estimation at matched validation calibration whilst simultaneously gaining physical insight on the system\.

## 1Introduction

Many scientific models are parametrised by a set of variables, yet their predictions often depend on these in a highly nontrivial way, with different parameter variations producing similar or indistinguishable effects\. As a result, variations in individual parameters can produce correlated effects on observables, with the model responding primarily to particular combinations of parameters rather than to each independently\. In such situations, we say that there aredegeneraciesin parameter space\.

Identifying these directions is important for both statistical and practical reasons\. From an inference perspective, aligning coordinates with directions of sensitivity improves conditioning, making likelihood or posterior exploration and optimisation significantly more efficient\. It also provides physical insight by revealing the combinations of parameters that actually control observables, rather than relying on arbitrary coordinate choices\. By resolving the true directions along which predictions change, one can capture parameter\-data relationships more precisely and avoid redundancies\. This has direct implications for computational cost: simulations can be targeted along informative directions rather than wasted along degeneracies that produce indistinguishable outputs\. Finally, such structure enables the construction of more informed priors, defined in terms of meaningful parameter combinations rather than poorly aligned individual parameters\.

Such behaviour is naturally captured by the Fisher information, which defines a geometry over parameters that is typically highly anisotropic and ill\-conditioned\. In settings with analytic control, one can sometimes identify appropriate reparametrisations by hand\. However, in more complex or computationally intensive models this becomes infeasible, motivating the need for systematic, data\-driven approaches\. In this work, we develop a method to discover global reparametrisations that regularise this geometry, yielding coordinates in which the Fisher information is approximately isotropic across parameter space\.

Thedegeneracy distillerymaps the information geometry of a system from parameter\-data pairs and discovers a symbolic parametrisation in terms of the system’s original parameters for which the Fisher matrix is as close to the identity as possible\. As outlined in[Fig\.˜1](https://arxiv.org/html/2606.23838#S1.F1), we achieve this in three main steps: \(i\) Estimate the information geometry everywhere with neural networks; \(ii\) Learn neural coordinates that flatten the Fisher geometry and ensure a unique solution; \(iii\) Find nonlinear symbolic expressions that approximate these neural coordinates and collect like terms\. The output of the distillery is therefore a coordinate transformation that removes the degeneracies inherent in the original parametrisation\.

##### Contributions\.

\(i\) We introduce a three\-stage pipeline that infers the Fisher metric globally over parameter space from simulations alone, and uses it to learn a symbolic, invertible reparametrisation that flattens the Fisher information\. \(ii\) We provide ensemble estimators for both the Fisher matrix and the flattened coordinates, with an alignment scheme that fixes the residual constant offset and orthogonal symmetries of the flattening loss \(see[Appendix˜B](https://arxiv.org/html/2606.23838#A2)\)\. \(iii\) We validate the method on two synthetic problems with known geometry \(Rosenbrock and a one\-dimensional Gaussian\), where analytic flat coordinates and geodesic normal coordinates serve as benchmarks\. \(iv\) We apply the distillery to four scientific problems—SIR epidemic dynamics, gravitational\-wave inspiral waveforms, weak\-lensing cosmology from expensive dark matter simulations, and an industrial heater control problem—each chosen to probe a different regime of nonlinearity, dimensionality, and prior knowledge\. \(v\) We demonstrate that the resulting coordinates lower the simulation budget required by downstream neural posterior estimation while preserving calibrated coverage\.

We briefly review the related literature in[Section˜2](https://arxiv.org/html/2606.23838#S2), then describe the principles of information geometry in[Section˜3](https://arxiv.org/html/2606.23838#S3)\. We detail the degeneracy distillery pipeline in[Section˜4](https://arxiv.org/html/2606.23838#S4), before validating it against synthetic examples and applying it to scientific problems in[Section˜5](https://arxiv.org/html/2606.23838#S5)\. We discuss limitations and conclude in[Section˜6](https://arxiv.org/html/2606.23838#S6)\. Full experimental, architectural, and theoretical details are deferred to the appendices\.

![Refer to caption](https://arxiv.org/html/2606.23838v1/x1.png)Figure 1:Degeneracy distillery pipeline in three steps\. \(1\) Parameter\-data pairs\(θ,x\)\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\},\\textbf\{x\}\)are passed to an ensemble of Fishnet networks to learn approximations to the Fisher matrix at each parameter value\. \(2\) A “flattener” network is trained so that its Jacobian maps the learned metric to the identity as a function ofθ\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\. \(3\) Symbolic regression is performed on each outputη\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}coordinate to obtain short, nonlinear, degeneracy\-resolving expressions\.

## 2Related Work

##### Sloppy models\.

“Sloppy” models describe a common feature of multiparameter nonlinear systems: predictions depend strongly on a few parameter combinations while remaining insensitive to many others\. This is reflected in a Fisher information matrix with eigenvalues spanning many orders of magnitude, producing a highly anisotropic parameter geometry and a “hyperribbon” model manifold\[[15](https://arxiv.org/html/2606.23838#bib.bib69),[41](https://arxiv.org/html/2606.23838#bib.bib70),[40](https://arxiv.org/html/2606.23838#bib.bib68)\]\.Gutenkunstet al\.\[[15](https://arxiv.org/html/2606.23838#bib.bib69)\]interpret the Fisher information as a Riemannian metric, using its eigenstructure to distinguish stiff and sloppy directions and explain how accurate predictions coexist with poorly constrained parameters\. Exact or near degeneracies are handled explicitly, either as directions of vanishing sensitivity or via reductions such as the manifold boundary approximation method \(MBAM\), which collapse these directions at manifold boundaries\[[42](https://arxiv.org/html/2606.23838#bib.bib71)\]\.

##### Local conformal autoencoders and intrinsic coordinates\.

The LOCA method\[[36](https://arxiv.org/html/2606.23838#bib.bib73)\]andEvangelouet al\.\[[12](https://arxiv.org/html/2606.23838#bib.bib72)\]also construct intrinsic coordinates that remove redundancies from nonlinear observation maps, but without explicit use of likelihood\-based geometry\. LOCA enforces local whitening in data space, normalising the pull\-back metric and producing coordinates invariant \(up to conformal factors\) under smooth observation transformations\.Evangelouet al\.\[[12](https://arxiv.org/html/2606.23838#bib.bib72)\]instead identify parameter combinations that determine outputs by analysing the parameter\-to\-observable map, separating predictive directions from level sets\. Both are data\-driven and focus on removing non\-identifiable directions rather than flattening an information metric\. They yield coordinates adapted to the model manifold, whereas our approach seeks to regularise the Fisher geometry over parameter space\.

##### Geometric variational inference and posterior reparametrisation\.

Geometric variational inference \(geoVI\)\[[13](https://arxiv.org/html/2606.23838#bib.bib74)\]also uses the Fisher information, but locally: it constructs posterior\-specific coordinates that flatten the geometry near the posterior, enabling Gaussian approximations\. Our approach instead provides a global, model\-intrinsic reparametrisation independent of data\. Similarly,Dacunhaet al\.\[[10](https://arxiv.org/html/2606.23838#bib.bib75)\]learn global density models to derive the local Fisher geometry and decompose parameters into constrained and unconstrained directions\. This is also posterior\-specific, whereas our method is global; additionally, their coordinates are neural, while ours use explicit analytic forms via symbolic regression\.

##### Symbolic regression and information\-geometry priors\.

The symbolic component of our pipeline builds onoperon\[[9](https://arxiv.org/html/2606.23838#bib.bib77),[26](https://arxiv.org/html/2606.23838#bib.bib76)\]and on minimum\-description\-length selection criteria for symbolic regression\[[4](https://arxiv.org/html/2606.23838#bib.bib30),[3](https://arxiv.org/html/2606.23838#bib.bib31)\]\. Ours is, to our knowledge, the first method to combine simulation\-based Fisher estimation with a global, symbolic flattening of the resulting metric\.

## 3Theoretical Background

Consider a model that producesnxn\_\{\\textbf\{x\}\}data pointsx, controlled bynθn\_\{\\theta\}parametersθ\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\. The likelihood,p​\(x\|θ\)p\(\\textbf\{x\}\|\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\), describes the probability distribution of data that is produced at a particular parameter value\. A higher likelihood value indicates a better description of the model, viaθ\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}, of the data\. The Kullback–Leibler \(KL\) divergence between two points,θ1\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\_\{1\}andθ2\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\_\{2\}, describes how much more likely the choice ofθ1\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\_\{1\}describes the data thanθ2\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\_\{2\}:

DKL\(θ1\|\|θ2\)=∫dxp\(x\|θ1\)\(logp\(x\|θ1\)−logp\(x\|θ2\)\)\.D\_\{\\rm KL\}\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\_\{1\}\\,\|\|\\,\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\_\{2\}\)=\\int\{\\rm d\}\\textbf\{x\}\\ p\(\\textbf\{x\}\|\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\_\{1\}\)\\left\(\\log p\(\\textbf\{x\}\|\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\_\{1\}\)\-\\log p\(\\textbf\{x\}\|\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\_\{2\}\)\\right\)\.\(1\)This does not satisfy the triangle inequality, and is not symmetric, making it a poor choice of distance metric\. However,Rao \[[37](https://arxiv.org/html/2606.23838#bib.bib38)\]’s insight came from considering a very small deviation from a point,θ\+δ​θ\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\+\\delta\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\. Here, the KL divergence reduces to:

DKL\(θ\|\|θ\+δθ\)=12Fa​bδθaδθb\+𝒪\(δθ3\),D\_\{\\rm KL\}\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\\,\|\|\\,\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\+\\delta\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\)=\\frac\{1\}\{2\}F\_\{ab\}\\,\\delta\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}^\{a\}\\,\\delta\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}^\{b\}\+\\mathcal\{O\}\(\\delta\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}^\{3\}\),\(2\)where theFisher Information Matrix,𝗙\\bm\{\\mathsf\{F\}\}, is defined as:

Fa​b​\(θ\)=−∫dx​p​\(x\|θ\)​∂∂θa​∂∂θb​log⁡p​\(x\|θ\),F\_\{ab\}\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\)=\-\\int\{\\rm d\}\\textbf\{x\}\\ p\(\\textbf\{x\}\|\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\)\\,\\frac\{\\partial\}\{\\partial\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}^\{a\}\}\\frac\{\\partial\}\{\\partial\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}^\{b\}\}\\log p\(\\textbf\{x\}\|\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}\),\(3\)where we use the Einstein summation convention\. The Fisher Matrix is positive semi\-definite, and transforms properly under smooth diffeomorphisms, making it an ideal candidate for a Riemannian metric\. This metric underpins the underlyinginformation geometryof the likelihood\[[33](https://arxiv.org/html/2606.23838#bib.bib89),[37](https://arxiv.org/html/2606.23838#bib.bib38)\], and is controlled by the parametersθ\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}via the expectation over the data generated by the model\. The Fisher matrix is not invariant to the way we choose to parametrise our system\. Suppose we have some parametrisation in terms of the parameters\{θa\}\\\{\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}^\{a\}\\\}, for which we have the Fisher matrix𝗙\\bm\{\\mathsf\{F\}\}\. If we now change to a new parametrisation in terms of\{ηa\}\\\{\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{a\}\\\}then we have the new Fisher matrix

Fa​b′=∂θc∂ηa​∂θd∂ηb​Fc​d\.F\_\{ab\}^\{\\prime\}=\\frac\{\\partial\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}^\{c\}\}\{\\partial\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{a\}\}\\frac\{\\partial\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}^\{d\}\}\{\\partial\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{b\}\}\\,F\_\{cd\}\.\(4\)If all the parameters of a system acted completely independently, then we would have a diagonal Fisher matrix\. However, in general, this is not the case, since some of the parameters aredegenerate: changing the value of one of them changes the data in a similar way to changing another\. In terms of the Fisher information, a statistical degeneracy is a parameter\-space directionδ​𝜽\\delta\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\}with vanishing or very small curvature, i\.e\.δ​𝜽⊤​𝗙​δ​𝜽≈0\\delta\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\}^\{\\top\}\\bm\{\\mathsf\{F\}\}\\,\\delta\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\}\\approx 0, where𝗙\\bm\{\\mathsf\{F\}\}has small eigenvalues\.

Ideally, one would find a parametrisation in𝜼\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}that makes the Fisher matrix diagonal everywhere in parameter space, and hence all degeneracies would be removed\. It is even more desirable to have a parametrisation in which𝗙\\bm\{\\mathsf\{F\}\}is equal to the identity, such that a change of unity in any of the𝜼\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}parameters has a similar effect on the likelihood\.

Utilising the analogy with geometry, this corresponds to finding a coordinate system \(parametrisation\) in which the metric \(Fisher matrix\) is as close to the identity matrix as possible\. In some special cases, it is possible to do this exactly everywhere, which in our geometry analogy corresponds to a system with zero curvature \(the Riemann tensor is zero\[[27](https://arxiv.org/html/2606.23838#bib.bib35)\]\)\. In general, however, there will be some curvature, and thus we can only aim to make our Fisher matrix as flat as possible, even if we cannot completely remove all degeneracies or curvature\. Our goal is therefore to find such a \(symbolic\) coordinate transformation automatically from data–parameter pairs, for which the average deviation from flatness across the prior range of the parameters is minimised\.

## 4Method

The pipeline is summarised in[Fig\.˜1](https://arxiv.org/html/2606.23838#S1.F1)\. We separate the construction into three stages so that each can be assessed independently\.

##### 1\. Mapping information geometry with Fishnets\.

Given a set of parameter\-data pairs,\{\(𝜽,x\)\}\\\{\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\},\\textbf\{x\}\)\\\}, we first wish to estimate the Fisher matrix at all points in parameter space\. Traditionally, this is done by running new simulations at perturbed values of the parameters and using finite differences, or by leveraging auto\-differentiation\. The former can be costly, and the latter is only possible when one has a differentiable pipeline\. To circumvent these issues, we use the Fishnets formalism\[[30](https://arxiv.org/html/2606.23838#bib.bib23)\], which we briefly review here\. Given the data vectorx, we train a compression network to predict estimates of the parameters,𝜽^​\(x\)\\hat\{\\bm\{\\theta\}\}\(\\textbf\{x\}\), as well as the Fisher matrix,𝗙​\(x\)\\bm\{\\mathsf\{F\}\}\(\\textbf\{x\}\)\. Both functions are controlled by a set of model weights,φ\\varphi\. To ensure a positive\-definite matrix, the network does not directly output𝗙\\bm\{\\mathsf\{F\}\}, but instead its Cholesky decomposition,𝗟\\bm\{\\mathsf\{L\}\}, such that𝗙=𝗟𝗟T\\bm\{\\mathsf\{F\}\}=\\bm\{\\mathsf\{L\}\}\\bm\{\\mathsf\{L\}\}^\{\\rm T\}\.

Our optimisation objective is

Lfish​\(φ\)=12​\(𝜽−𝜽^​\(x\)\)⊤​𝗙​\(x\)​\(𝜽−𝜽^​\(x\)\)−12​log​det𝗙​\(x\)\.L\_\{\\rm fish\}\(\\varphi\)=\\tfrac\{1\}\{2\}\\\!\\left\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\}\-\\hat\{\\bm\{\\theta\}\}\(\\textbf\{x\}\)\\right\)^\{\\\!\\top\}\\bm\{\\mathsf\{F\}\}\(\\textbf\{x\}\)\\left\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\}\-\\hat\{\\bm\{\\theta\}\}\(\\textbf\{x\}\)\\right\)\-\\tfrac\{1\}\{2\}\\log\\det\\bm\{\\mathsf\{F\}\}\(\\textbf\{x\}\)\.\(5\)This objective is minimised over batches of pairs\(𝜽,x\)\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\},\\textbf\{x\}\):

minφ⁡Lfish=minφ​∫dx​d𝜽​Lfish​p​\(x,𝜽\),\\min\_\{\\varphi\}L\_\{\\rm fish\}=\\min\_\{\\varphi\}\\int\\\!\{\\rm d\}\\textbf\{x\}\\,\{\\rm d\}\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\}\\,L\_\{\\rm fish\}\\,p\(\\textbf\{x\},\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\}\),\(6\)which can be seen as a variational minimisation over parameter space of[Eq\.˜2](https://arxiv.org/html/2606.23838#S3.E2)\. We show in[Appendix˜A](https://arxiv.org/html/2606.23838#A1)that minimising this objective yields the posterior mean and \(inverse\) covariance, which asymptotically approaches the underlying Fisher information\.

In practice, the Fishnets prediction is a noisy estimator of the “true” underlying Fisher information\. To compensate, we train an ensemble of networks\. For each memberii, we re\-evaluate[Eq\.˜6](https://arxiv.org/html/2606.23838#S4.E6)on a validation set, obtaining a valueLiL\_\{i\}, and form a validation\-weighted mean and uncertainty

𝗙¯=∑iwi​𝗙\(i\),δ​Fa​b=∑iwi​\(Fa​b\(i\)−F¯a​b\)2,wi=exp⁡\(−Li\)∑jexp⁡\(−Lj\),\\bar\{\\bm\{\\mathsf\{F\}\}\}=\\sum\_\{i\}w\_\{i\}\\,\\bm\{\\mathsf\{F\}\}^\{\(i\)\},\\quad\\delta F\_\{ab\}=\\sqrt\{\\sum\_\{i\}w\_\{i\}\\\!\\left\(F\_\{ab\}^\{\(i\)\}\-\\bar\{F\}\_\{ab\}\\right\)^\{\\\!2\}\},\\quad w\_\{i\}=\\frac\{\\exp\(\-L\_\{i\}\)\}\{\\sum\_\{j\}\\exp\(\-L\_\{j\}\)\},\(7\)on the individual entries of the Fisher matrix\. We do not consider any covariance between the elements of𝗙\\bm\{\\mathsf\{F\}\}\. To ensure sufficient diversity, we vary hidden size over a wide uniform range and choose activation functions at random from a library of options \([Appendix˜C](https://arxiv.org/html/2606.23838#A3)\)\. We find that1010–2020networks produce Gaussian errors on the entries of𝗙\\bm\{\\mathsf\{F\}\}\.

##### 2\. Neural coordinates\.

Once we have an estimate for the Fisher matrix at all points in parameter space, we wish to find a new set of coordinates for which the Fisher is close to the identity everywhere\. We parametrise this coordinate transformation with a neural network which takes the original coordinate values as input and outputs the transformed coordinates,𝜼=𝜼NN​\(𝜽\)\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}=\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\_\{\\rm NN\}\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\}\)\. One can take the derivative of this network with respect to the inputs to obtain its Jacobian𝗝=\[∂𝜼/∂𝜽\]\\bm\{\\mathsf\{\{J\}\}\}=\[\\partial\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}/\\partial\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\}\]\. We use the pseudo\-inverse of𝓙\\bm\{\\mathsf\{\\mathcal\{J\}\}\}to transform𝗙¯\\bar\{\\bm\{\\mathsf\{F\}\}\}to its value in the new coordinates,𝗙¯′\\bar\{\\bm\{\\mathsf\{F\}\}\}^\{\\prime\}, using[Eq\.˜4](https://arxiv.org/html/2606.23838#S3.E4)\.

To train the network, we compute the Frobenius norm between𝗙¯′\\bar\{\\bm\{\\mathsf\{F\}\}\}^\{\\prime\}and the identity:

ℓ​\(𝗔\)=‖Aa​b−δa​b‖ℱ2≡tr⁡\(\(𝗔−𝟭\)​\(𝗔−𝟭\)T\)\.\\ell\(\\bm\{\\mathsf\{A\}\}\)=\\big\\\|\{A\}\_\{ab\}\-\\delta\_\{ab\}\\big\\\|\_\{\\mathcal\{F\}\}^\{2\}\\equiv\\operatorname\{tr\}\\\!\\left\(\(\\bm\{\\mathsf\{A\}\}\-\\bm\{\\mathsf\{1\}\}\)\(\\bm\{\\mathsf\{A\}\}\-\\bm\{\\mathsf\{1\}\}\)^\{\\rm T\}\\right\)\.\(8\)We additionally penalise the inverse, so that the loss isL​\(𝜼\)=ℓ​\(𝗙¯′\)\+ℓ​\(𝗙¯′⁣−1\)L\(\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\)=\\ell\(\\bar\{\\bm\{\\mathsf\{F\}\}\}^\{\\prime\}\)\+\\ell\(\\bar\{\\bm\{\\mathsf\{F\}\}\}^\{\\prime\\,\-1\}\)\. This loss is summed over all training points, averaging over the prior from which the simulations were drawn\. Instead of locally flattening our coordinates around a fiducial point, this corresponds to making them approximately flat over all of space\. As with estimating the Fisher information, we fit an ensemble of networks to this coordinate transformation to compute uncertainties\. We begin by fitting a single network to the coordinate transformation using𝗙¯\\bar\{\\bm\{\\mathsf\{F\}\}\}as our Fisher estimate\. This network is then copied multiple times, and each copy is fine\-tuned to a different Fisher estimate,𝗙\(i\)\\bm\{\\mathsf\{F\}\}^\{\(i\)\}, from the earlier ensemble, to obtain𝜼\(i\)\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}^\{\(i\)\}\. After ensuring that each of these estimates is aligned \(see[Appendix˜B](https://arxiv.org/html/2606.23838#A2)\), we compute the mean and uncertainty

η¯a=∑iwi​η\(i\)a,δ​ηa=∑iwi​\(η\(i\)a−η¯a\)2,\\bar\{\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}\}^\{a\}=\\sum\_\{i\}w\_\{i\}\\,\{\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{\(i\)\}\}^\{a\},\\quad\\delta\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{a\}=\\sqrt\{\\sum\_\{i\}w\_\{i\}\\\!\\left\(\{\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{\(i\)\}\}^\{a\}\-\\bar\{\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}\}^\{a\}\\right\)^\{\\\!2\}\},\(9\)on each transformed coordinate, with the same weighting as before\. We provide several architecture options for the flattening operator, but use an inverse\-conditioned, fully\-connected MLP by default \(see[Appendix˜C](https://arxiv.org/html/2606.23838#A3)\)\.

##### 3\. Symbolic coordinates\.

For increased interpretability and ease of use, we wish to find a symbolic approximation to the coordinate transformation learned by the network\. We apply symbolic regression\[[26](https://arxiv.org/html/2606.23838#bib.bib76)\]withoperon\[[9](https://arxiv.org/html/2606.23838#bib.bib77)\]separately for each output coordinate \(and thus assume sufficiently independent errors for each component\), using a Gaussian loss function with the previously calculated mean and uncertainty\. Crucially, once trained, the neural map𝜼​\(𝜽\)\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\}\)is decoupled from the original simulator: we therefore draw a fresh, dense set of parameters from the prior and pass them through the flattening ensemble to construct the augmented\(𝜽,𝜼¯,δ​𝜼\)\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\},\\bar\{\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\},\\delta\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\)training set used by SR\. In our experiments, this augmentation was the most consistent way to improve symbolic\-regression convergence–e\.g\. on Rosenbrock we use∼2,000\{\\sim\}2\{,\}000prior draws against∼250\{\\sim\}250Fishnet validation points \([Appendix˜E](https://arxiv.org/html/2606.23838#A5)\)\. We use the defaultoperonconfiguration described in[Appendix˜C](https://arxiv.org/html/2606.23838#A3): an offspring\-generator with maximum equation length2525, depth1010,1010optimiser iterations per candidate, the default operator set\{add, mul, pow, sqrt, square, exp, logabs\}, and a per\-component time budget of120120s\. Given a final Pareto front of the most accurate models at each complexity, we select our final equation using a formulation of the minimum\-description\-length \(MDL\) principle relevant to symbolic regression\[[4](https://arxiv.org/html/2606.23838#bib.bib30)\]\. This single objective penalises functions which are poor fitting, highly complex, or have many or finely tuned parameters, and is an approximation to the logarithm of the probability of the function given the data, under a specific prior on the equation’s form\[[3](https://arxiv.org/html/2606.23838#bib.bib31)\]\.

For comparison, we also leverage a “flatness” metric in the form of[Eq\.˜8](https://arxiv.org/html/2606.23838#S4.E8)\. To select expressions component\-wise we replace the Jacobian column∂ηi/∂θ\\partial\\eta\_\{i\}/\\partial\\thetacalculated from the neural network with the gradient computed from the proposed symbolic expression, and select those expressions which result in the minimum change in loss with respect to the neural coordinates\. This criterion does not penalise expression length, but most often yields a flatter geometry\. All models undergo a pruning procedure to remove unnecessary terms, as outlined in[Section˜C\.4](https://arxiv.org/html/2606.23838#A3.SS4)\.

## 5Experiments

We apply our method to both synthetic and scientific applications\. In both settings we demonstrate both the physical insights \(learned coordinates\) and efficiency gains degeneracy resolution brings to Bayesian inference problems, in particular neural posterior estimation\.

### 5\.1Synthetic validation

![Refer to caption](https://arxiv.org/html/2606.23838v1/x2.png)Figure 2:Rosenbrock validation case\.Left: Learned Fisher curvature \(top\) is flattened by learningη\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}\(bottom\)\.Right: Knowing the symbolic, flat geometry A\) makes learning posteriors easier even with increasing simulation budget B\) yields faithful posteriors in targetθ\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}coordinates and C\) is better calibrated than learning in degenerateθ\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\theta\}space\.##### Rosenbrock\.

We consider the two\-dimensional Rosenbrock function: a notoriously difficult, non\-convex test problem whose information geometry is known and can be analytically mapped to the identity using “flat” coordinates \([Section˜E\.1](https://arxiv.org/html/2606.23838#A5.SS1)\)\. Using data generated as described in[Section˜E\.1](https://arxiv.org/html/2606.23838#A5.SS1), we display the results of our three\-step procedure in[Fig\.˜2](https://arxiv.org/html/2606.23838#S5.F2), where our original coordinates are𝜽=\(μ0,μ1\)\\bm\{\\theta\}=\(\\mu\_\{0\},\\mu\_\{1\}\)and the transformed coordinates are𝜼=\(ν0,ν1\)\{\\bm\{\\eta\}\}=\(\\nu\_\{0\},\\nu\_\{1\}\)\.

Results\.The Fishnet ensemble produces an unbiased and well\-calibrated estimate of the true underlying information geometry of the Rosenbrock posterior, illustrated via \(a\) Fisher Cholesky factors and \(b\) trace\-normalised Fisher matrix shapeMi​j=Fi​j/tr⁡𝗙M\_\{ij\}=F\_\{ij\}/\\operatorname\{tr\}\\bm\{\\mathsf\{F\}\}\(Fig\.[6](https://arxiv.org/html/2606.23838#A5.F6)\)\. The flattening operation produces ensemble\-computed neural coordinatesη\\etaand residualsδ​η\\delta\\etathat are consistent with𝒩​\(0,1\)\\mathcal\{N\}\(0,1\), ideally suited for the symbolic\-regression MDL selection criterion\. The symbolic regression and post\-processing step consistently recovers the true expressions,η=\[b0​μ02\+b1​μ1,b2​μ1\]\\eta=\[b\_\{0\}\\mu\_\{0\}^\{2\}\+b\_\{1\}\\mu\_\{1\},\\;b\_\{2\}\\mu\_\{1\}\]\. Knowing amortised coordinates makes downstream inference in the Rosenbrock setting more simulation efficient\. We train an ensemble of NPE networks at different simulation budgets on an exacerbated Rosenbrock geometry \(Σ=diag​\(42,12\)\\Sigma=\{\\rm diag\}\(4^\{2\},1^\{2\}\)\)\. Using the learned coordinates we reconstruct the true posteriors at all simulation budgets with 1\) better calibration and 2\) up to10×10\\timesfewer simulations than NPE in rawθ\\theta\.

##### Gaussian with unknown variance\.

We now turn to an example which cannot be flattened exactly at all points in parameter space\. We consider a one\-dimensional Gaussian, parametrised by meanμ\\muand standard deviationσ\\sigma\. Taking draws from this distribution, the Fisher matrix isFa​b=nd​diag⁡\(1/σ2,2/σ2\)F\_\{ab\}=n\_\{\\rm d\}\\operatorname\{diag\}\(1/\\sigma^\{2\},2/\\sigma^\{2\}\)\. As discussed in[Appendix˜D](https://arxiv.org/html/2606.23838#A4), the Riemann tensor for a metric withga​b=Fa​bg\_\{ab\}=F\_\{ab\}is non\-zero in this case \([Eq\.˜49](https://arxiv.org/html/2606.23838#A4.E49)\), confirming that one cannot flatten the Fisher matrix everywhere\. Hence, we wish to find a choice of parameters which makes the Fisher as flat as possible, even if one cannot make it identical to the identity everywhere\. Given a fiducial point, it is possible to enforce only quadratic deviations from flatness through the use of geodesic normal coordinates\. These are constructed in[Appendix˜D](https://arxiv.org/html/2606.23838#A4), and are used as a benchmark: when averaged over the prior, our method should construct coordinates with a lower loss than these analytic ones, although close to the fiducial point the analytic ones must perform better\. We choose the fiducial point to be\(0\.0,1\.0\)\(0\.0,1\.0\)\. We generate10 00010\\,000simulations viaμ∼𝒰​\[−5,5\]\\mu\\sim\\mathcal\{U\}\[\-5,5\]andσ∼𝒰​\[0\.2,7\.0\]\\sigma\\sim\\mathcal\{U\}\[0\.2,7\.0\], and generatex∼𝒩​\(μ,σ\)x\\sim\\mathcal\{N\}\(\\mu,\\sigma\)withdim\(x\)=50\\dim\(x\)=50\. We adopt the same network setup calibrated on the Rosenbrock example\.

Results\.We obtain \(i\) neural coordinates and \(ii\) two sets of symbolic coordinates selected according to \(a\) MDL and \(b\) Frobenius criteria\. We compare the “flatness” of our coordinates againsttrue, analytic Fisher matrices as a function of geodesic distanceβ\\beta\(see[Appendix˜D](https://arxiv.org/html/2606.23838#A4)\), computed over test data\. For comparison, we compute geodesic normal coordinates around the fiducial pointθ⋆=\(μ⋆,σ⋆\)=\(0,1\)\\theta^\{\\star\}=\(\\mu^\{\\star\},\\sigma^\{\\star\}\)=\(0,1\), as well as “ad\-hoc” coordinates typically used for Gaussian parameter inference,ηad​\-​hoc=\(μ/σ,log⁡σ\)\\eta^\{\\rm ad\\text\{\-\}hoc\}=\(\\mu/\\sigma,\\,\\log\\sigma\)\.[Fig\.˜3](https://arxiv.org/html/2606.23838#S5.F3)shows that geodesic coordinates obtain perfect flattening close to the fiducial point \(β≈0\\beta\\approx 0\), but fail exponentially quickly further away\. Both neural and symbolic coordinates consistently flatten the true geometry across the parameter space\.

![Refer to caption](https://arxiv.org/html/2606.23838v1/x3.png)Figure 3:Left:Neural and symbolic regression components outperform \(i\) hand\-picked \(ad\-hoc\) coordinates and \(ii\) flatten geometry far from the fiducial point[Eq\.˜74](https://arxiv.org/html/2606.23838#A4.E74), without the need for an analytic Fisher matrix to derive geodesic normal coordinates\.Right:Inference validationlogp\(⋅\|y\)\\log p\(\\cdot\|y\)shows little degredation for NPE estimating distilledη=∏iθi∈ℝ\\eta=\\prod\_\{i\}\\theta\_\{i\}\\in\\mathbb\{R\}for the chain\-product heater simulator with intrinsic dim=1=1, while raw NPE inθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}degrades withddat fixed training budgetNN\.

### 5\.2Scientific applications

##### Industrial Heater control\.

We construct a minimal synthetic experiment to demonstrate that the distillery can perform rank\-deficient*dimensionality reduction*, improving NPE scaling\. Here we explore a rank\-deficient diagnostic case, not a global flattening case\. A first\-order resistive\-heater plant is driven by an applied voltage and currentθ=\(V,I\)\\theta=\(V,I\), with observable 20\-point time series of the temperature step response \(see e\.g\.\[[20](https://arxiv.org/html/2606.23838#bib.bib86)\]\),

y​\(t\)=\(V⋅I\)​\(1−e−t/τ\)\+ε​\(t\),ε​\(t\)​∼iid​𝒩​\(0,σ2\)\.y\(t\)\\;=\\;\(V\\cdot I\)\\,\(1\-e^\{\-t/\\tau\}\)\\;\+\\;\\varepsilon\(t\),\\qquad\\varepsilon\(t\)\\overset\{\\text\{iid\}\}\{\\sim\}\\mathcal\{N\}\(0,\\sigma^\{2\}\)\.\(10\)The plant dynamics depend on\(V,I\)\(V,I\)onlythrough the dissipated powerP=V⋅IP=V\\cdot I, so the Fisher information matrixFθ​\(θ\)=\(‖k‖2/σ2\)​\(θ2,θ1\)⊤​\(θ2,θ1\)F\_\{\\theta\}\(\\theta\)=\(\\\|k\\\|^\{2\}/\\sigma^\{2\}\)\\,\(\\theta\_\{2\},\\theta\_\{1\}\)^\{\\top\}\(\\theta\_\{2\},\\theta\_\{1\}\)is structurally rank\-1*everywhere*:θ\\thetaparametrises a 2\-D box but the data manifold is exactly 1\-D\. This is a common occurrence in industrial control problems where two physical dials multiply together to produce a single output\.

Results\.Running our method onNsim=103N\_\{\\text\{sim\}\}=10^\{3\}simulations with a uniform priorθi∼𝒰​\[1,2\]\\theta\_\{i\}\\sim\\mathcal\{U\}\[1,2\]recovers two affine coordinates whose symbolic form isηnuis≈0\.10​\(V−I\)\+0\.4\\eta\_\{\\text\{nuis\}\}\\;\\approx\\;0\.10\\,\(V\-I\)\+0\.4andηiden≈−0\.6​\(V\+I\)\+6\.0\\eta\_\{\\text\{iden\}\}\\;\\approx\\;\-0\.6\\,\(V\+I\)\+6\.0\. The pipeline therefore*automatically*identifies the 1\-D submanifold of identifiable parameters and the orthogonal nuisance direction from the data alone \(see App\.[Section˜E\.5](https://arxiv.org/html/2606.23838#A5.SS5)for details\)\.Scaling Impact\.Density estimation becomes increasingly difficult in high dimensions, requiringN​\(d,ε\)∼ε−\(2​β\+d\)/\(2​β\)N\(d,\\varepsilon\)\\;\\sim\\;\\varepsilon^\{\-\(2\\beta\+d\)/\(2\\beta\)\}simulations to reach an error toleranceε\\varepsilon\(Stone’s relation;\[[39](https://arxiv.org/html/2606.23838#bib.bib78)\]\)\. If the data manifold has intrinsic dimensiond∗<dd^\{\*\}<d, training on the distilled coordinates costs onlyN​\(d∗,ε\)N\(d^\{\*\},\\varepsilon\), giving an exponential\-in\-\(d−d∗\)\(d\-d^\{\*\}\)saving \(see App[E\.5](https://arxiv.org/html/2606.23838#A5.SS5)\)\.

Explicit symbolic \(or neural\) distillation of theintrinsicparameter combinations delivers this exponential gain in density estimation by effectively shrinking the parameter space\.

Empirical scaling sweep\.To make this explicit in the context of neural posterior estimation \(NPE;\[[34](https://arxiv.org/html/2606.23838#bib.bib40),[2](https://arxiv.org/html/2606.23838#bib.bib11)\]\), we extend the heater problem to a chain\-product simulator

y​\(t\)=\(∏i=1dθi\)​\(1−e−t/τ\)\+ε​\(t\),θi∼𝒰​\[1,2\],y\(t\)\\;=\\;\\Big\(\\prod\_\{i=1\}^\{d\}\\theta\_\{i\}\\Big\)\\,\(1\-e^\{\-t/\\tau\}\)\\;\+\\;\\varepsilon\(t\),\\qquad\\theta\_\{i\}\\sim\\mathcal\{U\}\[1,2\],\(11\)in which the intrinsic dimension is fixed at one for everydd\. Figure[3](https://arxiv.org/html/2606.23838#S5.F3)shows the best validation log\-probability of two neural density estimators trained on a fixed budget ofNsim=103N\_\{\\text\{sim\}\}=10^\{3\}simulations: the raw flow with targetθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}and the distilled flow with targetη=∏iθi∈ℝ\\eta=\\prod\_\{i\}\\theta\_\{i\}\\in\\mathbb\{R\}\. The raw flow’s inference quality degrades monotonically withdd, tracing Stone’s relation, while the distilled flow is, by construction, insensitive todd\. The gap widens at the rate predicted by Eq\. \([92](https://arxiv.org/html/2606.23838#A5.E92)\), despite the underlying observed information being identical at everydd\.

![Refer to caption](https://arxiv.org/html/2606.23838v1/x4.png)Figure 4:Epidemiology example\. Distilled coordinates flatten learned curvature \(left\)\. Amortisedη\\etamake downstream NPE more efficient inR0=β/γR\_\{0\}=\\beta/\\gamma, measured by CRPS at fixed simulation budgets and higher noise levels \(middle\)\.Right:Posteriors sampled inη\\etacaptureR0R\_\{0\}with higher fidelity and trace the degeneracy direction in\(β,γ\)\(\\beta,\\gamma\)\(shown for theNsim=500N\_\{\\rm sim\}=500case\)\.
##### Epidemic infection rates \(SIR\)\.

The Susceptible–Infected–Recovered model is a standard model of epidemic dynamics\[[23](https://arxiv.org/html/2606.23838#bib.bib66),[16](https://arxiv.org/html/2606.23838#bib.bib67)\]\. Epidemic\-outbreak dynamics can be modelled using coupled ordinary differential equations describing susceptible, infectious, and recovered populations, governed by infection \(β\\beta\) and recovery \(γ\\gamma\) rates\. A disease becomes epidemic when its basic reproduction numberR0=β/γ\>1R\_\{0\}=\\beta/\\gamma\>1, corresponding to the case where one infected person infects more than one other person on average\.Data\.Epidemiologists usually only see noisy estimates of infection numbers over time\. Here we ask whether a nonlinear reparameterisation of\(β,γ\)\(\\beta,\\gamma\)can be learned from noisy simulations alone\. We focus on the*epidemic regime*, retaining simulations withR0≥1\.15R\_\{0\}\\geq 1\.15and discarding a narrow band around the epidemic threshold \(details in[Section˜E\.3](https://arxiv.org/html/2606.23838#A5.SS3)\)\. Each simulation vector is a normalised infectious trajectoryI​\(t\)/NI\(t\)/Non a uniform grid of3030time\-points over\[0,50\]\[0,50\]with additive Gaussian noise of standard deviation0\.050\.05\. To distil equations, we use 500 simulations in\(β,γ\)\(\\beta,\\gamma\)\. To demonstrate the usefulness of the learned coordinates, we extend our validation inference problem to include the initial number of infected individuals,I0I\_\{0\}\. The full parameter vector at inference is thenθ=\(β,γ,I0/10\)\\theta=\(\\beta,\\gamma,I\_\{0\}/10\)\.

Results\.We recover the symbolic coordinates

η1=0\.64​β1\.41​γ−0\.72\+0\.54,η2=0\.55​β\+0\.67\\eta\_\{1\}=\\frac\{0\.64\\beta\}\{1\.41\\gamma\-0\.72\}\+0\.54,\\quad\\eta\_\{2\}=0\.55\\beta\+0\.67which closely track the analytically expectedR0∝β/γR\_\{0\}\\propto\\beta/\\gammadirection in the first component, flattening the learned Fisher matrices shown in[Fig\.˜4](https://arxiv.org/html/2606.23838#S5.F4)\. Downstream neural posterior estimation using amortisedη\\etacoordinates provides sharperR0R\_\{0\}posterior reconstruction \(measured by CRPS\) and attains the same validation log\-probability as theθ\\theta\-space baseline using up to five times fewer simulations in low\-simulation settings, since𝜼\\bm\{\\eta\}more closely tracks parameter fluctuations in the data \(details in[Section˜E\.3](https://arxiv.org/html/2606.23838#A5.SS3)\)\.

##### Gravitational\-wave chirp mass\.

Here we apply our method to gravitational\-wave \(GW\) data, with the goal of inferring the component massesθ=\(m1,m2\)\\theta=\(m\_\{1\},m\_\{2\}\)of a compact binary black\-hole system from full waveform data\. We compare two simulation regimes with increasing complexity\. In both cases, we benchmark discovered coordinates against the conventional chirp\-mass / mass\-ratio combinationℳc=\(m1​m2\)3/5/\(m1\+m2\)1/5\\mathcal\{M\}\_\{c\}=\(m\_\{1\}m\_\{2\}\)^\{3/5\}/\(m\_\{1\}\+m\_\{2\}\)^\{1/5\},q=m2/m1q=m\_\{2\}/m\_\{1\}used in GW astronomy\.

Run 1\.We use TaylorF2 frequency\-domain inspiral waveforms\[[8](https://arxiv.org/html/2606.23838#bib.bib44),[7](https://arxiv.org/html/2606.23838#bib.bib45)\]at 2PN order \(non\-spinning\), whitened by the Advanced LIGO design sensitivity\[[1](https://arxiv.org/html/2606.23838#bib.bib46)\]and PCA\-compressed to4040dimensions with additive unit Gaussian noise in the whitened domain, at a luminosity distance of200200Mpc\. The Fisher matrix varies non\-trivially over the uniform prior inm1≥m2∈\[5,50\]​M⊙m\_\{1\}\\geq m\_\{2\}\\in\[5,50\]\\,M\_\{\\odot\}\. Heavy systems terminate at lowerfISCO∝M−1f\_\{\\rm ISCO\}\\propto M^\{\-1\}and accumulate fewer in\-band cycles, so no closed\-form flat coordinates inm1,m2m\_\{1\},m\_\{2\}exist\.

Results\.For masses rescaled to the\[0\.1,1\.0\]\[0\.1,1\.0\]range, the distillery returnsη1=0\.2​m1,η2=5\.5​exp⁡\(−0\.08​\(m1\+m2\)2\)\\eta\_\{1\}=0\.2\\,m\_\{1\},\\ \\eta\_\{2\}=5\.5\\,\\exp\\\!\\left\(\-0\.08\\,\(m\_\{1\}\+m\_\{2\}\)^\{2\}\\right\), which is invertible in the prior studied\. The second coordinate captures the inverse of the chirp\-mass\-like sensitivity along the heavy\-mass direction, and together the learned coordinates are a bijection of\(ℳc,q\)\(\\mathcal\{M\}\_\{c\},q\)\.

Run 2\. Beyond the chirp\-mass approximation\.TaylorF2 only captures the inspiral phase of the binary signal; for moderately heavy systems the merger and ringdown carry most of the in\-band signal\-to\-noise and the leading\-orderℳc∝\(m1​m2\)3/5/\(m1\+m2\)1/5\\mathcal\{M\}\_\{c\}\\propto\(m\_\{1\}m\_\{2\}\)^\{3/5\}/\(m\_\{1\}\+m\_\{2\}\)^\{1/5\}scaling no longer dominates the waveform\. We repeat the experiment with the phenomenological inspiral\-merger\-ringdown waveformIMRPhenomD\[[24](https://arxiv.org/html/2606.23838#bib.bib47),[19](https://arxiv.org/html/2606.23838#bib.bib48)\], keeping the prior and PCA pipeline of the TaylorF2 case unchanged\.

Results\.The distillery now returns coordinates that do not contain the leading\-order chirp\-mass factor explicitly: the symbolic outputs are

η1=0\.34​\(m1\+m2\)2−1\.9,η2=−0\.1​m1−0\.3​m2\+0\.10​\(m1\+m2\)2,\\eta\_\{1\}=0\.34\\,\(m\_\{1\}\+m\_\{2\}\)^\{2\}\-1\.9,\\quad\\eta\_\{2\}=\-0\.1\\,m\_\{1\}\-0\.3\\,m\_\{2\}\+0\.10\\,\(m\_\{1\}\+m\_\{2\}\)^\{2\},i\.e\. a monotonic function of total massg​\(M\)g\(M\)and the asymmetric combinationΔ​m=m1−m2\\Delta m=m\_\{1\}\-m\_\{2\}, with no chirp\-mass\-like factor anywhere\. The discovered coordinates are dramatically flatter than\(ℳc,q\)\(\\mathcal\{M\}\_\{c\},q\)for theIMRPhenomDcase, and the downstream posteriors are noticeably rounder and better TARP\-calibrated\[[28](https://arxiv.org/html/2606.23838#bib.bib88)\]than the chirp\-mass baseline \([Fig\.˜5](https://arxiv.org/html/2606.23838#S5.F5); details in[Section˜E\.7](https://arxiv.org/html/2606.23838#A5.SS7)\)\. This result indicates that the distillery has identified a more informative, data\-driven combination of parameters\.

##### Weak gravitational lensing\.

Weak gravitational lensing \(WL\) alters the trajectories of photons as they pass through massive structures of visible and dark matter\. The two\-point correlation function of this observable is sensitive to two degenerate parameters,\(Ωm,σ8\)\(\\Omega\_\{\\rm m\},\\sigma\_\{8\}\), which control the Universe’s matter content and dark\-matter clustering, respectively\. The nonlinear parameterS8=σ8​\(Ωm/0\.3\)0\.5S\_\{8\}=\\sigma\_\{8\}\(\\Omega\_\{\\rm m\}/0\.3\)^\{0\.5\}is conventionally used to summarise the leading amplitude degeneracy to changes in the two\-point function\[[21](https://arxiv.org/html/2606.23838#bib.bib57),[5](https://arxiv.org/html/2606.23838#bib.bib58),[25](https://arxiv.org/html/2606.23838#bib.bib59)\]\. We apply our method to 2\-point statistics from mock WL convergence images, computed from expensive dark matter simulations \(see App\.[E\.4](https://arxiv.org/html/2606.23838#A5.SS4)\), varied over wide priors in\(Ωm,σ8\)\(\\Omega\_\{m\},\\sigma\_\{8\}\)and initial conditions\.

Results\.We recover flattened coordinatesη1=Ωm​σ8−0\.9​Ωm0\.144\\eta\_\{1\}=\\Omega\_\{m\}\\sigma\_\{8\}\-0\.9\\,\\Omega\_\{m\}^\{0\.144\}andη2=0\.5​\(Ωm−1\.0\)\\eta\_\{2\}=0\.5\(\\Omega\_\{m\}\-1\.0\)\. These coordinates are similar to the standard\(Ωm,S8\)\(\\Omega\_\{m\},S\_\{8\}\)parameterization\. In particular,η1\\eta\_\{1\}captures the dominant weak\-lensing amplitude degeneracy: near typical valuesΩm≃0\.3\\Omega\_\{m\}\\simeq 0\.3andσ8≃0\.8\\sigma\_\{8\}\\simeq 0\.8, contours of constantη1\\eta\_\{1\}have local sloped​ln⁡σ8/d​ln⁡Ωm≃−0\.55d\\ln\\sigma\_\{8\}/d\\ln\\Omega\_\{m\}\\simeq\-0\.55, close to the constant\-S8S\_\{8\}slope of−0\.5\-0\.5\.

![Refer to caption](https://arxiv.org/html/2606.23838v1/x5.png)Figure 5:Main GW result \(IMRPhenomD\)\.Left: Example Fisher matrices in\(m1,m2\)\(m\_\{1\},m\_\{2\}\)and in the learnedη\\etacoordinates\.Middle: posterior contours from neural posterior estimators trained inθ=\(m1,m2\)\\theta=\(m\_\{1\},m\_\{2\}\), in theη\\etabasis, and in the conventional\(ℳc,q\)\(\\mathcal\{M\}\_\{c\},q\)basis\. Theη\\eta\-trained posterior is round and centred on the truth, while bothθ\\theta\-space and\(ℳc,q\)\(\\mathcal\{M\}\_\{c\},q\)posteriors stretch into bananas along theℳc\\mathcal\{M\}\_\{c\}direction \(grey contours\)\.Right: TARP joint coverage;η\\etais closest to the diagonal across the full credibility range, indicating the learned coordinates are well\-calibrated and more informative\.

## 6Discussion and Limitations

Across all experiments, the distillery captured information geometry and produced amortised, symbolic coordinates that make downstream inference more simulation\-efficient and better calibrated in low\- and noisy\-data regimes, and provide physical insight about how observables are controlled by model parameters of interest\.

##### Limitations\.

Our method has four main limitations\. \(i\) Genuinely*multimodal*posteriors cannot be flattened by an invertible reparametrisation, since the Fisher information is a local curvature object that is blind to global modes\. \(ii\) Very high\-dimensional parameter spaces \(nθ≳100n\_\{\\theta\}\\gtrsim 100\) require large networks for stable Fisher estimation, and symbolic regression scales poorly in the number of input variables\. \(iii\) Exact degeneracies \(where𝗙\\bm\{\\mathsf\{F\}\}is singular\) cannot be resolved by an invertible map and require a prior reduction step\[[42](https://arxiv.org/html/2606.23838#bib.bib71)\], although this can be mitigated in most cases with observational noise\. \(iv\) The Fishnet stage is only as accurate as the underlying Gaussian\-posterior assumption \([Appendix˜A](https://arxiv.org/html/2606.23838#A1)\); for strongly non\-Gaussian likelihoods, the recovered metric is the inverse of the posterior covariance rather than the Fisher\.

We introduced thedegeneracy distillery, a method for automatically discovering and resolving parameter degeneracies directly from parameter\-data pairs\. By estimating the Fisher information geometry with neural networks, learning coordinate transformations that flatten this geometry, and extracting symbolic representations of these transformations, our approach produces interpretable reparametrisations that align with the true directions of sensitivity in a model\. This yields coordinates in which the Fisher information is approximately isotropic, improving conditioning for optimisation and sampling, while simultaneously exposing the combinations of parameters that meaningfully control observables\. Across synthetic and real\-world examples, we demonstrated that the distillery can recover known degeneracies as well as uncover previously unrecognised structure, providing both computational and scientific benefits\. In particular, the resulting symbolic transformations offer a compact and human\-interpretable description of physical parameter dependencies\. By focusing computation on informative directions, the method can reduce the cost of simulation\-based inference and improve the robustness of downstream tasks\. The degeneracy distillery provides a principled and practical tool for uncovering and resolving degeneracies, offering both improved numerical performance and deeper insight into the structure of parametrised models\.

## Broader Impact

This method is a tool for understanding and accelerating scientific inference\. The most direct impact is on simulation\-based inference for problems where simulation cost is the dominant bottleneck\. By focusing learning from simulations along informative directions, the distillery can reduce the energy and compute cost of large inference campaigns and uncertainty quantification\. The amortised, symbolic outputs additionally mitigates the opacity of black\-box neural simulation\-based inference\. We do not foresee specific negative societal impacts beyond those generic to machine\-learning tools deployed in scientific domains, although the epidemiology application could be misapplied if used for policy decisions without expert calibration\.

## Data Availability

## Acknowledgements

TLM acknowledges support from the InfoSys Cambridge AI Lab and Imperial College London’s President’s Scholarship, as well as fruitful discussions with Will Handley, Alan Heavens, Natalia Porqueres, Roberto Trotta, Laurence Perreault\-Levasseur, and François Lanusse\. DJB was supported by the Simons Collaboration on “Learning the Universe”, and acknowledges that support was provided by Schmidt Sciences, LLC\. NJ is supported by the ERC\-selected UKRI Frontier Research Grant EP/Y03015X/1\.

## References

- \[1\]J\. Aasiet al\.\(2015\)Advanced LIGO\.Class\. Quantum Grav\.32,pp\. 074001\.External Links:[Document](https://dx.doi.org/10.1088/0264-9381/32/7/074001)Cited by:[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px3.p2.5)\.
- \[2\]J\. Alsing, T\. Charnock, S\. Feeney, and B\. Wandelt\(2019\-07\)Fast likelihood\-free cosmology with neural density estimators and active learning\.Monthly Notices of the Royal Astronomical Society\.External Links:ISSN 1365\-2966,[Link](http://dx.doi.org/10.1093/mnras/stz1960),[Document](https://dx.doi.org/10.1093/mnras/stz1960)Cited by:[§E\.5](https://arxiv.org/html/2606.23838#A5.SS5.SSS0.Px1.p1.8),[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px1.p4.8)\.
- \[3\]D\. Bartlett, H\. Desmond, and P\. Ferreira\(2023\-07\)Priors for symbolic regression\.InProceedings of the Companion Conference on Genetic and Evolutionary Computation,GECCO ’23 Companion,pp\. 2402–2411\.External Links:[Link](http://dx.doi.org/10.1145/3583133.3596327),[Document](https://dx.doi.org/10.1145/3583133.3596327)Cited by:[1st item](https://arxiv.org/html/2606.23838#A3.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.23838#S2.SS0.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2606.23838#S4.SS0.SSS0.Px3.p1.8)\.
- \[4\]D\. J\. Bartlett, H\. Desmond, and P\. G\. Ferreira\(2024\-08\)Exhaustive symbolic regression\.IEEE Transactions on Evolutionary Computation28\(4\),pp\. 950–964\.External Links:ISSN 1941\-0026,[Link](http://dx.doi.org/10.1109/TEVC.2023.3280250),[Document](https://dx.doi.org/10.1109/tevc.2023.3280250)Cited by:[1st item](https://arxiv.org/html/2606.23838#A3.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.23838#S2.SS0.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2606.23838#S4.SS0.SSS0.Px3.p1.8)\.
- \[5\]F\. Bernardeau, L\. van Waerbeke, and Y\. Mellier\(1997\)Weak lensing statistics as a probe ofΩ\\Omegaand power spectrum\.Astronomy & Astrophysics322,pp\. 1–18\.External Links:astro\-ph/9609122Cited by:[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px4.p1.3)\.
- \[6\]C\. M\. Biwer, C\. D\. Capano, S\. De, M\. Cabero, D\. A\. Brown, A\. H\. Nitz, and V\. Raymond\(2019\)PyCBC Inference: A Python\-based parameter estimation toolkit for compact binary coalescence signals\.Publ\. Astron\. Soc\. Pac\.131,pp\. 024503\.External Links:[Document](https://dx.doi.org/10.1088/1538-3873/aaef0b),1807\.10312Cited by:[§E\.7](https://arxiv.org/html/2606.23838#A5.SS7.p1.5)\.
- \[7\]L\. Blanchet\(2014\)Gravitational radiation from post\-Newtonian sources and inspiralling compact binaries\.Living Rev\. Relativ\.17,pp\. 2\.External Links:[Document](https://dx.doi.org/10.12942/lrr-2014-2)Cited by:[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px3.p2.5)\.
- \[8\]A\. Buonanno, B\. R\. Iyer, E\. Ochsner, Y\. Pan, and B\. S\. Sathyaprakash\(2009\)Comparison of post\-Newtonian templates for compact binary inspiral signals in gravitational\-wave detectors\.Phys\. Rev\. D80,pp\. 084043\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevD.80.084043)Cited by:[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px3.p2.5)\.
- \[9\]B\. Burlacu, G\. Kronberger, and M\. Kommenda\(2020\)Operon c\+\+: an efficient genetic programming framework for symbolic regression\.InProceedings of the 2020 Genetic and Evolutionary Computation Conference Companion,GECCO ’20,New York, NY, USA,pp\. 1562–1570\.External Links:ISBN 9781450371278,[Link](https://doi.org/10.1145/3377929.3398099),[Document](https://dx.doi.org/10.1145/3377929.3398099)Cited by:[§C\.3](https://arxiv.org/html/2606.23838#A3.SS3.p1.3),[§2](https://arxiv.org/html/2606.23838#S2.SS0.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2606.23838#S4.SS0.SSS0.Px3.p1.8)\.
- \[10\]T\. Dacunha, M\. Raveri, M\. Park, C\. Doux, and B\. Jain\(2022\-03\)What does a cosmological experiment really measure? Covariant posterior decomposition with normalizing flows\.Physical Review D105\(6\),pp\. 063529\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevD.105.063529),2112\.05737Cited by:[§2](https://arxiv.org/html/2606.23838#S2.SS0.SSS0.Px3.p1.1)\.
- \[11\]L\. Dinh, J\. Sohl\-Dickstein, and S\. Bengio\(2017\)Density estimation using real NVP\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=HkpbnH9lx)Cited by:[§C\.5](https://arxiv.org/html/2606.23838#A3.SS5.p1.1)\.
- \[12\]N\. Evangelou, N\. J\. Wichrowski, G\. A\. Kevrekidis, F\. Dietrich, M\. Kooshkbaghi, S\. McFann, and I\. G\. Kevrekidis\(2021\-10\)On the Parameter Combinations That Matter and on Those That do Not\.arXiv e\-prints,pp\. arXiv:2110\.06717\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2110.06717),2110\.06717Cited by:[§2](https://arxiv.org/html/2606.23838#S2.SS0.SSS0.Px2.p1.1)\.
- \[13\]P\. Frank, R\. Leike, and T\. A\. Enßlin\(2021\-07\)Geometric Variational Inference\.Entropy23\(7\),pp\. 853\.External Links:[Document](https://dx.doi.org/10.3390/e23070853),2105\.10470Cited by:[§2](https://arxiv.org/html/2606.23838#S2.SS0.SSS0.Px3.p1.1)\.
- \[14\]T\. Gneiting and A\. E\. Raftery\(2007\-03\)Strictly Proper Scoring Rules, Prediction, and Estimation\.Journal of the American Statistical Association102\(477\),pp\. 359–378\.Note:\_eprint: https://doi\.org/10\.1198/016214506000001437External Links:ISSN 0162\-1459,[Link](https://doi.org/10.1198/016214506000001437),[Document](https://dx.doi.org/10.1198/016214506000001437)Cited by:[§E\.3](https://arxiv.org/html/2606.23838#A5.SS3.p3.14)\.
- \[15\]R\. N\. Gutenkunst, J\. J\. Waterfall, F\. P\. Casey, K\. S\. Brown, C\. R\. Myers, and J\. P\. Sethna\(2007\-01\)Universally Sloppy Parameter Sensitivities in Systems Biology Models\.PLoS Computational Biology3\(10\),pp\. e189\.External Links:[Document](https://dx.doi.org/10.1371/journal.pcbi.0030189),q\-bio/0701039Cited by:[§2](https://arxiv.org/html/2606.23838#S2.SS0.SSS0.Px1.p1.1)\.
- \[16\]H\. W\. Hethcote\(2000\)The mathematics of infectious diseases\.SIAM Review42\(4\),pp\. 599–653\.External Links:[Document](https://dx.doi.org/10.1137/S0036144500371907)Cited by:[§E\.3](https://arxiv.org/html/2606.23838#A5.SS3.p1.4),[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px2.p1.12)\.
- \[17\]M\. Ho, D\. Bartlett, N\. Chartier, C\. Cuesta\-Lazaro, S\. Ding, A\. Lapel, P\. Lemos, C\. C\. Lovell, T\. L\. Makinen, C\. Modi, V\. Pandya, S\. Pandey, L\. Perez\-Granado, H\. Shao, G\. Valogiannis, and B\. Wandelt\(2024\)LtU\-ili: an all\-in\-one framework for implicit inference in astrophysics and cosmology\.arXiv:2402\.05137\.Cited by:[§E\.5](https://arxiv.org/html/2606.23838#A5.SS5.SSS0.Px6.p1.11)\.
- \[18\]A\. S\. Householder\(1958\-10\)Unitary Triangularization of a Nonsymmetric Matrix\.J\. ACM5\(4\),pp\. 339–342\.External Links:ISSN 0004\-5411,[Link](https://dl.acm.org/doi/10.1145/320941.320947),[Document](https://dx.doi.org/10.1145/320941.320947)Cited by:[Appendix B](https://arxiv.org/html/2606.23838#A2.p2.3)\.
- \[19\]S\. Husa, S\. Khan, M\. Hannam, M\. Pürrer, F\. Ohme, X\. Jiménez Forteza, and A\. Bohé\(2016\)Frequency\-domain gravitational waves from nonprecessing black\-hole binaries\. I\. New numerical waveforms and anatomy of the signal\.Phys\. Rev\. D93,pp\. 044006\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevD.93.044006),1508\.07250Cited by:[§E\.7](https://arxiv.org/html/2606.23838#A5.SS7.p1.5),[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px3.p4.1)\.
- \[20\]F\. P\. Incropera, D\. P\. DeWitt, T\. L\. Bergman, and A\. S\. Lavine\(2011\)Fundamentals of heat and mass transfer\.7 edition,Wiley,Hoboken, NJ\.Note:Lumped\-capacitance method, Ch\. 5; canonical derivation of the first\-order thermal step response used in our simulator\.Cited by:[§E\.5](https://arxiv.org/html/2606.23838#A5.SS5.p2.1),[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px1.p1.1)\.
- \[21\]B\. Jain and U\. Seljak\(1997\)Cosmological model predictions for weak lensing: linear and nonlinear regimes\.The Astrophysical Journal484\(2\),pp\. 560–573\.External Links:[Document](https://dx.doi.org/10.1086/304372),astro\-ph/9611077Cited by:[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px4.p1.3)\.
- \[22\]W\. Kabsch\(1976\-09\)A solution for the best rotation to relate two sets of vectors\.Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography32\(5\),pp\. 922–923\(en\)\.Note:Publisher: International Union of CrystallographyExternal Links:ISSN 0567\-7394,[Link](https://journals.iucr.org/a/issues/1976/05/00/a12999/),[Document](https://dx.doi.org/10.1107/S0567739476001873)Cited by:[Appendix B](https://arxiv.org/html/2606.23838#A2.p4.5)\.
- \[23\]W\. O\. Kermack and A\. G\. McKendrick\(1927\)A contribution to the mathematical theory of epidemics\.Proceedings of the Royal Society of London\. Series A115\(772\),pp\. 700–721\.External Links:[Document](https://dx.doi.org/10.1098/rspa.1927.0118)Cited by:[§E\.3](https://arxiv.org/html/2606.23838#A5.SS3.p1.4),[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px2.p1.12)\.
- \[24\]S\. Khan, S\. Husa, M\. Hannam, F\. Ohme, M\. Pürrer, X\. Jiménez Forteza, and A\. Bohé\(2016\)Frequency\-domain gravitational waves from nonprecessing black\-hole binaries\. II\. A phenomenological model for the advanced detector era\.Phys\. Rev\. D93,pp\. 044007\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevD.93.044007),1508\.07253Cited by:[§E\.7](https://arxiv.org/html/2606.23838#A5.SS7.p1.5),[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px3.p4.1)\.
- \[25\]M\. Kilbinger\(2015\)Cosmology with cosmic shear observations: a review\.Reports on Progress in Physics78\(8\),pp\. 086901\.External Links:[Document](https://dx.doi.org/10.1088/0034-4885/78/8/086901),1411\.0115Cited by:[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px4.p1.3)\.
- \[26\]G\. Kronberger, B\. Burlacu, M\. Kommenda, S\. M\. Winkler, and M\. Affenzeller\(2024\)Symbolic regression\.Chapman & Hall / CRC Press\.Cited by:[§2](https://arxiv.org/html/2606.23838#S2.SS0.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2606.23838#S4.SS0.SSS0.Px3.p1.8)\.
- \[27\]J\. Lee\(2018\)Introduction to Riemannian manifolds\.Springer Berlin Heidelberg,New York, NY\.External Links:ISBN 978\-3\-319\-91754\-2Cited by:[§3](https://arxiv.org/html/2606.23838#S3.p3.1)\.
- \[28\]P\. Lemos, A\. Coogan, Y\. Hezaveh, and L\. Perreault\-Levasseur\(2023\)Sampling\-based accuracy testing of posterior estimators for general inference\.External Links:2302\.03026,[Link](https://arxiv.org/abs/2302.03026)Cited by:[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px3.p5.3)\.
- \[29\]Y\. Li, L\. Lu, C\. Modi, D\. Jamieson, Y\. Zhang, Y\. Feng, W\. Zhou, N\. P\. Kwan, F\. Lanusse, and L\. Greengard\(2022\)Pmwd: A Differentiable Cosmological Particle\-Mesh N\-body Library\.Note:\_eprint: 2211\.09958Cited by:[§E\.4](https://arxiv.org/html/2606.23838#A5.SS4.p1.4)\.
- \[30\]T\. L\. Makinen, J\. Alsing, and B\. D\. Wandelt\(2023\-08\)Fishnets: scalable neural compression with optimal aggregation\.arxiv2023\.Cited by:[§4](https://arxiv.org/html/2606.23838#S4.SS0.SSS0.Px1.p1.8)\.
- \[31\]T\. L\. Makinen, A\. Heavens, N\. Porqueres, T\. Charnock, A\. Lapel, and B\. D\. Wandelt\(2025\-01\)Hybrid summary statistics: neural weak lensing inference beyond the power spectrum\.Journal of Cosmology and Astroparticle Physics2025\(01\),pp\. 095\.External Links:ISSN 1475\-7516,[Link](http://dx.doi.org/10.1088/1475-7516/2025/01/095),[Document](https://dx.doi.org/10.1088/1475-7516/2025/01/095)Cited by:[§E\.4](https://arxiv.org/html/2606.23838#A5.SS4.p1.4)\.
- \[32\]A\. Meurer, C\. P\. Smith, M\. Paprocki, O\. Čertík, S\. B\. Kirpichev, M\. Rocklin, A\. Kumar, S\. Ivanov, J\. K\. Moore, S\. Singh, T\. Rathnayake, S\. Vig, B\. E\. Granger, R\. P\. Muller, F\. Bonazzi, H\. Gupta, S\. Vats, F\. Johansson, F\. Pedregosa, M\. J\. Curry, A\. R\. Terrel, Š\. Roučka, A\. Saboo, I\. Fernando, S\. Kulal, R\. Cimrman, and A\. Scopatz\(2017\-01\)SymPy: symbolic computing in python\.PeerJ Computer Science3,pp\. e103\.External Links:ISSN 2376\-5992,[Link](https://doi.org/10.7717/peerj-cs.103),[Document](https://dx.doi.org/10.7717/peerj-cs.103)Cited by:[§C\.3\.1](https://arxiv.org/html/2606.23838#A3.SS3.SSS1.p1.1)\.
- \[33\]F\. Nielsen\(2020\-10\)An Elementary Introduction to Information Geometry\.Entropy22\(10\),pp\. 1100\(en\)\.External Links:ISSN 1099\-4300,[Link](https://www.mdpi.com/1099-4300/22/10/1100),[Document](https://dx.doi.org/10.3390/e22101100)Cited by:[§3](https://arxiv.org/html/2606.23838#S3.p1.16)\.
- \[34\]G\. Papamakarios, E\. Nalisnick, D\. J\. Rezende, S\. Mohamed, and B\. Lakshminarayanan\(2021\)Normalizing flows for probabilistic modeling and inference\.Journal of Machine Learning Research22\(57\),pp\. 1–64\.Cited by:[§C\.5](https://arxiv.org/html/2606.23838#A3.SS5.p1.1),[Appendix E](https://arxiv.org/html/2606.23838#A5.SS0.SSS0.Px2.p1.13),[§E\.5](https://arxiv.org/html/2606.23838#A5.SS5.SSS0.Px1.p1.8),[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px1.p4.8)\.
- \[35\]G\. Papamakarios, T\. Pavlakou, and I\. Murray\(2017\)Masked autoregressive flow for density estimation\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§E\.5](https://arxiv.org/html/2606.23838#A5.SS5.SSS0.Px6.p1.11)\.
- \[36\]E\. Peterfreund, O\. Lindenbaum, F\. Dietrich, T\. Bertalan, M\. Gavish, I\. G\. Kevrekidis, and R\. R\. Coifman\(2020\-12\)Local conformal autoencoder for standardized data coordinates\.Proceedings of the National Academy of Science117\(49\),pp\. 30918–30927\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2014627117),2004\.07234Cited by:[§2](https://arxiv.org/html/2606.23838#S2.SS0.SSS0.Px2.p1.1)\.
- \[37\]C\. R\. Rao\(1945\)Information and the accuracy attainable in the estimation of statistical parameters\.Bulletin of the Calcutta Mathematical Society37,pp\. 81–91\.Cited by:[§3](https://arxiv.org/html/2606.23838#S3.p1.11),[§3](https://arxiv.org/html/2606.23838#S3.p1.16)\.
- \[38\]H\. H\. Rosenbrock\(1960\-01\)An Automatic Method for Finding the Greatest or Least Value of a Function\.The Computer Journal3\(3\),pp\. 175–184\.Note:\_eprint: https://academic\.oup\.com/comjnl/article\-pdf/3/3/175/988633/030175\.pdfExternal Links:ISSN 0010\-4620,[Link](https://doi.org/10.1093/comjnl/3.3.175),[Document](https://dx.doi.org/10.1093/comjnl/3.3.175)Cited by:[§E\.1](https://arxiv.org/html/2606.23838#A5.SS1.p2.3)\.
- \[39\]C\. J\. Stone\(1980\)Optimal rates of convergence for nonparametric estimators\.The Annals of Statistics8\(6\),pp\. 1348–1360\.External Links:[Document](https://dx.doi.org/10.1214/aos/1176345206)Cited by:[§E\.5](https://arxiv.org/html/2606.23838#A5.SS5.SSS0.Px1.p1.5),[§5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px1.p2.9)\.
- \[40\]M\. K\. Transtrum, B\. B\. Machta, K\. S\. Brown, B\. C\. Daniels, C\. R\. Myers, and J\. P\. Sethna\(2015\-07\)Perspective: Sloppiness and emergent theories in physics, biology, and beyond\.J\. Chem\. Phys\.143\(1\),pp\. 010901\.External Links:[Document](https://dx.doi.org/10.1063/1.4923066),1501\.07668Cited by:[§2](https://arxiv.org/html/2606.23838#S2.SS0.SSS0.Px1.p1.1)\.
- \[41\]M\. K\. Transtrum, B\. B\. Machta, and J\. P\. Sethna\(2011\-03\)Geometry of nonlinear least squares with applications to sloppy models and optimization\.Physical Review E83\(3\),pp\. 036701\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevE.83.036701),1010\.1449Cited by:[§2](https://arxiv.org/html/2606.23838#S2.SS0.SSS0.Px1.p1.1)\.
- \[42\]M\. K\. Transtrum and P\. Qiu\(2014\-08\)Model reduction by manifold boundaries\.Phys\. Rev\. Lett\.113,pp\. 098701\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevLett.113.098701),[Link](https://link.aps.org/doi/10.1103/PhysRevLett.113.098701)Cited by:[§2](https://arxiv.org/html/2606.23838#S2.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.23838#S6.SS0.SSS0.Px1.p1.2)\.

The appendices are organised as follows\.[Appendix˜A](https://arxiv.org/html/2606.23838#A1)proves that the Fishnet objective recovers the posterior mean and covariance, and motivates its use as a Fisher estimator\.[Appendix˜B](https://arxiv.org/html/2606.23838#A2)describes the alignment scheme used to fix the constant\-offset and orthogonal symmetries of the flattening loss when ensembling\.[Appendix˜C](https://arxiv.org/html/2606.23838#A3)details the Fishnet, flattener, and symbolic\- regression architectures and training hyperparameters\.[Appendix˜D](https://arxiv.org/html/2606.23838#A4)constructs geodesic normal coordinates for the Gaussian validation example used in[Section˜5\.1](https://arxiv.org/html/2606.23838#S5.SS1)\.[Appendix˜E](https://arxiv.org/html/2606.23838#A5)collects per\-experiment simulation, prior, and downstream\-inference details\.

## Appendix AForward\-KL Minimisation Yields the Posterior Moments

This appendix establishes that the Fishnet objective \([6](https://arxiv.org/html/2606.23838#S4.E6)\), when treated as a variational loss over a Gaussian family, is minimised by the posterior mean and posterior covariance ofθ\\thetagivenXX\. The identification with the Fisher information then follows in the standard Laplace / Bernstein–von Mises regime\.

##### Setup\.

LetXXbe observed data andθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}parameters, with joint densityp​\(X,θ\)p\(X,\\theta\)and posteriorp​\(θ∣X\)p\(\\theta\\mid X\)\. Throughout we assume the posterior has finite first and second conditional moments

m​\(X\):=𝔼p​\(θ∣X\)​\[θ\],S​\(X\):=Covp​\(θ∣X\)​\[θ\]∈𝕊\+\+d,m\(X\):=\\mathbb\{E\}\_\{p\(\\theta\\mid X\)\}\[\\theta\],\\qquad S\(X\):=\\mathrm\{Cov\}\_\{p\(\\theta\\mid X\)\}\[\\theta\]\\in\\mathbb\{S\}\_\{\+\+\}^\{d\},\(12\)where𝕊\+\+d\\mathbb\{S\}\_\{\+\+\}^\{d\}denotes the cone ofd×dd\\times dsymmetric positive\-definite matrices\. The Fishnet model parametrises a Gaussian variational family

qφ​\(θ∣X\)=𝒩​\(θ;μφ​\(X\),Σφ​\(X\)\),Σφ​\(X\)∈𝕊\+\+d,q\_\{\\varphi\}\(\\theta\\mid X\)=\\mathcal\{N\}\\\!\\bigl\(\\theta;\\,\\mu\_\{\\varphi\}\(X\),\\,\\Sigma\_\{\\varphi\}\(X\)\\bigr\),\\quad\\Sigma\_\{\\varphi\}\(X\)\\in\\mathbb\{S\}\_\{\+\+\}^\{d\},\(13\)through the network outputsμφ​\(X\):=θ^​\(X\)\\mu\_\{\\varphi\}\(X\):=\\hat\{\\theta\}\(X\)andΣφ​\(X\):=𝐅φ​\(X\)−1\\Sigma\_\{\\varphi\}\(X\):=\\mathbf\{F\}\_\{\\varphi\}\(X\)^\{\-1\}, where𝐅φ\\mathbf\{F\}\_\{\\varphi\}is the log\-Cholesky\-parametrised matrix returned by the network\. Writingℒ\\mathcal\{L\}for the loss \([6](https://arxiv.org/html/2606.23838#S4.E6)\), the per\-sample loss is

ℓ​\(φ;X,θ\)=12​\(θ−μφ​\(X\)\)⊤​𝐅φ​\(X\)​\(θ−μφ​\(X\)\)−12​log​det𝐅φ​\(X\),\\ell\(\\varphi;X,\\theta\)=\\tfrac\{1\}\{2\}\\bigl\(\\theta\-\\mu\_\{\\varphi\}\(X\)\\bigr\)^\{\\\!\\top\}\\mathbf\{F\}\_\{\\varphi\}\(X\)\\bigl\(\\theta\-\\mu\_\{\\varphi\}\(X\)\\bigr\)\-\\tfrac\{1\}\{2\}\\log\\det\\mathbf\{F\}\_\{\\varphi\}\(X\),\(14\)and the population loss is𝔼p​\(X,θ\)​\[ℓ​\(φ;X,θ\)\]\\mathbb\{E\}\_\{p\(X,\\theta\)\}\[\\ell\(\\varphi;X,\\theta\)\]\.

### A\.1Main result

###### Proposition 1\(Forward\-KL Gaussian moment matching\)\.

FixXX\. Within the Gaussian family \([13](https://arxiv.org/html/2606.23838#A1.E13)\), the conditional forward KL divergenceDKL\(p\(θ∣X\)∥qφ\(θ∣X\)\)D\_\{\\mathrm\{KL\}\}\\\!\\bigl\(p\(\\theta\\mid X\)\\,\\big\\\|\\,q\_\{\\varphi\}\(\\theta\\mid X\)\\bigr\)admits a unique minimum over\(μφ,𝐅φ\)∈ℝd×𝕊\+\+d\(\\mu\_\{\\varphi\},\\,\\mathbf\{F\}\_\{\\varphi\}\)\\in\\mathbb\{R\}^\{d\}\\times\\mathbb\{S\}\_\{\+\+\}^\{d\}at

μφ⋆​\(X\)=m​\(X\),Σφ⋆​\(X\)=𝐅φ⋆​\(X\)−1=S​\(X\),\\mu\_\{\\varphi\}^\{\\star\}\(X\)=m\(X\),\\qquad\\Sigma\_\{\\varphi\}^\{\\star\}\(X\)=\\mathbf\{F\}\_\{\\varphi\}^\{\\star\}\(X\)^\{\-1\}=S\(X\),\(15\)the posterior mean and posterior covariance ofθ\\thetagivenXX\.

###### Proof\.

Direct expansion of the Gaussian density in \([13](https://arxiv.org/html/2606.23838#A1.E13)\) gives, for any\(μφ,𝐅φ\)\(\\mu\_\{\\varphi\},\\mathbf\{F\}\_\{\\varphi\}\),

𝔼p​\(θ∣X\)\[ℓ\(φ;X,θ\)\]=DKL\(p\(θ∣X\)∥qφ\(θ∣X\)\)\+H\(p\(θ∣X\)\)−d2log\(2π\),\\mathbb\{E\}\_\{p\(\\theta\\mid X\)\}\\bigl\[\\ell\(\\varphi;X,\\theta\)\\bigr\]=D\_\{\\mathrm\{KL\}\}\\\!\\bigl\(p\(\\theta\\mid X\)\\,\\big\\\|\\,q\_\{\\varphi\}\(\\theta\\mid X\)\\bigr\)\+H\\\!\\bigl\(p\(\\theta\\mid X\)\\bigr\)\-\\tfrac\{d\}\{2\}\\log\(2\\pi\),\(16\)whereH​\(⋅\)H\(\\cdot\)denotes differential entropy\. Both correction terms are independent ofφ\\varphi, so minimising \([14](https://arxiv.org/html/2606.23838#A1.E14)\) in expectation and minimising the conditional KL are equivalent\.

Decomposeθ−μφ=\(θ−m\)\+\(m−μφ\)\\theta\-\\mu\_\{\\varphi\}=\(\\theta\-m\)\+\(m\-\\mu\_\{\\varphi\}\)in the quadratic form\. The cross\-term vanishes because𝔼p​\[θ−m\]=0\\mathbb\{E\}\_\{p\}\[\\theta\-m\]=0, and the trace identity𝔼p​\[\(θ−m\)⊤​A​\(θ−m\)\]=tr​\(A​S\)\\mathbb\{E\}\_\{p\}\\\!\\left\[\(\\theta\-m\)^\{\\\!\\top\}\\\!A\(\\theta\-m\)\\right\]=\\mathrm\{tr\}\(AS\), valid for any symmetricAA, reduces \([14](https://arxiv.org/html/2606.23838#A1.E14)\) to

𝔼p​\[ℓ​\(φ;X,θ\)\]=12​\(m−μφ\)⊤​𝐅φ​\(m−μφ\)⏟\(I\)≥0\+12​tr​\(𝐅φ​S\)−12​log​det𝐅φ⏟\(II\)\.\\mathbb\{E\}\_\{p\}\\bigl\[\\ell\(\\varphi;X,\\theta\)\\bigr\]=\\underbrace\{\\tfrac\{1\}\{2\}\(m\-\\mu\_\{\\varphi\}\)^\{\\\!\\top\}\\mathbf\{F\}\_\{\\varphi\}\(m\-\\mu\_\{\\varphi\}\)\}\_\{\\text\{\(I\)\}\\;\\geq 0\}\+\\underbrace\{\\tfrac\{1\}\{2\}\\mathrm\{tr\}\(\\mathbf\{F\}\_\{\\varphi\}S\)\-\\tfrac\{1\}\{2\}\\log\\det\\mathbf\{F\}\_\{\\varphi\}\}\_\{\\text\{\(II\)\}\}\.\(17\)Term \(II\) does not depend onμφ\\mu\_\{\\varphi\}, and term \(I\) is a non\-negative quadratic form that vanishes if and only ifμφ=m\\mu\_\{\\varphi\}=m\(since𝐅φ≻0\\mathbf\{F\}\_\{\\varphi\}\\succ 0\)\. For any𝐅φ∈𝕊\+\+d\\mathbf\{F\}\_\{\\varphi\}\\in\\mathbb\{S\}\_\{\+\+\}^\{d\}, replacingμφ\\mu\_\{\\varphi\}bymmtherefore strictly decreases \([17](https://arxiv.org/html/2606.23838#A1.E17)\) unlessμφ=m\\mu\_\{\\varphi\}=malready; any minimiser must satisfyμφ⋆=m\\mu\_\{\\varphi\}^\{\\star\}=m\.

Substitutingμφ=m\\mu\_\{\\varphi\}=mleaves the precision objective

g​\(𝐅\):=12​tr​\(𝐅​S\)−12​log​det𝐅,𝐅∈𝕊\+\+d\.g\(\\mathbf\{F\}\):=\\tfrac\{1\}\{2\}\\mathrm\{tr\}\(\\mathbf\{F\}S\)\-\\tfrac\{1\}\{2\}\\log\\det\\mathbf\{F\},\\qquad\\mathbf\{F\}\\in\\mathbb\{S\}\_\{\+\+\}^\{d\}\.\(18\)The map𝐅↦−log​det𝐅\\mathbf\{F\}\\mapsto\-\\log\\det\\mathbf\{F\}is strictly convex on𝕊\+\+d\\mathbb\{S\}\_\{\+\+\}^\{d\}and𝐅↦tr​\(𝐅​S\)\\mathbf\{F\}\\mapsto\\mathrm\{tr\}\(\\mathbf\{F\}S\)is linear, soggis strictly convex on the \(convex\) cone𝕊\+\+d\\mathbb\{S\}\_\{\+\+\}^\{d\}\. Standard matrix calculus gives∇𝐅g​\(𝐅\)=12​\(S−𝐅−1\)\\nabla\_\{\\mathbf\{F\}\}\\,g\(\\mathbf\{F\}\)=\\tfrac\{1\}\{2\}\\bigl\(S\-\\mathbf\{F\}^\{\-1\}\\bigr\), which vanishes uniquely at𝐅⋆=S−1\\mathbf\{F\}^\{\\star\}=S^\{\-1\}, henceΣφ⋆=S\\Sigma\_\{\\varphi\}^\{\\star\}=S\. The pair\(μφ⋆,𝐅φ⋆\)=\(m,S−1\)\(\\mu\_\{\\varphi\}^\{\\star\},\\mathbf\{F\}\_\{\\varphi\}^\{\\star\}\)=\(m,S^\{\-1\}\)is the unique global minimiser of \([17](https://arxiv.org/html/2606.23838#A1.E17)\), and by \([16](https://arxiv.org/html/2606.23838#A1.E16)\) also of the conditional KL\. ∎

###### Corollary 1\(Fishnets recover the posterior moments\)\.

Suppose the variational class\{\(μφ,𝐅φ\):φ∈Φ\}\\\{\(\\mu\_\{\\varphi\},\\,\\mathbf\{F\}\_\{\\varphi\}\):\\varphi\\in\\Phi\\\}is sufficiently expressive to represent the conditional moment mapsX↦m​\(X\)X\\mapsto m\(X\)andX↦S​\(X\)−1X\\mapsto S\(X\)^\{\-1\}, and that the training distribution coversp​\(X\)p\(X\)\. Then any global minimiser of the population Fishnet loss𝔼p​\(X,θ\)​\[ℓ​\(φ;X,θ\)\]\\mathbb\{E\}\_\{p\(X,\\theta\)\}\[\\ell\(\\varphi;X,\\theta\)\]satisfies, forp​\(X\)p\(X\)\-almost everyXX,

θ^​\(X\)=𝔼p​\(θ∣X\)​\[θ\],𝐅φ​\(X\)=Covp​\(θ∣X\)​\[θ\]−1\.\\hat\{\\theta\}\(X\)=\\mathbb\{E\}\_\{p\(\\theta\\mid X\)\}\[\\theta\],\\qquad\\mathbf\{F\}\_\{\\varphi\}\(X\)=\\mathrm\{Cov\}\_\{p\(\\theta\\mid X\)\}\[\\theta\]^\{\-1\}\.\(19\)

###### Proof\.

By the tower property,𝔼p​\(X,θ\)​\[ℓ\]=𝔼p​\(X\)​\[𝔼p​\(θ∣X\)​\[ℓ\]\]\\mathbb\{E\}\_\{p\(X,\\theta\)\}\[\\ell\]=\\mathbb\{E\}\_\{p\(X\)\}\\\!\\bigl\[\\mathbb\{E\}\_\{p\(\\theta\\mid X\)\}\[\\ell\]\\bigr\]\. Applying Proposition[1](https://arxiv.org/html/2606.23838#Thmproposition1)pointwise inXXand using the identity \([16](https://arxiv.org/html/2606.23838#A1.E16)\), the conditional minimum is attained at \([15](https://arxiv.org/html/2606.23838#A1.E15)\); expressivity of the variational class allows this minimum to be realised simultaneously forp​\(X\)p\(X\)\-almost everyXX\. ∎

## Appendix BObtaining a unique solution in flattened coordinates

The loss function we minimise \([Eq\.˜6](https://arxiv.org/html/2606.23838#S4.E6)\) has two important symmetries:

1. 1\.L​\(𝜼\)L\(\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\)is invariant ifηa→ηa\+ca\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{a\}\\to\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{a\}\+c^\{a\}, for some constant vectorcac^\{a\}\(this is trivial from the invariance ofFa​bF\_\{ab\}to this\)\.
2. 2\.L​\(𝜼\)L\(\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\)is invariant under global orthogonal transformations\. To prove this, consider new coordinates\{ϕ~a\}\\\{\\tilde\{\\phi\}^\{a\}\\\}such thatηa=Ra​ϕ~bb\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{a\}=R^\{a\}\{\}\_\{b\}\\tilde\{\\phi\}^\{b\}where𝗥𝗥T=𝟭\\bm\{\\mathsf\{R\}\}\\bm\{\\mathsf\{R\}\}^\{\\rm T\}=\\bm\{\\mathsf\{1\}\}\. The new Fisher is 𝗙~​\(ϕ\)=𝗥T​𝗙~​\(θ\)​𝗥\.\\tilde\{\\bm\{\\mathsf\{F\}\}\}\(\\phi\)=\\bm\{\\mathsf\{R\}\}^\{\\rm T\}\\tilde\{\\bm\{\\mathsf\{F\}\}\}\(\\theta\)\\bm\{\\mathsf\{R\}\}\.\(20\)Using the cyclic property of the trace, one sees thatL​\(ϕ~\)=L​\(η\)L\(\\tilde\{\\phi\}\)=L\(\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}\)\.

Optimising the Frobenius norm loss does not, therefore, yield a unique set of parameters, but one that is known only up to a constant offset and an orthogonal transformation\. This is particularly problematic when we train an ensemble of networks, since each member of the ensemble will have a different offset and orthogonal transformation relative to some fiducial network\.

To deal with this, we employ a two\-step rotation scheme to align ensemble members\{𝜼\}\\\{\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\\\}\. First, we choose a reference ensemble member and employ a Householder reflection\[[18](https://arxiv.org/html/2606.23838#bib.bib36)\]to rotate its learned coordinates at a fiducial valueη∗=η​\(θ∗\)\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{\*\}=\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}\(\\theta^\{\*\}\)such thatη∗\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{\*\}points along the first eigenvalue of the variance\-weighted Fisher matrix:

𝗙∗/δ​𝗙​v=λ​v,RH​η∗=λ1\.\\bm\{\\mathsf\{F\}\}^\{\*\}/\\delta\\bm\{\\mathsf\{F\}\}v=\\lambda v,\\quad R^\{H\}\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{\*\}=\\lambda\_\{1\}\.\(21\)
The Householder reflection is given as

ref𝐧​\(𝐮\)=𝐮−2​𝐧𝐧⊤​𝐮𝐧⊤​𝐧\\text\{ref\}\_\{\\mathbf\{n\}\}\(\\mathbf\{u\}\)=\\mathbf\{u\}\-2\\frac\{\\mathbf\{n\}\\mathbf\{n\}^\{\\top\}\\mathbf\{u\}\}\{\\mathbf\{n\}^\{\\top\}\\mathbf\{n\}\}\(22\)Where𝐧\\mathbf\{n\}is the normal to the reflection plane and𝐧⊤​𝐧\\mathbf\{n\}^\{\\top\}\\mathbf\{n\}is the squared norm‖𝐧‖2\\\|\\mathbf\{n\}\\\|^\{2\}\. This can be cast as two successive reflections from𝐱^=𝐱‖𝐱‖\\hat\{\\mathbf\{x\}\}=\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\},𝐲^=𝐲‖𝐲‖\\hat\{\\mathbf\{y\}\}=\\frac\{\\mathbf\{y\}\}\{\\\|\\mathbf\{y\}\\\|\}, one across the hyperplane defined by the sum of the two vectors, and one across the target vector:

S\\displaystyle S=I−2​\(𝐲^\+𝐱^\)​\(𝐲^\+𝐱^\)⊤\(𝐲^\+𝐱^\)⊤​\(𝐲^\+𝐱^\)\\displaystyle=I\-2\\frac\{\(\\hat\{\\mathbf\{y\}\}\+\\hat\{\\mathbf\{x\}\}\)\(\\hat\{\\mathbf\{y\}\}\+\\hat\{\\mathbf\{x\}\}\)^\{\\top\}\}\{\(\\hat\{\\mathbf\{y\}\}\+\\hat\{\\mathbf\{x\}\}\)^\{\\top\}\(\\hat\{\\mathbf\{y\}\}\+\\hat\{\\mathbf\{x\}\}\)\}\(23\)R\\displaystyle R=S−2​𝐲^​𝐲^⊤​S𝐲^⊤​𝐲^,\\displaystyle=S\-2\\frac\{\\hat\{\\mathbf\{y\}\}\\hat\{\\mathbf\{y\}\}^\{\\top\}S\}\{\\hat\{\\mathbf\{y\}\}^\{\\top\}\\hat\{\\mathbf\{y\}\}\},\(24\)whereRRsatisfies

R​𝐱^=𝐲^\.R\\hat\{\\mathbf\{x\}\}=\\hat\{\\mathbf\{y\}\}\.\(25\)
Once this reference point is established, we employ a Kabsch rotation\[[22](https://arxiv.org/html/2606.23838#bib.bib33)\]to re\-align each ensemble member’s coordinates\. We wish to align each ensemble member’s output centroid\-subtracted coordinatesP=\[ηji−⟨ηi⟩\]∈ℝN×npP=\[\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{i\}\_\{j\}\-\\langle\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{i\}\\rangle\]\\in\\mathbb\{R\}^\{N\\times n\_\{p\}\}, whereNNis the number of samples andnpn\_\{p\}is the number of parameters, to the reference ensemble member’s centroid\-subtracted coordinatesP=\[ηj0−⟨η0⟩\]∈ℝN×npP=\[\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{0\}\_\{j\}\-\\langle\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}^\{0\}\\rangle\]\\in\\mathbb\{R\}^\{N\\times n\_\{p\}\}\. The Kabsch algorithm finds an optimal rotationR∗R^\{\*\}which minimises

ℒkabsch=12​∑i=0N‖Q−R​P‖2\.\\mathcal\{L\}\_\{\\rm kabsch\}=\\frac\{1\}\{2\}\\sum\_\{i=0\}^\{N\}\|\|Q\-RP\|\|^\{2\}\.\(26\)The optimal rotation can be found from the Singular Value Decomposition \(SVD\) of the point cloud covariance matrix,C=PT​QC=P^\{T\}Q:

C=U​Σ​VT,R∗=V​B​UT,C=U\\Sigma V^\{T\},\\quad R^\{\*\}=VBU^\{T\},\(27\)whered=sign​\(detV​UT\)d=\{\\rm sign\}\(\\det VU^\{T\}\)andB=diag​\(1,1,d\)B=\{\\rm diag\}\(1,1,d\)fixes rotations to be right\-handed\.

##### Jacobian Procrustes alignment\.

Coordinate Kabsch matches ensemble members in the𝜼\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\-space, but the loss \([8](https://arxiv.org/html/2606.23838#S4.E8)\) cares only about the Jacobian𝓙=∂𝜼/∂𝜽\\bm\{\\mathsf\{\\mathcal\{J\}\}\}=\\partial\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}/\\partial\\bm\{\\theta\}through the metric pullback\. As an option to users we also provide another alignment scheme to highlight nonlinear flattened components along directions of high variance in the network Jacobian\. We perform an orthogonal Procrustes fit performed directly on the batch of Jacobians: given source Jacobians\{𝗝b\(i\)\}b=1B\\\{\\bm\{\\mathsf\{\{J\}\}\}^\{\(i\)\}\_\{b\}\\\}\_\{b=1\}^\{B\}for ensemble memberiiand reference Jacobians\{𝗝b\(0\)\}b=1B\\\{\\bm\{\\mathsf\{\{J\}\}\}^\{\(0\)\}\_\{b\}\\\}\_\{b=1\}^\{B\}, the rotation acts on the𝜼\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\-axis \(left index\) of𝗝\\bm\{\\mathsf\{\{J\}\}\},𝗝b′=𝗥​𝗝b\(i\)\\bm\{\\mathsf\{\{J\}\}\}^\{\\prime\}\_\{b\}=\\bm\{\\mathsf\{R\}\}\\,\\bm\{\\mathsf\{\{J\}\}\}^\{\(i\)\}\_\{b\}, and the alignment problem

𝗥∗=arg⁡min𝗥∈O​\(np\)​∑b=1Bwb​‖𝗥​𝗝b\(i\)−𝗝b\(0\)‖ℱ2\\bm\{\\mathsf\{R\}\}^\{\*\}\\;=\\;\\arg\\min\_\{\\bm\{\\mathsf\{R\}\}\\in O\(n\_\{p\}\)\}\\sum\_\{b=1\}^\{B\}w\_\{b\}\\,\\big\\\|\\bm\{\\mathsf\{R\}\}\\,\\bm\{\\mathsf\{\{J\}\}\}^\{\(i\)\}\_\{b\}\-\\bm\{\\mathsf\{\{J\}\}\}^\{\(0\)\}\_\{b\}\\big\\\|\_\{\\mathcal\{F\}\}^\{2\}\(28\)admits the closed\-form solution𝗥∗=𝗨𝗩T\\bm\{\\mathsf\{R\}\}^\{\*\}=\\bm\{\\mathsf\{U\}\}\\bm\{\\mathsf\{V\}\}^\{\\rm T\}, where𝗨𝗦𝗩T\\bm\{\\mathsf\{U\}\}\\bm\{\\mathsf\{S\}\}\\bm\{\\mathsf\{V\}\}^\{\\rm T\}is the SVD of the cross\-covariance𝗠=∑bwb​𝗝b\(0\)​\(𝗝b\(i\)\)⊤\\bm\{\\mathsf\{M\}\}=\\sum\_\{b\}w\_\{b\}\\,\\bm\{\\mathsf\{\{J\}\}\}^\{\(0\)\}\_\{b\}\\bigl\(\\bm\{\\mathsf\{\{J\}\}\}^\{\(i\)\}\_\{b\}\\bigr\)^\{\\\!\\top\}\. The sample weightswbw\_\{b\}optionally reflect the validation weighting from[Eq\.˜6](https://arxiv.org/html/2606.23838#S4.E6)\. Compared with coordinate Kabsch, this scheme uses every sample and unit\-magnitude derivatives instead of noisy coordinates, so it is markedly more stable on near\-degenerate axes whose𝜼\\bm\{\\eta\}\-displacements are dominated by ensemble noise\. We allow improper rotations \(det𝗥∗=−1\\det\\bm\{\\mathsf\{R\}\}^\{\*\}=\-1\) per member when this strictly reduces \([28](https://arxiv.org/html/2606.23838#A2.E28)\): the flatness loss‖𝗥​𝗙¯′​𝗥T−𝟭‖ℱ2\\big\\\|\\bm\{\\mathsf\{R\}\}\\bar\{\\bm\{\\mathsf\{F\}\}\}^\{\\prime\}\\bm\{\\mathsf\{R\}\}^\{\\rm T\}\-\\bm\{\\mathsf\{1\}\}\\big\\\|\_\{\\mathcal\{F\}\}^\{2\}is invariant under arbitrary𝗥∈O​\(np\)\\bm\{\\mathsf\{R\}\}\\in O\(n\_\{p\}\), so the only requirement is that all members land in a common frame, which the Procrustes solution achieves\.

##### Member\-independent canonical basis\.

After Procrustes alignment the ensemble shares a single orthogonal frame, but the choice of frame is still arbitrary\. We fix it once on the reference member by composing two operations\. First, a*nonlinearity\-separating rotation*takes the SVD of the sample\-centred Jacobian stackΔ​𝗝b=𝗝b\(0\)−𝗝¯\(0\)\\Delta\\bm\{\\mathsf\{\{J\}\}\}\_\{b\}=\\bm\{\\mathsf\{\{J\}\}\}^\{\(0\)\}\_\{b\}\-\\bar\{\\bm\{\\mathsf\{\{J\}\}\}\}^\{\(0\)\}\(with optional column rescaling by the prior widthsΔ​θa=θamax−θamin\\Delta\\theta\_\{a\}=\\theta\_\{a\}^\{\\max\}\-\\theta\_\{a\}^\{\\min\}to put all parameters on a dimensionless footing\): reshapingΔ​𝗝∈ℝB×np×np\\Delta\\bm\{\\mathsf\{\{J\}\}\}\\in\\mathbb\{R\}^\{B\\times n\_\{p\}\\times n\_\{p\}\}toΩ∈ℝnp×B​np\\Omega\\in\\mathbb\{R\}^\{n\_\{p\}\\times Bn\_\{p\}\}and writingΩ=𝗨​Σ​𝗩T\\Omega=\\bm\{\\mathsf\{U\}\}\{\\Sigma\}\\bm\{\\mathsf\{V\}\}^\{\\rm T\}, the rotation𝗥nl=𝗨T\\bm\{\\mathsf\{R\}\}\_\{\\rm nl\}=\\bm\{\\mathsf\{U\}\}^\{\\rm T\}orders the𝜼\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\-axes by decreasing nonlinearity “energy” \(singular values ofΩ\\Omega\): axes whose Jacobian rows are sample\-independent \(i\.e\.ηa\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\eta\}\_\{a\}is a linear function of𝜽\\bm\{\\theta\}\) collect into the trailing rows\. Second, a sign and \(optionally\) permutation fix\-up𝗥canon\\bm\{\\mathsf\{R\}\}\_\{\\rm canon\}uses the eigenstructure of the mean prior\-normalised𝜽\\bm\{\\theta\}\-FisherF¯a​b/\(Δ​θa​Δ​θb\)\\bar\{F\}\_\{ab\}/\(\\Delta\\theta\_\{a\}\\,\\Delta\\theta\_\{b\}\)to flip per\-axis signs so that⟨\(𝗥nl​𝗝\)a,va⟩≥0\\langle\(\\bm\{\\mathsf\{R\}\}\_\{\\rm nl\}\\,\\bm\{\\mathsf\{\{J\}\}\}\)\_\{a\},v\_\{a\}\\rangle\\geq 0for the corresponding Fisher eigenvectorvav\_\{a\}\. Composing𝗥basis=𝗥canon​𝗥nl\\bm\{\\mathsf\{R\}\}\_\{\\rm basis\}=\\bm\{\\mathsf\{R\}\}\_\{\\rm canon\}\\bm\{\\mathsf\{R\}\}\_\{\\rm nl\}once on the reference and applying𝗥total\(i\)=𝗥basis​𝗥∗\(i\)\\bm\{\\mathsf\{R\}\}\_\{\\rm total\}^\{\(i\)\}=\\bm\{\\mathsf\{R\}\}\_\{\\rm basis\}\\,\\bm\{\\mathsf\{R\}\}^\{\*\\,\(i\)\}to every member guarantees the same axis ordering and orientation across the ensemble, which keeps the dispersion of𝜼\(i\)\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}^\{\(i\)\}tight on the near\-degenerate \(least\-nonlinear\) axes\.

## Appendix CArchitecture details

### C\.1Fisher Information Neural Networks

We train an ensemble of neural networks \(Fishnets\) to predict Fisher information matricesF​\(θ\)F\(\\theta\)directly from simulation datax\. Each network comprises a residual MLP followed by a Fisher prediction layer\. The Fisher matrices are parameterised via log\-Cholesky decomposition to ensure positive\-definiteness:

F=L​L⊤,Li​i=softplus​\(ℓi\)\+10−4,F=LL^\{\\top\},\\quad L\_\{ii\}=\\mathrm\{softplus\}\(\\ell\_\{i\}\)\+10^\{\-4\},\(29\)whereℓi\\ell\_\{i\}are learned diagonal parameters and off\-diagonal elements are unconstrained\.

#### C\.1\.1SoftSquelch activation

The Fisher prediction branch employs a learned sparse activation function:

SoftSquelch​\(x\)=x⋅σ​\(s​\(\|x\|−t\)\),\\mathrm\{SoftSquelch\}\(x\)=x\\cdot\\sigma\\bigl\(s\(\|x\|\-t\)\\bigr\),\(30\)whereσ\\sigmais the sigmoid function, and thresholdttand sharpnessssare hyperparameters\. This suppresses small activations whilst preserving large ones, promoting sparsity\.

#### C\.1\.2Ensemble hyperparameter priors

Ensemble diversity is achieved through randomised architectural choices, which contribute to pushing the sampling distribution of learned Fisher matrix entries and downstream flattened coordinates towards Gaussian\. Hidden layer widths are sampled uniformly from a user\-specified range \(default\[10,300\)\[10,300\)neurons\), and activation functions are drawn from a discrete distribution:

p​\(activation\)=\{3/23ReLU4/23Leaky ReLU3/23Swish2/23Mish3/23smooth\_leaky8/23GELU\\displaystyle p\(\\mathrm\{activation\}\)=\\begin\{cases\}3/23&\\text\{ReLU\}\\\\ 4/23&\\text\{Leaky ReLU\}\\\\ 3/23&\\text\{Swish\}\\\\ 2/23&\\text\{Mish\}\\\\ 3/23&\\text\{smooth\\\_leaky\}\\\\ 8/23&\\text\{GELU\}\\end\{cases\}\(31\)where Mish is defined asMish​\(x\)=x​tanh⁡\(softplus​\(x\)\)\\mathrm\{Mish\}\(x\)=x\\tanh\(\\mathrm\{softplus\}\(x\)\), and smooth\_leaky is defined as:

smooth​\_​leaky​\(x\)=α​x\+\(1−α\)​x​log⁡σ​\(x\),α=0\.2,\\mathrm\{smooth\\\_leaky\}\(x\)=\\alpha x\+\(1\-\\alpha\)x\\log\\sigma\(x\),\\quad\\alpha=0\.2,\(32\)withlog⁡σ​\(x\)=−log⁡\(1\+e−x\)\\log\\sigma\(x\)=\-\\log\(1\+e^\{\-x\}\)ensuring no discontinuities in derivatives\. The SoftSquelch hyperparameters are sampled from independent Gaussians:

s\\displaystyle s∼𝒩​\(5\.0,0\.72\),\\displaystyle\\sim\\mathcal\{N\}\(5\.0,0\.7^\{2\}\),\(33\)t\\displaystyle t∼𝒩​\(1\.0,0\.72\),\\displaystyle\\sim\\mathcal\{N\}\(1\.0,0\.7^\{2\}\),\(34\)wheresscontrols the sharpness of the transition andttsets the suppression threshold\.

#### C\.1\.3Training objective

Networks minimise the Kullback–Leibler divergence between predicted and true parameter distributions:

ℒKL=12​𝔼​\[tr​\(F​\(θ^−θ\)​\(θ^−θ\)⊤\)−log​detF\],\\mathcal\{L\}\_\{\\mathrm\{KL\}\}=\\frac\{1\}\{2\}\\mathbb\{E\}\\Bigl\[\\mathrm\{tr\}\\bigl\(F\(\\hat\{\\theta\}\-\\theta\)\(\\hat\{\\theta\}\-\\theta\)^\{\\top\}\\bigr\)\-\\log\\det F\\Bigr\],\(35\)whereθ^\\hat\{\\theta\}denotes the predicted parameter estimate\.

### C\.2Geometric Flattening

We learn a coordinate transformationη=f​\(θ;w\)\\eta=f\(\\theta;w\)that renders the Fisher metric approximately Euclidean\. The network architecture is a residual MLP with inverse whitening:

η=W−1⋅MLP​\(θ\),\\eta=W^\{\-1\}\\cdot\\mathrm\{MLP\}\(\\theta\),\(36\)whereW−1=Fmean1/2W^\{\-1\}=F\_\{\\mathrm\{mean\}\}^\{1/2\}is a fixed inverse whitening matrix computed from the global mean Fisher\.

#### C\.2\.1Flattening loss

The transformation is learned by minimising deviations of the pullback metric from the identity:

ℒflat=‖J−⊤​F​J−1−I‖F\+‖\(J−⊤​F​J−1\)−1−I‖F,\\mathcal\{L\}\_\{\\mathrm\{flat\}\}=\\Bigl\\\|J^\{\-\\top\}FJ^\{\-1\}\-I\\Bigr\\\|\_\{F\}\+\\Bigl\\\|\(J^\{\-\\top\}FJ^\{\-1\}\)^\{\-1\}\-I\\Bigr\\\|\_\{F\},\(37\)whereJ=∂η/∂θJ=\\partial\\eta/\\partial\\thetais the Jacobian\. To speed up progress on hard points, eachℓ\\ellis rescaled by an adaptive factor

r​\(ℒ\)=λ​ℒℓ\+e−α​ℒ,α=α​\(λ,ε\),r\(\\mathcal\{L\}\)\\;=\\;\\frac\{\\lambda\\,\\mathcal\{L\}\}\{\\ell\+\\mathrm\{e\}^\{\-\\alpha\\mathcal\{L\}\}\},\\qquad\\alpha=\\alpha\(\\lambda,\\varepsilon\),and the training objective is the batch mean oflog⁡\(r​\(ℒ\)​ℒ\)\\log\\bigl\(r\(\\mathcal\{L\}\)\\,\\mathcal\{L\}\\bigr\)\. Largerλ\\lambdamakes the mapr​\(⋅\)r\(\\cdot\)more aggressive about emphasizing large\-residual samples relative to easy ones\.

#### C\.2\.2Whitening

The inverse whitening layer implicitly transforms the training objective\. Withη=W−1​ηraw\\eta=W^\{\-1\}\\eta\_\{\\mathrm\{raw\}\}andJ=W−1​JrawJ=W^\{\-1\}J\_\{\\mathrm\{raw\}\}, the loss becomes:

ℒflat=‖Jraw−⊤​\(W​F​W\)​Jraw−1−I‖F,\\mathcal\{L\}\_\{\\mathrm\{flat\}\}=\\Bigl\\\|J\_\{\\mathrm\{raw\}\}^\{\-\\top\}\(WFW\)J\_\{\\mathrm\{raw\}\}^\{\-1\}\-I\\Bigr\\\|\_\{F\},\(38\)effectively training on pre\-whitened Fishers without modifying the data\.

### C\.3Symbolic Coordinate Discovery

We apply symbolic regression \(operon,\[[9](https://arxiv.org/html/2606.23838#bib.bib77)\]\) to the flattened coordinatesη\\etato discover interpretable expressions\. Each componentηi​\(θ\)\\eta\_\{i\}\(\\theta\)is fitted independently using genetic programming with objectives balancing accuracy \(R2R^\{2\}\) and complexity \(equation length\)\. The Pareto frontier yields candidate expressions\.

##### SR input augmentation\.

The trained flattening ensemble defines a deterministic mapη¯​\(θ\)\\bar\{\\eta\}\(\\theta\)with attached uncertaintyδ​η​\(θ\)\\delta\\eta\(\\theta\)that is independent of the simulations originally used to fit the Fisher–flattener stack\. Rather than feeding only the \(small\) Fishnet validation set into SR, we draw a fresh dense sample from the prior support and evaluate\(η¯,δ​η\)\(\\bar\{\\eta\},\\delta\\eta\)on those points; the resulting triples are used as the Gaussian\-loss SR targets\. We find this materially improves SR convergence in every experiment we have tried \(cleaner Pareto fronts, more reliable recovery of the analytic coordinates on Rosenbrock\) and is essentially free since it requires no additional simulator calls; see[Appendix˜E](https://arxiv.org/html/2606.23838#A5)for the per\-experiment sample counts\.

#### C\.3\.1Model selection

We rank expressions by multiple criteria:

- •Minimum description length \(MDL\):Bayesian model comparison penalising complexity\[[3](https://arxiv.org/html/2606.23838#bib.bib31),[4](https://arxiv.org/html/2606.23838#bib.bib30)\]\.
- •Frobenius loss:Flattening quality‖Jsymb−⊤​F​Jsymb−1−I‖F\\\|J\_\{\\mathrm\{symb\}\}^\{\-\\top\}FJ\_\{\\mathrm\{symb\}\}^\{\-1\}\-I\\\|\_\{F\}on held\-out data\.
- •R2R^\{2\}score:Prediction accuracy on test set\.

The symbolic JacobianJsymbJ\_\{\\mathrm\{symb\}\}is computed via automatic differentiation with Sympy\[[32](https://arxiv.org/html/2606.23838#bib.bib87)\]\. Final coordinates are selected to minimise either the MDL or the Frobenius loss whilst maintaining interpretability\.

### C\.4Postprocessing Symbolic Expressions

Letℓflat​\(θ\)=1B​∑b=1B‖F~b​\(θ\)−I‖F\\ell\_\{\\mathrm\{flat\}\}\(\\theta\)=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\\bigl\\\|\\widetilde\{F\}\_\{b\}\(\\theta\)\-I\\bigr\\\|\_\{F\}be the mean \(over samplesbb\) Frobenius distance of the flattened FisherF~b\\widetilde\{F\}\_\{b\}from the identity, for linear coefficientsθ\\thetain the symbolic coordinates\. Fixℓref=ℓflat​\(θ0\)\\ell\_\{\\mathrm\{ref\}\}=\\ell\_\{\\mathrm\{flat\}\}\(\\theta\_\{0\}\)at the current parameterization\.

*Importance \(importance sampling / local sensitivity\):*for each scalar coefficientccwithc≠0c\\neq 0, use a one\-sided finite difference stepε\\varepsilonand set

wc=\|c\|​\|ℓflat​\(θ0\+ε​ec\)−ℓrefε\|,w\_\{c\}\\;=\\;\\bigl\|c\\bigr\|\\;\\left\|\\frac\{\\ell\_\{\\mathrm\{flat\}\}\(\\theta\_\{0\}\+\\varepsilon e\_\{c\}\)\-\\ell\_\{\\mathrm\{ref\}\}\}\{\\varepsilon\}\\right\|,whereece\_\{c\}perturbs only that coefficient\.

*Pruning:*sort coefficients by increasingwcw\_\{c\}\. Greedily try setting coefficients to zero \(optionally in batches\); accept a proposed sparsificationθ′\\theta^\{\\prime\}only if the*relative*flattening increase stays below a toleranceτ\\tau,

ℓflat​\(θ′\)−ℓrefℓref<τ\.\\frac\{\\ell\_\{\\mathrm\{flat\}\}\(\\theta^\{\\prime\}\)\-\\ell\_\{\\mathrm\{ref\}\}\}\{\\ell\_\{\\mathrm\{ref\}\}\}\\;<\\;\\tau\.\(39\)

### C\.5Invertible network mapping: Real NVP

We optionally parameterize the reparameterizationη​\(θ\)\\eta\(\\theta\)with a Real NVP normalizing flow\[[11](https://arxiv.org/html/2606.23838#bib.bib39)\], implemented in JAX/Flax as a stack of affine coupling layers\[[11](https://arxiv.org/html/2606.23838#bib.bib39),[34](https://arxiv.org/html/2606.23838#bib.bib40)\]\. Each layer fixes half the coordinates \(alternating even/odd dimensions across layers\) and applies an elementwise affine map to the remaining coordinates,

yi=xi⊙exp⁡\(s​\(xfix\)\)\+t​\(xfix\),y\_\{i\}=x\_\{i\}\\odot\\exp\\bigl\(s\(x\_\{\\mathrm\{fix\}\}\)\\bigr\)\+t\(x\_\{\\mathrm\{fix\}\}\),wheressandttare MLPs of one hidden layer with widthhidden\_dims; thess\-head uses a finaltanh\\tanhfor bounded logits, and scales useexp\\expby default so Jacobians stay tractable\. The forward pass accumulateslog⁡\|detJ\|\\log\|\\det J\|over layers; the inverse applies the couplings in reverse order\. For training,θ\\thetais min–max scaled to\[0,1\]\[0,1\], shifted by one \(so inputs lie in\[1,2\]\[1,2\]\), passed through the flow, and optionally left\-multiplied by a fixed inverse\-whitening matrix𝐖−1\\mathbf\{W\}^\{\-1\}onηraw\\eta\_\{\\mathrm\{raw\}\}\(WhitenedRealNVP\); the inverse reverses these steps in order\.

### C\.6Forward–backward MLP

As a lightweight alternative to the RealNVP flow, we pair the forward MLPη=fϕ​\(θ\)\\eta=f\_\{\\phi\}\(\\theta\)with a separate*reverse\-path*MLPθ^=gψ​\(η\)\\hat\{\\theta\}=g\_\{\\psi\}\(\\eta\)that approximates the inverse, and add a mean\-square cycle\-consistency penalty‖θ−gψ​\(fϕ​\(θ\)\)‖2\\\|\\theta\-g\_\{\\psi\}\(f\_\{\\phi\}\(\\theta\)\)\\\|^\{2\}to the flattening loss with weightλinv\\lambda\_\{\\rm inv\}, so that the total objective isℒflat\+λinv​𝔼θ​‖θ−gψ​\(fϕ​\(θ\)\)‖2\\mathcal\{L\}\_\{\\rm flat\}\+\\lambda\_\{\\rm inv\}\\,\\mathbb\{E\}\_\{\\theta\}\\\|\\theta\-g\_\{\\psi\}\(f\_\{\\phi\}\(\\theta\)\)\\\|^\{2\}; this softly enforces invertibility on the prior support without constraining the architecture as a normalising flow\.

## Appendix DGeodesic normal coordinates for Gaussian example

### D\.1Motivation

The aim of this paper is to construct coordinates which make the Fisher information as flat as possible\. In[Section˜5\.1](https://arxiv.org/html/2606.23838#S5.SS1), we considered the example of a Gaussian, parametrised by meanμ\\muand standard deviationσ\\sigma\. We will write these as coordinatesya=\(μ,σ\)y^\{a\}=\(\\mu,\\sigma\)\. The Fisher information will be our metric, which isFa​b=diag⁡\(1/σ2,2/σ2\)F\_\{ab\}=\\operatorname\{diag\}\(1/\\sigma^\{2\},2/\\sigma^\{2\}\), where we have ignored the multiplicative factor ofndn\_\{\\rm d\}for simplicity\.

In this appendix, we wish to find an analytic coordinate transformation to compare against the one obtained by our method\. As we will show, there does not exist a coordinate transformation which can make the Fisher equal to the identity everywhere, and thus we construct geodesic normal coordinates to find the coordinates which make the metric appear as flat as possible in the vicinity of some point\.

Suppose we were at some special pointppwith coordinatesy⋆a=\(μ⋆,σ⋆\)y\_\{\\star\}^\{a\}=\(\\mu\_\{\\star\},\\sigma\_\{\\star\}\)and wanted to make the metric flat in the vicinity of that point\. For simplicity, let us begin by rescalingμ\\muandσ\\sigmaby functions ofσ⋆\\sigma\_\{\\star\}to make the metric equal to the identity atpp\. We define

u≡μσ⋆,v≡2​σσ⋆,u\\equiv\\frac\{\\mu\}\{\\sigma\_\{\\star\}\},\\quad v\\equiv\\sqrt\{2\}\\frac\{\\sigma\}\{\\sigma\_\{\\star\}\},\(40\)so that the corresponding line element for a metricga​b=Fa​bg\_\{ab\}=F\_\{ab\}becomes

d​s2=\(σ⋆σ\)2​\(d​u2\+d​v2\)=2v2​\(d​u2\+d​v2\),\{\\rm d\}s^\{2\}=\\left\(\\frac\{\\sigma\_\{\\star\}\}\{\\sigma\}\\right\)^\{2\}\\left\(\{\\rm d\}u^\{2\}\+\{\\rm d\}v^\{2\}\\right\)=\\frac\{2\}\{v^\{2\}\}\\left\(\{\\rm d\}u^\{2\}\+\{\\rm d\}v^\{2\}\\right\),\(41\)i\.e\. the metric in these coordinates is

ga​b=2v2​δa​b\.g\_\{ab\}=\\frac\{2\}\{v^\{2\}\}\\delta\_\{ab\}\.\(42\)We will usexa=\(u,v\)x^\{a\}=\(u,v\)\. The coordinates of the pointppare then

u⋆=μ⋆σ⋆,v⋆=2\.u\_\{\\star\}=\\frac\{\\mu\_\{\\star\}\}\{\\sigma\_\{\\star\}\},\\quad v\_\{\\star\}=\\sqrt\{2\}\.\(43\)
Although the metric is Euclidean atppin these coordinates, we would see that as we moved away frompp, the leading order terms for the metric would be linear in our coordinates\. A better set of coordinates to use is geodesic normal coordinates, since these give quadratic corrections to the metric at leading order:

g~a​b=δa​b−13​Ra​c​b​d​\(x⋆\)​\(xc−x⋆c\)​\(xd−x⋆d\)\+𝒪​\(\|x−x⋆\|3\),\\tilde\{g\}\_\{ab\}=\\delta\_\{ab\}\-\\frac\{1\}\{3\}R\_\{acbd\}\(x\_\{\\star\}\)\\left\(x^\{c\}\-x\_\{\\star\}^\{c\}\\right\)\\left\(x^\{d\}\-x\_\{\\star\}^\{d\}\\right\)\+\\mathcal\{O\}\\left\(\\left\|x\-x\_\{\\star\}\\right\|^\{3\}\\right\),\(44\)whereRa​b​c​dR\_\{abcd\}is the Riemann tensor\. The aim of this appendix is to construct these coordinates for a Gaussian\. Of course, the flatness is only valid for some region of space\. This does, however, represent a good benchmark to see whether our ML method can provide something better over all space than this construction\.

### D\.2Series expansion for geodesic normal coordinates

Before explicitly constructing geodesic normal coordinates, let us first obtain a series expansion for the metric in these coordinates, so that we have something to verify our result against\.

Let us first construct the Christoffel symbols for this metric, given by

Γb​ca=12​ga​d​\(∂cgd​b\+∂bgd​c−∂dgb​c\)\.\\Gamma^\{a\}\_\{bc\}=\\frac\{1\}\{2\}g^\{ad\}\\left\(\\partial\_\{c\}g\_\{db\}\+\\partial\_\{b\}g\_\{dc\}\-\\partial\_\{d\}g\_\{bc\}\\right\)\.\(45\)For our metric, these are

Γa​b0=\(0−1v−1v0\),Γa​b1=\(1v00−1v\)\.\\Gamma^\{0\}\_\{ab\}=\\begin\{pmatrix\}0&\-\\frac\{1\}\{v\}\\\\ \-\\frac\{1\}\{v\}&0\\end\{pmatrix\},\\quad\\Gamma^\{1\}\_\{ab\}=\\begin\{pmatrix\}\\frac\{1\}\{v\}&0\\\\ 0&\-\\frac\{1\}\{v\}\\end\{pmatrix\}\.\(46\)Now we must compute the Riemann tensor

Rc​a​bd=∂aΓb​cd−∂bΓa​cd\+Γa​ed​Γb​ce−Γb​ed​Γa​ce\.R^\{d\}\_\{cab\}=\\partial\_\{a\}\\Gamma^\{d\}\_\{bc\}\-\\partial\_\{b\}\\Gamma^\{d\}\_\{ac\}\+\\Gamma^\{d\}\_\{ae\}\\Gamma^\{e\}\_\{bc\}\-\\Gamma^\{d\}\_\{be\}\\Gamma^\{e\}\_\{ac\}\.\(47\)Since we are in two dimensions, this has only one independent component, and is thus fully determined by the Ricci scalar,RR\. For our metric, we find that

which yields the non\-zero components of the Riemann tensor

R0110=R1001=−R1010=−R0101=1v2\.R\_\{0110\}=R\_\{1001\}=\-R\_\{1010\}=\-R\_\{0101\}=\\frac\{1\}\{v^\{2\}\}\.\(49\)Since the Riemann tensor is non\-zero, we cannot find global coordinates which flatten the metric\. Instead, we are now in a position to write down the metric expansion in geodesic normal coordinates using[Eq\.˜44](https://arxiv.org/html/2606.23838#A4.E44)\.

g~a​b≈𝟭\+13​v⋆2​\(\(v−v⋆\)2−\(u−u⋆\)​\(v−v⋆\)−\(u−u⋆\)​\(v−v⋆\)\(u−u⋆\)2\)\+𝒪​\(\|x−x⋆\|3\)\.\\tilde\{g\}\_\{ab\}\\approx\\bm\{\\mathsf\{1\}\}\+\\frac\{1\}\{3v\_\{\\star\}^\{2\}\}\\begin\{pmatrix\}\\left\(v\-v\_\{\\star\}\\right\)^\{2\}&\-\\left\(u\-u\_\{\\star\}\\right\)\\left\(v\-v\_\{\\star\}\\right\)\\\\ \-\\left\(u\-u\_\{\\star\}\\right\)\\left\(v\-v\_\{\\star\}\\right\)&\\left\(u\-u\_\{\\star\}\\right\)^\{2\}\\end\{pmatrix\}\+\\mathcal\{O\}\\left\(\\left\|x\-x\_\{\\star\}\\right\|^\{3\}\\right\)\.\(50\)

### D\.3Geodesic Equations

To construct geodesic normal coordinates, we first need to compute the geodesics on this manifold\. Let us usettas our affine parameter, and use a dot to denote a derivative with respect tott\. The geodesic equation is defined to be

d2​xad​t2\+Γb​ca​d​xbd​t​d​xcd​t=0,\\frac\{\{\\rm d\}^\{2\}x^\{a\}\}\{\{\\rm d\}t^\{2\}\}\+\\Gamma^\{a\}\_\{bc\}\\frac\{\{\\rm d\}x^\{b\}\}\{\{\\rm d\}t\}\\frac\{\{\\rm d\}x^\{c\}\}\{\{\\rm d\}t\}=0,\(51\)hence our equations are

u¨−2v​u˙​v˙=0,v¨\+u˙2v−v˙2v=0\.\\ddot\{u\}\-\\frac\{2\}\{v\}\\dot\{u\}\\dot\{v\}=0,\\qquad\\ddot\{v\}\+\\frac\{\\dot\{u\}^\{2\}\}\{v\}\-\\frac\{\\dot\{v\}^\{2\}\}\{v\}=0\.\(52\)
The equation foruuis easy to integrate once\. If we multiply by1/v21/v^\{2\}, we note that this can be written as

dd​t​\(u˙v2\)=0⟹u˙=A​v2,\\frac\{\{\\rm d\}\}\{\{\\rm d\}t\}\\left\(\\frac\{\\dot\{u\}\}\{v^\{2\}\}\\right\)=0\\quad\\implies\\quad\\dot\{u\}=Av^\{2\},\(53\)whereAAis a constant\. Similarly, we note that we can write the equation forvvas

v​dd​t​\(v˙v\)\+u˙2v=v​d2​log⁡vd​t2\+u˙2v=0\.v\\frac\{\{\\rm d\}\}\{\{\\rm d\}t\}\\left\(\\frac\{\\dot\{v\}\}\{v\}\\right\)\+\\frac\{\\dot\{u\}^\{2\}\}\{v\}=v\\frac\{\{\\rm d\}^\{2\}\\log v\}\{\{\\rm d\}t^\{2\}\}\+\\frac\{\\dot\{u\}^\{2\}\}\{v\}=0\.\(54\)We can then substitute in our result foru˙\\dot\{u\}to obtain an equation forvvonly\. This becomes easier to solve if we definew=log⁡vw=\\log v, since this is

w¨\+A2​e2​w=0,\\ddot\{w\}\+A^\{2\}e^\{2w\}=0,\(55\)which has the solution

w˙=±\(B−A2​e2​w\)12,\\dot\{w\}=\\pm\\left\(B\-A^\{2\}e^\{2w\}\\right\)^\{\\frac\{1\}\{2\}\},\(56\)whereBBis a constant\. This can then be integrated to obtain

±\(t−t0\)=−1B​tanh−1⁡\(1−A2B​e2​w\),\\pm\\left\(t\-t\_\{0\}\\right\)=\\frac\{\-1\}\{\\sqrt\{B\}\}\\tanh^\{\-1\}\\left\(\\sqrt\{1\-\\frac\{A^\{2\}\}\{B\}e^\{2w\}\}\\right\),\(57\)wheret0t\_\{0\}is an integration constant\. Rearranging \(squaring will remove the±\\pm\), one finds

v​\(t\)=B\|A\|​sech⁡\(B​\(t−t0\)\),v\(t\)=\\frac\{\\sqrt\{B\}\}\{\|A\|\}\\operatorname\{sech\}\\left\(\\sqrt\{B\}\\left\(t\-t\_\{0\}\\right\)\\right\),\(58\)where we take the positive square root, since we will always havev\>0v\>0for a Gaussian\. This can then be used to obtain an equation foru˙\\dot\{u\}

u˙=A​v2=BA​sech2⁡\(B​\(t−t0\)\),\\dot\{u\}=Av^\{2\}=\\frac\{B\}\{A\}\\operatorname\{sech\}^\{2\}\\left\(\\sqrt\{B\}\\left\(t\-t\_\{0\}\\right\)\\right\),\(59\)which gives

u​\(t\)=BA​tanh⁡\(B​\(t−t0\)\)\+C,u\(t\)=\\frac\{\\sqrt\{B\}\}\{A\}\\tanh\\left\(\\sqrt\{B\}\\left\(t\-t\_\{0\}\\right\)\\right\)\+C,\(60\)whereCCis again a constant\.

We will redefine our constants such thatβ=B\\beta=\\sqrt\{B\}, and we will assume that the geodesic goes through the point\(u⋆,v⋆\)\(u\_\{\\star\},v\_\{\\star\}\)att=0t=0\. This gives us

u​\(t\)\\displaystyle u\(t\)=u⋆\+v⋆​cosh⁡\(β​t0\)​\[tanh⁡\(β​\(t−t0\)\)\+tanh⁡\(β​t0\)\],\\displaystyle=u\_\{\\star\}\+v\_\{\\star\}\\cosh\\left\(\\beta t\_\{0\}\\right\)\\left\[\\tanh\\left\(\\beta\\left\(t\-t\_\{0\}\\right\)\\right\)\+\\tanh\(\\beta t\_\{0\}\)\\right\],\(61\)v​\(t\)\\displaystyle v\(t\)=v⋆​cosh⁡\(β​t0\)​sech⁡\(β​\(t−t0\)\)\.\\displaystyle=v\_\{\\star\}\\cosh\(\\beta t\_\{0\}\)\\operatorname\{sech\}\\left\(\\beta\\left\(t\-t\_\{0\}\\right\)\\right\)\.\(62\)It is useful to rewrite this with the identitytanh⁡x\+tanh⁡y=sech⁡x​sech⁡y​sinh⁡\(x\+y\)\\tanh x\+\\tanh y=\\operatorname\{sech\}x\\operatorname\{sech\}y\\sinh\(x\+y\), to give

u​\(t\)\\displaystyle u\(t\)=u⋆\+v⋆​sinh⁡\(β​t0\)​sech⁡\(β​\(t−t0\)\),\\displaystyle=u\_\{\\star\}\+v\_\{\\star\}\\sinh\\left\(\\beta t\_\{0\}\\right\)\\operatorname\{sech\}\\left\(\\beta\\left\(t\-t\_\{0\}\\right\)\\right\),\(63\)v​\(t\)\\displaystyle v\(t\)=v⋆​cosh⁡\(β​t0\)​sech⁡\(β​\(t−t0\)\)\.\\displaystyle=v\_\{\\star\}\\cosh\(\\beta t\_\{0\}\)\\operatorname\{sech\}\\left\(\\beta\\left\(t\-t\_\{0\}\\right\)\\right\)\.\(64\)

### D\.4Geodesic Normal Coordinates

Now we have found the geodesics which pass through the pointpp, we are in a position to construct geodesic normal coordinates about this point\. These are constructed as follows\. Suppose one has a vectorVaV^\{a\}which lives in the tangent space atpp, then one can parallel transport this to the pointXX, which one defines as the point on the geodesic with affine parametert=1t=1\(the exponential map\)\. In this case \(provided the metric is Euclidean atpp\) the geodesic normal coordinates ofXXareVaV^\{a\}\.

Hence, to construct the geodesic normal coordinates, we need to know the tangent vector of our geodesic at the pointpp\. If we defineVa=\(ξ,η\)V^\{a\}=\(\\xi,\\eta\), then these are

ξ≡u˙​\(0\)=β​v⋆​sech⁡\(β​t0\),η≡v˙​\(0\)=β​v⋆​tanh⁡\(β​t0\)\.\\xi\\equiv\\dot\{u\}\(0\)=\\beta v\_\{\\star\}\\operatorname\{sech\}\\left\(\\beta t\_\{0\}\\right\),\\quad\\eta\\equiv\\dot\{v\}\(0\)=\\beta v\_\{\\star\}\\tanh\\left\(\\beta t\_\{0\}\\right\)\.\(65\)We can therefore relate the remaining two free constants \(β\\betaandt0t\_\{0\}\) in our geodesic equation to the geodesic normal coordinatesξ\\xiandη\\eta\.

It is useful to use the identitycosh⁡\(x−y\)=cosh⁡x​cosh⁡y−sinh⁡x​sinh⁡y\\cosh\(x\-y\)=\\cosh x\\cosh y\-\\sinh x\\sinh yto split up thesech⁡\(β​\(t−t0\)\)\\operatorname\{sech\}\(\\beta\(t\-t\_\{0\}\)\)term into terms which are functions of onlyβ​t\\beta torβ​t0\\beta t\_\{0\}\. Doing this, and recalling the definition that\(u,v\)\(u,v\)is at the pointt=1t=1, we obtain the following expressions to relate\(u,v\)\(u,v\)to\(ξ,η\)\(\\xi,\\eta\)

u=u⋆\+v⋆​ξ​sinh⁡ββ​v⋆​cosh⁡β−η​sinh⁡β,v=v⋆​β​v⋆β​v⋆​cosh⁡β−η​sinh⁡β,u=u\_\{\\star\}\+v\_\{\\star\}\\frac\{\\xi\\sinh\\beta\}\{\\beta v\_\{\\star\}\\cosh\\beta\-\\eta\\sinh\\beta\},\\qquad v=v\_\{\\star\}\\frac\{\\beta v\_\{\\star\}\}\{\\beta v\_\{\\star\}\\cosh\\beta\-\\eta\\sinh\\beta\},\(66\)for

β=1v⋆​ξ2\+η2\.\\beta=\\frac\{1\}\{v\_\{\\star\}\}\\sqrt\{\\xi^\{2\}\+\\eta^\{2\}\}\.\(67\)
This gives expressions foruuandvvin terms of the geodesic normal coordinatesξ\\xiandη\\eta\. One can see by Taylor expansion, that the coordinatesξ=η=0\\xi=\\eta=0indeed corresponds to the pointxa=\(u⋆,v⋆\)x^\{a\}=\(u\_\{\\star\},v\_\{\\star\}\)\. For small perturbations, we find that

u−u⋆≈ξ\+𝒪​\(quadratic\),v−v⋆≈η\+𝒪​\(quadratic\),u\-u\_\{\\star\}\\approx\\xi\+\\mathcal\{O\}\(\{\\rm quadratic\}\),\\quad v\-v\_\{\\star\}\\approx\\eta\+\\mathcal\{O\}\(\{\\rm quadratic\}\),\(68\)which will be useful later\.

### D\.5Inverting the coordinate transformation

In the previous section we founduuandvvas a function ofξ\\xiandη\\eta\. In this section, we find the opposite:ξ\\xiandη\\etaas a function ofuuandvv\. We begin by takingu−u⋆u\-u\_\{\\star\}and dividing by our expression forvv, one can obtain the relation

μ−μ⋆σ=ξ​sinh⁡ββ\.\\frac\{\\mu\-\\mu\_\{\\star\}\}\{\\sigma\}=\\xi\\frac\{\\sinh\\beta\}\{\\beta\}\.\(69\)The left hand side of this expression is reminiscent of thezz\-score, or standardised Gaussian variable\. For smallβ\\beta,sinh⁡β/β≈1\\sinh\\beta/\\beta\\approx 1, and thusξ\\xiis approximately equal to this standardised variable when one is close to the fiducial point\. The rest of the derivation requires us to consider separate cases, depending on the values ofξ\\xiandβ\\beta, which we outline below before summarising the result\.

##### Case 1:ξ≠0\\xi\\neq 0,β≠0\\beta\\neq 0

If we rearrange our expression forvvand assumeξ≠0\\xi\\neq 0, and substitute in this result to removesinh⁡β\\sinh\\beta, we obtain

β​v⋆​cosh⁡β−η​βξ​μ−μ⋆σ=β​v⋆2v=β​v⋆​σ⋆σ\.\\beta v\_\{\\star\}\\cosh\\beta\-\\frac\{\\eta\\beta\}\{\\xi\}\\frac\{\\mu\-\\mu\_\{\\star\}\}\{\\sigma\}=\\frac\{\\beta v\_\{\\star\}^\{2\}\}\{v\}=\\beta v\_\{\\star\}\\frac\{\\sigma\_\{\\star\}\}\{\\sigma\}\.\(70\)Assumingβ≠0\\beta\\neq 0, this can be rearranged to obtain

cosh⁡β=σ⋆σ\+μ−μ⋆v⋆​σ​ηξ\.\\cosh\\beta=\\frac\{\\sigma\_\{\\star\}\}\{\\sigma\}\+\\frac\{\\mu\-\\mu\_\{\\star\}\}\{v\_\{\\star\}\\sigma\}\\frac\{\\eta\}\{\\xi\}\.\(71\)These can be combined using a hypergeometric identity to obtain

ηξ=12​\(μ−μ⋆\)2\+σ2−σ⋆2σ⋆​\(μ−μ⋆\)​2,\\frac\{\\eta\}\{\\xi\}=\\frac\{\\frac\{1\}\{2\}\\left\(\\mu\-\\mu\_\{\\star\}\\right\)^\{2\}\+\\sigma^\{2\}\-\\sigma\_\{\\star\}^\{2\}\}\{\\sigma\_\{\\star\}\\left\(\\mu\-\\mu\_\{\\star\}\\right\)\\sqrt\{2\}\},\(72\)where we have used thatv⋆=2v\_\{\\star\}=\\sqrt\{2\}\. We can now substitute this into our earlier expression forsinh⁡β\\sinh\\beta

sinh⁡β=βξ​μ−μ⋆σ=sgn⁡\(ξ\)v⋆​1\+η2ξ2​μ−μ⋆σ\.\\sinh\\beta=\\frac\{\\beta\}\{\\xi\}\\frac\{\\mu\-\\mu\_\{\\star\}\}\{\\sigma\}=\\frac\{\\operatorname\{sgn\}\(\\xi\)\}\{v\_\{\\star\}\}\\sqrt\{1\+\\frac\{\\eta^\{2\}\}\{\\xi^\{2\}\}\}\\frac\{\\mu\-\\mu\_\{\\star\}\}\{\\sigma\}\.\(73\)Sinceβ\>0\\beta\>0, we require the conditionsgn⁡\(ξ\)=sgn⁡\(μ−μ⋆\)\\operatorname\{sgn\}\(\\xi\)=\\operatorname\{sgn\}\(\\mu\-\\mu\_\{\\star\}\)\. But given thatsgn⁡\(sinh⁡β/β\)=1\\operatorname\{sgn\}\(\\sinh\\beta/\\beta\)=1, from the earlier formula, we know that this must be true\. We therefore find that we can obtainβ\\betafrom

sinh⁡β=\(12​\(μ−μ⋆\)2\+\(σ−σ⋆\)2\)​\(12​\(μ−μ⋆\)2\+\(σ\+σ⋆\)2\)2​σ⋆​σ,\\sinh\\beta=\\frac\{\\sqrt\{\\left\(\\frac\{1\}\{2\}\\left\(\\mu\-\\mu\_\{\\star\}\\right\)^\{2\}\+\(\\sigma\-\\sigma\_\{\\star\}\)^\{2\}\\right\)\\left\(\\frac\{1\}\{2\}\\left\(\\mu\-\\mu\_\{\\star\}\\right\)^\{2\}\+\(\\sigma\+\\sigma\_\{\\star\}\)^\{2\}\\right\)\}\}\{2\\sigma\_\{\\star\}\\sigma\},\(74\)and thus we now have a way of computing\(\|ξ\|,\|η\|\)\(\|\\xi\|,\|\\eta\|\)from\(μ,σ\)\(\\mu,\\sigma\)\. We note that the signs of the coordinates can be determined using

sgn⁡\(ξ\)=sgn⁡\(μ−μ⋆\),sgn⁡\(η\)=sgn⁡\(12​\(μ−μ⋆\)2\+σ2−σ⋆2\)\.\\operatorname\{sgn\}\(\\xi\)=\\operatorname\{sgn\}\(\\mu\-\\mu\_\{\\star\}\),\\quad\\operatorname\{sgn\}\(\\eta\)=\\operatorname\{sgn\}\\left\(\\frac\{1\}\{2\}\\left\(\\mu\-\\mu\_\{\\star\}\\right\)^\{2\}\+\\sigma^\{2\}\-\\sigma\_\{\\star\}^\{2\}\\right\)\.\(75\)

##### Case 2:ξ=0\\xi=0,β≠0\\beta\\neq 0

Ifξ=0\\xi=0, then automatically we have thatμ=μ⋆\\mu=\\mu\_\{\\star\}and thatβ​v⋆=\|η\|\\beta v\_\{\\star\}=\|\\eta\|\. We therefore obtain

v=v⋆​1cosh⁡β−sgn⁡\(η\)​sinh⁡β=v⋆​exp⁡\(sgn⁡\(η\)​β\),v=v\_\{\\star\}\\frac\{1\}\{\\cosh\\beta\-\\operatorname\{sgn\}\(\\eta\)\\sinh\\beta\}=v\_\{\\star\}\\exp\\left\(\\operatorname\{sgn\}\(\\eta\)\\beta\\right\),\(76\)and thus \(given thatβ≥0\\beta\\geq 0\)

β=\|log⁡\(σσ⋆\)\|,\\beta=\\left\|\\log\\left\(\\frac\{\\sigma\}\{\\sigma\_\{\\star\}\}\\right\)\\right\|,\(77\)whereη\>0\\eta\>0forσ\>σ⋆\\sigma\>\\sigma\_\{\\star\}andη<0\\eta<0forσ<σ⋆\\sigma<\\sigma\_\{\\star\}\. Note that this is identical to[Eq\.˜74](https://arxiv.org/html/2606.23838#A4.E74)forμ=μ⋆\\mu=\\mu\_\{\\star\}\.

##### Case 3:β=0\\beta=0

It looks like there are two remaining cases to consider:ξ≠0\\xi\\neq 0withβ=0\\beta=0, andξ=0\\xi=0,β=0\\beta=0, but from the definition ofβ\\beta, the first of these is impossible\. We also note that the second of these means thatξ=η=0\\xi=\\eta=0\. We already noted that this corresponds to the pointμ=μ⋆\\mu=\\mu\_\{\\star\}andσ=σ⋆\\sigma=\\sigma\_\{\\star\}, so there is nothing else to add here\.

##### Summary

In summary, to obtain\(μ,σ\)\(\\mu,\\sigma\)from\(ξ,η\)\(\\xi,\\eta\):

1. 1\.Ifμ=μ⋆\\mu=\\mu\_\{\\star\}andσ=σ⋆\\sigma=\\sigma\_\{\\star\}, thenξ=η=0\\xi=\\eta=0\.
2. 2\.Otherwise,β\\betacan be obtained from evaluating[Eq\.˜74](https://arxiv.org/html/2606.23838#A4.E74)\. The signs ofξ\\xiandη\\etaare given by[Eq\.˜75](https://arxiv.org/html/2606.23838#A4.E75)\. To obtain the magnitudes ofξ\\xiandη\\eta, one can use the results: 1. \(a\)Ifμ=μ⋆\\mu=\\mu\_\{\\star\}, then\|η\|=v⋆​β\|\\eta\|=v\_\{\\star\}\\betaandξ=0\\xi=0\. 2. \(b\)Otherwise, use[Eq\.˜72](https://arxiv.org/html/2606.23838#A4.E72)

### D\.6Metric and Jacobian for Geodesic Normal Coordinates

Now we have functions foruuandvvas a function ofξ\\xiandη\\eta, let us find the metric in these new coordinates\. We first find the Jacobian matrix although we do not write the explicit components here, as they are not particularly illuminating\. One important property, however, is its determinant, which is

detJ=v⋆2​β​sinh⁡β\(β​v⋆​cosh⁡β−η​sinh⁡β\)2=μ−μ⋆σ⋆​σσ⋆​1ξ\.\\det J=\\frac\{v\_\{\\star\}^\{2\}\\beta\\sinh\\beta\}\{\\left\(\\beta v\_\{\\star\}\\cosh\\beta\-\\eta\\sinh\\beta\\right\)^\{2\}\}=\\frac\{\\mu\-\\mu\_\{\\star\}\}\{\\sigma\_\{\\star\}\}\\frac\{\\sigma\}\{\\sigma\_\{\\star\}\}\\frac\{1\}\{\\xi\}\.\(78\)
We see here that, unlessμ=μ⋆\\mu=\\mu\_\{\\star\},ξ=0\\xi=0orξ→∞\\xi\\to\\infty, we will definitely have a finite and non\-zero determinant for finite values ofμ\\muandσ\\sigma\(providedσ≠0\\sigma\\neq 0andσ⋆≠0\\sigma\_\{\\star\}\\neq 0, but we always have this case\)\. Let us investigate the caseμ=μ⋆\\mu=\\mu\_\{\\star\}to see if this is problematic\. This case corresponds toξ=0\\xi=0\(so our first two problematic cases are identical\), and thus we must consider the limit of\(u−u⋆\)/ξ\(u\-u\_\{\\star\}\)/\\xiasξ→0\\xi\\to 0\. In this limitβ=η/v⋆\\beta=\\eta/v\_\{\\star\}, and hence

limξ→0u−u⋆ξ=v⋆​sinh⁡\(η/v⋆\)η​cosh⁡\(η/v⋆\)−η​sinh⁡\(η/v⋆\)=\[ηv⋆​\(coth⁡\(ηv⋆\)−1\)\]−1\.\\lim\_\{\\xi\\to 0\}\\frac\{u\-u\_\{\\star\}\}\{\\xi\}=\\frac\{v\_\{\\star\}\\sinh\\left\(\\eta/v\_\{\\star\}\\right\)\}\{\\eta\\cosh\\left\(\\eta/v\_\{\\star\}\\right\)\-\\eta\\sinh\\left\(\\eta/v\_\{\\star\}\\right\)\}=\\left\[\\frac\{\\eta\}\{v\_\{\\star\}\}\\left\(\\coth\\left\(\\frac\{\\eta\}\{v\_\{\\star\}\}\\right\)\-1\\right\)\\right\]^\{\-1\}\.\(79\)The functiony=\(x​\(coth⁡x−1\)\)−1y=\(x\(\\coth x\-1\)\)^\{\-1\}has no roots and is always positive, hence we find that the Jacobian is always positive at the pointu=u⋆u=u\_\{\\star\}\. Therefore, unlessξ\\xi,μ−μ⋆\\mu\-\\mu\_\{\\star\}orσ\\sigmais infinite, then we always have a finite and non\-zero Jacobian, and thus we can locally invert the coordinate transformation\.

#### D\.6\.1Metric

We can use the Jacobian to obtain the metric \(i\.e\. the Fisher\) in our new coordinates, which is

F~a​b=g~a​b=12​β4​\(η2​sinh2⁡β\+ξ2​β2η​ξ​\(β2−sinh2⁡β\)η​ξ​\(β2−sinh2⁡β\)η2​β2\+ξ2​sinh2⁡β\),\\tilde\{F\}\_\{ab\}=\\tilde\{g\}\_\{ab\}=\\frac\{1\}\{2\\beta^\{4\}\}\\begin\{pmatrix\}\\eta^\{2\}\\sinh^\{2\}\\beta\+\\xi^\{2\}\\beta^\{2\}&\\eta\\xi\\left\(\\beta^\{2\}\-\\sinh^\{2\}\\beta\\right\)\\\\ \\eta\\xi\\left\(\\beta^\{2\}\-\\sinh^\{2\}\\beta\\right\)&\\eta^\{2\}\\beta^\{2\}\+\\xi^\{2\}\\sinh^\{2\}\\beta\\end\{pmatrix\},\(80\)where we have used thatv⋆=2v\_\{\\star\}=\\sqrt\{2\}\. To verify that this has the desired behaviour, let us expand the metric around the pointξ=η=0\\xi=\\eta=0\. In this case, we obtain

g~a​b≈𝟭\+16​\(η2−η​ξ−η​ξξ2\)\+𝒪​\(\|x\|3\),\\tilde\{g\}\_\{ab\}\\approx\\bm\{\\mathsf\{1\}\}\+\\frac\{1\}\{6\}\\begin\{pmatrix\}\\eta^\{2\}&\-\\eta\\xi\\\\ \-\\eta\\xi&\\xi^\{2\}\\end\{pmatrix\}\+\\mathcal\{O\}\\left\(\|x\|^\{3\}\\right\),\(81\)so indeed, in these coordinates, the metric is flat at zeroth and linear order, with the correction occurring at quadratic order\. Using the series expansion of[Eq\.˜68](https://arxiv.org/html/2606.23838#A4.E68), this can be written as

g~a​b≈𝟭\+16​\(\(v−v⋆\)2−\(u−u⋆\)​\(v−v⋆\)−\(u−u⋆\)​\(v−v⋆\)\(u−u⋆\)2\)\+…\\tilde\{g\}\_\{ab\}\\approx\\bm\{\\mathsf\{1\}\}\+\\frac\{1\}\{6\}\\begin\{pmatrix\}\\left\(v\-v\_\{\\star\}\\right\)^\{2\}&\-\\left\(u\-u\_\{\\star\}\\right\)\\left\(v\-v\_\{\\star\}\\right\)\\\\ \-\\left\(u\-u\_\{\\star\}\\right\)\\left\(v\-v\_\{\\star\}\\right\)&\\left\(u\-u\_\{\\star\}\\right\)^\{2\}\\end\{pmatrix\}\+\\ldots\(82\)which is identical to[Eq\.˜50](https://arxiv.org/html/2606.23838#A4.E50)\(sincev⋆=2v\_\{\\star\}=\\sqrt\{2\}\), as required\.

### D\.7Loss function for neural network

In the neural network search for our coordinates, we compute the Frobenius norm \([Eq\.˜8](https://arxiv.org/html/2606.23838#S4.E8)\) between the metric and the identity\. For the geodesic normal coordinates we find that this takes a particularly nice form

‖g~a​b−δa​b‖ℱ2=\(sinh2⁡ββ2−1\)2≈β49\+𝒪​\(β6\),\\big\\\|\\tilde\{g\}\_\{ab\}\-\\delta\_\{ab\}\\big\\\|\_\{\\mathcal\{F\}\}^\{2\}=\\left\(\\frac\{\\sinh^\{2\}\\beta\}\{\\beta^\{2\}\}\-1\\right\)^\{2\}\\approx\\frac\{\\beta^\{4\}\}\{9\}\+\\mathcal\{O\}\(\\beta^\{6\}\),\(83\)where the first equality is exact, and the approximation is valid forβ≪1\\beta\\ll 1\. The quartic behaviour in the square of the Frobenius norm is consistent with geodesic normal coordinates having quadratic deviations from flatness near the fiducial point\.

## Appendix EExperiment Details

##### Default Distillery setup\.

The distillery comes with controls that users can tune for specific problems\. We calibrated our setup on the Rosenbrock problem, and made slight modifications \(such as changing network size or depth\) for each application\. A fixed simulation budget is chosen in advance\. The simulations are5050–5050split into Fishnets training and validation datasets\. Once the Fisher matrices are learned, the flattening network is fit on thevalidationset of true parameters and ensemble of learned Fishers\(θ,\{𝗙\}j\)\(\\theta,\\\{\\bm\{\\mathsf\{F\}\}\\\}\_\{j\}\)\(the train split is discarded\)\.SR data augmentation\.Once the flattening ensemble is trained, the map𝜼​\(𝜽\)\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\}\)no longer depends on the underlying simulator; we therefore draw a fresh, dense, uniform sample from the prior support \(typically∼5\{\\sim\}5–10×10\\timesthe Fishnet validation count, e\.g\.∼2,000\{\\sim\}2\{,\}000Rosenbrock draws against∼250\{\\sim\}250validation simulations\) and pass it through the ensemble to obtain the augmented triple\(𝜽,𝜼¯,δ​𝜼\)\(\{\\color\[rgb\]\{0\.62890625,0\.3515625,0\.0390625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.3515625,0\.0390625\}\\bm\{\\theta\}\},\\bar\{\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\},\\delta\{\\color\[rgb\]\{0\.1171875,0\.3515625,0\.78515625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1171875,0\.3515625,0\.78515625\}\\bm\{\\eta\}\}\)used as SR inputs\. Empirically this dense, well\-conditioned coverage of the prior is the single most reliable lever for clean Pareto fronts and recovery of the analytic coordinates—using only the∼250\{\\sim\}250original validation points routinely yields noisier expressions and slower convergence\. Each ensemble member is aligned to a reference member using the Procrustes alignment scheme\. Symbolic regression is carried out for 120 seconds per component, and expressions are pruned with an importance\-weighted Frobenius loss threshold of0\.10\.1\.

##### Downstream neural posterior estimation\.

For each of the four scientific applications and the Rosenbrock validation, we train a Masked Autoregressive Flow \(MAF\) as a Neural Posterior Estimator \(NPE\) usingltu\-iliwith thelampebackend\[[34](https://arxiv.org/html/2606.23838#bib.bib40)\]\. We use55MAF transforms,5050hidden features per transform, two MAF repeats, learning rate10−410^\{\-4\}, and up to1 0001\\,000epochs, with early stopping by smoothed validation log\-probability over a1010\-epoch moving window\. The simulation\-budget sweeps span\{100,500,1 000,5 000,10 000\}\\\{100,500,1\\,000,5\\,000,10\\,000\\\}training simulations with corresponding batch sizes\[8,22,32,32,32\]\[8,22,32,32,32\], and average over three random seeds\. Crucially, the simulations are generated uniformly inθ\\theta\-space and, for theη\\eta\-coordinate inference, converted via the discovered coordinate transformation\. This ensures that prior volumes do not influence the inference results\. Log\-losses inη\\eta\-space are converted toθ\\theta\-space using the meanlog⁡\|det\(∂θ/∂η\)\|\\log\\\!\\big\|\\\!\\det\\\!\\big\(\\partial\\theta/\\partial\\eta\\big\)\\big\|over the prior, estimated by central finite differences on512512samples\.

### E\.1Rosenbrock

Consider a Gaussian in two dimensions controlled by means𝜼=\(ν0,ν1\)\{\\bm\{\\eta\}\}=\(\\nu\_\{0\},\\nu\_\{1\}\), which we would like to infer from data, and fixed diagonal covariance𝚺=diag​\(σ02,σ12\)\\bm\{\\Sigma\}=\\text\{diag\}\(\\sigma\_\{0\}^\{2\},\\sigma\_\{1\}^\{2\}\)\. Assuming a uniform prior onθ\\theta, the posterior distribution can be expressed as

log⁡p​\(𝜼\|\{xi\}\)=−12​\(ν0−⟨x0⟩\)2\(σ0/nd\)2−12​\(ν1−⟨x1⟩\)2\(σ1/nd\)2,\\log p\(\{\\bm\{\\eta\}\}\|\\\{\\textbf\{x\}^\{i\}\\\}\)=\-\\frac\{1\}\{2\}\\frac\{\(\\nu\_\{0\}\-\\langle x\_\{0\}\\rangle\)^\{2\}\}\{\(\\sigma\_\{0\}/\\sqrt\{n\_\{d\}\}\)^\{2\}\}\-\\frac\{1\}\{2\}\\frac\{\(\\nu\_\{1\}\-\\langle x\_\{1\}\\rangle\)^\{2\}\}\{\(\\sigma\_\{1\}/\\sqrt\{n\_\{d\}\}\)^\{2\}\},\(84\)where⟨a⟩=1nd​∑i=1ndai\\langle a\\rangle=\\frac\{1\}\{n\_\{d\}\}\\sum\_\{i=1\}^\{n\_\{d\}\}a^\{i\}, which is also Gaussian, andndn\_\{\\rm d\}is the number of data points\. This likelihood has a known Fisher information matrix:𝗙′=nd​diag⁡\(1/σ02,1/σ12\)\\bm\{\\mathsf\{F\}\}^\{\\prime\}=n\_\{d\}\\operatorname\{diag\}\(1/\\sigma\_\{0\}^\{2\},1/\\sigma\_\{1\}^\{2\}\)\. Hence, in this coordinate system, by a simple rescaling we can make the Fisher information the identity\.

Let us now consider a new coordinate system for which the Fisher information is not so clearly related to the identity\. Let us define𝜽=\(μ0,μ1\)\\bm\{\\theta\}=\(\\mu\_\{0\},\\mu\_\{1\}\)such that

μ0=ν0,μ1=ν1\+ν02,⇔ν0=μ0,ν1=μ1−μ02\.\\mu\_\{0\}=\\nu\_\{0\},\\quad\\mu\_\{1\}=\\nu\_\{1\}\+\\nu\_\{0\}^\{2\},\\quad\\Leftrightarrow\\quad\\nu\_\{0\}=\\mu\_\{0\},\\quad\\nu\_\{1\}=\\mu\_\{1\}\-\\mu\_\{0\}^\{2\}\.\(85\)Applying the Jacobian of this transformation \(which has determinant11\), the posterior in the new coordinates is given as

log⁡p​\(𝜽\|\{xi\}\)=−12​\(μ0−⟨x0⟩\)2\(σ0/nd\)2−12​\(μ1−μ02−⟨x1⟩\)2\(σ1/nd\)2,\\log p\(\\bm\{\\theta\}\|\\\{\\textbf\{x\}^\{i\}\\\}\)=\-\\frac\{1\}\{2\}\\frac\{\(\\mu\_\{0\}\-\\langle x\_\{0\}\\rangle\)^\{2\}\}\{\(\\sigma\_\{0\}/\\sqrt\{n\_\{d\}\}\)^\{2\}\}\-\\frac\{1\}\{2\}\\frac\{\(\\mu\_\{1\}\-\\mu\_\{0\}^\{2\}\-\\langle x\_\{1\}\\rangle\)^\{2\}\}\{\(\\sigma\_\{1\}/\\sqrt\{n\_\{d\}\}\)^\{2\}\},\(86\)which is a non\-convex Rosenbrock\-like function\[e\.g\.[38](https://arxiv.org/html/2606.23838#bib.bib32)\]\. In these coordinates, the Fisher matrix is given by

𝗙=nd​\(1σ02\+4​μ02σ12−2​μ0σ12−2​μ0σ121σ12\)\.\\bm\{\\mathsf\{F\}\}=n\_\{d\}\\begin\{pmatrix\}\\frac\{1\}\{\\sigma^\{2\}\_\{0\}\}\+\\frac\{4\\mu\_\{0\}^\{2\}\}\{\\sigma\_\{1\}^\{2\}\}&\\frac\{\-2\\mu\_\{0\}\}\{\\sigma\_\{1\}^\{2\}\}\\\\ \\frac\{\-2\\mu\_\{0\}\}\{\\sigma\_\{1\}^\{2\}\}&\\frac\{1\}\{\\sigma\_\{1\}^\{2\}\}\\end\{pmatrix\}\.\(87\)
We have therefore shown that a Rosenbrock\-like posterior \([Eq\.˜86](https://arxiv.org/html/2606.23838#A5.E86)\), which has a complicated Fisher matrix, can be transformed into a Gaussian posterior \([Eq\.˜84](https://arxiv.org/html/2606.23838#A5.E84)\) with a simple Fisher matrix\. Since this result is exact, we use it as a benchmark to determine whether our methodology can find the coordinates𝜼\{\\bm\{\\eta\}\}given simulations performed in the coordinates𝜽\\bm\{\\theta\}\.

To test the algorithm, we generate500500simulations over a fixed uniform prior\(μ0,μ1\)∼𝒰​\[−3,3\]×\[−3,3\]\(\\mu\_\{0\},\\mu\_\{1\}\)\\sim\\mathcal\{U\}\[\-3,3\]\\times\[\-3,3\], splitting 50\-50 into train and test sets\. These coordinates are then transformed to\(ν0,ν1\)\(\\nu\_\{0\},\\nu\_\{1\}\)via[Eq\.˜85](https://arxiv.org/html/2606.23838#A5.E85)\. We then generatend=50n\_\{\\rm d\}=50two\-dimensional data points from a multivariate normal\{x\}i=1nd∼𝒩​\(\[ν0,ν1\],𝚺\)\\\{\\textbf\{x\}\\\}\_\{i=1\}^\{n\_\{d\}\}\\sim\\mathcal\{N\}\(\[\\nu\_\{0\},\\nu\_\{1\}\],\\bm\{\\Sigma\}\)with fixed, diagonal covariance𝚺=diag​\(1\.0,2\.0\)\\bm\{\\Sigma\}=\\mathrm\{diag\}\(\{1\.0,2\.0\}\)\. For inference scaling we investigate𝚺=diag​\(4\.0,1\.0\)\\bm\{\\Sigma\}=\\mathrm\{diag\}\(\{4\.0,1\.0\}\), which exacerbates the “banana” shape of the posterior, deviating from a Gaussian approximation\.

![Refer to caption](https://arxiv.org/html/2606.23838v1/x6.png)Figure 6:Top panels:Fishnets provides an unbiased estimate of a\) unique Cholesky factors and b\) normalised shape of the true Rosenbrock Fisher Matrix\.Bottom panel:Learned flat coordinatesη\\etaand their ensembled\-estimated uncertaintiesδ​η\\delta\\etaapproximately follow a Gaussian sampling distribution\.
### E\.2Gaussian with unknown variance

We generate10 00010\\,000simulations viaμ∼𝒰​\[−5,5\]\\mu\\sim\\mathcal\{U\}\[\-5,5\]andσ∼𝒰​\[0\.2,7\.0\]\\sigma\\sim\\mathcal\{U\}\[0\.2,7\.0\], and generatex∼𝒩​\(μ,σ\)x\\sim\\mathcal\{N\}\(\\mu,\\sigma\)withdim\(x\)=50\\dim\(x\)=50\. We adopt the same network setup calibrated on the Rosenbrock example\.Results\.Using the same distillery settings as the Rosenbrock example, we recover the MDL expressions

η1\\displaystyle\\eta\_\{1\}=μ​\(0\.4​σ−exp⁡\(0\.14​σ\)\)​\(0\.06​μ2−0\.02​exp⁡\(0\.95​σ\)−3\.0\)\+7\.0,\\displaystyle=\\mu\\,\(0\.4\\,\\sigma\-\\exp\(0\.14\\,\\sigma\)\)\\,\(0\.06\\,\\mu^\{2\}\-0\.02\\,\\exp\(0\.95\\,\\sigma\)\-3\.0\)\+7\.0,η2\\displaystyle\\eta\_\{2\}=0\.8​σ\+\(0\.4​μ2\+1\.0​σ\)exp⁡\(−0\.13​σ\)−0\.9\\displaystyle=0\.8\\,\\sigma\+\(0\.4\\,\\mu^\{2\}\+1\.0\\,\\sigma\)^\{\\exp\(\-0\.13\\,\\sigma\)\}\-0\.9and Frobenius expressions

η1\\displaystyle\\eta\_\{1\}=\(μ​\(0\.06​μ2−3\.0\)​\(0\.02​σ​exp⁡\(0\.28​σ\)−1\.0\)\+\(0\.3​μ\+7\.0\)​exp⁡\(0\.28​σ\)\)​exp⁡\(−0\.28​σ\),\\displaystyle=\\bigl\(\\mu\\,\(0\.06\\,\\mu^\{2\}\-3\.0\)\\,\(0\.02\\,\\sigma\\,\\exp\(0\.28\\,\\sigma\)\-1\.0\)\+\(0\.3\\,\\mu\+7\.0\)\\,\\exp\(0\.28\\,\\sigma\)\\bigr\)\\,\\exp\(\-0\.28\\,\\sigma\),η2\\displaystyle\\eta\_\{2\}=0\.8​σ\+\(0\.4​μ2−0\.04​μ\+σ\)exp⁡\(−0\.12​σ\)−0\.9\.\\displaystyle=0\.8\\,\\sigma\+\(0\.4\\,\\mu^\{2\}\-0\.04\\,\\mu\+\\sigma\)^\{\\exp\(\-0\.12\\,\\sigma\)\}\-0\.9\.

### E\.3SIR Simulation Details

![Refer to caption](https://arxiv.org/html/2606.23838v1/x7.png)Figure 7:Example SIR epidemic \(R0\>1\.0R\_\{0\}\>1\.0\) time\-series data and parameter values, coloured byR0R\_\{0\}\.We use the classical susceptible–infectious–removed \(SIR\) epidemic model\[[23](https://arxiv.org/html/2606.23838#bib.bib66),[16](https://arxiv.org/html/2606.23838#bib.bib67)\], in which a closed population of sizeNNis partitioned into susceptibleS​\(t\)S\(t\), infectiousI​\(t\)I\(t\), and removed/recoveredR​\(t\)R\(t\)compartments:

d​Sd​t\\displaystyle\\frac\{\{\\rm d\}S\}\{\{\\rm d\}t\}=−β​S​IN,\\displaystyle=\-\\beta\\frac\{SI\}\{N\},d​Id​t\\displaystyle\\frac\{\{\\rm d\}I\}\{\{\\rm d\}t\}=β​S​IN−γ​I,\\displaystyle=\\beta\\frac\{SI\}\{N\}\-\\gamma I,d​Rd​t\\displaystyle\\frac\{\{\\rm d\}R\}\{\{\\rm d\}t\}=γ​I,\\displaystyle=\\gamma I,\(88\)whereβ\\betais the transmission rate,γ\\gammathe recovery rate, and the basic reproduction number isR0=β/γR\_\{0\}=\\beta/\\gamma\.

![Refer to caption](https://arxiv.org/html/2606.23838v1/x8.png)Figure 8:Supplementary details for the SIR experiment\.Left:Discovered coordinates require fewer simulations in noisier settings to achieve the same log\-probability scores in downstream NDE posterior estimation inθ\\thetaspace\.Middle:3D posteriors \(here forNsim=1000N\_\{\\rm sim\}=1000\) show good agreement with simulation truths and tighterη\\etaconstraints\.Right:bothη\\etaandθ\\thetaposteriors indicate underconfident \(conservative\) coverage across test simulations\.Simulations useN=5000N=5000with initial conditionsI​\(0\)∼Pois​\(λ=8\)I\(0\)\\sim\\mathrm\{Pois\}\(\\lambda=8\),S​\(0\)=N−I​\(0\)S\(0\)=N\-I\(0\),R​\(0\)=0R\(0\)=0\. We observe the normalised infectious trajectoryI​\(t\)/NI\(t\)/Nat3030evenly spaced times overt∈\[0,50\]t\\in\[0,50\]with additive Gaussian noise of standard deviationσobs=0\.05\\sigma\_\{\\rm obs\}=0\.05\. Parameters are sampled independently asβ∼𝒰​\(0\.1,1\.0\)\\beta\\sim\\mathcal\{U\}\(0\.1,1\.0\),γ∼𝒰​\(0\.05,0\.5\)\\gamma\\sim\\mathcal\{U\}\(0\.05,0\.5\), with the rescaled initial conditionI0/10I\_\{0\}/10entering as a third inference coordinate; we generate2 5002\\,500training and2 5002\\,500test simulations\. For the epidemic\-regime analysis we retain samples withR0≥1\+δR\_\{0\}\\geq 1\+\\deltawithδ=0\.15\\delta=0\.15, and discard the band0\.85≤R0<1\.150\.85\\leq R\_\{0\}<1\.15; the corresponding non\-epidemic subset isR0<0\.85R\_\{0\}<0\.85\. We display example simulations and parameters in Fig\.[7](https://arxiv.org/html/2606.23838#A5.F7)\.

To demonstrate information capture in the identifiedR0R\_\{0\}direction, we grade posterior values ofR0R\_\{0\}, computed from posterior chains in\(β,γ\)\(\\beta,\\gamma\)\. For each test observationx\(i\)x^\{\(i\)\}with true reproduction numberR0\(i\)=β\(i\)/γ\(i\)R\_\{0\}^\{\(i\)\}=\\beta^\{\(i\)\}/\\gamma^\{\(i\)\}, we score the posteriorq​\(R0∣x\(i\)\)q\(R\_\{0\}\\mid x^\{\(i\)\}\)via the continuous ranked probability score \(CRPS\)

CRPS​\(q,R0\(i\)\)=𝔼R∼q​\|R−R0\(i\)\|−12​𝔼R,R′∼q​\|R−R′\|,\\mathrm\{CRPS\}\\\!\\left\(q,\\,R\_\{0\}^\{\(i\)\}\\right\)\\;=\\;\\mathbb\{E\}\_\{R\\sim q\}\\,\\bigl\|R\-R\_\{0\}^\{\(i\)\}\\bigr\|\\;\-\\;\\tfrac\{1\}\{2\}\\,\\mathbb\{E\}\_\{R,R^\{\\prime\}\\sim q\}\\,\\bigl\|R\-R^\{\\prime\}\\bigr\|,\(89\)and report the mean over the test set,CRPS¯​\(R0\)=N−1​∑iCRPS​\(q,R0\(i\)\)\\overline\{\\mathrm\{CRPS\}\}\(R\_\{0\}\)=N^\{\-1\}\\sum\_\{i\}\\mathrm\{CRPS\}\(q,R\_\{0\}^\{\(i\)\}\)\. CRPS is a strictly proper scoring rule with units ofR0R\_\{0\}: it reduces to the absolute error whenqqis a point mass, and it is uniquely minimized in expectation whenqqmatches the true posterior overR0R\_\{0\}\[[14](https://arxiv.org/html/2606.23838#bib.bib90)\]\. We estimate \([89](https://arxiv.org/html/2606.23838#A5.E89)\) fromMMposterior samples\{Rj\}j=1M\\\{R\_\{j\}\\\}\_\{j=1\}^\{M\}using the unbiasedUU\-statistic

CRPS^=1M​∑j\|Rj−R0\(i\)\|−12​M​\(M−1\)​∑j≠k\|Rj−Rk\|,\\widehat\{\\mathrm\{CRPS\}\}\\;=\\;\\frac\{1\}\{M\}\\sum\_\{j\}\\bigl\|R\_\{j\}\-R\_\{0\}^\{\(i\)\}\\bigr\|\\;\-\\;\\frac\{1\}\{2M\(M\{\-\}1\)\}\\sum\_\{j\\neq k\}\\bigl\|R\_\{j\}\-R\_\{k\}\\bigr\|,\(90\)withRj=βj/γjR\_\{j\}=\\beta\_\{j\}/\\gamma\_\{j\}obtained from posterior draws\(βj,γj\)∼q\(⋅∣x\(i\)\)\(\\beta\_\{j\},\\gamma\_\{j\}\)\\sim q\(\\,\\cdot\\mid x^\{\(i\)\}\)\.

### E\.4Weak Gravitational Lensing

For our weak lensing example we adopt the5 0005\\,000noisy tomographic WL simulations fromMakinenet al\.\[[31](https://arxiv.org/html/2606.23838#bib.bib50)\], generated with thepmwdparticle\-mesh code\[[29](https://arxiv.org/html/2606.23838#bib.bib85)\]and collected into four redshift bins to form convergence\-image data of shape\(128,128,4\)\(128,128,4\)\. Simulations span a wide uniform priorp​\(Ωm,S8\)=𝒰​\(\[0\.15,0\.7\]×\[0\.35,1\.52\]\)p\(\\Omega\_\{m\},S\_\{8\}\)=\\mathcal\{U\}\(\[0\.15,0\.7\]\\times\[0\.35,1\.52\]\)and additionally vary the initial\-condition seed; followingMakinenet al\.\[[31](https://arxiv.org/html/2606.23838#bib.bib50)\]we histogram all auto\- and cross\-power spectra per redshift bin into a 60\-component summary per simulation\. Additive shape noise with variance\[0\.00045021,0\.00087473,0\.00134725,0\.00183411\]\[0\.00045021,0\.00087473,0\.00134725,0\.00183411\]per tomographic bin is injected for downstream neural posterior estimation \(Fig\.[9](https://arxiv.org/html/2606.23838#A5.F9)\) Thesameunderlying dark matter simulations are used in the distillery and inference steps\.

![Refer to caption](https://arxiv.org/html/2606.23838v1/figs/wl_experiment/wl_ex1.png)
![Refer to caption](https://arxiv.org/html/2606.23838v1/figs/wl_experiment/wl_ex2.png)

Figure 9:Example NPE posteriors using learned versus conventional weak lensing coordinates\. Learned coordinates trace the conventional coordinate constraints, and outperform rawθ\\thetasampling\.
### E\.5Heater experiment

We expand on the rank\-1 industrial\-heater experiment of[Section˜5\.2](https://arxiv.org/html/2606.23838#S5.SS2.SSS0.Px1), repeating the simulator and prior for self\-containment and providing the hyperparameter and ablation details deferred from the main text\.

We construct a minimal synthetic experiment to demonstrate that the Distillery can perform genuine*dimensionality reduction*, leading to immense performance gains in high dimensions\. A first\-order resistive\-heater plant is driven by an applied voltage and currentθ=\(V,I\)\\theta=\(V,I\), with observable 20\-point time series of the temperature step response \(see e\.g\.\[[20](https://arxiv.org/html/2606.23838#bib.bib86)\]\),

y​\(t\)=\(V⋅I\)​\(1−e−t/τ\)\+ε​\(t\),ε​\(t\)​∼iid​𝒩​\(0,σ2\)\.y\(t\)\\;=\\;\(V\\cdot I\)\\,\(1\-e^\{\-t/\\tau\}\)\\;\+\\;\\varepsilon\(t\),\\qquad\\varepsilon\(t\)\\overset\{\\text\{iid\}\}\{\\sim\}\\mathcal\{N\}\(0,\\sigma^\{2\}\)\.\(91\)The plant dynamics depend on\(V,I\)\(V,I\)onlythrough the dissipated powerP=V⋅IP=V\\cdot I, so the Fisher information matrixFθ​\(θ\)=\(‖k‖2/σ2\)​\(θ2,θ1\)⊤​\(θ2,θ1\)F\_\{\\theta\}\(\\theta\)=\(\\\|k\\\|^\{2\}/\\sigma^\{2\}\)\\,\(\\theta\_\{2\},\\theta\_\{1\}\)^\{\\top\}\(\\theta\_\{2\},\\theta\_\{1\}\)is structurally rank\-1*everywhere*:θ\\thetaparameterises a 2\-D box but the data manifold is exactly 1\-D\. This is a common occurrence in industrial control problems where two physical dials multiply together to produce a single output\.

##### Scaling Impact\.

Density estimation becomes increasingly difficult in high dimensions, requiringN​\(d,ε\)∼ε−\(2​β\+d\)/\(2​β\)N\(d,\\varepsilon\)\\;\\sim\\;\\varepsilon^\{\-\(2\\beta\+d\)/\(2\\beta\)\}simulations to reach an error toleranceε\\varepsilon\(Stone’s relation;\[[39](https://arxiv.org/html/2606.23838#bib.bib78)\]\)\. If the data manifold has intrinsic dimensiond∗<dd^\{\*\}<d, training on the distilled coordinates costs onlyN​\(d∗,ε\)N\(d^\{\*\},\\varepsilon\), giving an exponential\-in\-kksaving

N​\(d,ε\)N​\(d∗,ε\)=ε−k/\(2​β\),k≡d−d∗\.\\frac\{N\(d,\\varepsilon\)\}\{N\(d^\{\*\},\\varepsilon\)\}\\;=\\;\\varepsilon^\{\-k/\(2\\beta\)\},\\qquad k\\equiv d\-d^\{\*\}\.\(92\)Forβ=1\\beta=1\(Lipschitz\) andε=10−2\\varepsilon=10^\{\-2\}this varies as10k/210^\{k/2\}\. Explicit symbolic \(or neural\) distillation of theintrinsicparameter combinations delivers this exponential gain in density estimation by effectively shrinking the parameter space\.Empirical scaling sweep\.To make this explicit in the context of neural posterior estimation \(NPE;\[[34](https://arxiv.org/html/2606.23838#bib.bib40),[2](https://arxiv.org/html/2606.23838#bib.bib11)\]\), we extend the heater problem to a chain\-product simulator

y​\(t\)=\(∏i=1dθi\)​\(1−e−t/τ\)\+ε​\(t\),θi∼𝒰​\[1,2\],y\(t\)\\;=\\;\\Big\(\\prod\_\{i=1\}^\{d\}\\theta\_\{i\}\\Big\)\\,\(1\-e^\{\-t/\\tau\}\)\\;\+\\;\\varepsilon\(t\),\\qquad\\theta\_\{i\}\\sim\\mathcal\{U\}\[1,2\],\(93\)in which the intrinsic dimension is fixed at one for everydd\. Figure[3](https://arxiv.org/html/2606.23838#S5.F3)shows the best validation log\-probability of two neural density estimators trained on a fixed budget ofNsim=103N\_\{\\text\{sim\}\}=10^\{3\}simulations: the raw flow with targetθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}and the distilled flow with targetη=∏iθi∈ℝ\\eta=\\prod\_\{i\}\\theta\_\{i\}\\in\\mathbb\{R\}\. The raw flow’s inference quality degrades monotonically withdd, tracing Stone’s relation, while the distilled flow is, by construction, insensitive todd\. The gap widens at the rate predicted by Eq\. \([92](https://arxiv.org/html/2606.23838#A5.E92)\), despite the underlying observed information being identical at everydd\.

##### Simulator and prior\.

The simulator is the resistive\-heater step response[Eq\.˜10](https://arxiv.org/html/2606.23838#S5.E10)with thermal time constantτ=1\\tau=1, observation horizont∈\[0,4\]t\\in\[0,4\]sampled atnt=20n\_\{t\}=20uniformly spaced points, and per\-sample noise standard deviationσ=0\.2\\sigma=0\.2\(σ2=0\.04\\sigma^\{2\}=0\.04in temperature units\)\. The prior isθ=\(V,I\)∼𝒰​\[1,2\]2\\theta=\(V,I\)\\sim\\mathcal\{U\}\[1,2\]^\{2\}—a tight factor\-of\-two box around the operating point—withNsim=103N\_\{\\text\{sim\}\}=10^\{3\}i\.i\.d\. training draws and an independent10310^\{3\}for validation\. The Fisher information matrixFθ​\(θ\)=\(‖k‖2/σ2\)​\(θ2,θ1\)⊤​\(θ2,θ1\)F\_\{\\theta\}\(\\theta\)=\(\\\|k\\\|^\{2\}/\\sigma^\{2\}\)\\,\(\\theta\_\{2\},\\theta\_\{1\}\)^\{\\top\}\(\\theta\_\{2\},\\theta\_\{1\}\)is structurally rank\-1 everywhere because the dynamics depend on\(V,I\)\(V,I\)only through the dissipated powerP=V⋅IP=V\\cdot I\.

##### Fishnet ensemble\.

We use an ensemble of1010Fishnets, each a33\-layer MLP with hidden widths drawn uniformly in\[64,128\]\[64,128\], trained for200200epochs at learning rate10−310^\{\-3\}and batch size6464withpatience=30\\mathrm\{patience\}=30on the validation MLE\+Fisher objective\. The ensemble\-mean Fisher at the prior centre is

F¯θ≈\(0\.620\.580\.580\.62\),eigvals​\(F¯θ\)≈\(0\.033,1\.20\),\\bar\{F\}\_\{\\theta\}\\;\\approx\\;\\begin\{pmatrix\}0\.62&0\.58\\\\ 0\.58&0\.62\\end\{pmatrix\},\\qquad\\mathrm\{eigvals\}\(\\bar\{F\}\_\{\\theta\}\)\\;\\approx\\;\(0\.033,\\;1\.20\),giving an empirical condition number of∼36\\sim 36– consistent with a structurally rank\-1 Fisher whose minor eigenvalue is set by the finite\-data noise floor at thisσ\\sigma\.

##### Flattening flow\.

We fit a single flattening networkη=fϕ​\(θ\)\\eta=f\_\{\\phi\}\(\\theta\)with the log\-Frobenius loss, trained for200200epochs at learning rate10−310^\{\-3\}on the full10310^\{3\}\-sample training set\. The asymptotic lossℒ≈0\.79\\mathcal\{L\}\\approx 0\.79is the structural rank\-1 floor oflog⁡‖Q−1−I‖F\\log\\\|Q^\{\-1\}\-I\\\|\_\{F\}: the unidentifiable direction has eigenvalue zero exactly, so a higher\-rank target would drive the loss to zero\. The Jacobian determinant\|detJ\|≈0\.04\|\\det J\|\\approx 0\.04indicates that the 2D problem contains only one meaningful direction\.

##### Symbolic regression\.

Discovered coordinates are post\-processed with the standard Distillery pipeline\. The two SR\-recovered expressions are

ηnuis​\(V,I\)\\displaystyle\\eta\_\{\\text\{nuis\}\}\(V,I\)≈\+0\.10​V−0\.09​I\+0\.4≈0\.10​\(V−I\)\+0\.4,\\displaystyle\\;\\approx\\;\+0\.10\\,V\\;\-\\;0\.09\\,I\\;\+\\;0\.4\\;\\approx\\;0\.10\\,\(V\-I\)\+0\.4,\(94\)ηiden​\(V,I\)\\displaystyle\\eta\_\{\\text\{iden\}\}\(V,I\)≈−0\.60​V−0\.60​I\+6\.0=−0\.60​\(V\+I\)\+6\.0,\\displaystyle\\;\\approx\\;\-0\.60\\,V\\;\-\\;0\.60\\,I\\;\+\\;6\.0\\;=\\;\-0\.60\\,\(V\+I\)\+6\.0,\(95\)with\|r\|​\(ηiden,V\+I\)=1\.00\|r\|\(\\eta\_\{\\text\{iden\}\},V\{\+\}I\)=1\.00and\|r\|​\(ηnuis,V−I\)=1\.00\|r\|\(\\eta\_\{\\text\{nuis\}\},V\{\-\}I\)=1\.00\(Table[1](https://arxiv.org/html/2606.23838#A5.T1)\); cross\-talk into the identifiable subspace is below0\.100\.10\.

Table 1:Pearson\|r\|\|r\|of the discovered Distillery coordinatesηnuis,ηiden\\eta\_\{\\text\{nuis\}\},\\eta\_\{\\text\{iden\}\}against ground\-truth physics axes for the rank\-1 heater experiment, evaluated over the validation set\. Boldface marks the highest correlation in each row\. The two Distillery coordinates partition the rank\-1 manifold into a clean identifiable\-and\-nuisance pair with cross\-talk\|r\|<0\.10\|r\|<0\.10\.The recoveredV\+IV\+Icoordinate is a local approximation toV⋅IV\\cdot Iwithin the prior box\[1,2\]2\[1,2\]^\{2\}\.

##### NPE\-MAF training \(sweep\)\.

For the chain\-product sweep \(Fig\.[3](https://arxiv.org/html/2606.23838#S5.F3)\), we train a single MAF\[[35](https://arxiv.org/html/2606.23838#bib.bib81)\]with55transforms and hidden width5050for raw coordinates, and a 4\-component MDN for 1D distilled coordinates using theltu\-ili\[[17](https://arxiv.org/html/2606.23838#bib.bib84)\]lampebackend with learning rate10−410^\{\-4\}, batch size6464, patience3030, max epochs10001000, and33independent random seeds per ambient dimensiond∈\{2,3,4,6,8,10,12\}d\\in\\\{2,3,4,6,8,10,12\\\}\. Both raw and distilled flows see identical\(θ,y\)\(\\theta,y\)pairs and identical training budgets\. The only difference is the regression target \(θ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}vsη=∏iθi∈ℝ\\eta=\\prod\_\{i\}\\theta\_\{i\}\\in\\mathbb\{R\}\) and NPE head parametrisation\.

### E\.6Gravitational\-wave inspiral waveforms \(TaylorF2 warm\-up\)

We use TaylorF2 inspiral waveforms with non\-spinning 2PN phase\. The frequency grid runs fromflow=20f\_\{\\rm low\}=20Hz to the lightest\-system ISCO frequencyfISCO=\(63/2​π​\(m1\+m2\)​M⊙\)−1f\_\{\\rm ISCO\}=\(6^\{3/2\}\\pi\(m\_\{1\}\+m\_\{2\}\)M\_\{\\odot\}\)^\{\-1\}in steps ofΔ​f=0\.5\\Delta f=0\.5Hz, capped at1 0241\\,024Hz\. Waveforms are whitened with the analytic Advanced\-LIGO design PSDSn​\(f\)=10−49​\(x−4\.14\+2\+2​x2\)S\_\{n\}\(f\)=10^\{\-49\}\(x^\{\-4\.14\}\+2\+2x^\{2\}\),x=f/215​Hzx=f/215\\,\{\\rm Hz\}, forf≥10f\\geq 10Hz, with whitening factor4​Δ​f/Sn​\(f\)\\sqrt\{4\\,\\Delta f/S\_\{n\}\(f\)\}\. Real and imaginary parts of the whitened waveform are concatenated, and PCA is fit on a bank of1 0001\\,000noiseless whitened waveforms with4040retained components\. Unit Gaussian noise is added in the PCA\-compressed space\. Mass priors arem1,m2∼𝒰​\(5,50\)​M⊙m\_\{1\},m\_\{2\}\\sim\\mathcal\{U\}\(5,50\)\\,M\_\{\\odot\}subject tom1≥m2m\_\{1\}\\geq m\_\{2\}\.

##### Conventional baseline\.

For comparison we also train a neural posterior estimator in the conventional\(ℳc,q\)\(\\mathcal\{M\}\_\{c\},q\)coordinates,

ℳc=\(m1​m2\)3/5\(m1\+m2\)1/5,q=m2/m1,\\mathcal\{M\}\_\{c\}=\\frac\{\(m\_\{1\}m\_\{2\}\)^\{3/5\}\}\{\(m\_\{1\}\+m\_\{2\}\)^\{1/5\}\},\\qquad q=m\_\{2\}/m\_\{1\},with their analytic inverse used to convert posterior samples back to\(m1,m2\)\(m\_\{1\},m\_\{2\}\)for evaluation\.

![Refer to caption](https://arxiv.org/html/2606.23838v1/x9.png)Figure 10:Benchmark GW result,TaylorF2\(inspiral\-only, 2PN, non\-spinning\)\. Same panel layout as[Fig\.˜5](https://arxiv.org/html/2606.23838#S5.F5)\. The learned coordinates collapse onto a smooth bijection of the conventional chirp\-mass / mass\-ratio combination\(ℳc,q\)\(\\mathcal\{M\}\_\{c\},q\); posterior contours and PP\-plots inη\\etaand\(ℳc,q\)\(\\mathcal\{M\}\_\{c\},q\)are consistent with the analytic argument that the inspiral phase is dominated byℳc\\mathcal\{M\}\_\{c\}in this mass range\.
##### Benchmarking results\.

For masses rescaled to\[0\.1,1\.0\]\[0\.1,1\.0\], the distillery returns the symbolic coordinates of[Section˜5\.2](https://arxiv.org/html/2606.23838#S5.SS2),η1=0\.2​m1\\eta\_\{1\}=0\.2\\,m\_\{1\},η2=5\.5​exp⁡\(−0\.08​\(m1\+m2\)2\)\\eta\_\{2\}=5\.5\\,\\exp\(\-0\.08\\,\(m\_\{1\}\+m\_\{2\}\)^\{2\}\), with analytic inversem1=5​η1m\_\{1\}=5\\eta\_\{1\},m2=−5​η1\+2\.5​3\.41−2​log⁡\|η2\|m\_\{2\}=\-5\\eta\_\{1\}\+2\.5\\sqrt\{3\.41\-2\\log\|\\eta\_\{2\}\|\}used for downstream sampling\. Posterior estimation inη\\eta\-space matches or exceeds the chirp\-mass coordinates at all simulation budgets tested, with the largest gains at low budgets\.

### E\.7Gravitational\-wave inspiral–merger–ringdown \(IMRPhenomD\)

For the main GW result we replace the inspiral\-only TaylorF2 model with the phenomenological inspiral–merger–ringdown waveformIMRPhenomD\[[24](https://arxiv.org/html/2606.23838#bib.bib47),[19](https://arxiv.org/html/2606.23838#bib.bib48)\], generated throughpycbc\[[6](https://arxiv.org/html/2606.23838#bib.bib49)\]on the same uniform\(m1,m2\)\(m\_\{1\},m\_\{2\}\)prior and Advanced\-LIGO design PSD as the TaylorF2 warm\-up\. The frequency grid is extended tofmax=2 048f\_\{\\rm max\}=2\\,048Hz to capture the merger and ringdown of the heaviest systems; the lower cutoffflow=20f\_\{\\rm low\}=20Hz, frequency stepΔ​f=0\.5\\Delta f=0\.5Hz, whitening, and PCA pipeline \(40 components fit on5 0005\\,000noiseless waveforms, unit Gaussian noise added in the PCA basis\) are identical to[Section˜E\.6](https://arxiv.org/html/2606.23838#A5.SS6)\.

##### Why the chirp\-mass scaling fails\.

At the heavy end of the prior, the merger amplitude and ringdown frequency \(∝1/\(m1\+m2\)\\propto 1/\(m\_\{1\}\+m\_\{2\}\)\) carry a substantial fraction of the in\-band signal\-to\-noise, so theℳc∝\(m1​m2\)3/5​\(m1\+m2\)−1/5\\mathcal\{M\}\_\{c\}\\propto\(m\_\{1\}m\_\{2\}\)^\{3/5\}\(m\_\{1\}\+m\_\{2\}\)^\{\-1/5\}scaling that dominates inspiral\-only Fisher information is no longer the principal degeneracy direction\. The distillery exploits this empirically: the symbolic outputs are functions ofM=m1\+m2M=m\_\{1\}\+m\_\{2\}andΔ​m=m1−m2\\Delta m=m\_\{1\}\-m\_\{2\}with noℳc\\mathcal\{M\}\_\{c\}factor, and we verified that running SR on\(M,ℳc\)\(M,\\mathcal\{M\}\_\{c\}\)inputs reproduces a similar functional form, confirming the chirp\-mass dependence is genuinely absent from the post\-merger Fisher rather than hidden by a coordinate symmetry\.

##### Calibration\.

The IMR sweep \(Nsims∈\{500,1 000,2 500,5 000,10 000\}N\_\{\\rm sims\}\\in\\\{500,1\\,000,2\\,500,5\\,000,10\\,000\\\}\) shows the learnedη\\etacoordinates achieve TARP coverage within the1​σ1\\sigmabootstrap band across the fullα∈\[0,1\]\\alpha\\in\[0,1\]range, whereas both\(m1,m2\)\(m\_\{1\},m\_\{2\}\)and\(ℳc,q\)\(\\mathcal\{M\}\_\{c\},q\)NPEs under\-cover at highα\\alpha\.

Similar Articles

Mechanisms of Misgeneralization in Physical Sequence Modeling

arXiv cs.LG

This paper identifies and analyzes 'physical misgeneralization' in generative sequence models, where individual trajectories appear plausible but the aggregate distribution over physical quantities is incorrect, and proposes a kernel-informed mitigation.

Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space

arXiv cs.LG

This paper frames model merging as probabilistic inference under a product-of-experts scenario, showing that existing methods are special cases and proposing a heavy-tailed Cauchy expert design that better captures real residual behavior, achieving significant improvements over state-of-the-art baselines.

Physics-based Digital Twins for Integrated Thermal Energy Systems Using Active Learning

arXiv cs.LG

This paper proposes an active learning framework to couple high-fidelity Modelica simulations with simpler surrogate models (SINDyC, FNN, GRU) for creating efficient digital twins of thermal energy distribution systems. The approach significantly reduces the number of simulation trajectories needed while maintaining predictive accuracy and enabling uncertainty quantification.