Fisher Width: A Geometric Measure of Complexity on Statistical Manifolds
Summary
Introduces Fisher width, a Riemannian analogue of Gaussian width for statistical manifolds, which captures local statistical curvature and is invariant under reparameterization. The paper develops its theory, proves generalization bounds for Fisher-Lipschitz classes, and demonstrates computable estimators on MNIST.
View Cached Full Text
Cached at: 06/18/26, 05:40 AM
# A Geometric Measure of Complexity on Statistical Manifolds
Source: [https://arxiv.org/html/2606.18306](https://arxiv.org/html/2606.18306)
###### Abstract
Gaussian width is a central geometric complexity measure in high\-dimensional probability, compressed sensing, convex optimization, and learning theory\. It quantifies the average extent of a set along random directions, thereby capturing the effective dimension of constraint sets, hypothesis classes, and descent cones\. However, this notion is intrinsically Euclidean\. Statistical models instead carry a natural Riemannian geometry induced by the Fisher information metric, where directions are scaled according to statistical distinguishability rather than ambient Euclidean length\.
We introduce Fisher width, a Fisher\-geometric analogue of Gaussian width for statistical manifolds\. At a parameter pointθ\\theta, Fisher width replaces the Euclidean identity by the local metric tensorG\(θ\)1/2G\(\\theta\)^\{1/2\}, measuring the Gaussian width of the Fisher\-rescaled set\. This makes the resulting quantity sensitive to local statistical curvature and invariant under smooth reparameterizations\.
We develop the basic theory of Fisher width, showing that it retains key structural features of Gaussian width, including concentration, metric perturbation stability, and spectral comparison bounds with the Euclidean baseline, while also capturing anisotropic geometric effects invisible to Euclidean measures\. As an application, we prove a generalization bound for Fisher\-Lipschitz hypothesis classes and propose computable estimators, which we evaluate empirically on MNIST across three model classes\.
Fisher width is to statistical manifolds what Gaussian width is to Euclidean convex bodies\. This work lays the foundation for studying complexity and learning on curved statistical manifolds\.
## 1Introduction
### 1\.1Complexity on Statistical Manifolds
A central problem in high\-dimensional geometry and statistical learning is to quantify the effective complexity of a model class\. One of the most successful notions for this purpose is the Gaussian width
w\(T\)=𝔼g∼𝒩\(0,Id\)\[supv∈T⟨g,v⟩\],w\(T\)=\\mathbb\{E\}\_\{g\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\}\\\!\\left\[\\sup\_\{v\\in T\}\\langle g,v\\rangle\\right\],which plays a fundamental role in compressed sensingChandrasekaranet al\.\([2012](https://arxiv.org/html/2606.18306#bib.bib12)\), phase transitions in convex recoveryAmelunxenet al\.\([2014](https://arxiv.org/html/2606.18306#bib.bib13)\), and empirical process theoryBartlett and Mendelson \([2002](https://arxiv.org/html/2606.18306#bib.bib14)\)\. HereTTtypically represents a geometric object encoding the complexity of a problem, such as a hypothesis class, a tangent cone, or a feasible region\. Its power rests on a simple geometric insight: the average extent ofT⊂ℝdT\\subset\\mathbb\{R\}^\{d\}in random directions captures its effective size in the ambient Euclidean space\.
Modern statistical models, however, are rarely governed by Euclidean geometry alone\. The parameters of an exponential family, a neural network, or a variational autoencoder naturally form a*statistical manifold*\(Θ,gF\)\(\\Theta,g\_\{F\}\)endowed with the Fisher information metric
G\(θ\)ij=𝔼x∼pθ\[∂logpθ\(x\)∂θi∂logpθ\(x\)∂θj\]\.G\(\\theta\)\_\{ij\}=\\mathbb\{E\}\_\{x\\sim p\_\{\\theta\}\}\\\!\\left\[\\frac\{\\partial\\log p\_\{\\theta\}\(x\)\}\{\\partial\\theta\_\{i\}\}\\frac\{\\partial\\log p\_\{\\theta\}\(x\)\}\{\\partial\\theta\_\{j\}\}\\right\]\.On such manifolds, statistical distinguishability is governed by the local Fisher metric rather than by ambient Euclidean distance\. A unit displacement in parameter space may change the model dramatically in one direction and leave it almost unchanged in another, depending entirely on the local geometry encoded byG\(θ\)G\(\\theta\)\. Classical Gaussian width, which treats all directions as equally significant, does not capture this information\-geometric structure\. This observation motivates the search for a notion of geometric complexity that is intrinsic to Fisher geometry in the same way that Gaussian width is intrinsic to Euclidean geometry\.
###### Definition\(Fisher Width\)\.
Letθ0∈Θ\\theta\_\{0\}\\in\\ThetaandT⊂ℝdT\\subset\\mathbb\{R\}^\{d\}\. The*Fisher width*ofTTatθ0\\theta\_\{0\}is
wG\(T;θ0\)=𝔼g∼𝒩\(0,Id\)\[supv∈T⟨g,G\(θ0\)1/2v⟩\]\.w\_\{G\}\(T;\\theta\_\{0\}\)=\\mathbb\{E\}\_\{g\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\}\\left\[\\sup\_\{v\\in T\}\\langle g,G\(\\theta\_\{0\}\)^\{1/2\}v\\rangle\\right\]\.
The factorG\(θ0\)1/2G\(\\theta\_\{0\}\)^\{1/2\}reweights directions by their local Fisher curvature, so that statistically sensitive directions contribute more to the width than insensitive ones\. Thus Fisher width is a local quantity: it depends on the base pointθ0\\theta\_\{0\}and varies across the statistical manifold\. WhenG\(θ0\)=IdG\(\\theta\_\{0\}\)=I\_\{d\}, it reduces to the classical Gaussian width\.
### 1\.2Main Contributions
The main contributions of this paper are as follows\.
- \(C1\)We introduce Fisher width as a local Fisher\-geometric analogue of Gaussian width on statistical manifolds\. We establish the lifting identity wG\(T;θ\)=w\(G\(θ\)1/2T\),w\_\{G\}\(T;\\theta\)=w\(G\(\\theta\)^\{1/2\}T\),and prove its invariance under smooth reparameterizations\.
- \(C2\)We develop the structural theory of Fisher width, including concentration inequalities, algebraic properties, spectral comparison bounds, and stability under metric perturbations\. We further prove an empirical Fisher stability theorem, showing that Fisher width can be consistently approximated from score samples under operator\-norm concentration of the empirical Fisher matrix\.
- \(C3\)We prove a generalization bound for Fisher\-Lipschitz hypothesis classes, showing that the uniform deviation is controlled by wG\(T−T;θ0\)n\.\\frac\{w\_\{G\}\(T\-T;\\theta\_\{0\}\)\}\{\\sqrt\{n\}\}\.We also show that this scale is sharp, up to constants, for local exponential\-family likelihood models\. Thus Fisher width appears in Fisher\-geometric learning bounds in a role analogous to Gaussian width and Rademacher complexity in Euclidean settings\.
- \(C4\)We develop practical estimators based on empirical Fisher information, randomized low\-rank approximation, and score\-based sampling, and validate these estimators on MNIST across logistic regression, softmax regression, and ridge regression models\.
### 1\.3Related Work
#### Gaussian width and high\-dimensional geometry\.
Gaussian width is a classical complexity measure in asymptotic convex geometry and high\-dimensional probability\. Gordon’s escape\-through\-a\-mesh theoremGordon \([1988](https://arxiv.org/html/2606.18306#bib.bib7)\)established its central role in the geometry of random projections\. It has since become a standard tool in compressed sensing and convex recoveryChandrasekaranet al\.\([2012](https://arxiv.org/html/2606.18306#bib.bib12)\), where sharp phase transitions are often described through the closely related statistical dimension of convex conesAmelunxenet al\.\([2014](https://arxiv.org/html/2606.18306#bib.bib13)\)\. Gaussian\-process and chaining methods provide a complementary analytic viewpoint, beginning with Dudley’s entropy boundDudley \([1967](https://arxiv.org/html/2606.18306#bib.bib11)\)and continuing through the modern high\-dimensional probability literatureLedoux and Talagrand \([1991](https://arxiv.org/html/2606.18306#bib.bib8)\); Vershynin \([2018](https://arxiv.org/html/2606.18306#bib.bib9)\); Wainwright \([2019](https://arxiv.org/html/2606.18306#bib.bib10)\)\. Our work follows this geometric tradition, but replaces the ambient Euclidean metric by the local Fisher metric induced by a statistical model\.
#### Complexity measures in statistical learning\.
Gaussian and Rademacher complexities are fundamental measures of hypothesis\-class richness in learning theoryBartlett and Mendelson \([2002](https://arxiv.org/html/2606.18306#bib.bib14)\)\. Localized empirical\-process methods and oracle inequalities further refine these complexity measures by adapting them to the geometry of the function class and the data distributionKoltchinskii \([2011](https://arxiv.org/html/2606.18306#bib.bib15)\)\. More recent work has developed norm\- and architecture\-dependent complexity bounds for neural networks, including size\-independent Rademacher boundsGolowichet al\.\([2018](https://arxiv.org/html/2606.18306#bib.bib37)\)\. Fisher width belongs to this family of width\-type complexity measures: it is a functional controlling uniform deviations, but the underlying geometry is Fisher rather than Euclidean\.
#### Information geometry and Fisher metrics\.
The Fisher information metric is the canonical Riemannian metric on many statistical model spaces\. It underlies the classical theory of statistical manifolds, dual connections, and information\-geometric curvatureAmari and Nagaoka \([2000](https://arxiv.org/html/2606.18306#bib.bib1)\); Amari \([2016](https://arxiv.org/html/2606.18306#bib.bib2)\); Ayet al\.\([2017](https://arxiv.org/html/2606.18306#bib.bib4)\); Murray and Rice \([1993](https://arxiv.org/html/2606.18306#bib.bib5)\)\. Statistical curvature was introduced by EfronEfron \([1975](https://arxiv.org/html/2606.18306#bib.bib6)\), and the local metric expansions used in this paper are standard in Riemannian geometrydo Carmo \([1992](https://arxiv.org/html/2606.18306#bib.bib29)\)\. Existing work in information geometry primarily studies divergences, geodesics, projections, curvature, and optimization\. In contrast, Fisher width introduces a Gaussian\-width\-type complexity functional on statistical manifolds, thereby connecting information geometry with high\-dimensional geometric complexity\.
#### Fisher geometry in optimization and deep learning\.
Fisher geometry has long been used in optimization through natural gradient descentAmari \([1998](https://arxiv.org/html/2606.18306#bib.bib3)\)\. In modern machine learning, it appears in Riemannian metrics for neural networksOllivier \([2015](https://arxiv.org/html/2606.18306#bib.bib18)\), scalable curvature approximations such as K\-FACMartens and Grosse \([2015](https://arxiv.org/html/2606.18306#bib.bib16)\), and analyses of natural\-gradient methodsMartens \([2020](https://arxiv.org/html/2606.18306#bib.bib17)\)\. The empirical Fisher is computationally convenient but is not, in general, equivalent to the population Fisher or the Hessian, and can exhibit distinct limitationsKunstneret al\.\([2019](https://arxiv.org/html/2606.18306#bib.bib20)\)\. The spectrum of the Fisher information matrix in deep networks can also be highly anisotropic, with a small number of large directions and many near\-flat directionsKarakidaet al\.\([2019](https://arxiv.org/html/2606.18306#bib.bib19)\)\. Fisher width is complementary to this optimization literature: instead of studying the dynamics of an algorithm, it uses the Fisher metric to measure the effective geometric size of a set\.
#### Fisher–Rao complexity and related generalization perspectives\.
The closest prior work is the Fisher–Rao norm of neural networksLianget al\.\([2019](https://arxiv.org/html/2606.18306#bib.bib27)\), which uses the Fisher metric to define a parameter\-space complexity measure\. The distinction is structural: the Fisher–Rao norm measures the length of a single parameter vector, whereas Fisher width measures the size of an entire set after Fisher deformation\. Thus Fisher width is a set\-complexity measure in the tradition of Gaussian width, rather than a parameter norm\.
PAC\-Bayes theory provides another information\-sensitive approach to generalization, relating risk bounds to divergences between posterior and prior distributionsMcAllester \([1999](https://arxiv.org/html/2606.18306#bib.bib24)\); Seeger \([2002](https://arxiv.org/html/2606.18306#bib.bib25)\); Dziugaite and Roy \([2017](https://arxiv.org/html/2606.18306#bib.bib26)\); seeAlquier \([2024](https://arxiv.org/html/2606.18306#bib.bib36)\)for a recent survey\. For exponential\-family models, such divergences admit local quadratic approximations involving the Fisher metric\. Flatness\-based explanations of generalization are also related, beginning with flat minimaHochreiter and Schmidhuber \([1997](https://arxiv.org/html/2606.18306#bib.bib21)\)and continuing through sharp\-minima phenomenaKeskaret al\.\([2017](https://arxiv.org/html/2606.18306#bib.bib22)\), sharpness\-aware optimizationForetet al\.\([2021](https://arxiv.org/html/2606.18306#bib.bib23)\), and empirical studies of generalization measuresJianget al\.\([2020](https://arxiv.org/html/2606.18306#bib.bib35)\)\. Fisher width provides a different but compatible viewpoint: it measures the Fisher\-geometric size of a class or perturbation set, rather than the sharpness of a single trained solution\.
To the best of our knowledge, no prior work has introduced a Gaussian\-width\-type complexity functional on statistical manifolds or developed its structural, statistical, and computational properties under the Fisher information metric\.
## 2Fisher Width
### 2\.1Statistical Manifolds and Fisher Geometry
Let𝒳\\mathcal\{X\}be a sample space and𝒫=\{pθ:θ∈Θ\}\\mathcal\{P\}=\\\{p\_\{\\theta\}:\\theta\\in\\Theta\\\}a smooth parametric family withΘ⊂ℝd\\Theta\\subset\\mathbb\{R\}^\{d\}\. The*Fisher information matrix*atθ\\thetais
G\(θ\)=𝔼x∼pθ\[∇θlogpθ\(x\)∇θlogpθ\(x\)⊤\]\.G\(\\theta\)=\\mathbb\{E\}\_\{x\\sim p\_\{\\theta\}\}\\\!\\left\[\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(x\)\\,\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(x\)^\{\\top\}\\right\]\.We assumeG\(θ\)≻0G\(\\theta\)\\succ 0throughout\. The*Fisher metric*gF\(u,v\)\|θ=u⊤G\(θ\)vg\_\{F\}\(u,v\)\|\_\{\\theta\}=u^\{\\top\}G\(\\theta\)vmakesΘ\\Thetaa Riemannian manifold\. The Fisher metric characterizes the local statistical distinguishability of nearby distributionsAmari and Nagaoka \([2000](https://arxiv.org/html/2606.18306#bib.bib1)\); Amari \([2016](https://arxiv.org/html/2606.18306#bib.bib2)\)\. Specifically, for sufficiently smallΔθ\\Delta\\theta, the Kullback–Leibler divergence
DKL\(p∥q\)=∫p\(x\)logp\(x\)q\(x\)dxD\_\{\\mathrm\{KL\}\}\(p\\\|q\)=\\int p\(x\)\\log\\frac\{p\(x\)\}\{q\(x\)\}\\,dxadmits the local expansion
DKL\(pθ∥pθ\+Δθ\)=12Δθ⊤G\(θ\)Δθ\+o\(‖Δθ‖2\)D\_\{\\mathrm\{KL\}\}\\\!\\left\(p\_\{\\theta\}\\;\\\|\\;p\_\{\\theta\+\\Delta\\theta\}\\right\)=\\frac\{1\}\{2\}\\Delta\\theta^\{\\top\}G\(\\theta\)\\,\\Delta\\theta\+o\(\\\|\\Delta\\theta\\\|^\{2\}\)\(see\(Amari and Nagaoka,[2000](https://arxiv.org/html/2606.18306#bib.bib1), Chapter 2\)\)\. Thus, the Fisher metric provides the local quadratic approximation of statistical distinguishability independently of the ambient Euclidean structure onℝd\\mathbb\{R\}^\{d\}\.
### 2\.2Fisher Width
We begin by recalling the classical Gaussian width\. For a setT⊂ℝdT\\subset\\mathbb\{R\}^\{d\}, the*Gaussian width*is defined by
w\(T\)=𝔼g∼𝒩\(0,Id\)\[supv∈T⟨g,v⟩\]\.w\(T\)=\\mathbb\{E\}\_\{g\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\}\\\!\\left\[\\sup\_\{v\\in T\}\\langle g,v\\rangle\\right\]\.
The following standard properties will serve as reference points for the Fisher\-width theory developed later\.
###### Proposition 2\.1\(Basic Properties of Gaussian Width\)\.
For compact setsT,T1,T2⊂ℝdT,T\_\{1\},T\_\{2\}\\subset\\mathbb\{R\}^\{d\}and scalara∈ℝa\\in\\mathbb\{R\}:
1. 1\.IfT1⊆T2T\_\{1\}\\subseteq T\_\{2\}, thenw\(T1\)≤w\(T2\)w\(T\_\{1\}\)\\leq w\(T\_\{2\}\)\.
2. 2\.w\(aT\)=\|a\|w\(T\)w\(aT\)=\|a\|\\,w\(T\)\.
3. 3\.w\(conv\(T\)\)=w\(T\)w\(\\operatorname\{conv\}\(T\)\)=w\(T\)\.
4. 4\.w\(T1\+T2\)≤w\(T1\)\+w\(T2\)w\(T\_\{1\}\+T\_\{2\}\)\\leq w\(T\_\{1\}\)\+w\(T\_\{2\}\)\.
We now introduce Fisher width, the Fisher\-geometric analogue of Gaussian width\.
###### Definition 2\.2\(Fisher Width\)\.
Letθ0∈Θ\\theta\_\{0\}\\in\\ThetaandG\(θ0\)≻0G\(\\theta\_\{0\}\)\\succ 0\. The*Fisher width*ofT⊂ℝdT\\subset\\mathbb\{R\}^\{d\}atθ0\\theta\_\{0\}is
wG\(T;θ0\)=𝔼g∼𝒩\(0,Id\)\[supv∈T⟨g,G\(θ0\)1/2v⟩\]\.w\_\{G\}\(T;\\theta\_\{0\}\)=\\mathbb\{E\}\_\{g\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\}\\\!\\left\[\\sup\_\{v\\in T\}\\langle g,\\,G\(\\theta\_\{0\}\)^\{1/2\}v\\rangle\\right\]\.Whenθ0\\theta\_\{0\}is fixed or clear from context, we simply writewG\(T\)w\_\{G\}\(T\)\.
The map
v⟼G\(θ0\)1/2vv\\longmapsto G\(\\theta\_\{0\}\)^\{1/2\}vrescales directions according to the local Fisher geometry\. Directions with large Fisher scaling contribute more strongly to the width, while statistically insensitive directions contribute less\. Consequently, Fisher width measures the effective size ofTTafter reweighting directions by their local statistical distinguishability\.
WhenG\(θ0\)=IdG\(\\theta\_\{0\}\)=I\_\{d\}, Fisher width reduces to the classical Gaussian width:
wG\(T;θ0\)=w\(T\)\.w\_\{G\}\(T;\\theta\_\{0\}\)=w\(T\)\.In general,wG\(T;θ\)w\_\{G\}\(T;\\theta\)varies across the statistical manifold, in contrast to Gaussian width, which depends only on the setTTand not on any base point\.
### 2\.3Equivalent Forms
###### Proposition 2\.3\(Lifting Identity\)\.
LetT⊂ℝdT\\subset\\mathbb\{R\}^\{d\}be compact and letG≻0G\\succ 0\. Then
wG\(T\)=w\(G1/2T\),whereG1/2T=\{G1/2v:v∈T\}\.w\_\{G\}\(T\)=w\\\!\\left\(G^\{1/2\}T\\right\),\\quad\\text\{where\}\\quad G^\{1/2\}T=\\\{G^\{1/2\}v:\\;v\\in T\\\}\.
###### Proof\.
By definition,
wG\(T\)=𝔼g\[supv∈T⟨g,G1/2v⟩\]\.w\_\{G\}\(T\)=\\mathbb\{E\}\_\{g\}\\left\[\\sup\_\{v\\in T\}\\langle g,G^\{1/2\}v\\rangle\\right\]\.Sinceu=G1/2vu=G^\{1/2\}vranges overG1/2TG^\{1/2\}T, we have
supv∈T⟨g,G1/2v⟩=supu∈G1/2T⟨g,u⟩\.\\sup\_\{v\\in T\}\\langle g,G^\{1/2\}v\\rangle=\\sup\_\{u\\in G^\{1/2\}T\}\\langle g,u\\rangle\.Therefore
wG\(T\)=𝔼g\[supu∈G1/2T⟨g,u⟩\]=w\(G1/2T\)\.w\_\{G\}\(T\)=\\mathbb\{E\}\_\{g\}\\left\[\\sup\_\{u\\in G^\{1/2\}T\}\\langle g,u\\rangle\\right\]=w\\\!\\left\(G^\{1/2\}T\\right\)\.∎
The lifting identity
wG\(T;θ0\)=w\(G\(θ0\)1/2T\)w\_\{G\}\(T;\\theta\_\{0\}\)=w\\\!\\left\(G\(\\theta\_\{0\}\)^\{1/2\}T\\right\)is the central structural observation underlying the theory\. It shows that, at a fixed base point, Fisher width is precisely the Gaussian width of the Fisher\-rescaled set\. Consequently, many classical properties of Gaussian width can be transferred to the Fisher setting through the local metric deformation induced by the Fisher information matrix\.
The geometric content arises because this deformation depends on the base point:
T⟼G\(θ\)1/2T\.T\\longmapsto G\(\\theta\)^\{1/2\}T\.Thus, asθ\\thetavaries across the statistical manifold, the same Euclidean setTTmay acquire different effective widths under the local information geometry\. Fisher width therefore measures not only the shape ofTT, but also how that shape is seen through the Fisher geometry of the model\.
### 2\.4Examples
###### Example 2\.4\(Euclidean Ball\)\.
Let
T=rB2d=\{v∈ℝd:‖v‖2≤r\}\.T=rB\_\{2\}^\{d\}=\\\{v\\in\\mathbb\{R\}^\{d\}:\\\|v\\\|\_\{2\}\\leq r\\\}\.By the Cauchy–Schwarz inequality,
⟨g,G1/2v⟩=⟨G1/2g,v⟩≤‖G1/2g‖2‖v‖2,\\langle g,G^\{1/2\}v\\rangle=\\langle G^\{1/2\}g,v\\rangle\\leq\\\|G^\{1/2\}g\\\|\_\{2\}\\,\\\|v\\\|\_\{2\},with equality whenvvis parallel toG1/2gG^\{1/2\}g\. Therefore
sup‖v‖2≤r⟨g,G1/2v⟩=r‖G1/2g‖2,\\sup\_\{\\\|v\\\|\_\{2\}\\leq r\}\\langle g,G^\{1/2\}v\\rangle=r\\,\\\|G^\{1/2\}g\\\|\_\{2\},and hencewG\(T\)=r𝔼‖G1/2g‖2w\_\{G\}\(T\)=r\\,\\mathbb\{E\}\\\|G^\{1/2\}g\\\|\_\{2\}\. Sinceg∼N\(0,Id\)g\\sim N\(0,I\_\{d\}\), we have
𝔼\[gg⊤\]=Id\.\\mathbb\{E\}\[gg^\{\\top\}\]=I\_\{d\}\.Thus
𝔼‖G1/2g‖22=𝔼\[g⊤Gg\]=Tr\(G𝔼\[gg⊤\]\)=Tr\(G\)\.\\mathbb\{E\}\\\|G^\{1/2\}g\\\|\_\{2\}^\{2\}=\\mathbb\{E\}\[g^\{\\top\}Gg\]=\\operatorname\{Tr\}\\\!\\left\(G\\,\\mathbb\{E\}\[gg^\{\\top\}\]\\right\)=\\operatorname\{Tr\}\(G\)\.By Jensen’s inequality,
𝔼‖G1/2g‖2=𝔼‖G1/2g‖22≤𝔼‖G1/2g‖22=Tr\(G\)\.\\mathbb\{E\}\\\|G^\{1/2\}g\\\|\_\{2\}=\\mathbb\{E\}\\sqrt\{\\\|G^\{1/2\}g\\\|\_\{2\}^\{2\}\}\\leq\\sqrt\{\\mathbb\{E\}\\\|G^\{1/2\}g\\\|\_\{2\}^\{2\}\}=\\sqrt\{\\operatorname\{Tr\}\(G\)\}\.Consequently,
wG\(rB2d\)≤rTr\(G\)\.w\_\{G\}\(rB\_\{2\}^\{d\}\)\\leq r\\sqrt\{\\operatorname\{Tr\}\(G\)\}\.
Thus the Fisher width of a Euclidean ball is controlled by the total Fisher informationTr\(G\)\\operatorname\{Tr\}\(G\)\.
###### Example 2\.5\(Sparse Vectors\)\.
Consider
T=\{v∈ℝd:‖v‖0≤s,‖v‖2≤1\},T=\\\{v\\in\\mathbb\{R\}^\{d\}:\\\|v\\\|\_\{0\}\\leq s,\\ \\\|v\\\|\_\{2\}\\leq 1\\\},the set ofss\-sparse unit vectors\. By Cauchy–Schwarz, the supremumsupv∈T⟨u,v⟩\\sup\_\{v\\in T\}\\langle u,v\\ranglefor any fixedu∈ℝdu\\in\\mathbb\{R\}^\{d\}is attained by concentrating the support ofvvon thesslargest coordinates ofuu, giving
sup‖v‖0≤s,‖v‖2≤1⟨u,v⟩=‖u‖\(s\):=\(∑i=1su\(i\)2\)1/2,\\sup\_\{\\\|v\\\|\_\{0\}\\leq s,\\,\\\|v\\\|\_\{2\}\\leq 1\}\\langle u,v\\rangle=\\\|u\\\|\_\{\(s\)\}:=\\Bigl\(\\sum\_\{i=1\}^\{s\}u\_\{\(i\)\}^\{2\}\\Bigr\)^\{1/2\},where\|u\(1\)\|≥⋯≥\|u\(d\)\|\|u\_\{\(1\)\}\|\\geq\\cdots\\geq\|u\_\{\(d\)\}\|\. Applying this withu=G1/2gu=G^\{1/2\}g,
wG\(T\)=𝔼‖G1/2g‖\(s\)\.w\_\{G\}\(T\)=\\mathbb\{E\}\\,\\\|G^\{1/2\}g\\\|\_\{\(s\)\}\.WhenG=IdG=I\_\{d\}this reduces tow\(T\)=𝔼‖g‖\(s\)w\(T\)=\\mathbb\{E\}\\,\\\|g\\\|\_\{\(s\)\}\. The Fisher metric replaces the isotropic vectorggby the anisotropically weightedG1/2gG^\{1/2\}g, so two equally sparse parameter vectors may contribute very differently to the width if they are aligned with directions carrying different amounts of Fisher information\.
###### Example 2\.6\(Exponential Families\)\.
Consider an exponential family
pθ\(x\)=h\(x\)exp\(θ⊤ϕ\(x\)−A\(θ\)\),p\_\{\\theta\}\(x\)=h\(x\)\\exp\\\!\\bigl\(\\theta^\{\\top\}\\phi\(x\)\-A\(\\theta\)\\bigr\),with sufficient statisticϕ\(x\)\\phi\(x\)\. Its Fisher information matrix is
G\(θ\)=∇2A\(θ\)=𝔼pθ\[\(ϕ\(x\)−𝔼pθ\[ϕ\(x\)\]\)\(ϕ\(x\)−𝔼pθ\[ϕ\(x\)\]\)⊤\]G\(\\theta\)=\\nabla^\{2\}A\(\\theta\)=\\mathbb\{E\}\_\{p\_\{\\theta\}\}\\\!\\left\[\\bigl\(\\phi\(x\)\-\\mathbb\{E\}\_\{p\_\{\\theta\}\}\[\\phi\(x\)\]\\bigr\)\\bigl\(\\phi\(x\)\-\\mathbb\{E\}\_\{p\_\{\\theta\}\}\[\\phi\(x\)\]\\bigr\)^\{\\top\}\\right\]\(Amari and Nagaoka,[2000](https://arxiv.org/html/2606.18306#bib.bib1)\)\. Thus the Fisher matrix, and hence Fisher width, is determined entirely by the covariance structure of the sufficient statistics\. For the Euclidean ballT=rB2dT=rB\_\{2\}^\{d\},
wG\(T\)≤rTr\(G\(θ\)\)=rTr\(Covpθ\[ϕ\(x\)\]\),w\_\{G\}\(T\)\\leq r\\sqrt\{\\mathrm\{Tr\}\(G\(\\theta\)\)\}=r\\sqrt\{\\mathrm\{Tr\}\(\\mathrm\{Cov\}\_\{p\_\{\\theta\}\}\[\\phi\(x\)\]\)\},the square root of the total variance of the sufficient statistic underpθp\_\{\\theta\}\. Fisher width is thus bounded above by a quantity that concentrates on directions of high statistical curvature; directions along whichϕ\(x\)\\phi\(x\)is nearly deterministic underpθp\_\{\\theta\}contribute little to this bound\.
###### Example 2\.7\(Linear subspaces\)\.
LetV⊂ℝdV\\subset\\mathbb\{R\}^\{d\}be akk\-dimensional linear subspace and letT=V∩B2dT=V\\cap B\_\{2\}^\{d\}\. Denote byPVP\_\{V\}the orthogonal projector ontoVV\. Since
supv∈V‖v‖2≤1⟨g,G1/2v⟩=‖PVG1/2g‖2\\sup\_\{\\begin\{subarray\}\{c\}v\\in V\\\\ \\\|v\\\|\_\{2\}\\leq 1\\end\{subarray\}\}\\langle g,G^\{1/2\}v\\rangle=\\\|P\_\{V\}G^\{1/2\}g\\\|\_\{2\}we have
wG\(T\)=𝔼‖PVG1/2g‖2≤𝔼‖PVG1/2g‖22\(by Jensen’s inequality\)\.w\_\{G\}\(T\)=\\mathbb\{E\}\\\|P\_\{V\}G^\{1/2\}g\\\|\_\{2\}\\leq\\sqrt\{\\mathbb\{E\}\\\|P\_\{V\}G^\{1/2\}g\\\|\_\{2\}^\{2\}\}\\quad\\text\{\(by Jensen's inequality\)\}\.
Note first that
‖PVG1/2g‖22=\(PVG1/2g\)⊤\(PVG1/2g\)=g⊤G1/2PV⊤PVG1/2g=g⊤G1/2PVG1/2g,\\\|P\_\{V\}G^\{1/2\}g\\\|\_\{2\}^\{2\}=\(P\_\{V\}G^\{1/2\}g\)^\{\\top\}\(P\_\{V\}G^\{1/2\}g\)=g^\{\\top\}G^\{1/2\}P\_\{V\}^\{\\top\}P\_\{V\}G^\{1/2\}g=g^\{\\top\}G^\{1/2\}P\_\{V\}G^\{1/2\}g,where we usePV2=PV=PV⊤P\_\{V\}^\{2\}=P\_\{V\}=P\_\{V\}^\{\\top\}\. Hence
𝔼‖PVG1/2g‖22\\displaystyle\\mathbb\{E\}\\\|P\_\{V\}G^\{1/2\}g\\\|\_\{2\}^\{2\}=𝔼\[g⊤G1/2PVG1/2g\]\\displaystyle=\\mathbb\{E\}\\bigl\[g^\{\\top\}G^\{1/2\}P\_\{V\}G^\{1/2\}g\\bigr\]=Tr\(G1/2PVG1/2\)\\displaystyle=\\operatorname\{Tr\}\(G^\{1/2\}P\_\{V\}G^\{1/2\}\)=Tr\(GPV\)\\displaystyle=\\operatorname\{Tr\}\(GP\_\{V\}\)=Tr\(PVGPV\)\.\\displaystyle=\\operatorname\{Tr\}\(P\_\{V\}GP\_\{V\}\)\.Here the second identity uses the quadratic\-form trace identity𝔼\[g⊤Ag\]=Tr\(A\)\\mathbb\{E\}\[g^\{\\top\}Ag\]=\\operatorname\{Tr\}\(A\)forg∼𝒩\(0,Id\)g\\sim\\mathcal\{N\}\(0,I\_\{d\}\), while the last two identities follow from cyclicity of the trace andPV2=PVP\_\{V\}^\{2\}=P\_\{V\}\. Therefore
wG\(T\)≤Tr\(PVGPV\)\.w\_\{G\}\(T\)\\leq\\sqrt\{\\operatorname\{Tr\}\(P\_\{V\}GP\_\{V\}\)\}\.Equivalently, for any orthonormal basis\{u1,…,uk\}\\\{u\_\{1\},\\ldots,u\_\{k\}\\\}ofVV,
Tr\(PVGPV\)=∑i=1kui⊤Gui\.\\operatorname\{Tr\}\(P\_\{V\}GP\_\{V\}\)=\\sum\_\{i=1\}^\{k\}u\_\{i\}^\{\\top\}Gu\_\{i\}\.Thus Fisher width is controlled by the total Fisher information carried by the directions ofVV\. This is the subspace analogue of the full\-ball bound controlled byTr\(G\)\\operatorname\{Tr\}\(G\): only the Fisher energy restricted toVVcontributes to the width\.
### 2\.5Basic Properties
###### Lemma 2\.8\(Monotonicity and Homogeneity\)\.
2. 1\.LetT1,T2⊂ℝdT\_\{1\},T\_\{2\}\\subset\\mathbb\{R\}^\{d\}be compact\. IfT1⊆T2T\_\{1\}\\subseteq T\_\{2\}, then wG\(T1\)≤wG\(T2\)\.w\_\{G\}\(T\_\{1\}\)\\leq w\_\{G\}\(T\_\{2\}\)\.
3. 2\.LetT⊂ℝdT\\subset\\mathbb\{R\}^\{d\}be compact and symmetric\. Then for everyα∈ℝ\\alpha\\in\\mathbb\{R\}, wG\(αT\)=\|α\|wG\(T\)\.w\_\{G\}\(\\alpha T\)=\|\\alpha\|\\,w\_\{G\}\(T\)\.
The lemma follows directly from the definition\.
###### Lemma 2\.9\(Algebraic properties\)\.
LetG≻0G\\succ 0\. For nonempty compact setsT,T1,T2⊂ℝdT,T\_\{1\},T\_\{2\}\\subset\\mathbb\{R\}^\{d\},
wG\(T1\+T2\)\\displaystyle w\_\{G\}\(T\_\{1\}\+T\_\{2\}\)≤wG\(T1\)\+wG\(T2\),\\displaystyle\\leq w\_\{G\}\(T\_\{1\}\)\+w\_\{G\}\(T\_\{2\}\),wG\(conv\(T\)\)\\displaystyle w\_\{G\}\(\\mathrm\{conv\}\(T\)\)=wG\(T\),\\displaystyle=w\_\{G\}\(T\),wG\(T1∩T2\)\\displaystyle w\_\{G\}\(T\_\{1\}\\cap T\_\{2\}\)≤min\{wG\(T1\),wG\(T2\)\}\.\\displaystyle\\leq\\min\\\{w\_\{G\}\(T\_\{1\}\),w\_\{G\}\(T\_\{2\}\)\\\}\.
###### Proof\.
By the lifting identity,
wG\(T\)=w\(G1/2T\)\.w\_\{G\}\(T\)=w\(G^\{1/2\}T\)\.SinceG1/2G^\{1/2\}is linear and injective,
G1/2\(T1\+T2\)=G1/2T1\+G1/2T2,G^\{1/2\}\(T\_\{1\}\+T\_\{2\}\)=G^\{1/2\}T\_\{1\}\+G^\{1/2\}T\_\{2\},G1/2conv\(T\)=conv\(G1/2T\),G^\{1/2\}\\operatorname\{conv\}\(T\)=\\operatorname\{conv\}\(G^\{1/2\}T\),and
G1/2\(T1∩T2\)=G1/2T1∩G1/2T2\.G^\{1/2\}\(T\_\{1\}\\cap T\_\{2\}\)=G^\{1/2\}T\_\{1\}\\cap G^\{1/2\}T\_\{2\}\.The claims follow from the corresponding algebraic properties of Gaussian width\. ∎
###### Lemma 2\.10\(Spectral bounds\)\.
LetT⊂ℝdT\\subset\\mathbb\{R\}^\{d\}be compact\. Then
λmin\(G\)w\(T\)≤wG\(T\)≤λmax\(G\)w\(T\)\.\\sqrt\{\\lambda\_\{\\min\}\(G\)\}\\,w\(T\)\\leq w\_\{G\}\(T\)\\leq\\sqrt\{\\lambda\_\{\\max\}\(G\)\}\\,w\(T\)\.
###### Proof\.
SinceG1/2G^\{1/2\}is symmetric, forg∼𝒩\(0,Id\)g\\sim\\mathcal\{N\}\(0,I\_\{d\}\)we may write
⟨g,G1/2v⟩=⟨G1/2g,v⟩\.\\langle g,G^\{1/2\}v\\rangle=\\langle G^\{1/2\}g,v\\rangle\.Thus, withh=G1/2g∼𝒩\(0,G\)h=G^\{1/2\}g\\sim\\mathcal\{N\}\(0,G\),
wG\(T\)=𝔼hsupv∈T⟨h,v⟩\.w\_\{G\}\(T\)=\\mathbb\{E\}\_\{h\}\\sup\_\{v\\in T\}\\langle h,v\\rangle\.
Define the centered Gaussian processes
Xv=⟨h,v⟩,Yv=λmax\(G\)⟨g,v⟩,v∈T\.X\_\{v\}=\\langle h,v\\rangle,\\qquad Y\_\{v\}=\\sqrt\{\\lambda\_\{\\max\}\(G\)\}\\,\\langle g,v\\rangle,\\qquad v\\in T\.For allv,v′∈Tv,v^\{\\prime\}\\in T,
𝔼\|Xv−Xv′\|2=\(v−v′\)⊤G\(v−v′\)≤λmax\(G\)‖v−v′‖22=𝔼\|Yv−Yv′\|2\.\\mathbb\{E\}\|X\_\{v\}\-X\_\{v^\{\\prime\}\}\|^\{2\}=\(v\-v^\{\\prime\}\)^\{\\top\}G\(v\-v^\{\\prime\}\)\\leq\\lambda\_\{\\max\}\(G\)\\\|v\-v^\{\\prime\}\\\|\_\{2\}^\{2\}=\\mathbb\{E\}\|Y\_\{v\}\-Y\_\{v^\{\\prime\}\}\|^\{2\}\.By the Sudakov–Fernique comparison inequality\(Talagrand,[2014](https://arxiv.org/html/2606.18306#bib.bib32), Theorem 2\.2\.3\),
wG\(T\)=𝔼supv∈TXv≤𝔼supv∈TYv=λmax\(G\)w\(T\)\.w\_\{G\}\(T\)=\\mathbb\{E\}\\sup\_\{v\\in T\}X\_\{v\}\\leq\\mathbb\{E\}\\sup\_\{v\\in T\}Y\_\{v\}=\\sqrt\{\\lambda\_\{\\max\}\(G\)\}\\,w\(T\)\.
For the lower bound, define
Zv=λmin\(G\)⟨g,v⟩\.Z\_\{v\}=\\sqrt\{\\lambda\_\{\\min\}\(G\)\}\\,\\langle g,v\\rangle\.Then
𝔼\|Zv−Zv′\|2=λmin\(G\)‖v−v′‖22≤\(v−v′\)⊤G\(v−v′\)=𝔼\|Xv−Xv′\|2\.\\mathbb\{E\}\|Z\_\{v\}\-Z\_\{v^\{\\prime\}\}\|^\{2\}=\\lambda\_\{\\min\}\(G\)\\\|v\-v^\{\\prime\}\\\|\_\{2\}^\{2\}\\leq\(v\-v^\{\\prime\}\)^\{\\top\}G\(v\-v^\{\\prime\}\)=\\mathbb\{E\}\|X\_\{v\}\-X\_\{v^\{\\prime\}\}\|^\{2\}\.Applying Sudakov–Fernique again gives
λmin\(G\)w\(T\)=𝔼supv∈TZv≤𝔼supv∈TXv=wG\(T\)\.\\sqrt\{\\lambda\_\{\\min\}\(G\)\}\\,w\(T\)=\\mathbb\{E\}\\sup\_\{v\\in T\}Z\_\{v\}\\leq\\mathbb\{E\}\\sup\_\{v\\in T\}X\_\{v\}=w\_\{G\}\(T\)\.∎
###### Theorem 2\.11\(Gaussian Concentration\)\.
LetR\(T\)=supv∈T‖v‖2<∞R\(T\)=\\sup\_\{v\\in T\}\\\|v\\\|\_\{2\}<\\infty\. Define
φ\(g\)=supv∈T⟨g,G1/2v⟩\.\\varphi\(g\)=\\sup\_\{v\\in T\}\\langle g,G^\{1/2\}v\\rangle\.Thenφ\\varphiisλmax\(G\)R\(T\)\\sqrt\{\\lambda\_\{\\max\}\(G\)\}\\,R\(T\)\-Lipschitz and satisfies
Pr\(φ\(g\)−wG\(T\)\>t\)≤exp\(−t22λmax\(G\)R\(T\)2\)\.\\Pr\\\!\\left\(\\varphi\(g\)\-w\_\{G\}\(T\)\>t\\right\)\\leq\\exp\\\!\\left\(\-\\frac\{t^\{2\}\}\{2\\,\\lambda\_\{\\max\}\(G\)\\,R\(T\)^\{2\}\}\\right\)\.
###### Proof\.
Forg,h∈ℝdg,h\\in\\mathbb\{R\}^\{d\},
\|φ\(g\)−φ\(h\)\|\\displaystyle\|\\varphi\(g\)\-\\varphi\(h\)\|≤supv∈T\|⟨g−h,G1/2v⟩\|\\displaystyle\\leq\\sup\_\{v\\in T\}\|\\langle g\-h,\\,G^\{1/2\}v\\rangle\|≤‖g−h‖2supv∈T‖G1/2v‖2\\displaystyle\\leq\\\|g\-h\\\|\_\{2\}\\sup\_\{v\\in T\}\\\|G^\{1/2\}v\\\|\_\{2\}≤λmax\(G\)R\(T\)‖g−h‖2\.\\displaystyle\\leq\\sqrt\{\\lambda\_\{\\max\}\(G\)\}\\,R\(T\)\\,\\\|g\-h\\\|\_\{2\}\.Thusφ\\varphiisλmax\(G\)R\(T\)\\sqrt\{\\lambda\_\{\\max\}\(G\)\}\\,R\(T\)\-Lipschitz\. The concentration inequality follows from the Gaussian concentration theoremLedoux and Talagrand \([1991](https://arxiv.org/html/2606.18306#bib.bib8)\)\. ∎
###### Lemma 2\.12\(Reparameterization invariance\)\.
Letφ:Θ~→Θ\\varphi:\\widetilde\{\\Theta\}\\to\\Thetabe a smooth reparameterization with invertible Jacobian
J=Jφ\(θ~0\)\.J=J\_\{\\varphi\}\(\\tilde\{\\theta\}\_\{0\}\)\.Letθ0=φ\(θ~0\)\\theta\_\{0\}=\\varphi\(\\tilde\{\\theta\}\_\{0\}\), letG=G\(θ0\)G=G\(\\theta\_\{0\}\), and let
G~=J⊤GJ\\widetilde\{G\}=J^\{\\top\}GJbe the pullback Fisher metric atθ~0\\tilde\{\\theta\}\_\{0\}\. ForT⊂ℝdT\\subset\\mathbb\{R\}^\{d\}, define its pullback
T~=J−1T=\{v~∈ℝd:Jv~∈T\}\.\\widetilde\{T\}=J^\{\-1\}T=\\\{\\tilde\{v\}\\in\\mathbb\{R\}^\{d\}:\\ J\\tilde\{v\}\\in T\\\}\.Then
wG~\(T~\)=wG\(T\)\.w\_\{\\widetilde\{G\}\}\(\\widetilde\{T\}\)=w\_\{G\}\(T\)\.
###### Proof\.
Letg∼𝒩\(0,Id\)g\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\. Since
Cov\(J⊤G1/2g\)=J⊤GJ=G~,\\operatorname\{Cov\}\(J^\{\\top\}G^\{1/2\}g\)=J^\{\\top\}GJ=\\widetilde\{G\},the centered Gaussian vectorsJ⊤G1/2gJ^\{\\top\}G^\{1/2\}gandG~1/2g\\widetilde\{G\}^\{1/2\}ghave the same distribution\. Hence
wG~\(T~\)\\displaystyle w\_\{\\widetilde\{G\}\}\(\\widetilde\{T\}\)=𝔼supv~∈T~⟨G~1/2g,v~⟩\\displaystyle=\\mathbb\{E\}\\sup\_\{\\tilde\{v\}\\in\\widetilde\{T\}\}\\langle\\widetilde\{G\}^\{1/2\}g,\\tilde\{v\}\\rangle=𝔼supv~∈T~⟨J⊤G1/2g,v~⟩\\displaystyle=\\mathbb\{E\}\\sup\_\{\\tilde\{v\}\\in\\widetilde\{T\}\}\\langle J^\{\\top\}G^\{1/2\}g,\\tilde\{v\}\\rangle=𝔼supv~:Jv~∈T⟨G1/2g,Jv~⟩\\displaystyle=\\mathbb\{E\}\\sup\_\{\\tilde\{v\}:\\,J\\tilde\{v\}\\in T\}\\langle G^\{1/2\}g,J\\tilde\{v\}\\rangle=𝔼supv∈T⟨G1/2g,v⟩\\displaystyle=\\mathbb\{E\}\\sup\_\{v\\in T\}\\langle G^\{1/2\}g,v\\rangle=wG\(T\),\\displaystyle=w\_\{G\}\(T\),where the fourth equality uses the bijectionv=Jv~v=J\\tilde\{v\}betweenT~\\widetilde\{T\}andTT\. ∎
The equality is distributional rather than pointwise: in general,
\(J⊤GJ\)1/2≠J⊤G1/2\.\(J^\{\\top\}GJ\)^\{1/2\}\\neq J^\{\\top\}G^\{1/2\}\.Thus Fisher width is invariant under reparameterization only when both the metric and the tangent set are transformed by pullback\.
## 3Structural Properties
The identitywG\(T\)=w\(G1/2T\)w\_\{G\}\(T\)=w\(G^\{1/2\}T\)allows structural properties of Gaussian width to transfer directly to the Fisher setting\. We focus on the stability under metric perturbations and subspace geometry\.
### 3\.1Estimation Stability
In practice, the Fisher matrix is rarely available exactly\. Instead, one typically works with empirical Fisher matrices, low\-rank approximations, diagonal approximations, or Kronecker\-factored approximations\. A natural question is therefore whether Fisher width is stable under perturbations of the underlying metric\. The following result shows that errors in the Fisher metric translate linearly into errors in Fisher width\.
###### Theorem 3\.1\(Estimation stability\)\.
LetG1,G2⪰0G\_\{1\},G\_\{2\}\\succeq 0\. If
‖G11/2−G21/2‖op≤ε,\\\|G\_\{1\}^\{1/2\}\-G\_\{2\}^\{1/2\}\\\|\_\{\\mathrm\{op\}\}\\leq\\varepsilon,then for every compact setT⊂ℝdT\\subset\\mathbb\{R\}^\{d\},
\|wG1\(T\)−wG2\(T\)\|≤εw\(T\)\.\\bigl\|w\_\{G\_\{1\}\}\(T\)\-w\_\{G\_\{2\}\}\(T\)\\bigr\|\\leq\\varepsilon w\(T\)\.
###### Proof\.
Let
M=G11/2−G21/2\.M=G\_\{1\}^\{1/2\}\-G\_\{2\}^\{1/2\}\.For eachgg,
supv∈T⟨g,G11/2v⟩−supv∈T⟨g,G21/2v⟩≤supv∈T⟨g,Mv⟩\.\\sup\_\{v\\in T\}\\langle g,G\_\{1\}^\{1/2\}v\\rangle\-\\sup\_\{v\\in T\}\\langle g,G\_\{2\}^\{1/2\}v\\rangle\\leq\\sup\_\{v\\in T\}\\langle g,Mv\\rangle\.Taking expectations gives
wG1\(T\)−wG2\(T\)≤w\(MT\)\.w\_\{G\_\{1\}\}\(T\)\-w\_\{G\_\{2\}\}\(T\)\\leq w\(MT\)\.We now boundw\(MT\)w\(MT\)by Sudakov–Fernique\. Consider the centered Gaussian processes
Xv=⟨g,Mv⟩,Yv=ε⟨g,v⟩,v∈T\.X\_\{v\}=\\langle g,Mv\\rangle,\\qquad Y\_\{v\}=\\varepsilon\\langle g,v\\rangle,\\qquad v\\in T\.Since‖M‖op≤ε\\\|M\\\|\_\{\\mathrm\{op\}\}\\leq\\varepsilon,
𝔼\|Xu−Xv\|2=‖M\(u−v\)‖22≤ε2‖u−v‖22=𝔼\|Yu−Yv\|2\\mathbb\{E\}\|X\_\{u\}\-X\_\{v\}\|^\{2\}=\\\|M\(u\-v\)\\\|\_\{2\}^\{2\}\\leq\\varepsilon^\{2\}\\\|u\-v\\\|\_\{2\}^\{2\}=\\mathbb\{E\}\|Y\_\{u\}\-Y\_\{v\}\|^\{2\}for allu,v∈Tu,v\\in T\. Hence, by the Sudakov–Fernique comparison inequality,
w\(MT\)≤εw\(T\)\.w\(MT\)\\leq\\varepsilon w\(T\)\.Therefore
wG1\(T\)−wG2\(T\)≤εw\(T\)\.w\_\{G\_\{1\}\}\(T\)\-w\_\{G\_\{2\}\}\(T\)\\leq\\varepsilon w\(T\)\.InterchangingG1G\_\{1\}andG2G\_\{2\}gives the reverse bound
wG2\(T\)−wG1\(T\)≤εw\(T\)\.w\_\{G\_\{2\}\}\(T\)\-w\_\{G\_\{1\}\}\(T\)\\leq\\varepsilon w\(T\)\.Combining the two inequalities yields
\|wG1\(T\)−wG2\(T\)\|≤εw\(T\)\.\\bigl\|w\_\{G\_\{1\}\}\(T\)\-w\_\{G\_\{2\}\}\(T\)\\bigr\|\\leq\\varepsilon w\(T\)\.∎
We next apply the stability theorem to empirical Fisher matrices\. Suppose that the Fisher information atθ0\\theta\_\{0\}is estimated from independent samples
z1,…,zn∼pθ0\.z\_\{1\},\\ldots,z\_\{n\}\\sim p\_\{\\theta\_\{0\}\}\.Define the score vector by
sθ0\(z\)=∇θlogpθ\(z\)\|θ=θ0,s\_\{\\theta\_\{0\}\}\(z\)=\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(z\)\\big\|\_\{\\theta=\\theta\_\{0\}\},and write
si=sθ0\(zi\),i=1,…,n\.s\_\{i\}=s\_\{\\theta\_\{0\}\}\(z\_\{i\}\),\\qquad i=1,\\ldots,n\.Under the standard regularity conditions for Fisher information,
G=G\(θ0\)=𝔼z∼pθ0\[sθ0\(z\)sθ0\(z\)⊤\]=𝔼\[sisi⊤\]\.G=G\(\\theta\_\{0\}\)=\\mathbb\{E\}\_\{z\\sim p\_\{\\theta\_\{0\}\}\}\\bigl\[s\_\{\\theta\_\{0\}\}\(z\)s\_\{\\theta\_\{0\}\}\(z\)^\{\\top\}\\bigr\]=\\mathbb\{E\}\[s\_\{i\}s\_\{i\}^\{\\top\}\]\.The empirical Fisher matrix is then
G^n=1n∑i=1nsisi⊤\.\\widehat\{G\}\_\{n\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}s\_\{i\}s\_\{i\}^\{\\top\}\.We start with a concentration bound forG^n\\widehat\{G\}\_\{n\}\.
###### Lemma 3\.2\(Empirical Fisher concentration\)\.
Assume
‖sθ0\(z\)‖2≤Balmost surely\.\\\|s\_\{\\theta\_\{0\}\}\(z\)\\\|\_\{2\}\\leq B\\qquad\\text\{almost surely\}\.Then, with probability at least1−δ1\-\\delta,
‖G^n−G‖op≤B2‖G‖oplog\(2d/δ\)n\+4B23log\(2d/δ\)n\.\\\|\\widehat\{G\}\_\{n\}\-G\\\|\_\{\\mathrm\{op\}\}\\leq B\\sqrt\{\\frac\{2\\\|G\\\|\_\{\\mathrm\{op\}\}\\log\(2d/\\delta\)\}\{n\}\}\+\\frac\{4B^\{2\}\}\{3\}\\frac\{\\log\(2d/\\delta\)\}\{n\}\.
###### Proof\.
For eachii, write
si=sθ0\(zi\),Xi=sisi⊤−G\.s\_\{i\}=s\_\{\\theta\_\{0\}\}\(z\_\{i\}\),\\qquad X\_\{i\}=s\_\{i\}s\_\{i\}^\{\\top\}\-G\.Then𝔼Xi=0\\mathbb\{E\}X\_\{i\}=0and
G^n−G=1n∑i=1nXi\.\\widehat\{G\}\_\{n\}\-G=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}X\_\{i\}\.
We first bound the almost\-sure operator norm ofXiX\_\{i\}\. Since
‖sisi⊤‖op=‖si‖22≤B2\\\|s\_\{i\}s\_\{i\}^\{\\top\}\\\|\_\{\\mathrm\{op\}\}=\\\|s\_\{i\}\\\|\_\{2\}^\{2\}\\leq B^\{2\}and
‖G‖op=‖𝔼\[sisi⊤\]‖op≤𝔼‖sisi⊤‖op≤B2,\\\|G\\\|\_\{\\mathrm\{op\}\}=\\bigl\\\|\\mathbb\{E\}\[s\_\{i\}s\_\{i\}^\{\\top\}\]\\bigr\\\|\_\{\\mathrm\{op\}\}\\leq\\mathbb\{E\}\\\|s\_\{i\}s\_\{i\}^\{\\top\}\\\|\_\{\\mathrm\{op\}\}\\leq B^\{2\},we have
‖Xi‖op≤2B2almost surely\.\\\|X\_\{i\}\\\|\_\{\\mathrm\{op\}\}\\leq 2B^\{2\}\\qquad\\text\{almost surely\}\.
Next, we bound the variance proxy\. ExpandingXi2X\_\{i\}^\{2\}gives
𝔼Xi2\\displaystyle\\mathbb\{E\}X\_\{i\}^\{2\}=𝔼\[\(sisi⊤−G\)2\]\\displaystyle=\\mathbb\{E\}\\bigl\[\(s\_\{i\}s\_\{i\}^\{\\top\}\-G\)^\{2\}\\bigr\]=𝔼\[\(sisi⊤\)2\]−𝔼\[sisi⊤\]G−G𝔼\[sisi⊤\]\+G2\\displaystyle=\\mathbb\{E\}\[\(s\_\{i\}s\_\{i\}^\{\\top\}\)^\{2\}\]\-\\mathbb\{E\}\[s\_\{i\}s\_\{i\}^\{\\top\}\]G\-G\\mathbb\{E\}\[s\_\{i\}s\_\{i\}^\{\\top\}\]\+G^\{2\}=𝔼\[\(sisi⊤\)2\]−G2\.\\displaystyle=\\mathbb\{E\}\[\(s\_\{i\}s\_\{i\}^\{\\top\}\)^\{2\}\]\-G^\{2\}\.SinceG2⪰0G^\{2\}\\succeq 0,
𝔼Xi2⪯𝔼\[\(sisi⊤\)2\]\.\\mathbb\{E\}X\_\{i\}^\{2\}\\preceq\\mathbb\{E\}\[\(s\_\{i\}s\_\{i\}^\{\\top\}\)^\{2\}\]\.Moreover,
\(sisi⊤\)2=‖si‖22sisi⊤⪯B2sisi⊤\.\(s\_\{i\}s\_\{i\}^\{\\top\}\)^\{2\}=\\\|s\_\{i\}\\\|\_\{2\}^\{2\}s\_\{i\}s\_\{i\}^\{\\top\}\\preceq B^\{2\}s\_\{i\}s\_\{i\}^\{\\top\}\.Taking expectations yields
𝔼Xi2⪯B2𝔼\[sisi⊤\]=B2G\.\\mathbb\{E\}X\_\{i\}^\{2\}\\preceq B^\{2\}\\mathbb\{E\}\[s\_\{i\}s\_\{i\}^\{\\top\}\]=B^\{2\}G\.Therefore
‖∑i=1n𝔼Xi2‖op≤nB2‖G‖op\.\\left\\\|\\sum\_\{i=1\}^\{n\}\\mathbb\{E\}X\_\{i\}^\{2\}\\right\\\|\_\{\\mathrm\{op\}\}\\leq nB^\{2\}\\\|G\\\|\_\{\\mathrm\{op\}\}\.
By the matrix Bernstein inequality\(Tropp,[2015](https://arxiv.org/html/2606.18306#bib.bib34), Theorem 6\.1\.1\),
Pr\(‖∑i=1nXi‖op≥t\)≤2dexp\(−t2/2σ2\+Lt/3\),\\Pr\\left\(\\left\\\|\\sum\_\{i=1\}^\{n\}X\_\{i\}\\right\\\|\_\{\\mathrm\{op\}\}\\geq t\\right\)\\leq 2d\\exp\\left\(\-\\frac\{t^\{2\}/2\}\{\\sigma^\{2\}\+Lt/3\}\\right\),where
L=2B2,σ2=‖∑i=1n𝔼Xi2‖op≤nB2‖G‖op\.L=2B^\{2\},\\qquad\\sigma^\{2\}=\\left\\\|\\sum\_\{i=1\}^\{n\}\\mathbb\{E\}X\_\{i\}^\{2\}\\right\\\|\_\{\\mathrm\{op\}\}\\leq nB^\{2\}\\\|G\\\|\_\{\\mathrm\{op\}\}\.Taking
u=log\(2dδ\),t=2σ2u\+23Lu,u=\\log\(\\frac\{2d\}\{\\delta\}\),\\qquad t=\\sqrt\{2\\sigma^\{2\}u\}\+\\frac\{2\}\{3\}Lu,gives, with probability at least1−δ1\-\\delta,
‖∑i=1nXi‖op≤2σ2log\(2dδ\)\+23Llog\(2dδ\)\.\\left\\\|\\sum\_\{i=1\}^\{n\}X\_\{i\}\\right\\\|\_\{\\mathrm\{op\}\}\\leq\\sqrt\{2\\sigma^\{2\}\\log\(\\frac\{2d\}\{\\delta\}\)\}\+\\frac\{2\}\{3\}L\\log\(\\frac\{2d\}\{\\delta\}\)\.Therefore
‖∑i=1nXi‖op≤B2n‖G‖oplog\(2dδ\)\+43B2log\(2dδ\)\.\\left\\\|\\sum\_\{i=1\}^\{n\}X\_\{i\}\\right\\\|\_\{\\mathrm\{op\}\}\\leq B\\sqrt\{2n\\\|G\\\|\_\{\\mathrm\{op\}\}\\log\(\\frac\{2d\}\{\\delta\}\)\}\+\\frac\{4\}\{3\}B^\{2\}\\log\(\\frac\{2d\}\{\\delta\}\)\.Dividing bynn, we obtain
‖G^n−G‖op≤B2‖G‖oplog\(2dδ\)n\+4B23log\(2dδ\)n\.\\\|\\widehat\{G\}\_\{n\}\-G\\\|\_\{\\mathrm\{op\}\}\\leq B\\sqrt\{\\frac\{2\\\|G\\\|\_\{\\mathrm\{op\}\}\\log\(\\frac\{2d\}\{\\delta\}\)\}\{n\}\}\+\\frac\{4B^\{2\}\}\{3\}\\frac\{\\log\(\\frac\{2d\}\{\\delta\}\)\}\{n\}\.∎
###### Lemma 3\.3\(Square\-root perturbation\)\.
LetA,B≻0A,B\\succ 0\. Suppose
A⪰aI,B⪰bIA\\succeq aI,\\qquad B\\succeq bIfor somea,b\>0a,b\>0\. Then
‖A1/2−B1/2‖op≤‖A−B‖opa\+b\.\\\|A^\{1/2\}\-B^\{1/2\}\\\|\_\{\\mathrm\{op\}\}\\leq\\frac\{\\\|A\-B\\\|\_\{\\mathrm\{op\}\}\}\{\\sqrt\{a\}\+\\sqrt\{b\}\}\.
###### Proof\.
We use the standard integral representation for the matrix square root\(Bhatia,[1997](https://arxiv.org/html/2606.18306#bib.bib33), Chapter V\):
A1/2−B1/2=1π∫0∞t1/2\(A\+tI\)−1\(A−B\)\(B\+tI\)−1𝑑t\.A^\{1/2\}\-B^\{1/2\}=\\frac\{1\}\{\\pi\}\\int\_\{0\}^\{\\infty\}t^\{1/2\}\(A\+tI\)^\{\-1\}\(A\-B\)\(B\+tI\)^\{\-1\}\\,dt\.Taking operator norms and using the triangle inequality for Bochner integrals gives
‖A1/2−B1/2‖op≤1π∫0∞t1/2‖\(A\+tI\)−1\(A−B\)\(B\+tI\)−1‖op𝑑t\.\\\|A^\{1/2\}\-B^\{1/2\}\\\|\_\{\\mathrm\{op\}\}\\leq\\frac\{1\}\{\\pi\}\\int\_\{0\}^\{\\infty\}t^\{1/2\}\\left\\\|\(A\+tI\)^\{\-1\}\(A\-B\)\(B\+tI\)^\{\-1\}\\right\\\|\_\{\\mathrm\{op\}\}\\,dt\.By submultiplicativity of the operator norm,
‖\(A\+tI\)−1\(A−B\)\(B\+tI\)−1‖op≤‖\(A\+tI\)−1‖op‖A−B‖op‖\(B\+tI\)−1‖op\.\\left\\\|\(A\+tI\)^\{\-1\}\(A\-B\)\(B\+tI\)^\{\-1\}\\right\\\|\_\{\\mathrm\{op\}\}\\leq\\\|\(A\+tI\)^\{\-1\}\\\|\_\{\\mathrm\{op\}\}\\\|A\-B\\\|\_\{\\mathrm\{op\}\}\\\|\(B\+tI\)^\{\-1\}\\\|\_\{\\mathrm\{op\}\}\.SinceA⪰aIA\\succeq aI, every eigenvalue ofA\+tIA\+tIis at leasta\+ta\+t\. Hence
‖\(A\+tI\)−1‖op=1λmin\(A\+tI\)≤1a\+t\.\\\|\(A\+tI\)^\{\-1\}\\\|\_\{\\mathrm\{op\}\}=\\frac\{1\}\{\\lambda\_\{\\min\}\(A\+tI\)\}\\leq\\frac\{1\}\{a\+t\}\.Similarly,
‖\(B\+tI\)−1‖op≤1b\+t\.\\\|\(B\+tI\)^\{\-1\}\\\|\_\{\\mathrm\{op\}\}\\leq\\frac\{1\}\{b\+t\}\.Therefore
‖A1/2−B1/2‖op≤‖A−B‖opπ∫0∞t1/2\(a\+t\)\(b\+t\)𝑑t\.\\\|A^\{1/2\}\-B^\{1/2\}\\\|\_\{\\mathrm\{op\}\}\\leq\\frac\{\\\|A\-B\\\|\_\{\\mathrm\{op\}\}\}\{\\pi\}\\int\_\{0\}^\{\\infty\}\\frac\{t^\{1/2\}\}\{\(a\+t\)\(b\+t\)\}\\,dt\.
It remains to compute the scalar integral\. With the substitutiont=u2t=u^\{2\},
1π∫0∞t1/2\(a\+t\)\(b\+t\)𝑑t=2π∫0∞u2\(a\+u2\)\(b\+u2\)𝑑u\.\\frac\{1\}\{\\pi\}\\int\_\{0\}^\{\\infty\}\\frac\{t^\{1/2\}\}\{\(a\+t\)\(b\+t\)\}\\,dt=\\frac\{2\}\{\\pi\}\\int\_\{0\}^\{\\infty\}\\frac\{u^\{2\}\}\{\(a\+u^\{2\}\)\(b\+u^\{2\}\)\}\\,du\.Ifa≠ba\\neq b, then
u2\(a\+u2\)\(b\+u2\)=1a−b\(aa\+u2−bb\+u2\)\.\\frac\{u^\{2\}\}\{\(a\+u^\{2\}\)\(b\+u^\{2\}\)\}=\\frac\{1\}\{a\-b\}\\left\(\\frac\{a\}\{a\+u^\{2\}\}\-\\frac\{b\}\{b\+u^\{2\}\}\\right\)\.Using
∫0∞cc\+u2𝑑u=πc2,c\>0,\\int\_\{0\}^\{\\infty\}\\frac\{c\}\{c\+u^\{2\}\}\\,du=\\frac\{\\pi\\sqrt\{c\}\}\{2\},\\qquad c\>0,we obtain
2π∫0∞u2\(a\+u2\)\(b\+u2\)𝑑u=a−ba−b=1a\+b\.\\frac\{2\}\{\\pi\}\\int\_\{0\}^\{\\infty\}\\frac\{u^\{2\}\}\{\(a\+u^\{2\}\)\(b\+u^\{2\}\)\}\\,du=\\frac\{\\sqrt\{a\}\-\\sqrt\{b\}\}\{a\-b\}=\\frac\{1\}\{\\sqrt\{a\}\+\\sqrt\{b\}\}\.The casea=ba=bfollows by continuity, or directly from the same integral formula\. Thus
1π∫0∞t1/2\(a\+t\)\(b\+t\)𝑑t=1a\+b\.\\frac\{1\}\{\\pi\}\\int\_\{0\}^\{\\infty\}\\frac\{t^\{1/2\}\}\{\(a\+t\)\(b\+t\)\}\\,dt=\\frac\{1\}\{\\sqrt\{a\}\+\\sqrt\{b\}\}\.Substituting this into the previous estimate gives
‖A1/2−B1/2‖op≤‖A−B‖opa\+b,\\\|A^\{1/2\}\-B^\{1/2\}\\\|\_\{\\mathrm\{op\}\}\\leq\\frac\{\\\|A\-B\\\|\_\{\\mathrm\{op\}\}\}\{\\sqrt\{a\}\+\\sqrt\{b\}\},as claimed\. ∎
We can now combine empirical Fisher concentration, square\-root perturbation, and estimation stability to obtain a high\-probability stability bound for Fisher width computed from the empirical Fisher matrix\.
###### Corollary 3\.4\(Empirical Fisher stability\)\.
Assume‖sθ0\(z\)‖2≤B\\\|s\_\{\\theta\_\{0\}\}\(z\)\\\|\_\{2\}\\leq Balmost surely, andG⪰μIG\\succeq\\mu Ifor someμ\>0\\mu\>0\. Define
ηn\(δ\)=B2‖G‖oplog\(2d/δ\)n\+4B23log\(2d/δ\)n\.\\eta\_\{n\}\(\\delta\)=B\\sqrt\{\\frac\{2\\\|G\\\|\_\{\\mathrm\{op\}\}\\log\(2d/\\delta\)\}\{n\}\}\+\\frac\{4B^\{2\}\}\{3\}\\frac\{\\log\(2d/\\delta\)\}\{n\}\.Fornnlarge enough so thatηn\(δ\)<μ,\\eta\_\{n\}\(\\delta\)<\\mu,then with probability at least1−δ1\-\\delta,
\|wG^n\(T\)−wG\(T\)\|≤ηn\(δ\)μw\(T\)\.\\bigl\|w\_\{\\widehat\{G\}\_\{n\}\}\(T\)\-w\_\{G\}\(T\)\\bigr\|\\leq\\frac\{\\eta\_\{n\}\(\\delta\)\}\{\\sqrt\{\\mu\}\}\\,w\(T\)\.
###### Proof\.
Applying Weyl’s inequality, we have, for everyk=1,…,dk=1,\\ldots,d,
λk\(G\)−λk\(G^n\)≤‖G−G^n‖op\.\\lambda\_\{k\}\(G\)\-\\lambda\_\{k\}\(\\widehat\{G\}\_\{n\}\)\\leq\\\|G\-\\widehat\{G\}\_\{n\}\\\|\_\{\\mathrm\{op\}\}\.By Lemma[3\.2](https://arxiv.org/html/2606.18306#S3.Thmtheorem2), with probability at least1−δ1\-\\delta,
‖G−G^n‖op≤ηn\(δ\)\.\\\|G\-\\widehat\{G\}\_\{n\}\\\|\_\{\\mathrm\{op\}\}\\leq\\eta\_\{n\}\(\\delta\)\.Hence, on this event,
λk\(G\)−λk\(G^n\)≤ηn\(δ\),k=1,…,d\.\\lambda\_\{k\}\(G\)\-\\lambda\_\{k\}\(\\widehat\{G\}\_\{n\}\)\\leq\\eta\_\{n\}\(\\delta\),\\qquad k=1,\\ldots,d\.Thus
λk\(G^n\)≥λk\(G\)−ηn\(δ\)\.\\lambda\_\{k\}\(\\widehat\{G\}\_\{n\}\)\\geq\\lambda\_\{k\}\(G\)\-\\eta\_\{n\}\(\\delta\)\.SinceG⪰μIG\\succeq\\mu I, we haveλk\(G\)≥μ\\lambda\_\{k\}\(G\)\\geq\\mufor everykk\. Therefore
λk\(G^n\)≥μ−ηn\(δ\),k=1,…,d\.\\lambda\_\{k\}\(\\widehat\{G\}\_\{n\}\)\\geq\\mu\-\\eta\_\{n\}\(\\delta\),\\qquad k=1,\\ldots,d\.It follows that
G^n⪰\(μ−ηn\(δ\)\)I\.\\widehat\{G\}\_\{n\}\\succeq\\bigl\(\\mu\-\\eta\_\{n\}\(\\delta\)\\bigr\)I\.Sinceηn\(δ\)<μ\\eta\_\{n\}\(\\delta\)<\\mu, the empirical Fisher matrix is positive definite\.
Applying Lemma[3\.3](https://arxiv.org/html/2606.18306#S3.Thmtheorem3)with
A=G,B=G^n,a=μ,b=μ−ηn\(δ\),A=G,\\qquad B=\\widehat\{G\}\_\{n\},\\qquad a=\\mu,\\qquad b=\\mu\-\\eta\_\{n\}\(\\delta\),we obtain
‖G^n1/2−G1/2‖op≤‖G^n−G‖opμ\+μ−ηn\(δ\)≤ηn\(δ\)μ\.\\\|\\widehat\{G\}\_\{n\}^\{1/2\}\-G^\{1/2\}\\\|\_\{\\mathrm\{op\}\}\\leq\\frac\{\\\|\\widehat\{G\}\_\{n\}\-G\\\|\_\{\\mathrm\{op\}\}\}\{\\sqrt\{\\mu\}\+\\sqrt\{\\mu\-\\eta\_\{n\}\(\\delta\)\}\}\\leq\\frac\{\\eta\_\{n\}\(\\delta\)\}\{\\sqrt\{\\mu\}\}\.Finally, Theorem[3\.1](https://arxiv.org/html/2606.18306#S3.Thmtheorem1)yields
\|wG^n\(T\)−wG\(T\)\|≤‖G^n1/2−G1/2‖opw\(T\)≤ηn\(δ\)μw\(T\)\.\\bigl\|w\_\{\\widehat\{G\}\_\{n\}\}\(T\)\-w\_\{G\}\(T\)\\bigr\|\\leq\\\|\\widehat\{G\}\_\{n\}^\{1/2\}\-G^\{1/2\}\\\|\_\{\\mathrm\{op\}\}w\(T\)\\leq\\frac\{\\eta\_\{n\}\(\\delta\)\}\{\\sqrt\{\\mu\}\}\\,w\(T\)\.∎
## 4Curvature Corrections to Gaussian Width
Theorem[3\.1](https://arxiv.org/html/2606.18306#S3.Thmtheorem1)quantifies sensitivity to errors in the metric representationG\(θ0\)1/2G\(\\theta\_\{0\}\)^\{1/2\}, such as those arising from empirical or structured Fisher approximations\. It is an estimation result: it assumes the base pointθ0\\theta\_\{0\}is fixed and measures how width changes as the metric changes\. A complementary question is how Fisher width varies asθ0\\theta\_\{0\}itself moves across the statistical manifold \- this is governed by curvature, and is the subject of this section\.
The definition
wG\(θ0\)\(T\)=w\(G\(θ0\)1/2T\)w\_\{G\(\\theta\_\{0\}\)\}\(T\)=w\\\!\\left\(G\(\\theta\_\{0\}\)^\{1/2\}T\\right\)shows that Fisher width is a metric deformation of Gaussian width\. Rather than measuring the Euclidean size ofTT, it measures the size ofTTafter the directions have been rescaled by the local Fisher information\.
This is the first point where the Riemannian nature of the Fisher metric enters explicitly\. Up to this point, Fisher width only uses the positive semidefinite matrixG\(θ0\)G\(\\theta\_\{0\}\)at a fixed parameter value\. In contrast, the following argument uses how the metric tensorG\(θ\)G\(\\theta\)varies in a neighborhood ofθ0\\theta\_\{0\}\. Normal coordinates eliminate the zeroth\- and first\-order Euclidean effects, so the leading nontrivial correction is second order and is controlled by curvature\.
Throughout this section, we regard
T⊂Tθ0ℳT\\subset T\_\{\\theta\_\{0\}\}\\mathcal\{M\}as a compact set of tangent directions, represented in normal coordinates atθ0\\theta\_\{0\}\. In these coordinates,
G\(θ0\)=I,∇G\(θ0\)=0\.G\(\\theta\_\{0\}\)=I,\\qquad\\nabla G\(\\theta\_\{0\}\)=0\.
### 4\.1Local Metric Expansion
Assume that, in a normal\-coordinate neighborhood ofθ0\\theta\_\{0\}, the sectional curvatures of the Fisher manifold are uniformly bounded in absolute value\. That is, there exists a constantκ\>0\\kappa\>0such that
\|Kθ\(Π\)\|≤κ\|K\_\{\\theta\}\(\\Pi\)\|\\leq\\kappafor every pointθ\\thetain this neighborhood and every two\-dimensional tangent planeΠ⊂Tθℳ\\Pi\\subset T\_\{\\theta\}\\mathcal\{M\}\. In normal coordinates atθ0\\theta\_\{0\}, write
Gθ:=G\(θ0\+Δθ\)\.G\_\{\\theta\}:=G\(\\theta\_\{0\}\+\\Delta\\theta\)\.Since
G\(θ0\)=I,∇G\(θ0\)=0,G\(\\theta\_\{0\}\)=I,\\qquad\\nabla G\(\\theta\_\{0\}\)=0,the standard normal\-coordinate expansion of the Riemannian metric\(do Carmo,[1992](https://arxiv.org/html/2606.18306#bib.bib29)\)gives
Gθ=I\+ΔG,ΔG:=Gθ−I,with‖ΔG‖op=O\(κ‖Δθ‖2\)\.G\_\{\\theta\}=I\+\\Delta G,\\qquad\\Delta G:=G\_\{\\theta\}\-I,\\text\{ with \}\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}=O\\\!\\left\(\\kappa\\\|\\Delta\\theta\\\|^\{2\}\\right\)\.Thus the first nontrivial departure from Euclidean geometry appears only at second order in the local radius\.
Expanding the matrix square root around the identity gives
Gθ1/2=\(I\+ΔG\)1/2=I\+12ΔG\+R,G\_\{\\theta\}^\{1/2\}=\(I\+\\Delta G\)^\{1/2\}=I\+\\frac\{1\}\{2\}\\Delta G\+R,where, for‖ΔG‖op\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}sufficiently small,
‖R‖op≤C‖ΔG‖op2\.\\\|R\\\|\_\{\\mathrm\{op\}\}\\leq C\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}^\{2\}\.
The next result quantifies the resulting deviation between Fisher width and Gaussian width\.
###### Proposition 4\.1\(Curvature perturbation bound\)\.
LetT⊂Tθ0ℳT\\subset T\_\{\\theta\_\{0\}\}\\mathcal\{M\}be compact and represented in normal coordinates atθ0\\theta\_\{0\}\. For a nearby pointθ=θ0\+Δθ,\\theta=\\theta\_\{0\}\+\\Delta\\theta,writeGθ=G\(θ\)=I\+ΔG\.G\_\{\\theta\}=G\(\\theta\)=I\+\\Delta G\.Assume
‖ΔG‖op≤c0<12\.\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}\\leq c\_\{0\}<\\frac\{1\}\{2\}\.Then there exists a constantC\>0C\>0, depending only onc0c\_\{0\}, such that
\|wGθ\(T\)−w\(T\)\|≤\(12‖ΔG‖op\+C‖ΔG‖op2\)w\(T\)\.\\bigl\|w\_\{G\_\{\\theta\}\}\(T\)\-w\(T\)\\bigr\|\\leq\\left\(\\frac\{1\}\{2\}\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}\+C\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}^\{2\}\\right\)w\(T\)\.In particular, if the absolute sectional curvature is bounded byκ\\kappain the normal\-coordinate neighborhood, then
\|wGθ\(T\)−w\(T\)\|=O\(κ‖Δθ‖2w\(T\)\)\.\\bigl\|w\_\{G\_\{\\theta\}\}\(T\)\-w\(T\)\\bigr\|=O\\\!\\left\(\\kappa\\\|\\Delta\\theta\\\|^\{2\}\\,w\(T\)\\right\)\.
###### Proof\.
Since‖ΔG‖op≤c0<1/2\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}\\leq c\_\{0\}<1/2, the matrix square\-root map admits the first\-order expansion around the identity
\(I\+ΔG\)1/2=I\+12ΔG\+R,‖R‖op≤C‖ΔG‖op2\.\(I\+\\Delta G\)^\{1/2\}=I\+\\frac\{1\}\{2\}\\Delta G\+R,\\qquad\\\|R\\\|\_\{\\mathrm\{op\}\}\\leq C\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}^\{2\}\.Consequently,
‖\(I\+ΔG\)1/2−I‖op≤12‖ΔG‖op\+C‖ΔG‖op2\.\\\|\(I\+\\Delta G\)^\{1/2\}\-I\\\|\_\{\\mathrm\{op\}\}\\leq\\frac\{1\}\{2\}\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}\+C\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}^\{2\}\.
Applying Theorem[3\.1](https://arxiv.org/html/2606.18306#S3.Thmtheorem1)with
G1=Gθ=I\+ΔG,G2=I,G\_\{1\}=G\_\{\\theta\}=I\+\\Delta G,\\qquad G\_\{2\}=I,we obtain
\|wGθ\(T\)−w\(T\)\|≤\(12‖ΔG‖op\+C‖ΔG‖op2\)w\(T\)\.\\bigl\|w\_\{G\_\{\\theta\}\}\(T\)\-w\(T\)\\bigr\|\\leq\\left\(\\frac\{1\}\{2\}\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}\+C\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}^\{2\}\\right\)w\(T\)\.
Using the normal\-coordinate curvature estimate stated above,
‖ΔG‖op=O\(κ‖Δθ‖2\),\\\|\\Delta G\\\|\_\{\\mathrm\{op\}\}=O\\\!\\left\(\\kappa\\\|\\Delta\\theta\\\|^\{2\}\\right\),we conclude that
\|wGθ\(T\)−w\(T\)\|=O\(κ‖Δθ‖2w\(T\)\)\.\\bigl\|w\_\{G\_\{\\theta\}\}\(T\)\-w\(T\)\\bigr\|=O\\\!\\left\(\\kappa\\\|\\Delta\\theta\\\|^\{2\}\\,w\(T\)\\right\)\.∎
Proposition[4\.1](https://arxiv.org/html/2606.18306#S4.Thmtheorem1)shows that Fisher width and Gaussian width coincide to first order\. The difference between the two quantities is controlled by the local curvature of the Fisher metric\.
### 4\.2Statistical model geometries
The preceding analysis explains Fisher width as a local metric deformation of Gaussian width\. We now illustrate this deformation in concrete statistical models, where the Fisher metricG\(θ\)G\(\\theta\)reflects noise, confidence, and sufficient\-statistic covariance\. Throughout, the base pointθ0\\theta\_\{0\}is fixed andT=ρB2T=\\rho B\_\{2\}denotes a Euclidean ball of radiusρ\\rho\.
###### Example 4\.2\(Bernoulli family\)\.
Let
pθ\(x\)=θx\(1−θ\)1−x,x∈\{0,1\},p\_\{\\theta\}\(x\)=\\theta^\{x\}\(1\-\\theta\)^\{1\-x\},\\qquad x\\in\\\{0,1\\\},with parameterθ∈\(0,1\)\\theta\\in\(0,1\)\. The Fisher information is
G\(θ\)=1θ\(1−θ\)\.G\(\\theta\)=\\frac\{1\}\{\\theta\(1\-\\theta\)\}\.
In one dimension,T=ρB21=\[−ρ,ρ\]T=\\rho B\_\{2\}^\{1\}=\[\-\\rho,\\rho\]\. Hence
wG\(θ0\)\(T\)=w\(1θ0\(1−θ0\)\[−ρ,ρ\]\)\.w\_\{G\(\\theta\_\{0\}\)\}\(T\)=w\\\!\\left\(\\frac\{1\}\{\\sqrt\{\\theta\_\{0\}\(1\-\\theta\_\{0\}\)\}\}\[\-\\rho,\\rho\]\\right\)\.Therefore
wG\(θ0\)\(T\)=ρθ0\(1−θ0\)𝔼\|g\|=ρ2πθ0\(1−θ0\),w\_\{G\(\\theta\_\{0\}\)\}\(T\)=\\frac\{\\rho\}\{\\sqrt\{\\theta\_\{0\}\(1\-\\theta\_\{0\}\)\}\}\\,\\mathbb\{E\}\|g\|=\\rho\\sqrt\{\\frac\{2\}\{\\pi\\,\\theta\_\{0\}\(1\-\\theta\_\{0\}\)\}\},whereg∼N\(0,1\)g\\sim N\(0,1\)\.
Fisher width is minimized at
θ0=12,\\theta\_\{0\}=\\frac\{1\}\{2\},where the Bernoulli distribution has maximal entropy and
wG\(1/2\)\(T\)=ρ8π\.w\_\{G\(1/2\)\}\(T\)=\\rho\\sqrt\{\\frac\{8\}\{\\pi\}\}\.By contrast, asθ0→0\+\\theta\_\{0\}\\to 0^\{\+\}orθ0→1−\\theta\_\{0\}\\to 1^\{\-\}, the Fisher information diverges and
wG\(θ0\)\(T\)→∞\.w\_\{G\(\\theta\_\{0\}\)\}\(T\)\\to\\infty\.
Thus Fisher width detects the boundary singularity of the Bernoulli family: although the Euclidean perturbation setT=\[−ρ,ρ\]T=\[\-\\rho,\\rho\]is fixed, its statistical size diverges as the base distribution approaches a deterministic Bernoulli law\.
###### Example 4\.3\(Gaussian location\-scale family\)\.
Let
X∼N\(μ,τ\),μ∈ℝ,τ=σ2\>0,X\\sim N\(\\mu,\\tau\),\\qquad\\mu\\in\\mathbb\{R\},\\qquad\\tau=\\sigma^\{2\}\>0,with parameter
θ=\(μ,τ\)∈ℝ×\(0,∞\)\.\\theta=\(\\mu,\\tau\)\\in\\mathbb\{R\}\\times\(0,\\infty\)\.The log\-likelihood is
ℓ\(μ,τ;x\)=−12log\(2πτ\)−\(x−μ\)22τ\.\\ell\(\\mu,\\tau;x\)=\-\\frac\{1\}\{2\}\\log\(2\\pi\\tau\)\-\\frac\{\(x\-\\mu\)^\{2\}\}\{2\\tau\}\.A direct computation gives
G\(μ,τ\)=\(1τ0012τ2\)\.G\(\\mu,\\tau\)=\\begin\{pmatrix\}\\dfrac\{1\}\{\\tau\}&0\\\\ 0&\\dfrac\{1\}\{2\\tau^\{2\}\}\\end\{pmatrix\}\.Thus the location and variance directions have different statistical scales: forv=\(vμ,vτ\)v=\(v\_\{\\mu\},v\_\{\\tau\}\),
‖v‖G\(μ,τ\)2=1τvμ2\+12τ2vτ2\.\\\|v\\\|\_\{G\(\\mu,\\tau\)\}^\{2\}=\\frac\{1\}\{\\tau\}v\_\{\\mu\}^\{2\}\+\\frac\{1\}\{2\\tau^\{2\}\}v\_\{\\tau\}^\{2\}\.
Now take
T=ρB22⊂ℝ2\.T=\\rho B\_\{2\}^\{2\}\\subset\\mathbb\{R\}^\{2\}\.Since
G\(μ,τ\)1/2=\(τ−1/200\(2τ\)−1\),G\(\\mu,\\tau\)^\{1/2\}=\\begin\{pmatrix\}\\tau^\{\-1/2\}&0\\\\ 0&\(\\sqrt\{2\}\\tau\)^\{\-1\}\\end\{pmatrix\},we obtain
wG\(μ,τ\)\(T\)=ρ𝔼‖G\(μ,τ\)1/2g‖2,g∼N\(0,I2\)\.w\_\{G\(\\mu,\\tau\)\}\(T\)=\\rho\\,\\mathbb\{E\}\\\|G\(\\mu,\\tau\)^\{1/2\}g\\\|\_\{2\},\\qquad g\\sim N\(0,I\_\{2\}\)\.Equivalently,
wG\(μ,τ\)\(T\)=ρ𝔼\[\(1τg12\+12τ2g22\)1/2\],w\_\{G\(\\mu,\\tau\)\}\(T\)=\\rho\\,\\mathbb\{E\}\\left\[\\left\(\\frac\{1\}\{\\tau\}g\_\{1\}^\{2\}\+\\frac\{1\}\{2\\tau^\{2\}\}g\_\{2\}^\{2\}\\right\)^\{1/2\}\\right\],whereg=\(g1,g2\)∼N\(0,I2\)g=\(g\_\{1\},g\_\{2\}\)\\sim N\(0,I\_\{2\}\)\.
The spectral sandwich bound gives
min\{1τ,12τ2\}w\(T\)≤wG\(μ,τ\)\(T\)≤max\{1τ,12τ2\}w\(T\)\.\\sqrt\{\\min\\left\\\{\\frac\{1\}\{\\tau\},\\frac\{1\}\{2\\tau^\{2\}\}\\right\\\}\}\\,w\(T\)\\leq w\_\{G\(\\mu,\\tau\)\}\(T\)\\leq\\sqrt\{\\max\\left\\\{\\frac\{1\}\{\\tau\},\\frac\{1\}\{2\\tau^\{2\}\}\\right\\\}\}\\,w\(T\)\.
Thus Fisher width is not merely the Euclidean size of the perturbation set\. AlthoughT=ρB22T=\\rho B\_\{2\}^\{2\}is a round Euclidean ball in the\(μ,τ\)\(\\mu,\\tau\)\-space, the Fisher metric transforms it into an anisotropic ellipsoid\. The mean direction is rescaled byτ−1/2\\tau^\{\-1/2\}, while the variance direction is rescaled by\(2τ\)−1\(\\sqrt\{2\}\\tau\)^\{\-1\}\. These scales vary with the noise levelτ\\tau\.
###### Example 4\.4\(Logistic regression\)\.
LetX∈ℝdX\\in\\mathbb\{R\}^\{d\}be a covariate with distributionPXP\_\{X\}, and letY∈\{0,1\}Y\\in\\\{0,1\\\}satisfy the conditional Bernoulli model
ℙθ\(Y=1∣X=x\)=σ\(θ⊤x\),σ\(t\)=11\+e−t\.\\mathbb\{P\}\_\{\\theta\}\(Y=1\\mid X=x\)=\\sigma\(\\theta^\{\\top\}x\),\\qquad\\sigma\(t\)=\\frac\{1\}\{1\+e^\{\-t\}\}\.Equivalently,
ℙθ\(Y=y∣X=x\)=σ\(θ⊤x\)y\(1−σ\(θ⊤x\)\)1−y,y∈\{0,1\}\.\\mathbb\{P\}\_\{\\theta\}\(Y=y\\mid X=x\)=\\sigma\(\\theta^\{\\top\}x\)^\{y\}\\bigl\(1\-\\sigma\(\\theta^\{\\top\}x\)\\bigr\)^\{1\-y\},\\qquad y\\in\\\{0,1\\\}\.
The log\-likelihood for one observation\(X,Y\)\(X,Y\)is
logℙθ\(Y∣X\)=Yθ⊤X−log\(1\+eθ⊤X\),\\log\\mathbb\{P\}\_\{\\theta\}\(Y\\mid X\)=Y\\theta^\{\\top\}X\-\\log\(1\+e^\{\\theta^\{\\top\}X\}\),and differentiating with respect toθ\\thetagives the score
∇θlogℙθ\(Y∣X\)=\(Y−σ\(θ⊤X\)\)X\.\\nabla\_\{\\theta\}\\log\\mathbb\{P\}\_\{\\theta\}\(Y\\mid X\)=\\bigl\(Y\-\\sigma\(\\theta^\{\\top\}X\)\\bigr\)X\.
Therefore the Fisher information matrix is
G\(θ\)=𝔼X∼PX\[σ\(θ⊤X\)\(1−σ\(θ⊤X\)\)XX⊤\]\.G\(\\theta\)=\\mathbb\{E\}\_\{X\\sim P\_\{X\}\}\\left\[\\sigma\(\\theta^\{\\top\}X\)\\bigl\(1\-\\sigma\(\\theta^\{\\top\}X\)\\bigr\)XX^\{\\top\}\\right\]\.
The scalar factor
σ\(θ⊤X\)\(1−σ\(θ⊤X\)\)\\sigma\(\\theta^\{\\top\}X\)\\bigl\(1\-\\sigma\(\\theta^\{\\top\}X\)\\bigr\)is the conditional variance ofY∣XY\\mid X\. It is maximized at
where it equals1/41/4, and decreases to zero as
\|θ⊤X\|→∞\.\|\\theta^\{\\top\}X\|\\to\\infty\.
Suppose that, along a sequence of parametersθm\\theta\_\{m\},
\|θm⊤X\|→∞PX\-almost surely,\|\\theta\_\{m\}^\{\\top\}X\|\\to\\infty\\qquad P\_\{X\}\\text\{\-almost surely\},and assume
𝔼‖X‖22<∞\.\\mathbb\{E\}\\\|X\\\|\_\{2\}^\{2\}<\\infty\.Then
σ\(θm⊤X\)\(1−σ\(θm⊤X\)\)XX⊤→0PX\-almost surely\.\\sigma\(\\theta\_\{m\}^\{\\top\}X\)\\bigl\(1\-\\sigma\(\\theta\_\{m\}^\{\\top\}X\)\\bigr\)XX^\{\\top\}\\to 0\\qquad P\_\{X\}\\text\{\-almost surely\}\.Moreover,
0⪯σ\(θm⊤X\)\(1−σ\(θm⊤X\)\)XX⊤⪯14XX⊤\.0\\preceq\\sigma\(\\theta\_\{m\}^\{\\top\}X\)\\bigl\(1\-\\sigma\(\\theta\_\{m\}^\{\\top\}X\)\\bigr\)XX^\{\\top\}\\preceq\\frac\{1\}\{4\}XX^\{\\top\}\.By dominated convergence,
G\(θm\)→0\.G\(\\theta\_\{m\}\)\\to 0\.In particular,
λmax\(G\(θm\)\)→0\.\\lambda\_\{\\max\}\(G\(\\theta\_\{m\}\)\)\\to 0\.
The spectral sandwich bound gives
0≤wG\(θm\)\(T\)≤λmax\(G\(θm\)\)w\(T\)⟶0\.0\\leq w\_\{G\(\\theta\_\{m\}\)\}\(T\)\\leq\\sqrt\{\\lambda\_\{\\max\}\(G\(\\theta\_\{m\}\)\)\}\\,w\(T\)\\longrightarrow 0\.
Thus Fisher width vanishes as as‖θm⊤X‖→∞\\\|\\theta\_\{m\}^\{\\top\}X\\\|\\to\\infty\. In this regime, the model assigns probabilities close to0or11, the Bernoulli variance collapses, and the Fisher metric contracts\.
This example suggests a connection between Fisher width and model confidence\. The experiments in Section[7](https://arxiv.org/html/2606.18306#S7)examine related Fisher\-width behavior in trained logistic, softmax, and ridge models\.
#### Summary\.
These examples illustrate three qualitatively distinct Fisher geometries\. The Bernoulli family exhibits a singular geometry near the boundary of the parameter space, where Fisher width diverges\. The Gaussian location\-scale family shows that even a simple Euclidean perturbation set can be distorted anisotropically: mean and variance directions are measured on different statistical scales, and these scales vary with the noise level\. Logistic regression exhibits a data\-dependent geometry whose local Fisher structure contracts as predictions become more confident\.
Together, these examples show that Fisher width captures geometric structure that classical Gaussian width treats uniformly: boundary singularities in Bernoulli models, anisotropic noise\-dependent scaling in Gaussian location\-scale models, and confidence\-driven contraction in logistic regression\.
## 5Generalization Bound
### 5\.1Setup and assumptions
LetT⊂ℝdT\\subset\\mathbb\{R\}^\{d\}be a compact parameter set, and letℓ\(θ;z\)\\ell\(\\theta;z\)be a loss function indexed byθ∈T\\theta\\in T, wherez=\(x,y\)z=\(x,y\)\. We assume that
0∈T,ℓ\(0;z\)=00\\in T,\\qquad\\ell\(0;z\)=0for allzz\. Define the population and empirical risks by
R\(θ\)=𝔼z∼𝒟\[ℓ\(θ;z\)\],R^n\(θ\)=1n∑i=1nℓ\(θ;zi\),R\(\\theta\)=\\mathbb\{E\}\_\{z\\sim\\mathcal\{D\}\}\[\\ell\(\\theta;z\)\],\\qquad\\widehat\{R\}\_\{n\}\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\(\\theta;z\_\{i\}\),wherez1,…,znz\_\{1\},\\ldots,z\_\{n\}are independent samples from𝒟\\mathcal\{D\}\.
###### Assumption 5\.2\(Boundedness\)\.
There existsB\>0B\>0such that0≤ℓ\(θ;z\)≤B0\\leq\\ell\(\\theta;z\)\\leq Bfor allθ∈T\\theta\\in Tand all sampleszz\.
###### Assumption 5\.3\(Base\-point Fisher\-Lipschitzness\)\.
Letθ0∈Θ\\theta\_\{0\}\\in\\Thetabe fixed and write
G0:=G\(θ0\)\.G\_\{0\}:=G\(\\theta\_\{0\}\)\.There existsL\>0L\>0such that, for every samplezzand everyθ1,θ2∈T\\theta\_\{1\},\\theta\_\{2\}\\in T,
\|ℓ\(θ1;z\)−ℓ\(θ2;z\)\|≤L‖θ1−θ2‖G0,\|\\ell\(\\theta\_\{1\};z\)\-\\ell\(\\theta\_\{2\};z\)\|\\leq L\\\|\\theta\_\{1\}\-\\theta\_\{2\}\\\|\_\{G\_\{0\}\},where
‖h‖G0=h⊤G0h\.\\\|h\\\|\_\{G\_\{0\}\}=\\sqrt\{h^\{\\top\}G\_\{0\}h\}\.
The base\-point Fisher\-Lipschitz condition in Assumption[5\.3](https://arxiv.org/html/2606.18306#S5.Thmtheorem3)is stated using the fixed metricG0=G\(θ0\)G\_\{0\}=G\(\\theta\_\{0\}\)\. In applications, however, gradient bounds are often more naturally expressed in the local Fisher metricG\(θ\)G\(\\theta\), which varies with the parameter\. The following proposition shows that such a local bound implies base\-point Fisher\-Lipschitzness whenever the metric does not vary too much over the parameter setTT\.
###### Proposition 5\.5\(Metric comparability implies base\-point Fisher\-Lipschitzness\)\.
LetT⊂ℝdT\\subset\\mathbb\{R\}^\{d\}be convex and letG0=G\(θ0\)≻0G\_\{0\}=G\(\\theta\_\{0\}\)\\succ 0\. Suppose that, for someε∈\(0,1\)\\varepsilon\\in\(0,1\),
G\(θ\)⪯\(1\+ε\)G0G\(\\theta\)\\preceq\(1\+\\varepsilon\)G\_\{0\}for allθ∈T\\theta\\in T\. Assume also that, for every samplezzand everyθ∈T\\theta\\in T,
‖∇θℓ\(θ;z\)‖G\(θ\)−1≤L\.\\\|\\nabla\_\{\\theta\}\\ell\(\\theta;z\)\\\|\_\{G\(\\theta\)^\{\-1\}\}\\leq L\.Then, for every samplezzand everyθ1,θ2∈T\\theta\_\{1\},\\theta\_\{2\}\\in T,
\|ℓ\(θ1;z\)−ℓ\(θ2;z\)\|≤L1\+ε‖θ1−θ2‖G0\.\|\\ell\(\\theta\_\{1\};z\)\-\\ell\(\\theta\_\{2\};z\)\|\\leq L\\sqrt\{1\+\\varepsilon\}\\\|\\theta\_\{1\}\-\\theta\_\{2\}\\\|\_\{G\_\{0\}\}\.In particular, Assumption[5\.3](https://arxiv.org/html/2606.18306#S5.Thmtheorem3)holds with constantL1\+εL\\sqrt\{1\+\\varepsilon\}\.
###### Proof\.
Fixzzand letθ1,θ2∈T\\theta\_\{1\},\\theta\_\{2\}\\in T\. Define
γt=\(1−t\)θ1\+tθ2,t∈\[0,1\]\.\\gamma\_\{t\}=\(1\-t\)\\theta\_\{1\}\+t\\theta\_\{2\},\\qquad t\\in\[0,1\]\.SinceTTis convex,γt∈T\\gamma\_\{t\}\\in Tfor allt∈\[0,1\]t\\in\[0,1\]\. By the fundamental theorem of calculus,
ℓ\(θ2;z\)−ℓ\(θ1;z\)=∫01⟨∇θℓ\(γt;z\),θ2−θ1⟩𝑑t\.\\ell\(\\theta\_\{2\};z\)\-\\ell\(\\theta\_\{1\};z\)=\\int\_\{0\}^\{1\}\\left\\langle\\nabla\_\{\\theta\}\\ell\(\\gamma\_\{t\};z\),\\theta\_\{2\}\-\\theta\_\{1\}\\right\\rangle\\,dt\.Applying Cauchy–Schwarz in the Fisher metricG\(γt\)G\(\\gamma\_\{t\}\), we obtain
\|ℓ\(θ2;z\)−ℓ\(θ1;z\)\|≤∫01‖∇θℓ\(γt;z\)‖G\(γt\)−1‖θ2−θ1‖G\(γt\)𝑑t\.\|\\ell\(\\theta\_\{2\};z\)\-\\ell\(\\theta\_\{1\};z\)\|\\leq\\int\_\{0\}^\{1\}\\\|\\nabla\_\{\\theta\}\\ell\(\\gamma\_\{t\};z\)\\\|\_\{G\(\\gamma\_\{t\}\)^\{\-1\}\}\\\|\\theta\_\{2\}\-\\theta\_\{1\}\\\|\_\{G\(\\gamma\_\{t\}\)\}\\,dt\.By assumption,
‖∇θℓ\(γt;z\)‖G\(γt\)−1≤L\.\\\|\\nabla\_\{\\theta\}\\ell\(\\gamma\_\{t\};z\)\\\|\_\{G\(\\gamma\_\{t\}\)^\{\-1\}\}\\leq L\.Moreover, since
G\(γt\)⪯\(1\+ε\)G0,G\(\\gamma\_\{t\}\)\\preceq\(1\+\\varepsilon\)G\_\{0\},we have
‖θ2−θ1‖G\(γt\)≤1\+ε‖θ2−θ1‖G0\.\\\|\\theta\_\{2\}\-\\theta\_\{1\}\\\|\_\{G\(\\gamma\_\{t\}\)\}\\leq\\sqrt\{1\+\\varepsilon\}\\\|\\theta\_\{2\}\-\\theta\_\{1\}\\\|\_\{G\_\{0\}\}\.Therefore
\|ℓ\(θ2;z\)−ℓ\(θ1;z\)\|≤L1\+ε‖θ2−θ1‖G0\.\|\\ell\(\\theta\_\{2\};z\)\-\\ell\(\\theta\_\{1\};z\)\|\\leq L\\sqrt\{1\+\\varepsilon\}\\\|\\theta\_\{2\}\-\\theta\_\{1\}\\\|\_\{G\_\{0\}\}\.∎
The Fisher\-Lipschitz assumption is model\-dependent, since it requires a uniform bound on the dual Fisher norm of the loss gradients\. For the classification models used in the experiments, this condition follows from standard Fisher\-leverage bounds\. In particular, logistic regression and multiclass softmax regression satisfy Assumption[5\.3](https://arxiv.org/html/2606.18306#S5.Thmtheorem3)under bounded\-feature and Fisher nondegeneracy conditions; the precise statements and proofs are given in Appendix[A](https://arxiv.org/html/2606.18306#A1)\.
### 5\.2Main theorem
###### Theorem 5\.6\(Local Fisher\-width generalization bound\)\.
Let
G0:=G\(θ0\)\.G\_\{0\}:=G\(\\theta\_\{0\}\)\.Under Assumptions[5\.2](https://arxiv.org/html/2606.18306#S5.Thmtheorem2)and[5\.3](https://arxiv.org/html/2606.18306#S5.Thmtheorem3), for everyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,
supθ∈T\|R\(θ\)−R^n\(θ\)\|≤CLwG0\(T−T\)n\+Blog\(1/δ\)2n\.\\sup\_\{\\theta\\in T\}\\bigl\|R\(\\theta\)\-\\widehat\{R\}\_\{n\}\(\\theta\)\\bigr\|\\leq\\frac\{CL\\,w\_\{G\_\{0\}\}\(T\-T\)\}\{\\sqrt\{n\}\}\+B\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{2n\}\}\.
###### Proof\.
The proof proceeds in four steps\.
Step 1: Symmetrization\.By the standard symmetrization lemma,
𝔼supθ∈T\|R\(θ\)−R^n\(θ\)\|≤2n𝔼𝝈,𝐳supθ∈T\|∑i=1nσiℓ\(θ;zi\)\|,\\mathbb\{E\}\\sup\_\{\\theta\\in T\}\\bigl\|R\(\\theta\)\-\\widehat\{R\}\_\{n\}\(\\theta\)\\bigr\|\\leq\\frac\{2\}\{n\}\\mathbb\{E\}\_\{\\boldsymbol\{\\sigma\},\\mathbf\{z\}\}\\sup\_\{\\theta\\in T\}\\left\|\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}\\ell\(\\theta;z\_\{i\}\)\\right\|,where
σi∼i\.i\.d\.Unif\{−1,\+1\}\\sigma\_\{i\}\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}\\operatorname\{Unif\}\\\{\-1,\+1\\\}are independent Rademacher variables andzi=\(xi,yi\)z\_\{i\}=\(x\_\{i\},y\_\{i\}\)\.
For fixed data, define the Rademacher process
Xθ=∑i=1nσiℓ\(θ;zi\),θ∈T\.X\_\{\\theta\}=\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}\\ell\(\\theta;z\_\{i\}\),\\qquad\\theta\\in T\.We first control
𝔼𝝈supθ∈TXθ\.\\mathbb\{E\}\_\{\\boldsymbol\{\\sigma\}\}\\sup\_\{\\theta\\in T\}X\_\{\\theta\}\.
Step 2: Subgaussian increments in the Fisher metric\.For fixed data and forθ1,θ2∈T\\theta\_\{1\},\\theta\_\{2\}\\in T,
Xθ1−Xθ2=∑i=1nσi\(ℓ\(θ1;zi\)−ℓ\(θ2;zi\)\)\.X\_\{\\theta\_\{1\}\}\-X\_\{\\theta\_\{2\}\}=\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}\\bigl\(\\ell\(\\theta\_\{1\};z\_\{i\}\)\-\\ell\(\\theta\_\{2\};z\_\{i\}\)\\bigr\)\.By the standard subgaussian bound for Rademacher sums,
‖Xθ1−Xθ2‖ψ2≤C\(∑i=1n\|ℓ\(θ1;zi\)−ℓ\(θ2;zi\)\|2\)1/2\.\\\|X\_\{\\theta\_\{1\}\}\-X\_\{\\theta\_\{2\}\}\\\|\_\{\\psi\_\{2\}\}\\leq C\\left\(\\sum\_\{i=1\}^\{n\}\|\\ell\(\\theta\_\{1\};z\_\{i\}\)\-\\ell\(\\theta\_\{2\};z\_\{i\}\)\|^\{2\}\\right\)^\{1/2\}\.Using Assumption[5\.3](https://arxiv.org/html/2606.18306#S5.Thmtheorem3), we obtain
‖Xθ1−Xθ2‖ψ2≤CLn‖θ1−θ2‖G0\.\\\|X\_\{\\theta\_\{1\}\}\-X\_\{\\theta\_\{2\}\}\\\|\_\{\\psi\_\{2\}\}\\leq CL\\sqrt\{n\}\\,\\\|\\theta\_\{1\}\-\\theta\_\{2\}\\\|\_\{G\_\{0\}\}\.Thus\(Xθ\)θ∈T\(X\_\{\\theta\}\)\_\{\\theta\\in T\}is a subgaussian process with respect to the metric
d\(θ1,θ2\)=Ln‖θ1−θ2‖G0\.d\(\\theta\_\{1\},\\theta\_\{2\}\)=L\\sqrt\{n\}\\,\\\|\\theta\_\{1\}\-\\theta\_\{2\}\\\|\_\{G\_\{0\}\}\.
Step 3: Chaining and Fisher width\.By the generic chaining bound for subgaussian processes,
𝔼𝝈supθ∈TXθ≤CLnγ2\(T,∥⋅∥G0\)\.\\mathbb\{E\}\_\{\\boldsymbol\{\\sigma\}\}\\sup\_\{\\theta\\in T\}X\_\{\\theta\}\\leq CL\\sqrt\{n\}\\,\\gamma\_\{2\}\(T,\\\|\\cdot\\\|\_\{G\_\{0\}\}\)\.Since
‖θ1−θ2‖G0=‖G01/2θ1−G01/2θ2‖2,\\\|\\theta\_\{1\}\-\\theta\_\{2\}\\\|\_\{G\_\{0\}\}=\\\|G\_\{0\}^\{1/2\}\\theta\_\{1\}\-G\_\{0\}^\{1/2\}\\theta\_\{2\}\\\|\_\{2\},we identifyTTwith the Fisher\-rescaled set
T~:=G01/2T\.\\widetilde\{T\}:=G\_\{0\}^\{1/2\}T\.Hence
γ2\(T,∥⋅∥G0\)=γ2\(T~,∥⋅∥2\)\.\\gamma\_\{2\}\(T,\\\|\\cdot\\\|\_\{G\_\{0\}\}\)=\\gamma\_\{2\}\(\\widetilde\{T\},\\\|\\cdot\\\|\_\{2\}\)\.By Talagrand’s majorizing measure theorem\(Talagrand,[2014](https://arxiv.org/html/2606.18306#bib.bib32)\),
γ2\(T~,∥⋅∥2\)≤Cw\(T~\)\.\\gamma\_\{2\}\(\\widetilde\{T\},\\\|\\cdot\\\|\_\{2\}\)\\leq C\\,w\(\\widetilde\{T\}\)\.Since0∈T0\\in T, we have
and therefore
T~=G01/2T⊂G01/2\(T−T\)\.\\widetilde\{T\}=G\_\{0\}^\{1/2\}T\\subset G\_\{0\}^\{1/2\}\(T\-T\)\.Thus
w\(T~\)≤w\(G01/2\(T−T\)\)=wG0\(T−T\)\.w\(\\widetilde\{T\}\)\\leq w\\bigl\(G\_\{0\}^\{1/2\}\(T\-T\)\\bigr\)=w\_\{G\_\{0\}\}\(T\-T\)\.Consequently,
𝔼𝝈supθ∈TXθ≤CLnwG0\(T−T\)\.\\mathbb\{E\}\_\{\\boldsymbol\{\\sigma\}\}\\sup\_\{\\theta\\in T\}X\_\{\\theta\}\\leq CL\\sqrt\{n\}\\,w\_\{G\_\{0\}\}\(T\-T\)\.
We now apply the same argument to the process\(−Xθ\)θ∈T\(\-X\_\{\\theta\}\)\_\{\\theta\\in T\}\. For everyθ1,θ2∈T\\theta\_\{1\},\\theta\_\{2\}\\in T,
\(−Xθ1\)−\(−Xθ2\)=−\(Xθ1−Xθ2\),\(\-X\_\{\\theta\_\{1\}\}\)\-\(\-X\_\{\\theta\_\{2\}\}\)=\-\(X\_\{\\theta\_\{1\}\}\-X\_\{\\theta\_\{2\}\}\),and hence
\|\(−Xθ1\)−\(−Xθ2\)\|=\|Xθ1−Xθ2\|\.\|\(\-X\_\{\\theta\_\{1\}\}\)\-\(\-X\_\{\\theta\_\{2\}\}\)\|=\|X\_\{\\theta\_\{1\}\}\-X\_\{\\theta\_\{2\}\}\|\.Thus\(−Xθ\)θ∈T\(\-X\_\{\\theta\}\)\_\{\\theta\\in T\}has the same increment size as\(Xθ\)θ∈T\(X\_\{\\theta\}\)\_\{\\theta\\in T\}, so the same chaining bound gives
𝔼𝝈supθ∈T\(−Xθ\)≤CLnwG0\(T−T\)\.\\mathbb\{E\}\_\{\\boldsymbol\{\\sigma\}\}\\sup\_\{\\theta\\in T\}\(\-X\_\{\\theta\}\)\\leq CL\\sqrt\{n\}\\,w\_\{G\_\{0\}\}\(T\-T\)\.Since
supθ∈T\|Xθ\|≤supθ∈TXθ\+supθ∈T\(−Xθ\),\\sup\_\{\\theta\\in T\}\|X\_\{\\theta\}\|\\leq\\sup\_\{\\theta\\in T\}X\_\{\\theta\}\+\\sup\_\{\\theta\\in T\}\(\-X\_\{\\theta\}\),we conclude that
𝔼𝝈supθ∈T\|Xθ\|≤CLnwG0\(T−T\)\.\\mathbb\{E\}\_\{\\boldsymbol\{\\sigma\}\}\\sup\_\{\\theta\\in T\}\|X\_\{\\theta\}\|\\leq CL\\sqrt\{n\}\\,w\_\{G\_\{0\}\}\(T\-T\)\.Combining this estimate with Step 1 yields
𝔼supθ∈T\|R\(θ\)−R^n\(θ\)\|≤CLwG0\(T−T\)n\.\\mathbb\{E\}\\sup\_\{\\theta\\in T\}\\bigl\|R\(\\theta\)\-\\widehat\{R\}\_\{n\}\(\\theta\)\\bigr\|\\leq\\frac\{CL\\,w\_\{G\_\{0\}\}\(T\-T\)\}\{\\sqrt\{n\}\}\.
Step 4: Concentration\.Define
f\(z1,…,zn\)=supθ∈T\|R\(θ\)−R^n\(θ\)\|\.f\(z\_\{1\},\\ldots,z\_\{n\}\)=\\sup\_\{\\theta\\in T\}\\bigl\|R\(\\theta\)\-\\widehat\{R\}\_\{n\}\(\\theta\)\\bigr\|\.Changing one sampleziz\_\{i\}changesR^n\(θ\)\\widehat\{R\}\_\{n\}\(\\theta\)by at mostB/nB/n, uniformly overθ\\theta, becauseℓ\(θ;z\)∈\[0,B\]\\ell\(\\theta;z\)\\in\[0,B\]\. Henceffsatisfies the bounded differences condition with constants
By McDiarmid’s inequality,
ℙ\(f−𝔼f≥t\)≤exp\(−2nt2B2\)\.\\mathbb\{P\}\\left\(f\-\\mathbb\{E\}f\\geq t\\right\)\\leq\\exp\\left\(\-\\frac\{2nt^\{2\}\}\{B^\{2\}\}\\right\)\.Taking
t=Blog\(1/δ\)2nt=B\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{2n\}\}and adding this concentration term to the expectation bound proves the claim\. ∎
### 5\.3Tightness for Exponential Families
Theorem[5\.6](https://arxiv.org/html/2606.18306#S5.Thmtheorem6)gives an upper bound controlled by Fisher width\. We now show that this scale is sharp in a canonical setting: exponential families with negative log\-likelihood loss\. The result should be understood as a sharpness example rather than a matching lower bound for all Fisher\-Lipschitz losses\.
Let
pθ\(z\)=h\(z\)exp\(θ⊤ϕ\(z\)−A\(θ\)\),θ∈Θ⊆ℝd,p\_\{\\theta\}\(z\)=h\(z\)\\exp\\bigl\(\\theta^\{\\top\}\\phi\(z\)\-A\(\\theta\)\\bigr\),\\qquad\\theta\\in\\Theta\\subseteq\\mathbb\{R\}^\{d\},be a regular exponential family, and fix a base pointθ0∈Θ\\theta\_\{0\}\\in\\Theta\. WriteG0=G\(θ0\)=Covpθ0\(ϕ\(Z\)\)G\_\{0\}=G\(\\theta\_\{0\}\)=\\operatorname\{Cov\}\_\{p\_\{\\theta\_\{0\}\}\}\(\\phi\(Z\)\)and assumeG0≻0G\_\{0\}\\succ 0\. Foru∈ℝdu\\in\\mathbb\{R\}^\{d\}, define the local perturbation loss
ℓ~\(u;z\)=−logpθ0\+u\(z\)\+logpθ0\(z\),\\widetilde\{\\ell\}\(u;z\)=\-\\log p\_\{\\theta\_\{0\}\+u\}\(z\)\+\\log p\_\{\\theta\_\{0\}\}\(z\),and assumeθ0\+u∈Θ\\theta\_\{0\}\+u\\in\\Thetafor allu∈Tu\\in T\. We takeT=rB2dT=rB\_\{2\}^\{d\}\.
###### Lemma 5\.8\(Exact gap for exponential\-family NLL\)\.
LetZ1,…,Zn∼i\.i\.d\.pθ0Z\_\{1\},\\ldots,Z\_\{n\}\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}p\_\{\\theta\_\{0\}\}, and define
ϕ¯n=1n∑i=1nϕ\(Zi\),Δn=ϕ¯n−𝔼pθ0ϕ\(Z\)\.\\overline\{\\phi\}\_\{n\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\phi\(Z\_\{i\}\),\\qquad\\Delta\_\{n\}=\\overline\{\\phi\}\_\{n\}\-\\mathbb\{E\}\_\{p\_\{\\theta\_\{0\}\}\}\\phi\(Z\)\.Then
supu∈rB2d\|R\(u\)−R^n\(u\)\|=r‖Δn‖2,\\sup\_\{u\\in rB\_\{2\}^\{d\}\}\\bigl\|R\(u\)\-\\widehat\{R\}\_\{n\}\(u\)\\bigr\|=r\\\|\\Delta\_\{n\}\\\|\_\{2\},whereRRandR^n\\widehat\{R\}\_\{n\}are the population and empirical risks associated withℓ~\(u;z\)\\widetilde\{\\ell\}\(u;z\)\. The proof is given in Appendix[B](https://arxiv.org/html/2606.18306#A2)\.
###### Theorem 5\.9\(Sharp Fisher\-width scale for exponential families\)\.
Assume𝔼pθ0‖ϕ\(Z\)‖22<∞\\mathbb\{E\}\_\{p\_\{\\theta\_\{0\}\}\}\\\|\\phi\(Z\)\\\|\_\{2\}^\{2\}<\\infty\. Then
limn→∞n𝔼\[supu∈rB2d\|R\(u\)−R^n\(u\)\|\]=wG0\(rB2d\)\.\\lim\_\{n\\to\\infty\}\\sqrt\{n\}\\;\\mathbb\{E\}\\\!\\left\[\\sup\_\{u\\in rB\_\{2\}^\{d\}\}\\bigl\|R\(u\)\-\\widehat\{R\}\_\{n\}\(u\)\\bigr\|\\right\]=w\_\{G\_\{0\}\}\(rB\_\{2\}^\{d\}\)\.Consequently, for everyε∈\(0,1\)\\varepsilon\\in\(0,1\)there existsNεN\_\{\\varepsilon\}such that for alln≥Nεn\\geq N\_\{\\varepsilon\},
𝔼\[supu∈rB2d\|R\(u\)−R^n\(u\)\|\]≥\(1−ε\)wG0\(rB2d\)n\.\\mathbb\{E\}\\\!\\left\[\\sup\_\{u\\in rB\_\{2\}^\{d\}\}\\bigl\|R\(u\)\-\\widehat\{R\}\_\{n\}\(u\)\\bigr\|\\right\]\\geq\\frac\{\(1\-\\varepsilon\)\\,w\_\{G\_\{0\}\}\(rB\_\{2\}^\{d\}\)\}\{\\sqrt\{n\}\}\.The proof is given in Appendix[B](https://arxiv.org/html/2606.18306#A2)\.
### 5\.4Fisher Width of Euclidean Balls
Theorem[5\.6](https://arxiv.org/html/2606.18306#S5.Thmtheorem6)shows that local generalization is governed by Fisher width\. We next evaluate this quantity for a canonical parameter set, the Euclidean ballT=rB2d\.T=rB\_\{2\}^\{d\}\.As observed in Example 2\.4,
wG\(rB2d\)=r𝔼‖G1/2g‖2≤rTr\(G\)\.w\_\{G\}\(rB\_\{2\}^\{d\}\)=r\\,\\mathbb\{E\}\\\|G^\{1/2\}g\\\|\_\{2\}\\leq r\\sqrt\{\\operatorname\{Tr\}\(G\)\}\.The following result shows that this upper bound is sharp up to universal constants\. Thus, for Euclidean balls, Fisher width is controlled precisely by the total Fisher informationTr\(G\)\\operatorname\{Tr\}\(G\), giving an intrinsic effective dimension for the local model\.
###### Theorem 5\.13\(Fisher width of Euclidean balls\)\.
LetG⪰0G\\succeq 0and letT=rB2d\.T=rB\_\{2\}^\{d\}\.Then
2πrTr\(G\)≤wG\(T\)≤rTr\(G\)\.\\sqrt\{\\frac\{2\}\{\\pi\}\}\\,r\\sqrt\{\\operatorname\{Tr\}\(G\)\}\\leq w\_\{G\}\(T\)\\leq r\\sqrt\{\\operatorname\{Tr\}\(G\)\}\.Consequently,
wG\(rB2d\)≍rTr\(G\),w\_\{G\}\(rB\_\{2\}^\{d\}\)\\asymp r\\sqrt\{\\operatorname\{Tr\}\(G\)\},where the implied constants are universal\.
###### Proof\.
The upper bound follows from Example 2\.4, so it remains only to prove the lower bound\. Letλ1,…,λd\\lambda\_\{1\},\\ldots,\\lambda\_\{d\}be the eigenvalues ofGG\. By rotational invariance of the standard Gaussian,
wG\(rB2d\)=r𝔼\(∑i=1dλigi2\)1/2,gi∼i\.i\.d\.N\(0,1\)\.w\_\{G\}\(rB\_\{2\}^\{d\}\)=r\\,\\mathbb\{E\}\\Bigl\(\\sum\_\{i=1\}^\{d\}\\lambda\_\{i\}g\_\{i\}^\{2\}\\Bigr\)^\{1/2\},\\qquad g\_\{i\}\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}N\(0,1\)\.IfTr\(G\)=0\\mathrm\{Tr\}\(G\)=0, thenG=0G=0and the claim is trivial\. Otherwise, set
pi=λiTr\(G\)\.p\_\{i\}=\\frac\{\\lambda\_\{i\}\}\{\\mathrm\{Tr\}\(G\)\}\.Thenp=\(p1,…,pd\)p=\(p\_\{1\},\\ldots,p\_\{d\}\)lies in the probability simplex, and
wG\(rB2d\)=rTr\(G\)𝔼\(∑i=1dpigi2\)1/2\.w\_\{G\}\(rB\_\{2\}^\{d\}\)=r\\sqrt\{\\mathrm\{Tr\}\(G\)\}\\,\\mathbb\{E\}\\Bigl\(\\sum\_\{i=1\}^\{d\}p\_\{i\}g\_\{i\}^\{2\}\\Bigr\)^\{1/2\}\.It remains to lower bound the last expectation uniformly over all probability vectorspp\.
For each fixed realization ofgg, the map
p↦\(∑i=1dpigi2\)1/2p\\mapsto\\Bigl\(\\sum\_\{i=1\}^\{d\}p\_\{i\}g\_\{i\}^\{2\}\\Bigr\)^\{1/2\}is concave on the simplex\. Sincep=∑ipieip=\\sum\_\{i\}p\_\{i\}e\_\{i\}, concavity gives
\(∑i=1dpigi2\)1/2≥∑i=1dpi\|gi\|\.\\Bigl\(\\sum\_\{i=1\}^\{d\}p\_\{i\}g\_\{i\}^\{2\}\\Bigr\)^\{1/2\}\\geq\\sum\_\{i=1\}^\{d\}p\_\{i\}\|g\_\{i\}\|\.Taking expectations,
𝔼\(∑i=1dpigi2\)1/2≥∑i=1dpi𝔼\|gi\|=2π\.\\mathbb\{E\}\\Bigl\(\\sum\_\{i=1\}^\{d\}p\_\{i\}g\_\{i\}^\{2\}\\Bigr\)^\{1/2\}\\geq\\sum\_\{i=1\}^\{d\}p\_\{i\}\\,\\mathbb\{E\}\|g\_\{i\}\|=\\sqrt\{\\frac\{2\}\{\\pi\}\}\.Therefore
wG\(rB2d\)≥2πrTr\(G\)\.w\_\{G\}\(rB\_\{2\}^\{d\}\)\\geq\\sqrt\{\\frac\{2\}\{\\pi\}\}\\,r\\sqrt\{\\mathrm\{Tr\}\(G\)\}\.This proves the desired lower bound\. ∎
### 5\.5Effective Fisher Dimension
Theorem[5\.13](https://arxiv.org/html/2606.18306#S5.Thmtheorem13)suggests a natural way to separate the scale of the Fisher metric from its effective dimensionality\. For a nonzero matrixG⪰0G\\succeq 0, define
dF\(G\):=Tr\(G\)λmax\(G\)\.d\_\{F\}\(G\):=\\frac\{\\mathrm\{Tr\}\(G\)\}\{\\lambda\_\{\\max\}\(G\)\}\.which we call the*effective Fisher dimension*\.
The normalization byλmax\(G\)\\lambda\_\{\\max\}\(G\)removes the largest local Fisher scale\. What remains is a dimension\-like quantity measuring how many Fisher\-relevant directions contribute to the trace\. Indeed,
Tr\(G\)=λmax\(G\)dF\(G\)\.\\mathrm\{Tr\}\(G\)=\\lambda\_\{\\max\}\(G\)d\_\{F\}\(G\)\.Substituting this identity into Theorem[5\.13](https://arxiv.org/html/2606.18306#S5.Thmtheorem13)immediately yields the following characterization\.
###### Corollary 5\.14\(Effective Fisher dimension\)\.
LetG⪰0G\\succeq 0be nonzero and letT=rB2dT=rB\_\{2\}^\{d\}\. Then
2πdF\(G\)≤wG\(T\)2r2λmax\(G\)≤dF\(G\)\.\\frac\{2\}\{\\pi\}\\,d\_\{F\}\(G\)\\leq\\frac\{w\_\{G\}\(T\)^\{2\}\}\{r^\{2\}\\lambda\_\{\\max\}\(G\)\}\\leq d\_\{F\}\(G\)\.Equivalently,
wG\(rB2d\)2≍r2λmax\(G\)dF\(G\),w\_\{G\}\(rB\_\{2\}^\{d\}\)^\{2\}\\asymp r^\{2\}\\lambda\_\{\\max\}\(G\)d\_\{F\}\(G\),with universal constants\.
As a direct consequence, the local generalization bound can be expressed in terms of the effective Fisher dimension\. Indeed, ifT⊆rB2dT\\subseteq rB\_\{2\}^\{d\}, then
T−T⊆2rB2d\.T\-T\\subseteq 2rB\_\{2\}^\{d\}\.Hence
wG0\(T−T\)≤wG0\(2rB2d\)≲rλmax\(G0\)dF\(G0\)\.w\_\{G\_\{0\}\}\(T\-T\)\\leq w\_\{G\_\{0\}\}\(2rB\_\{2\}^\{d\}\)\\lesssim r\\sqrt\{\\lambda\_\{\\max\}\(G\_\{0\}\)d\_\{F\}\(G\_\{0\}\)\}\.Substituting this estimate into Theorem[5\.6](https://arxiv.org/html/2606.18306#S5.Thmtheorem6)gives, with probability at least1−δ1\-\\delta,
supθ∈T\|R\(θ\)−R^n\(θ\)\|≲Lrλmax\(G0\)dF\(G0\)n\+Blog\(1/δ\)2n\.\\sup\_\{\\theta\\in T\}\|R\(\\theta\)\-\\widehat\{R\}\_\{n\}\(\\theta\)\|\\lesssim\\frac\{Lr\\sqrt\{\\lambda\_\{\\max\}\(G\_\{0\}\)d\_\{F\}\(G\_\{0\}\)\}\}\{\\sqrt\{n\}\}\+B\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{2n\}\}\.
## 6Computational Estimation of Fisher Width
Fisher width depends on the Fisher information matrixG=G\(θ0\)G=G\(\\theta\_\{0\}\), which is typically unavailable in closed form\. In practice, one can replacesGGby a computable positive semidefinite approximationG^⪰0\\widehat\{G\}\\succeq 0, such as an empirical, diagonal, low\-rank, or structured Fisher approximation\.
The basic stability principle is
\|wG^\(T\)−wG\(T\)\|≤‖G^1/2−G1/2‖opw\(T\),\\bigl\|w\_\{\\widehat\{G\}\}\(T\)\-w\_\{G\}\(T\)\\bigr\|\\leq\\\|\\widehat\{G\}^\{1/2\}\-G^\{1/2\}\\\|\_\{\\mathrm\{op\}\}\\,w\(T\),as established in Theorem[3\.1](https://arxiv.org/html/2606.18306#S3.Thmtheorem1)\. Thus Fisher width can be approximated by
wG^\(T\)=w\(G^1/2T\),w\_\{\\widehat\{G\}\}\(T\)=w\(\\widehat\{G\}^\{1/2\}T\),with an error controlled by the square\-root approximation error of the Fisher matrix\.
Assume that the support function
hT\(c\)=supv∈T⟨c,v⟩h\_\{T\}\(c\)=\\sup\_\{v\\in T\}\\langle c,v\\ranglecan be evaluated efficiently\. Then
wG^\(T\)=𝔼ghT\(G^1/2g\),g∼N\(0,Id\),w\_\{\\widehat\{G\}\}\(T\)=\\mathbb\{E\}\_\{g\}h\_\{T\}\(\\widehat\{G\}^\{1/2\}g\),\\qquad g\\sim N\(0,I\_\{d\}\),can be estimated by Monte Carlo sampling\. The estimators below differ in howG^1/2\\widehat\{G\}^\{1/2\}is constructed\.
### 6\.1Full Empirical Fisher Estimator
Given i\.i\.d\. samplesz1,…,znz\_\{1\},\\ldots,z\_\{n\}, define the empirical Fisher matrix
G^n=1n∑i=1nsθ0\(zi\)sθ0\(zi\)⊤,\\widehat\{G\}\_\{n\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}s\_\{\\theta\_\{0\}\}\(z\_\{i\}\)s\_\{\\theta\_\{0\}\}\(z\_\{i\}\)^\{\\top\},where
sθ0\(z\)=∇θlogpθ\(z\)\|θ=θ0s\_\{\\theta\_\{0\}\}\(z\)=\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(z\)\\big\|\_\{\\theta=\\theta\_\{0\}\}is the score\. The corresponding Monte Carlo estimator is
w^G^nMC\(T\)=1B∑b=1BhT\(G^n1/2gb\),gb∼i\.i\.d\.N\(0,Id\)\.\\widehat\{w\}\_\{\\widehat\{G\}\_\{n\}\}^\{\\mathrm\{MC\}\}\(T\)=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}h\_\{T\}\(\\widehat\{G\}\_\{n\}^\{1/2\}g\_\{b\}\),\\qquad g\_\{b\}\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}N\(0,I\_\{d\}\)\.
The total error decomposes into Monte Carlo error and Fisher estimation error:
\|w^G^nMC\(T\)−wG\(T\)\|≤\|w^G^nMC\(T\)−wG^n\(T\)\|⏟MonteCarloerror\+\|wG^n\(T\)−wG\(T\)\|⏟Fisherestimationerror\.\\bigl\|\\widehat\{w\}\_\{\\widehat\{G\}\_\{n\}\}^\{\\mathrm\{MC\}\}\(T\)\-w\_\{G\}\(T\)\\bigr\|\\leq\\underbrace\{\\bigl\|\\widehat\{w\}\_\{\\widehat\{G\}\_\{n\}\}^\{\\mathrm\{MC\}\}\(T\)\-w\_\{\\widehat\{G\}\_\{n\}\}\(T\)\\bigr\|\}\_\{\\mathrm\{Monte\\ Carlo\\ error\}\}\+\\underbrace\{\\bigl\|w\_\{\\widehat\{G\}\_\{n\}\}\(T\)\-w\_\{G\}\(T\)\\bigr\|\}\_\{\\mathrm\{Fisher\\ estimation\\ error\}\}\.The Monte Carlo term vanishes asB→∞B\\to\\inftyby the law of large numbers\. The second term is controlled by stability\. In particular, on any event where
‖G^n1/2−G1/2‖op≤εn\(δ\),\\\|\\widehat\{G\}\_\{n\}^\{1/2\}\-G^\{1/2\}\\\|\_\{\\mathrm\{op\}\}\\leq\\varepsilon\_\{n\}\(\\delta\),one has
\|wG^n\(T\)−wG\(T\)\|≤εn\(δ\)w\(T\)\.\\bigl\|w\_\{\\widehat\{G\}\_\{n\}\}\(T\)\-w\_\{G\}\(T\)\\bigr\|\\leq\\varepsilon\_\{n\}\(\\delta\)\\,w\(T\)\.For dense matrices, the dominant costs are formingG^n\\widehat\{G\}\_\{n\}and computing a spectral factorization, which scale as
O\(nd2\+d3\)\.O\(nd^\{2\}\+d^\{3\}\)\.
### 6\.2Low\-Rank Fisher Approximation
When the empirical Fisher spectrum decays rapidly, one may replace the full spectral factorization by a rank\-kkapproximation\. Let
G^n=UΛU⊤\\widehat\{G\}\_\{n\}=U\\Lambda U^\{\\top\}be the eigendecomposition of the empirical Fisher matrix, and let
G^k=UkΛkUk⊤\\widehat\{G\}\_\{k\}=U\_\{k\}\\Lambda\_\{k\}U\_\{k\}^\{\\top\}be its rank\-kktruncation\. The corresponding low\-rank approximation is
wG^k\(T\)=𝔼ghT\(G^k1/2g\)\.w\_\{\\widehat\{G\}\_\{k\}\}\(T\)=\\mathbb\{E\}\_\{g\}h\_\{T\}\(\\widehat\{G\}\_\{k\}^\{1/2\}g\)\.
###### Proposition 6\.1\(Low\-rank approximation error\)\.
LetG^k\\widehat\{G\}\_\{k\}be the rank\-kktruncation ofG^n\\widehat\{G\}\_\{n\}\. Then, for everyT⊆ℝdT\\subseteq\\mathbb\{R\}^\{d\},
\|wG^k\(T\)−wG^n\(T\)\|≤λk\+1\(G^n\)w\(T\),\\bigl\|w\_\{\\widehat\{G\}\_\{k\}\}\(T\)\-w\_\{\\widehat\{G\}\_\{n\}\}\(T\)\\bigr\|\\leq\\sqrt\{\\lambda\_\{k\+1\}\(\\widehat\{G\}\_\{n\}\)\}\\,w\(T\),with the convention thatλk\+1\(G^n\)=0\\lambda\_\{k\+1\}\(\\widehat\{G\}\_\{n\}\)=0ifk≥rank\(G^n\)k\\geq\\operatorname\{rank\}\(\\widehat\{G\}\_\{n\}\)\. Consequently, if
\|wG^n\(T\)−wG\(T\)\|≤εn\(δ\)w\(T\)\\bigl\|w\_\{\\widehat\{G\}\_\{n\}\}\(T\)\-w\_\{G\}\(T\)\\bigr\|\\leq\\varepsilon\_\{n\}\(\\delta\)\\,w\(T\)with probability at least1−δ1\-\\delta, then
\|wG^k\(T\)−wG\(T\)\|≤\(λk\+1\(G^n\)\+εn\(δ\)\)w\(T\)\\bigl\|w\_\{\\widehat\{G\}\_\{k\}\}\(T\)\-w\_\{G\}\(T\)\\bigr\|\\leq\\Bigl\(\\sqrt\{\\lambda\_\{k\+1\}\(\\widehat\{G\}\_\{n\}\)\}\+\\varepsilon\_\{n\}\(\\delta\)\\Bigr\)w\(T\)with probability at least1−δ1\-\\delta\.
###### Proof\.
By the stability principle,
\|wG^k\(T\)−wG^n\(T\)\|≤‖G^k1/2−G^n1/2‖opw\(T\)\.\\bigl\|w\_\{\\widehat\{G\}\_\{k\}\}\(T\)\-w\_\{\\widehat\{G\}\_\{n\}\}\(T\)\\bigr\|\\leq\\\|\\widehat\{G\}\_\{k\}^\{1/2\}\-\\widehat\{G\}\_\{n\}^\{1/2\}\\\|\_\{\\mathrm\{op\}\}\\,w\(T\)\.SinceG^k\\widehat\{G\}\_\{k\}is obtained by truncating the spectrum ofG^n\\widehat\{G\}\_\{n\},
G^n−G^k=∑i\>kλi\(G^n\)uiui⊤\.\\widehat\{G\}\_\{n\}\-\\widehat\{G\}\_\{k\}=\\sum\_\{i\>k\}\\lambda\_\{i\}\(\\widehat\{G\}\_\{n\}\)u\_\{i\}u\_\{i\}^\{\\top\}\.Therefore,
‖G^k1/2−G^n1/2‖op=λk\+1\(G^n\)\.\\\|\\widehat\{G\}\_\{k\}^\{1/2\}\-\\widehat\{G\}\_\{n\}^\{1/2\}\\\|\_\{\\mathrm\{op\}\}=\\sqrt\{\\lambda\_\{k\+1\}\(\\widehat\{G\}\_\{n\}\)\}\.This proves the first inequality\. The second follows by the triangle inequality\. ∎
Thus the computational error is controlled by the discarded Fisher spectrum\. If the eigenvalues ofG^n\\widehat\{G\}\_\{n\}decay rapidly, a small number of leading Fisher directions suffices\. Randomized SVD or sketching methods can computeG^k\\widehat\{G\}\_\{k\}, without explicitly forming the dense empirical Fisher matrix, at substantially lower cost than a full eigendecomposition, typically on the order of
O\(ndk\+dk2\)O\(ndk\+dk^\{2\}\)when applied to score samples\(Halkoet al\.,[2011](https://arxiv.org/html/2606.18306#bib.bib28)\)\.
### 6\.3Score\-Norm Estimator for Euclidean Balls
For Euclidean balls, Fisher width admits an even cheaper estimator\. Let
The identity
Tr\(G\)=𝔼‖sθ0\(z\)‖22\\mathrm\{Tr\}\(G\)=\\mathbb\{E\}\\\|s\_\{\\theta\_\{0\}\}\(z\)\\\|\_\{2\}^\{2\}suggests estimating the trace by the empirical score norm
τ^n=1n∑i=1n‖sθ0\(zi\)‖22\.\\widehat\{\\tau\}\_\{n\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|s\_\{\\theta\_\{0\}\}\(z\_\{i\}\)\\\|\_\{2\}^\{2\}\.This gives the score\-norm estimator
w^score=rτ^n\.\\widehat\{w\}\_\{\\mathrm\{score\}\}=r\\sqrt\{\\widehat\{\\tau\}\_\{n\}\}\.It requires neither matrix formation, spectral factorization, nor Monte Carlo integration, and its cost isO\(nd\)O\(nd\)\.
By Theorem[5\.13](https://arxiv.org/html/2606.18306#S5.Thmtheorem13),
2πrTr\(G\)≤wG\(rB2d\)≤rTr\(G\)\.\\sqrt\{\\frac\{2\}\{\\pi\}\}\\,r\\sqrt\{\\mathrm\{Tr\}\(G\)\}\\leq w\_\{G\}\(rB\_\{2\}^\{d\}\)\\leq r\\sqrt\{\\mathrm\{Tr\}\(G\)\}\.Since
τ^n→Tr\(G\)\\widehat\{\\tau\}\_\{n\}\\to\\mathrm\{Tr\}\(G\)almost surely by the law of large numbers, the estimatorw^score\\widehat\{w\}\_\{\\mathrm\{score\}\}converges almost surely to the upper scalerTr\(G\)r\\sqrt\{\\mathrm\{Tr\}\(G\)\}\. Hence it provides a constant\-factor approximation ofwG\(rB2d\)w\_\{G\}\(rB\_\{2\}^\{d\}\), with sharp universal factor2/π\\sqrt\{2/\\pi\}\.
The constant is sharp: the lower factor2/π\\sqrt\{2/\\pi\}is attained in the rank\-one case\. In the isotropic caseG=λIdG=\\lambda I\_\{d\},
wG\(rB2d\)rTr\(G\)=𝔼‖g‖2d→1asd→∞\.\\frac\{w\_\{G\}\(rB\_\{2\}^\{d\}\)\}\{r\\sqrt\{\\mathrm\{Tr\}\(G\)\}\}=\\frac\{\\mathbb\{E\}\\\|g\\\|\_\{2\}\}\{\\sqrt\{d\}\}\\to 1\\qquad\\text\{as \}d\\to\\infty\.
Using the decomposition
Tr\(G\)=λmax\(G\)dF\(G\),\\mathrm\{Tr\}\(G\)=\\lambda\_\{\\max\}\(G\)\\,d\_\{F\}\(G\),the score\-norm estimator may equivalently be viewed as estimating the scale
rλmax\(G\)dF\(G\)\.r\\sqrt\{\\lambda\_\{\\max\}\(G\)d\_\{F\}\(G\)\}\.
Table 1:Comparison of Fisher width estimators\.These comparisons show that Fisher width is not only a theoretical complexity measure, but also a computable quantity whose approximation can be adapted to the available spectral and geometric structure\.
## 7Experiments
The experiments provide empirical support for three aspects of the theory across three model classes and two tasks\. Section[7\.2](https://arxiv.org/html/2606.18306#S7.SS2)shows that Fisher width is a nontrivial, computable quantity on trained models\. Section[7\.3](https://arxiv.org/html/2606.18306#S7.SS3)evaluates the estimation methods and stability bounds of Section[6](https://arxiv.org/html/2606.18306#S6)\. Section[7\.4](https://arxiv.org/html/2606.18306#S7.SS4)examines the predictedO\(1/n\)O\(1/\\sqrt\{n\}\)scaling from Theorem[5\.6](https://arxiv.org/html/2606.18306#S5.Thmtheorem6)\. All experiments useT=B2pT=B\_\{2\}^\{p\}, are averaged over 10 independent random seeds, and report error bars of one standard deviation\. Code is provided in the supplementary material\.
### 7\.1Experimental Setup
#### Dataset\.
All experiments use MNIST\(LeCunet al\.,[1998](https://arxiv.org/html/2606.18306#bib.bib38)\)\. Sections[7\.2](https://arxiv.org/html/2606.18306#S7.SS2)and[7\.3](https://arxiv.org/html/2606.18306#S7.SS3)use the binary subset \(digits 0 vs\. 1,13,00713\{,\}007samples total\)\. Section[7\.4](https://arxiv.org/html/2606.18306#S7.SS4)uses all 10 classes \(70,00070\{,\}000samples\)\. Features are standardized to zero mean and unit variance\.
#### Models and Fisher estimation\.
Three model classes are used throughout\.
*Model A: Binary logistic regression*\(p=784p=784\)\. The full empirical Fisher
G^n=1n∑i=1nsθ^\(xi\)sθ^\(xi\)⊤\\widehat\{G\}\_\{n\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}s\_\{\\hat\{\\theta\}\}\(x\_\{i\}\)s\_\{\\hat\{\\theta\}\}\(x\_\{i\}\)^\{\\top\}is computed exactly\. Fisher width is estimated by Monte Carlo withB=5,000B=5\{,\}000samples\.
*Model B: Softmax regression, 10 classes*\(p=7,840p=7\{,\}840\)\. The diagonal empirical Fisher
γ^k,j=1n∑i=1np^ik\(1−p^ik\)xij2\\widehat\{\\gamma\}\_\{k,j\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\widehat\{p\}\_\{ik\}\(1\-\\widehat\{p\}\_\{ik\}\)x\_\{ij\}^\{2\}is used\. Fisher width is estimated withB=10,000B=10\{,\}000Monte Carlo samples\.
*Model C: Ridge regression*\(p=784p=784\)\. The Fisher matrix
G\(θ\)=σ−2ΣX,ΣX=1nX⊤X,G\(\\theta\)=\\sigma^\{\-2\}\\Sigma\_\{X\},\\qquad\\Sigma\_\{X\}=\\frac\{1\}\{n\}X^\{\\top\}X,does not depend onθ\\theta, providing an analytic ground truth for Fisher width via the eigenvalues ofΣX\\Sigma\_\{X\}\.
#### Regularization\.
We varyL2L\_\{2\}regularization
λ∈\{0,10−4,10−3,10−2,10−1,1,5\}\\lambda\\in\\\{0,10^\{\-4\},10^\{\-3\},10^\{\-2\},10^\{\-1\},1,5\\\}in Sections[7\.2](https://arxiv.org/html/2606.18306#S7.SS2)–[7\.3](https://arxiv.org/html/2606.18306#S7.SS3), and
λ∈\{10−3,10−2,10−1,1\}\\lambda\\in\\\{10^\{\-3\},10^\{\-2\},10^\{\-1\},1\\\}in Section[7\.4](https://arxiv.org/html/2606.18306#S7.SS4)\.
### 7\.2Fisher Width in Trained Models
#### Goal\.
The first experiment examines whether empirical Fisher width is substantially different from the Euclidean baselinew\(B2p\)≈pw\(B\_\{2\}^\{p\}\)\\approx\\sqrt\{p\}, whether it satisfies the theoretical spectral and score bounds, and how it changes under regularization across the three model classes\.
#### Setup\.
We usen=10,000n=10\{,\}000, 10 seeds, and
λ∈\{0,10−4,10−3,10−2,10−1,1,5\}\.\\lambda\\in\\\{0,10^\{\-4\},10^\{\-3\},10^\{\-2\},10^\{\-1\},1,5\\\}\.
#### Observations\.
*Fisher width is substantially below the Euclidean baseline\.*Figure[1](https://arxiv.org/html/2606.18306#S7.F1)and Table[2](https://arxiv.org/html/2606.18306#S7.T2)report the results\. For Model A, the normalized ratio
w^G\(B2784\)/w\(B2784\)\\widehat\{w\}\_\{G\}\(B\_\{2\}^\{784\}\)/w\(B\_\{2\}^\{784\}\)ranges from0\.0010\.001atλ=0\\lambda=0to0\.2950\.295atλ=5\\lambda=5\. For Model B, the ratio ranges from0\.0300\.030to0\.2660\.266over the larger parameter spacep=7,840p=7\{,\}840\. For Model C, the ratio is approximately0\.8650\.865, reflecting the near\-isotropic Fisher geometry of the Gaussian covariate model\.
*The theoretical bounds are satisfied\.*The spectral sandwich bound of Lemma[2\.10](https://arxiv.org/html/2606.18306#S2.Thmtheorem10)holds across all 70 runs for each model, corresponding to 7 regularization values and 10 random seeds\. For Model A,
minλ,s\(w^G\(s\)−λmin\(G^\)w\(B2p\)\)=0\.021\>0\.\\min\_\{\\lambda,s\}\\left\(\\widehat\{w\}\_\{G\}^\{\(s\)\}\-\\sqrt\{\\lambda\_\{\\min\}\(\\widehat\{G\}\)\}\\,w\(B\_\{2\}^\{p\}\)\\right\)=0\.021\>0\.The score upper bound of Theorem[5\.13](https://arxiv.org/html/2606.18306#S5.Thmtheorem13)is satisfied in every run for Models A and C, with
min\(Tr\(G^\)−w^G\)=2\.8×10−4\.\\min\\left\(\\sqrt\{\\mathrm\{Tr\}\(\\widehat\{G\}\)\}\-\\widehat\{w\}\_\{G\}\\right\)=2\.8\\times 10^\{\-4\}\.
*For the models considered here, Fisher width increases with regularization\.*For all three models,w^G\\widehat\{w\}\_\{G\}is increasing inλ\\lambdaacross all seeds; see Figure[1](https://arxiv.org/html/2606.18306#S7.F1)\(a\),\(b\)\. In Model A, the damped condition numberκε\(G^\)\\kappa\_\{\\varepsilon\}\(\\widehat\{G\}\)also increases withλ\\lambda, growing from approximately10210^\{2\}to10710^\{7\}\. This appears to occur because the fitted model changes the Fisher spectrum by increasing the leading eigenvalues while the smallest eigenvalues remain close to zero\.
*The score upper bound is tight for near\-isotropic models\.*The ratio between the score upper scale and the empirical Fisher width is1\.0081\.008for Model A and1\.0061\.006for Model C, consistent with near\-isotropic Fisher geometry\. The ratio is larger for Model B\.
*The Monte Carlo estimator is validated against analytic ground truth in Model C\.*For ridge regression, the analytic Fisher width
wGanalytic=24\.38w\_\{G\}^\{\\mathrm\{analytic\}\}=24\.38is available from the eigenvalues ofΣX\\Sigma\_\{X\}\. The Monte Carlo estimator withB=5,000B=5\{,\}000achieves a relative error of0\.13%0\.13\\%, uniformly across all regularization values, directly supporting the estimation procedure of Section[6](https://arxiv.org/html/2606.18306#S6)\.
Table 2:Selected Fisher width values from Experiment 1\.w\(B2p\)=27\.99w\(B\_\{2\}^\{p\}\)=27\.99forp=784p=784and88\.5488\.54forp=7,840p=7\{,\}840\. Means over 10 seeds;±\\pmstd shown\.Figure 1:Experiment 1: Fisher width in trained models\.Three model classes on MNIST,n=10,000n=10\{,\}000, 10 seeds,λ∈\{0,10−4,10−3,10−2,10−1,1,5\}\\lambda\\in\\\{0,10^\{\-4\},10^\{\-3\},10^\{\-2\},10^\{\-1\},1,5\\\}\.*Row A*: Binary logistic regression \(p=784p=784, full Fisher\)\.*Row B*: Softmax regression \(p=7,840p=7\{,\}840, diagonal Fisher\)\.*Row C*: Ridge regression \(p=784p=784, analytic Fisher\)\.Left column: empirical Fisher widthw^G\\widehat\{w\}\_\{G\}with score upper boundTr\(G^\)\\sqrt\{\\mathrm\{Tr\}\(\\widehat\{G\}\)\}and spectral sandwich\. The dotted line is the Euclidean baselinew\(B2p\)w\(B\_\{2\}^\{p\}\)\.Middle column: normalized ratiow^G/w\(B2p\)\\widehat\{w\}\_\{G\}/w\(B\_\{2\}^\{p\}\)\.Right column: damped condition numberκε\(G^\)\\kappa\_\{\\varepsilon\}\(\\widehat\{G\}\)on a log scale for Models A and B; Monte Carlo convergence against analytic ground truth for Model C, confirming theO\(1/B\)O\(1/\\sqrt\{B\}\)rate\.
### 7\.3Estimator Accuracy and Stability
#### Goal\.
The second experiment evaluates the low\-rank approximation error from Proposition[6\.1](https://arxiv.org/html/2606.18306#S6.Thmtheorem1), the data convergence predicted by Corollary[3\.4](https://arxiv.org/html/2606.18306#S3.Thmtheorem4), and the stability of structured approximations given by Theorem[3\.1](https://arxiv.org/html/2606.18306#S3.Thmtheorem1)\.
#### Setup\.
We use Model A withλ=0\.01\\lambda=0\.01,n=10,000n=10\{,\}000, and 10 seeds\. The reference value is
w^Gref=1\.462±0\.039\\widehat\{w\}\_\{G\}^\{\\mathrm\{ref\}\}=1\.462\\pm 0\.039computed withB=10,000B=10\{,\}000Monte Carlo samples\.
#### Observations\.
*Rank\-kkerror tracksλk\+1\\sqrt\{\\lambda\_\{k\+1\}\}\.*Figure[2](https://arxiv.org/html/2606.18306#S7.F2)\(a\) reports
\|w^G−w^G,k\|\\left\|\\widehat\{w\}\_\{G\}\-\\widehat\{w\}\_\{G,k\}\\right\|for
k∈\{1,2,5,10,20,30,50,100,200,392\}\.k\\in\\\{1,2,5,10,20,30,50,100,200,392\\\}\.The bound of Proposition[6\.1](https://arxiv.org/html/2606.18306#S6.Thmtheorem1)is satisfied at everykk, with
maxk\(error−bound\)=−0\.262<0\.\\max\_\{k\}\(\\text\{error\}\-\\text\{bound\}\)=\-0\.262<0\.On the log\-log scale, the actual error is parallel to the bound across allkk, supporting the predictedΘ\(λk\+1\)\\Theta\(\\sqrt\{\\lambda\_\{k\+1\}\}\)scaling\. The Fisher spectrum decays rapidly: byk=100k=100, the rank\-kkerror is0\.0960\.096, and byk=392k=392, it reaches0\.0330\.033\.
*Data convergence is consistent with anO\(1/n\)O\(1/\\sqrt\{n\}\)rate\.*Figure[2](https://arxiv.org/html/2606.18306#S7.F2)\(b\) shows the error
\|w^G\(n\)−w^Gref\|\\left\|\\widehat\{w\}\_\{G\}^\{\(n\)\}\-\\widehat\{w\}\_\{G\}^\{\\mathrm\{ref\}\}\\right\|as a function ofnn\. The log\-log slope is−0\.60\-0\.60, close to the−0\.5\-0\.5rate predicted by Corollary[3\.4](https://arxiv.org/html/2606.18306#S3.Thmtheorem4)\. Relative error falls from47\.7%47\.7\\%atn=100n=100to2\.0%2\.0\\%atn=10,000≈12\.8dn=10\{,\}000\\approx 12\.8d\. Errors are larger forn<dn<d, where the empirical Fisher is severely rank\-deficient\.
*The diagonal approximation is accurate in this setting\.*Figure[2](https://arxiv.org/html/2606.18306#S7.F2)\(c\) compares the full Fisher, diagonal Fisher, and score upper bound\. The diagonal Fisher incurs0\.56%0\.56\\%relative error and the score upper bound incurs0\.64%0\.64\\%relative error relative to the full Fisher width\. Both approximations consistently upper\-bound the full estimate across all seeds, as expected from the stability principle\.
Figure 2:Experiment 2: Estimator accuracy and stability\.Model A \(binary logistic,λ=0\.01\\lambda=0\.01,n=10,000n=10\{,\}000,d=784d=784, 10 seeds\)\.\(a\)Rank\-kkapproximation error\|w^G−w^G,k\|\|\\widehat\{w\}\_\{G\}\-\\widehat\{w\}\_\{G,k\}\|versus the theoretical boundλk\+1\(G^\)w\(T\)\\sqrt\{\\lambda\_\{k\+1\}\(\\widehat\{G\}\)\}\\,w\(T\)on a log\-log scale\. The bound is satisfied at allkk, with parallel scalingΘ\(λk\+1\)\\Theta\(\\sqrt\{\\lambda\_\{k\+1\}\}\)\(Proposition[6\.1](https://arxiv.org/html/2606.18306#S6.Thmtheorem1)\)\.\(b\)Data convergence:\|w^G\(n\)−w^Gref\|\|\\widehat\{w\}\_\{G\}^\{\(n\)\}\-\\widehat\{w\}\_\{G\}^\{\\mathrm\{ref\}\}\|versusnn, with anO\(1/n\)O\(1/\\sqrt\{n\}\)reference line; log\-log slope=−0\.60=\-0\.60\(Corollary[3\.4](https://arxiv.org/html/2606.18306#S3.Thmtheorem4)\)\.\(c\)Structured approximations: full Fisher, diagonal Fisher \(\+0\.56%\+0\.56\\%\), and score upper bound \(\+0\.64%\+0\.64\\%\) \(Theorem[3\.1](https://arxiv.org/html/2606.18306#S3.Thmtheorem1)\)\.
### 7\.4Fisher Width and Generalization
#### Goal\.
The third experiment examines theO\(1/n\)O\(1/\\sqrt\{n\}\)scaling predicted by Theorem[5\.6](https://arxiv.org/html/2606.18306#S5.Thmtheorem6)across training sizes and regularization levels\.
#### Setup\.
We use Model B, the 10\-class softmax model, withNtest=5,000N\_\{\\mathrm\{test\}\}=5\{,\}000and 10 seeds\. We vary
λ∈\{10−3,10−2,10−1,1\}\\lambda\\in\\\{10^\{\-3\},10^\{\-2\},10^\{\-1\},1\\\}and
n∈\{200,500,1,000,2,000,5,000,10,000,20,000\}\.n\\in\\\{200,500,1\{,\}000,2\{,\}000,5\{,\}000,10\{,\}000,20\{,\}000\\\}\.
Although the implementation uses the fullK×dK\\times dparametrization \(p=7,840p=7\{,\}840\), the softmax model is identifiable only modulo common shifts of all class weights\. The Fisher\-Lipschitz verification in Appendix[A\.3](https://arxiv.org/html/2606.18306#A1.SS3)is stated in an identifiable baseline parametrization; the full\-parametrization case is interpreted through the quotient\-space formulation of Remark[A\.5](https://arxiv.org/html/2606.18306#A1.Thmtheorem5)\.
#### Observations\.
*The generalization gap scales approximately as1/n1/\\sqrt\{n\}\.*For each fixedλ\\lambda, the generalization gap is approximately linear in1/n1/\\sqrt\{n\}across all seven training sizes; see Figure[3](https://arxiv.org/html/2606.18306#S7.F3)\(a\)\. The linear fits achieve
R2∈\{0\.935,0\.943,0\.987,0\.976\}R^\{2\}\\in\\\{0\.935,0\.943,0\.987,0\.976\\\}for
λ∈\{10−3,10−2,10−1,1\},\\lambda\\in\\\{10^\{\-3\},10^\{\-2\},10^\{\-1\},1\\\},respectively, consistent with the rate predicted by Theorem[5\.6](https://arxiv.org/html/2606.18306#S5.Thmtheorem6)\.
*Fisher width increases with training data and is stabilized by regularization\.*For fixedλ\\lambda,w^G\(B2p;θ^n\)\\widehat\{w\}\_\{G\}\(B\_\{2\}^\{p\};\\hat\{\\theta\}\_\{n\}\)grows withnn, as shown in Figure[3](https://arxiv.org/html/2606.18306#S7.F3)\(b\)\. Stronger regularization stabilizes this growth: the ratio
w^G\(n=20,000\)w^G\(n=200\)\\frac\{\\widehat\{w\}\_\{G\}\(n=20\{,\}000\)\}\{\\widehat\{w\}\_\{G\}\(n=200\)\}is5\.975\.97atλ=10−3\\lambda=10^\{\-3\}and1\.191\.19atλ=1\\lambda=1\.
*The observed gaps are consistent with Fisher\-width scaling\.*Figure[3](https://arxiv.org/html/2606.18306#S7.F3)\(c\) plots the generalization gap against
w^G/n\\widehat\{w\}\_\{G\}/\\sqrt\{n\}over all\(λ,n\)\(\\lambda,n\)pairs\. All 28 averaged data points lie below a fitted linear envelope
CL⋅w^G/nCL\\cdot\\widehat\{w\}\_\{G\}/\\sqrt\{n\}withCL=2\.672CL=2\.672\. This supports the scaling predicted by Theorem[5\.6](https://arxiv.org/html/2606.18306#S5.Thmtheorem6), although the constant is calibrated empirically\.
Figure 3:Experiment 3: Fisher width and generalization\.Model B \(MNIST 10\-class softmax\), 10 seeds,Ntest=5,000N\_\{\\mathrm\{test\}\}=5\{,\}000\.\(a\)Generalization gap versus1/n1/\\sqrt\{n\}for four regularization levels; dashed lines are linear fits\. All four fits achieveR2≥0\.935R^\{2\}\\geq 0\.935, consistent with theO\(1/n\)O\(1/\\sqrt\{n\}\)rate of Theorem[5\.6](https://arxiv.org/html/2606.18306#S5.Thmtheorem6)\.\(b\)Fisher widthw^G\\widehat\{w\}\_\{G\}versus training set sizennon a log scale for fixedλ\\lambda\. Stronger regularization stabilizesw^G\\widehat\{w\}\_\{G\}acrossnn\.\(c\)Generalization gap versusw^G/n\\widehat\{w\}\_\{G\}/\\sqrt\{n\}for all\(λ,n\)\(\\lambda,n\)pairs\. All 28 averaged points lie below the fitted envelopeCL⋅w^G/nCL\\cdot\\widehat\{w\}\_\{G\}/\\sqrt\{n\}withCL=2\.672CL=2\.672\.
## 8Discussion
### 8\.1Complexity Depends on Geometry
A central message of this paper is that complexity is not intrinsic to a hypothesis class: it depends on the metric used to measure it\.
Classical Gaussian widthw\(T\)w\(T\)implicitly assumes the Euclidean metricG=IG=I, treating all parameter directions as equally significant\. Fisher width
wG\(T\)=w\(G1/2T\)w\_\{G\}\(T\)=w\(G^\{1/2\}T\)replaces this assumption with the local statistical geometry of the model: directions of high Fisher curvature are amplified, statistically insensitive directions are suppressed, and the resulting complexity measure reflects what is locally distinguishable from data\.
This perspective has a concrete consequence\. Two hypothesis classes with identical Euclidean complexity may have very different Fisher complexities if their Fisher geometries differ\. It is Fisher complexity, rather than Euclidean complexity alone, that appears in the generalization bound of Theorem[5\.6](https://arxiv.org/html/2606.18306#S5.Thmtheorem6)\.
Gaussian width is therefore not replaced by Fisher width; it appears as the flat, isotropic limit of a more general information\-geometric complexity viewpoint\.
### 8\.2Effective Fisher Dimension
A central consequence of Fisher width is the emergence of an intrinsic effective dimension determined by the Fisher geometry itself\. The quantity
dF\(G\)=Tr\(G\)λmax\(G\)d\_\{F\}\(G\)=\\frac\{\\mathrm\{Tr\}\(G\)\}\{\\lambda\_\{\\max\}\(G\)\}appears naturally from the sharp characterization
wG\(rB2d\)2≍r2λmax\(G\)dF\(G\)w\_\{G\}\(rB\_\{2\}^\{d\}\)^\{2\}\\asymp r^\{2\}\\lambda\_\{\\max\}\(G\)d\_\{F\}\(G\)of Theorem[5\.13](https://arxiv.org/html/2606.18306#S5.Thmtheorem13)\. For nonzeroGG,
1≤dF\(G\)≤rank\(G\)≤d\.1\\leq d\_\{F\}\(G\)\\leq\\operatorname\{rank\}\(G\)\\leq d\.ThusdF\(G\)d\_\{F\}\(G\)measures how many directions carry substantial Fisher information, interpolating between the rank\-one, maximally anisotropic case and the isotropic caseG=λIdG=\\lambda I\_\{d\}, wheredF\(G\)=dd\_\{F\}\(G\)=d\.
The significance ofdF\(G\)d\_\{F\}\(G\)is that it can replace raw parameter count as a more intrinsic notion of local model size\. For Euclidean balls with comparable radius and largest Fisher scale, a model withd=106d=10^\{6\}butdF\(G\)=50d\_\{F\}\(G\)=50has the same Fisher\-width scaling as a 50\-dimensional isotropic model\. The experiments of Section[7](https://arxiv.org/html/2606.18306#S7)show that Fisher geometry varies substantially across model classes and regularization levels, suggesting thatdF\(G\)d\_\{F\}\(G\)captures structure beyond the raw parameter countdd\. A systematic comparison betweendF\(G\)d\_\{F\}\(G\)andddas predictors of generalization remains for future work\.
### 8\.3Limitations and Future Directions
Several limitations bound the scope of the present results\.
Locality\.Fisher width is defined at a fixed base pointθ0\\theta\_\{0\}\. Extending the theory to globally varying Fisher metrics on overparameterized manifolds remains open\. Such an extension would require controlling howG\(θ\)G\(\\theta\), and thereforewG\(θ\)\(T\)w\_\{G\(\\theta\)\}\(T\), changes along the manifold\.
Static metric\.The theory assumesG\(θ0\)G\(\\theta\_\{0\}\)fixed\. In modern learning systems, the Fisher geometry evolves throughout training\. Whether the static bounds of this paper remain informative along training trajectories is not yet understood\. This motivates a dynamic theory of Fisher complexity, in which Fisher width is studied as a time\-dependent geometric quantity along optimization paths\.
Perturbative curvature\.The curvature analysis of Section[4](https://arxiv.org/html/2606.18306#S4)is local and perturbative\. A global geometric characterization of Fisher complexity on curved statistical manifolds remains an open problem\.
Sharpness beyond exponential families\.Section[5\.3](https://arxiv.org/html/2606.18306#S5.SS3)shows that the Fisher\-width scale is sharp for local perturbations in exponential families with negative log\-likelihood loss\. Whether a matching lower bound holds for general Fisher\-Lipschitz losses under a suitable nondegeneracy condition remains open\. Establishing such a result would confirm that Fisher width is not merely sufficient, but also necessary, as a complexity measure for statistical learning on manifolds\.
Fisher\-Lipschitz assumption\.The generalization bound relies on Assumption[5\.3](https://arxiv.org/html/2606.18306#S5.Thmtheorem3)\. Appendix[A](https://arxiv.org/html/2606.18306#A1)verifies this assumption for logistic and softmax regression under bounded Fisher\-leverage conditions\. Extending such verification to neural networks and other highly nonconvex model classes remains open\.
Population versus empirical Fisher\.The theory is formulated in terms of the population Fisher matrixG\(θ0\)G\(\\theta\_\{0\}\), whereas practical estimation relies on the empirical approximationG^n\\widehat\{G\}\_\{n\}\. Section[6](https://arxiv.org/html/2606.18306#S6)quantifies the resulting width error, but understanding the full statistical and algorithmic consequences of this replacement, particularly in high\-dimensional, misspecified, or overparameterized settings, remains an important direction for future work\.
Recovery and duality\.Finally, Fisher width suggests a possible primal–dual extension of the present theory\. By analogy with Gaussian width in conic integral geometry, a dual Fisher width such aswG−1\(T\)w\_\{G^\{\-1\}\}\(T\), orwG†\(T\)w\_\{G^\{\\dagger\}\}\(T\)in singular settings, may be relevant for Fisher\-scaled recovery thresholds\. This points toward a picture in whichwGw\_\{G\}measures statistical sensitivity, while a dual width measures estimation uncertainty or recovery difficulty\. We leave this direction for future work\.
## Appendix AVerification of the Fisher\-Lipschitz Assumption
This appendix verifies Assumption[5\.3](https://arxiv.org/html/2606.18306#S5.Thmtheorem3)for the classification models used in the experiments\. The common mechanism is a Fisher\-gradient bound: if the loss gradient has uniformly bounded dual norm with respect to the base\-point Fisher metric, then the loss is Lipschitz in the corresponding Fisher norm\.
### A\.1A General Fisher\-Gradient Criterion
###### Lemma A\.1\(Fisher\-gradient criterion\)\.
LetT⊂ℝdT\\subset\\mathbb\{R\}^\{d\}be convex and letG0≻0G\_\{0\}\\succ 0\. Suppose thatℓ\(θ;z\)\\ell\(\\theta;z\)is differentiable inθ\\thetaonTT, and that
supθ∈Tsupz‖G0−1/2∇θℓ\(θ;z\)‖2≤L\.\\sup\_\{\\theta\\in T\}\\sup\_\{z\}\\left\\\|G\_\{0\}^\{\-1/2\}\\nabla\_\{\\theta\}\\ell\(\\theta;z\)\\right\\\|\_\{2\}\\leq L\.Thenℓ\\ellis Fisher\-Lipschitz onTTwith respect toG0G\_\{0\}:
\|ℓ\(θ1;z\)−ℓ\(θ2;z\)\|≤L‖θ1−θ2‖G0,θ1,θ2∈T\.\|\\ell\(\\theta\_\{1\};z\)\-\\ell\(\\theta\_\{2\};z\)\|\\leq L\\\|\\theta\_\{1\}\-\\theta\_\{2\}\\\|\_\{G\_\{0\}\},\\qquad\\theta\_\{1\},\\theta\_\{2\}\\in T\.
###### Proof\.
Fixθ1,θ2∈T\\theta\_\{1\},\\theta\_\{2\}\\in T, and define
θt=θ2\+t\(θ1−θ2\),t∈\[0,1\]\.\\theta\_\{t\}=\\theta\_\{2\}\+t\(\\theta\_\{1\}\-\\theta\_\{2\}\),\\qquad t\\in\[0,1\]\.By convexity ofTT,θt∈T\\theta\_\{t\}\\in T\. The fundamental theorem of calculus gives
ℓ\(θ1;z\)−ℓ\(θ2;z\)=∫01⟨∇θℓ\(θt;z\),θ1−θ2⟩𝑑t\.\\ell\(\\theta\_\{1\};z\)\-\\ell\(\\theta\_\{2\};z\)=\\int\_\{0\}^\{1\}\\left\\langle\\nabla\_\{\\theta\}\\ell\(\\theta\_\{t\};z\),\\theta\_\{1\}\-\\theta\_\{2\}\\right\\rangle\\,dt\.For eachtt, Cauchy–Schwarz in the Fisher metric gives
\|⟨∇θℓ\(θt;z\),θ1−θ2⟩\|\\displaystyle\\left\|\\left\\langle\\nabla\_\{\\theta\}\\ell\(\\theta\_\{t\};z\),\\theta\_\{1\}\-\\theta\_\{2\}\\right\\rangle\\right\|=\|⟨G0−1/2∇θℓ\(θt;z\),G01/2\(θ1−θ2\)⟩\|\\displaystyle=\\left\|\\left\\langle G\_\{0\}^\{\-1/2\}\\nabla\_\{\\theta\}\\ell\(\\theta\_\{t\};z\),G\_\{0\}^\{1/2\}\(\\theta\_\{1\}\-\\theta\_\{2\}\)\\right\\rangle\\right\|≤‖G0−1/2∇θℓ\(θt;z\)‖2‖θ1−θ2‖G0\.\\displaystyle\\leq\\left\\\|G\_\{0\}^\{\-1/2\}\\nabla\_\{\\theta\}\\ell\(\\theta\_\{t\};z\)\\right\\\|\_\{2\}\\,\\\|\\theta\_\{1\}\-\\theta\_\{2\}\\\|\_\{G\_\{0\}\}\.Using the assumed uniform bound and integrating overtt, we obtain
\|ℓ\(θ1;z\)−ℓ\(θ2;z\)\|≤L‖θ1−θ2‖G0\.\|\\ell\(\\theta\_\{1\};z\)\-\\ell\(\\theta\_\{2\};z\)\|\\leq L\\\|\\theta\_\{1\}\-\\theta\_\{2\}\\\|\_\{G\_\{0\}\}\.∎
### A\.2Logistic Regression
###### Proposition A\.3\(Fisher\-Lipschitz condition for logistic regression\)\.
Consider binary logistic regression
ℙθ\(Y=1∣X=x\)=σ\(θ⊤x\),σ\(t\)=11\+e−t,\\mathbb\{P\}\_\{\\theta\}\(Y=1\\mid X=x\)=\\sigma\(\\theta^\{\\top\}x\),\\qquad\\sigma\(t\)=\\frac\{1\}\{1\+e^\{\-t\}\},with cross\-entropy loss
ℓ\(θ;\(x,y\)\)=−yθ⊤x\+log\(1\+eθ⊤x\)\.\\ell\(\\theta;\(x,y\)\)=\-y\\,\\theta^\{\\top\}x\+\\log\(1\+e^\{\\theta^\{\\top\}x\}\)\.Fix a base pointθ0\\theta\_\{0\}, and letG0=G\(θ0\)≻0G\_\{0\}=G\(\\theta\_\{0\}\)\\succ 0\. Suppose thatT⊂ℝdT\\subset\\mathbb\{R\}^\{d\}is convex and
supx‖G0−1/2x‖2≤L\.\\sup\_\{x\}\\\|G\_\{0\}^\{\-1/2\}x\\\|\_\{2\}\\leq L\.Then Assumption[5\.3](https://arxiv.org/html/2606.18306#S5.Thmtheorem3)holds onTTwith constantLL\.
###### Proof\.
For logistic regression,
∇θℓ\(θ;\(x,y\)\)=\(σ\(θ⊤x\)−y\)x\.\\nabla\_\{\\theta\}\\ell\(\\theta;\(x,y\)\)=\(\\sigma\(\\theta^\{\\top\}x\)\-y\)x\.Sincey∈\{0,1\}y\\in\\\{0,1\\\}andσ\(θ⊤x\)∈\[0,1\]\\sigma\(\\theta^\{\\top\}x\)\\in\[0,1\],
\|σ\(θ⊤x\)−y\|≤1\.\|\\sigma\(\\theta^\{\\top\}x\)\-y\|\\leq 1\.Therefore
‖G0−1/2∇θℓ\(θ;\(x,y\)\)‖2≤‖G0−1/2x‖2≤L\.\\left\\\|G\_\{0\}^\{\-1/2\}\\nabla\_\{\\theta\}\\ell\(\\theta;\(x,y\)\)\\right\\\|\_\{2\}\\leq\\\|G\_\{0\}^\{\-1/2\}x\\\|\_\{2\}\\leq L\.The claim follows from Lemma[A\.1](https://arxiv.org/html/2606.18306#A1.Thmtheorem1)\. ∎
### A\.3Softmax Regression
We next considerKK\-class softmax regression\. To avoid the non\-identifiability of the fullK×dK\\times dparametrization, we use the standard baseline parametrizationW∈ℝ\(K−1\)×dW\\in\\mathbb\{R\}^\{\(K\-1\)\\times d\}, where classKKis the baseline\.
Forx∈ℝdx\\in\\mathbb\{R\}^\{d\}, define
pW\(k∣x\)=exp\(wk⊤x\)1\+∑ℓ=1K−1exp\(wℓ⊤x\),k=1,…,K−1,p\_\{W\}\(k\\mid x\)=\\frac\{\\exp\(w\_\{k\}^\{\\top\}x\)\}\{1\+\\sum\_\{\\ell=1\}^\{K\-1\}\\exp\(w\_\{\\ell\}^\{\\top\}x\)\},\\qquad k=1,\\ldots,K\-1,and
pW\(K∣x\)=11\+∑ℓ=1K−1exp\(wℓ⊤x\)\.p\_\{W\}\(K\\mid x\)=\\frac\{1\}\{1\+\\sum\_\{\\ell=1\}^\{K\-1\}\\exp\(w\_\{\\ell\}^\{\\top\}x\)\}\.Write
πW\(x\)=\(pW\(1∣x\),…,pW\(K−1∣x\)\)∈ℝK−1\.\\pi\_\{W\}\(x\)=\(p\_\{W\}\(1\\mid x\),\\ldots,p\_\{W\}\(K\-1\\mid x\)\)\\in\\mathbb\{R\}^\{K\-1\}\.The cross\-entropy loss is
ℓ\(W;\(x,y\)\)=−logpW\(y∣x\)\.\\ell\(W;\(x,y\)\)=\-\\log p\_\{W\}\(y\\mid x\)\.
###### Proposition A\.4\(Fisher\-Lipschitz condition for softmax regression\)\.
Fix a base pointW0W\_\{0\}, and letG0=G\(W0\)≻0G\_\{0\}=G\(W\_\{0\}\)\\succ 0be the Fisher matrix in the identifiable baseline parametrization\. SupposeT⊂ℝ\(K−1\)×dT\\subset\\mathbb\{R\}^\{\(K\-1\)\\times d\}is convex\.
If
supW∈Tsup\(x,y\)‖G0−1/2∇Wℓ\(W;\(x,y\)\)‖F≤L,\\sup\_\{W\\in T\}\\sup\_\{\(x,y\)\}\\left\\\|G\_\{0\}^\{\-1/2\}\\nabla\_\{W\}\\ell\(W;\(x,y\)\)\\right\\\|\_\{F\}\\leq L,then Assumption[5\.3](https://arxiv.org/html/2606.18306#S5.Thmtheorem3)holds onTTwith constantLL\.
In particular, if‖x‖2≤R\\\|x\\\|\_\{2\}\\leq Ralmost surely and
for someμ\>0\\mu\>0, then Assumption[5\.3](https://arxiv.org/html/2606.18306#S5.Thmtheorem3)holds with
L=2Rμ\.L=\\frac\{\\sqrt\{2\}\\,R\}\{\\sqrt\{\\mu\}\}\.
###### Proof\.
The first statement follows directly from Lemma[A\.1](https://arxiv.org/html/2606.18306#A1.Thmtheorem1), identifyingWWwith its vectorization and using the Frobenius inner product\.
It remains to prove the explicit bound\. Differentiating the cross\-entropy loss gives
∇Wℓ\(W;\(x,y\)\)=\(πW\(x\)−ey\)x⊤,\\nabla\_\{W\}\\ell\(W;\(x,y\)\)=\(\\pi\_\{W\}\(x\)\-e\_\{y\}\)x^\{\\top\},whereey∈ℝK−1e\_\{y\}\\in\\mathbb\{R\}^\{K\-1\}is theyy\-th standard basis vector fory∈\{1,…,K−1\}y\\in\\\{1,\\ldots,K\-1\\\}, andeK=0e\_\{K\}=0for the baseline class\.
Since this is a rank\-one matrix,
‖∇Wℓ\(W;\(x,y\)\)‖F=‖πW\(x\)−ey‖2‖x‖2\.\\\|\\nabla\_\{W\}\\ell\(W;\(x,y\)\)\\\|\_\{F\}=\\\|\\pi\_\{W\}\(x\)\-e\_\{y\}\\\|\_\{2\}\\,\\\|x\\\|\_\{2\}\.Let
v=πW\(x\)−ey\.v=\\pi\_\{W\}\(x\)\-e\_\{y\}\.Then‖v‖∞≤1\\\|v\\\|\_\{\\infty\}\\leq 1\. Moreover,
‖v‖1≤‖πW\(x\)‖1\+‖ey‖1≤1\+1=2,\\\|v\\\|\_\{1\}\\leq\\\|\\pi\_\{W\}\(x\)\\\|\_\{1\}\+\\\|e\_\{y\}\\\|\_\{1\}\\leq 1\+1=2,because∑k=1K−1pW\(k∣x\)≤1\\sum\_\{k=1\}^\{K\-1\}p\_\{W\}\(k\\mid x\)\\leq 1and‖ey‖1≤1\\\|e\_\{y\}\\\|\_\{1\}\\leq 1\. Hence
‖v‖22≤‖v‖∞‖v‖1≤2,\\\|v\\\|\_\{2\}^\{2\}\\leq\\\|v\\\|\_\{\\infty\}\\\|v\\\|\_\{1\}\\leq 2,so
‖πW\(x\)−ey‖2≤2\.\\\|\\pi\_\{W\}\(x\)\-e\_\{y\}\\\|\_\{2\}\\leq\\sqrt\{2\}\.Therefore, if‖x‖2≤R\\\|x\\\|\_\{2\}\\leq R,
‖∇Wℓ\(W;\(x,y\)\)‖F≤2R\.\\\|\\nabla\_\{W\}\\ell\(W;\(x,y\)\)\\\|\_\{F\}\\leq\\sqrt\{2\}\\,R\.Finally, sinceG0⪰μIG\_\{0\}\\succeq\\mu I,
‖G0−1/2∇Wℓ\(W;\(x,y\)\)‖F≤1μ‖∇Wℓ\(W;\(x,y\)\)‖F≤2Rμ\.\\left\\\|G\_\{0\}^\{\-1/2\}\\nabla\_\{W\}\\ell\(W;\(x,y\)\)\\right\\\|\_\{F\}\\leq\\frac\{1\}\{\\sqrt\{\\mu\}\}\\\|\\nabla\_\{W\}\\ell\(W;\(x,y\)\)\\\|\_\{F\}\\leq\\frac\{\\sqrt\{2\}\\,R\}\{\\sqrt\{\\mu\}\}\.The result follows from Lemma[A\.1](https://arxiv.org/html/2606.18306#A1.Thmtheorem1)\. ∎
## Appendix BProofs for Section[5\.3](https://arxiv.org/html/2606.18306#S5.SS3)
###### Proof of Lemma[5\.8](https://arxiv.org/html/2606.18306#S5.Thmtheorem8)\.
Since−logpθ\(z\)=A\(θ\)−θ⊤ϕ\(z\)−logh\(z\)\-\\log p\_\{\\theta\}\(z\)=A\(\\theta\)\-\\theta^\{\\top\}\\phi\(z\)\-\\log h\(z\),
ℓ~\(u;z\)=\(A\(θ0\+u\)−A\(θ0\)\)−u⊤ϕ\(z\)\.\\widetilde\{\\ell\}\(u;z\)=\\bigl\(A\(\\theta\_\{0\}\+u\)\-A\(\\theta\_\{0\}\)\\bigr\)\-u^\{\\top\}\\phi\(z\)\.The deterministic termA\(θ0\+u\)−A\(θ0\)A\(\\theta\_\{0\}\+u\)\-A\(\\theta\_\{0\}\)cancels in the difference between population and empirical risk:
R\(u\)−R^n\(u\)\\displaystyle R\(u\)\-\\widehat\{R\}\_\{n\}\(u\)=𝔼pθ0\[ℓ~\(u;Z\)\]−1n∑i=1nℓ~\(u;Zi\)\\displaystyle=\\mathbb\{E\}\_\{p\_\{\\theta\_\{0\}\}\}\[\\widetilde\{\\ell\}\(u;Z\)\]\-\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\widetilde\{\\ell\}\(u;Z\_\{i\}\)=−u⊤𝔼pθ0\[ϕ\(Z\)\]\+u⊤ϕ¯n=u⊤Δn\.\\displaystyle=\-u^\{\\top\}\\mathbb\{E\}\_\{p\_\{\\theta\_\{0\}\}\}\[\\phi\(Z\)\]\+u^\{\\top\}\\overline\{\\phi\}\_\{n\}=u^\{\\top\}\\Delta\_\{n\}\.Therefore
supu∈rB2d\|R\(u\)−R^n\(u\)\|=sup‖u‖2≤r\|u⊤Δn\|=r‖Δn‖2,\\sup\_\{u\\in rB\_\{2\}^\{d\}\}\\bigl\|R\(u\)\-\\widehat\{R\}\_\{n\}\(u\)\\bigr\|=\\sup\_\{\\\|u\\\|\_\{2\}\\leq r\}\|u^\{\\top\}\\Delta\_\{n\}\|=r\\\|\\Delta\_\{n\}\\\|\_\{2\},where the last equality follows from
sup‖u‖2≤r\|u⊤v\|=r‖v‖2\.\\sup\_\{\\\|u\\\|\_\{2\}\\leq r\}\|u^\{\\top\}v\|=r\\\|v\\\|\_\{2\}\.∎
###### Proof of Theorem[5\.9](https://arxiv.org/html/2606.18306#S5.Thmtheorem9)\.
By Lemma[5\.8](https://arxiv.org/html/2606.18306#S5.Thmtheorem8),
supu∈rB2d\|R\(u\)−R^n\(u\)\|=r‖Δn‖2\.\\sup\_\{u\\in rB\_\{2\}^\{d\}\}\\bigl\|R\(u\)\-\\widehat\{R\}\_\{n\}\(u\)\\bigr\|=r\\\|\\Delta\_\{n\}\\\|\_\{2\}\.SetWn:=nΔnW\_\{n\}:=\\sqrt\{n\}\\,\\Delta\_\{n\}\. By the multivariate CLT,
Wn→𝑑W,W∼𝒩\(0,G0\)\.W\_\{n\}\\xrightarrow\{d\}W,\\qquad W\\sim\\mathcal\{N\}\(0,G\_\{0\}\)\.Since
𝔼‖Wn‖22=n𝔼‖Δn‖22=Tr\(G0\)<∞,\\mathbb\{E\}\\\|W\_\{n\}\\\|\_\{2\}^\{2\}=n\\,\\mathbb\{E\}\\\|\\Delta\_\{n\}\\\|\_\{2\}^\{2\}=\\operatorname\{Tr\}\(G\_\{0\}\)<\\infty,we havesupn≥1𝔼‖Wn‖22=Tr\(G0\)<∞\\sup\_\{n\\geq 1\}\\mathbb\{E\}\\\|W\_\{n\}\\\|\_\{2\}^\{2\}=\\operatorname\{Tr\}\(G\_\{0\}\)<\\infty\. Hence the family\{‖Wn‖2:n≥1\}\\\{\\\|W\_\{n\}\\\|\_\{2\}:n\\geq 1\\\}is uniformly integrable\. Convergence in distribution together with uniform integrability gives
𝔼‖Wn‖2⟶𝔼‖W‖2=𝔼‖G01/2g‖2,g∼𝒩\(0,Id\)\.\\mathbb\{E\}\\\|W\_\{n\}\\\|\_\{2\}\\longrightarrow\\mathbb\{E\}\\\|W\\\|\_\{2\}=\\mathbb\{E\}\\\|G\_\{0\}^\{1/2\}g\\\|\_\{2\},\\qquad g\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\.By the Euclidean\-ball formula for Fisher width \(Definition[2\.2](https://arxiv.org/html/2606.18306#S2.Thmtheorem2)and Example[2\.4](https://arxiv.org/html/2606.18306#S2.Thmtheorem4)\),
wG0\(rB2d\)=r𝔼‖G01/2g‖2\.w\_\{G\_\{0\}\}\(rB\_\{2\}^\{d\}\)=r\\,\\mathbb\{E\}\\\|G\_\{0\}^\{1/2\}g\\\|\_\{2\}\.Combining,
n𝔼\[supu∈rB2d\|R\(u\)−R^n\(u\)\|\]=r𝔼‖Wn‖2⟶wG0\(rB2d\)\.\\sqrt\{n\}\\;\\mathbb\{E\}\\\!\\left\[\\sup\_\{u\\in rB\_\{2\}^\{d\}\}\|R\(u\)\-\\widehat\{R\}\_\{n\}\(u\)\|\\right\]=r\\,\\mathbb\{E\}\\\|W\_\{n\}\\\|\_\{2\}\\;\\longrightarrow\\;w\_\{G\_\{0\}\}\(rB\_\{2\}^\{d\}\)\.The finite\-nnlower bound follows immediately from the convergence\. ∎
## Acknowledgments
The author\(s\) acknowledge the use of ChatGPT and Claude to assist in the preparation of this manuscript\. Specifically, AI tools were utilized to refine the language and structure of the drafting process, to brainstorm and explore strategies for mathematical proofs, and to assist in generating code for empirical validation\. The author\(s\) meticulously reviewed, rigorously verified all step\-by\-step mathematical logic, and thoroughly debugged the source code\. The author\(s\) take full and sole responsibility for the originality, correctness, and final content of this work\.
## References
- P\. Alquier \(2024\)User\-friendly introduction to PAC\-bayes bounds\.Foundations and Trends in Machine Learning17\(2\),pp\. 174–303\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px5.p2.1)\.
- S\. Amari and H\. Nagaoka \(2000\)Methods of information geometry\.Translations of Mathematical Monographs, Vol\.191,American Mathematical Society,Providence, RI\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px3.p1.1),[§2\.1](https://arxiv.org/html/2606.18306#S2.SS1.p1.8),[§2\.1](https://arxiv.org/html/2606.18306#S2.SS1.p1.9),[Example 2\.6](https://arxiv.org/html/2606.18306#S2.Thmtheorem6.p1.2)\.
- S\. Amari \(1998\)Natural gradient works efficiently in learning\.Neural Computation10\(2\),pp\. 251–276\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px4.p1.1)\.
- S\. Amari \(2016\)Information geometry and its applications\.Applied Mathematical Sciences, Vol\.194,Springer,Tokyo\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px3.p1.1),[§2\.1](https://arxiv.org/html/2606.18306#S2.SS1.p1.8)\.
- D\. Amelunxen, M\. Lotz, M\. B\. McCoy, and J\. A\. Tropp \(2014\)Living on the edge: phase transitions in convex programs with random data\.Information and Inference3\(3\),pp\. 224–294\.Cited by:[§1\.1](https://arxiv.org/html/2606.18306#S1.SS1.p1.2),[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px1.p1.1)\.
- N\. Ay, J\. Jost, H\. V\. Lê, and L\. Schwachhöfer \(2017\)Information geometry\.Ergebnisse der Mathematik und ihrer Grenzgebiete, Vol\.64,Springer,Cham\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px3.p1.1)\.
- P\. L\. Bartlett and S\. Mendelson \(2002\)Rademacher and Gaussian complexities: risk bounds and structural results\.Journal of Machine Learning Research3,pp\. 463–482\.Cited by:[§1\.1](https://arxiv.org/html/2606.18306#S1.SS1.p1.2),[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px2.p1.1)\.
- R\. Bhatia \(1997\)Matrix analysis\.Graduate Texts in Mathematics, Vol\.169,Springer,New York\.Cited by:[§3\.1](https://arxiv.org/html/2606.18306#S3.SS1.6.p1.4)\.
- V\. Chandrasekaran, B\. Recht, P\. A\. Parrilo, and A\. S\. Willsky \(2012\)The convex geometry of linear inverse problems\.Foundations of Computational Mathematics12\(6\),pp\. 805–849\.Cited by:[§1\.1](https://arxiv.org/html/2606.18306#S1.SS1.p1.2),[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px1.p1.1)\.
- M\. P\. do Carmo \(1992\)Riemannian geometry\.Birkhäuser,Boston\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.18306#S4.SS1.p1.7)\.
- R\. M\. Dudley \(1967\)The sizes of compact subsets of Hilbert space and continuity of Gaussian processes\.Journal of Functional Analysis1\(3\),pp\. 290–330\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px1.p1.1)\.
- G\. K\. Dziugaite and D\. M\. Roy \(2017\)Computing nonvacuous generalization bounds for deep \(stochastic\) neural networks with many parameters\.InProceedings of the Thirty\-Third Conference on Uncertainty in Artificial Intelligence,Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px5.p2.1)\.
- B\. Efron \(1975\)Defining the curvature of a statistical problem \(with applications to second order efficiency\)\.The Annals of Statistics3\(6\),pp\. 1189–1242\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px3.p1.1)\.
- P\. Foret, A\. Kleiner, H\. Mobahi, and B\. Neyshabur \(2021\)Sharpness\-aware minimization for efficiently improving generalization\.InInternational Conference on Learning Representations,Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px5.p2.1)\.
- N\. Golowich, A\. Rakhlin, and O\. Shamir \(2018\)Size\-independent sample complexity of neural networks\.InProceedings of the 31st Conference on Learning Theory,Proceedings of Machine Learning Research, Vol\.75,pp\. 297–299\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px2.p1.1)\.
- Y\. Gordon \(1988\)On Milman’s inequality and random subspaces which escape through a mesh inℝn\\mathbb\{R\}^\{n\}\.Geometric Aspects of Functional Analysis1317,pp\. 84–106\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px1.p1.1)\.
- N\. Halko, P\. Martinsson, and J\. A\. Tropp \(2011\)Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions\.SIAM Review53\(2\),pp\. 217–288\.Cited by:[§6\.2](https://arxiv.org/html/2606.18306#S6.SS2.p2.3)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Flat minima\.Neural Computation9\(1\),pp\. 1–42\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px5.p2.1)\.
- Y\. Jiang, B\. Neyshabur, H\. Mobahi, D\. Krishnan, and S\. Bengio \(2020\)Fantastic generalization measures and where to find them\.InInternational Conference on Learning Representations,Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px5.p2.1)\.
- R\. Karakida, S\. Akaho, and S\. Amari \(2019\)Universal statistics of Fisher information in deep neural networks: mean field approach\.InProceedings of the 22nd International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.89,pp\. 1032–1041\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px4.p1.1)\.
- N\. S\. Keskar, D\. Mudigere, J\. Nocedal, M\. Smelyanskiy, and P\. T\. P\. Tang \(2017\)On large\-batch training for deep learning: generalization gap and sharp minima\.InInternational Conference on Learning Representations,Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px5.p2.1)\.
- V\. Koltchinskii \(2011\)Oracle inequalities in empirical risk minimization and sparse recovery problems\.Lecture Notes in Mathematics, Vol\.2033,Springer,Berlin\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px2.p1.1)\.
- F\. Kunstner, L\. Balles, and P\. Hennig \(2019\)Limitations of the empirical Fisher approximation for natural gradient descent\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px4.p1.1)\.
- Y\. LeCun, L\. Bottou, Y\. Bengio, and P\. Haffner \(1998\)Gradient\-based learning applied to document recognition\.Proceedings of the IEEE86\(11\),pp\. 2278–2324\.Cited by:[§7\.1](https://arxiv.org/html/2606.18306#S7.SS1.SSS0.Px1.p1.2)\.
- M\. Ledoux and M\. Talagrand \(1991\)Probability in Banach spaces: isoperimetry and processes\.Ergebnisse der Mathematik und ihrer Grenzgebiete \(3\), Vol\.23,Springer,Berlin\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px1.p1.1),[§2\.5](https://arxiv.org/html/2606.18306#S2.SS5.5.p1.3)\.
- T\. Liang, T\. Poggio, A\. Rakhlin, and J\. Stokes \(2019\)Fisher\-Rao metric, geometry, and complexity of neural networks\.InProceedings of the 22nd International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.89,pp\. 888–896\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px5.p1.1)\.
- J\. Martens and R\. Grosse \(2015\)Optimizing neural networks with Kronecker\-factored approximate curvature\.InProceedings of the 32nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.37,pp\. 2408–2417\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px4.p1.1)\.
- J\. Martens \(2020\)New insights and perspectives on the natural gradient method\.Journal of Machine Learning Research21\(146\),pp\. 1–76\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px4.p1.1)\.
- D\. A\. McAllester \(1999\)Some PAC\-Bayesian theorems\.InProceedings of the Eleventh Annual Conference on Computational Learning Theory,pp\. 230–234\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px5.p2.1)\.
- M\. K\. Murray and J\. W\. Rice \(1993\)Differential geometry and statistics\.Monographs on Statistics and Applied Probability, Vol\.48,Chapman & Hall,London\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px3.p1.1)\.
- Y\. Ollivier \(2015\)Riemannian metrics for neural networks I: feedforward networks\.Information and Inference4\(2\),pp\. 108–153\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px4.p1.1)\.
- M\. Seeger \(2002\)PAC\-Bayesian generalisation error bounds for Gaussian process classification\.Journal of Machine Learning Research3,pp\. 233–269\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px5.p2.1)\.
- M\. Talagrand \(2014\)Upper and lower bounds for stochastic processes\.Ergebnisse der Mathematik und ihrer Grenzgebiete \(3\), Vol\.60,Springer,Berlin\.Cited by:[§2\.5](https://arxiv.org/html/2606.18306#S2.SS5.3.p2.3),[§5\.2](https://arxiv.org/html/2606.18306#S5.SS2.5.p5.6)\.
- J\. A\. Tropp \(2015\)An introduction to matrix concentration inequalities\.Foundations and Trends in Machine Learning, Vol\.8,Now Publishers\.Cited by:[§3\.1](https://arxiv.org/html/2606.18306#S3.SS1.5.p4.3)\.
- R\. Vershynin \(2018\)High\-dimensional probability: an introduction with applications in data science\.Cambridge Series in Statistical and Probabilistic Mathematics,Cambridge University Press\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px1.p1.1)\.
- M\. J\. Wainwright \(2019\)High\-dimensional statistics: a non\-asymptotic viewpoint\.Cambridge Series in Statistical and Probabilistic Mathematics,Cambridge University Press\.Cited by:[§1\.3](https://arxiv.org/html/2606.18306#S1.SS3.SSS0.Px1.p1.1)\.Similar Articles
Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms
The paper proposes an attack-agnostic robustness metric based on the spectral norm of the Fisher Information Matrix, providing theoretical bounds and scalable evaluation methods for deep neural networks.
Exact Schur-Sylvester Dimensionality Reductions for Non-Smooth Stochastic Complexity and Manifold Sampling
This paper presents exact dimensionality reductions using Schur complement and Sylvester's determinant identity to reduce computational complexity from O(N^3) to O(k^3+N^2k) per step for non-smooth NML estimation, achieving over 14,000x speedup while maintaining numerical precision.
Finsler Geometry, Graph Neural Networks, and You
This paper proposes a Finslerian graph neural network that estimates the Finsler Laplacian on point clouds, proving convergence and demonstrating its use in recovering Finsler metrics from heat diffusion.
Riemannian Archetypal Analysis: Interpretable non-linear data analysis on deformed star distributions
This paper introduces a Riemannian version of archetypal analysis using data-driven pullback geometry to combine interpretability with non-linear expressiveness, proposing the Riemannian Archetypal Mapping (RAM) and demonstrating its effectiveness on synthetic data and MNIST.
The Degeneracy Distillery
This paper introduces the degeneracy distillery, a method that automatically detects and resolves degenerate parameter combinations in physical models by estimating and flattening the Fisher information matrix, reducing the simulation budget required for neural posterior estimation while providing physical insight.