Are Flat Minima an Illusion?
Summary
This paper challenges the common belief that flat minima cause better generalization in neural networks, arguing that 'weakness'—a reparameterization-invariant measure of function simplicity—is the true driver. Empirical results on MNIST and Fashion-MNIST show that weakness predicts generalization while sharpness anticorrelates, and the large-batch generalization advantage vanishes as training data increases.
View Cached Full Text
Cached at: 05/08/26, 06:25 AM
# Are Flat Minima an Illusion?
Source: [https://arxiv.org/html/2605.05209](https://arxiv.org/html/2605.05209)
Michael Timothy Bennett School of Computing The Australian National University michael\.bennett@anu\.edu\.au
###### Abstract
Neural networks that land in flat regions of the loss landscape tend to generalise better than those in sharp regions\. Sharpness\-Aware Minimisation exploits this to improve generalisation\. But function\-preserving reparameterisation can inflate the Hessian of any minimum by two orders of magnitude without changing a single prediction\. If the geometry of weight space can be manufactured from nothing, it cannot be the cause of anything\. In other words, flat is simple and simplicity depends on encoding\. Here I show that the actual driver is weakness, the volume of completions compatible with the learned function in the learner’s embodied language\. Weakness is reparameterisation\-invariant because it is defined over what the network*does*, not how it is parameterised\. I prove weakness is minimax\-optimal under exchangeable demands, and that PAC\-Bayes bounds work because they correlate with it\. On MNIST, the large\-batch generalisation advantage*vanishes*as training data grows, from\+1\.6%\+1\.6\\%atn=2,000n=2\{,\}000to\+0\.02%\+0\.02\\%atn=60,000n=60\{,\}000\. A quantity whose predictive power depends on how much data you have is not a cause but a confounder\. I run head\-to\-heads on 100 networks with identical architecture and training\. For MNIST weakness predicts generalisation \(ρ=\+0\.374\\rho=\+0\.374,p=0\.00012p=0\.00012\), sharpness anticorrelates \(ρ=−0\.226\\rho=\-0\.226\) and simplicity predicts nothing \(p=0\.848p=0\.848\)\. For Fashion\-MNIST \(ρ=\+0\.384\\rho=\+0\.384,p=8\.15×10−5p=8\.15\\times 10^\{\-5\}\), though simplicity is at least somewhat predictive there\. Simplicity is dataset dependent, whereas weakness is invariant\. Flat minima were never the answer\.
## 1Introduction
Flat minima generalise better than sharp ones\. At least, that is the orthodox position\.Hochreiter and Schmidhuber \([1997](https://arxiv.org/html/2605.05209#bib.bib5)\)argued it in 1997,Keskaret al\.\([2017](https://arxiv.org/html/2605.05209#bib.bib309)\)demonstrated it empirically in 2017\. Sharpness\-Aware Minimisation\(Foretet al\.,[2021](https://arxiv.org/html/2605.05209#bib.bib10)\)explicitly seeks flat minima and improves generalisation\.
Here the correlation is not in dispute, but the causation is\.Dinhet al\.\([2017](https://arxiv.org/html/2605.05209#bib.bib6)\)gave reasons to doubt it\. They constructed a reparameterisation that transforms sharp minima into a flat without changing the function of the network\. The Hessian eigenvalues and loss landscape geometry change, while the function and predictions do not\.
Several authors have proposed parameterisation\-invariant sharpness measures\.Tsuzukuet al\.\([2020](https://arxiv.org/html/2605.05209#bib.bib16)\)normalised by parameter scale\.Petzkaet al\.\([2021](https://arxiv.org/html/2605.05209#bib.bib15)\)used a Fisher information metric\.Kwonet al\.\([2021](https://arxiv.org/html/2605.05209#bib.bib17)\)introduced adaptive sharpness\. They still do not answer the question “what property of the learned*function*makes it generalise?”
This paper offers three contributions\.
1. 1\.The vanishing advantage\.The generalisation advantage of large\-batch over small\-batch training vanishes as a function of training set size\. On MNIST with a 3\-layer ReLU MLP, large\-batch networks generalise better at 500, 2,000, 6,000, and 12,000 training points\. At 24,000 the gap is small \(\+0\.08\+0\.08pp,p=0\.002p=0\.002\)\. At 60,000 it is negligible \(\+0\.02\+0\.02pp\)\. A quantity that correlates at some scales and not others is not a cause\. It is a confound\.
2. 2\.Reparameterisation invariance\.I prove that weakness\(Bennett,[2023](https://arxiv.org/html/2605.05209#bib.bib65),[2025a](https://arxiv.org/html/2605.05209#bib.bib75)\)is reparameterisation\-invariant by construction\. I confirm non\-invariance of the Hessian trace on trained networks, with changes of up to 99×\\timeswhile generalisation stays exactly constant\.
3. 3\.Weakness beats sharpness\.I construct two formal vocabularies for frozen\-partition ReLU networks\. The region\-class vocabulary \(Appendix[H](https://arxiv.org/html/2605.05209#A8)\) is an approximation that treats each activation region independently\. The feature\-classifier vocabulary \(Appendix[I](https://arxiv.org/html/2605.05209#A9)\) respects the shared\-weight constraint exactly and reduces each extension check to a linear program\. Bridging the lack of a direct neural\-network analogue of formal weakness was identified as an open problem inBennett \([2026](https://arxiv.org/html/2605.05209#bib.bib83)\); the present work provides a concrete partial operationalisation via linear feasibility and pair\-proxy measures\. I measure weakness via linear feasibility on 100 networks with the same architecture, data, and training\. Weakness outperforms sharpness at model selection \(ρ=\+0\.374\\rho=\+0\.374,p=0\.00012p=0\.00012vsρ=−0\.226\\rho=\-0\.226,p=0\.024p=0\.024\)\. Simplicity does not predict at all \(p=0\.848p=0\.848\)\. The result replicates on Fashion\-MNIST at nearly identical strength \(ρ≈\+0\.38\\rho\\approx\+0\.38,p≈10−4p\\approx 10^\{\-4\}\)\.
Buffer size \(kfreek\_\{\\mathrm\{free\}\}\) is a significant but modest predictor \(ρ=\+0\.117\\rho=\+0\.117,p=0\.043p=0\.043\)\. The region\-class vocabulary is an approximation\. The feature\-classifier vocabulary \(Appendix[I](https://arxiv.org/html/2605.05209#A9)\) solves the shared\-weight problem and computes single\-point extensions exactly, but the pair proxy is a sum of marginals, not the full extension count\.
The structure is as follows\. Section[2](https://arxiv.org/html/2605.05209#S2)reviews the definitions from Stack Theory\. Section[3](https://arxiv.org/html/2605.05209#S3)extends weakness to continuous domains\. Section[4](https://arxiv.org/html/2605.05209#S4)presents the reparameterisation invariance proof, the PAC\-Bayes connection, and the confounding diagnosis\. Section[5](https://arxiv.org/html/2605.05209#S5)presents the experiments\. Section[6](https://arxiv.org/html/2605.05209#S6)locates continuous weakness in the landscape of generalisation theory\. Section[7](https://arxiv.org/html/2605.05209#S7)discusses implications, limitations, and future work\. Appendix[H](https://arxiv.org/html/2605.05209#A8)constructs the region\-class vocabulary\. Appendix[I](https://arxiv.org/html/2605.05209#A9)constructs the feature\-classifier vocabulary, which respects the shared\-weight constraint and reduces each extension check to a linear program\.
The proofs are self contained, but the definitions used here drawn from the broader Stack Theory literature\(Bennett,[2025b](https://arxiv.org/html/2605.05209#bib.bib87)\)\. The finite\-case predecessors of these continuous results were proved inBennett \([2025a](https://arxiv.org/html/2605.05209#bib.bib75)\)\.
## 2Background
###### Definition 1\(Environment and programs\)\.
An*environment*is a nonempty setΦ\\Phiof mutually exclusive states\. A*program*is any subsetp⊆Φp\\subseteq\\Phi\. Write𝒫=2Φ\\mathcal\{P\}=2^\{\\Phi\}for the set of all programs\. A*vocabulary*is any set of programs𝔳⊆𝒫\\mathfrak\{v\}\\subseteq\\mathcal\{P\}\.
Think ofΦ\\Phias the set of all possible configurations of the world\. Each state is one complete configuration\. A program is a constraint that holds in some configurations and not others\. A vocabulary is the set of constraints a system can express\. For a neural network, the vocabulary is the set of input\-output behaviours the architecture can realise\.
###### Definition 2\(Embodied language and statements\)\.
A vocabulary𝔳\\mathfrak\{v\}induces an*embodied language*
L𝔳=\{l⊆𝔳\|⋂p∈lp≠∅\}\.L\_\{\\mathfrak\{v\}\}=\\left\\\{l\\subseteq\\mathfrak\{v\}\\;\\middle\|\\;\\bigcap\_\{p\\in l\}p\\neq\\emptyset\\right\\\}\.Elementsl∈L𝔳l\\in L\_\{\\mathfrak\{v\}\}are called*statements*\. The*truth set*ofllisT\(l\)=⋂p∈lpT\(l\)=\\bigcap\_\{p\\in l\}p\.
A statement is a consistent conjunction of programs corresponding to a subset of possible physical states\. For example, raising one’s arm is a statement\. It is a constraint on spatial position, musculature and so on\. The embodied languageL𝔳L\_\{\\mathfrak\{v\}\}is*not*the full powerset2𝔳2^\{\\mathfrak\{v\}\}\. Most combinations of programs are mutually exclusive\. You cannot raise and lower your arm at the same time\. The satisfiability constraint⋂p∈lp≠∅\\bigcap\_\{p\\in l\}p\\neq\\emptysetis what makes the structure interesting\.
###### Definition 3\(Extensions and weakness\)\.
A*completion*ofx∈L𝔳x\\in L\_\{\\mathfrak\{v\}\}is anyy∈L𝔳y\\in L\_\{\\mathfrak\{v\}\}withx⊆yx\\subseteq y\. The*extension*ofxxis
Ext\(x\)=\{y∈L𝔳∣x⊆y\}\.\\mathrm\{Ext\}\\\!\\left\(x\\right\)=\\\{y\\in L\_\{\\mathfrak\{v\}\}\\mid x\\subseteq y\\\}\.For a set of statementsX⊆L𝔳X\\subseteq L\_\{\\mathfrak\{v\}\}, writeExt\(X\)=⋃x∈XExt\(x\)\\mathrm\{Ext\}\\\!\\left\(X\\right\)=\\bigcup\_\{x\\in X\}\\mathrm\{Ext\}\\\!\\left\(x\\right\)\. The*weakness*ofxxisw\(x\)=\|Ext\(x\)\|w\(x\)=\|\\mathrm\{Ext\}\\\!\\left\(x\\right\)\|\.
Weakness measures how non\-committal a statement is\. The more commitments you can still make, the weaker the statement\.
###### Definition 4\(Tasks, correctness, and learning\)\.
A*𝔳\\mathfrak\{v\}\-task*is a pairα=⟨Iα,Oα⟩\\alpha=\\langle I\_\{\\alpha\},O\_\{\\alpha\}\\ranglewhereIα⊆L𝔳I\_\{\\alpha\}\\subseteq L\_\{\\mathfrak\{v\}\}andOα⊆Ext\(Iα\)O\_\{\\alpha\}\\subseteq\\mathrm\{Ext\}\\\!\\left\(I\_\{\\alpha\}\\right\)\. A*policy*π∈L𝔳\\pi\\in L\_\{\\mathfrak\{v\}\}is*correct forα\\alpha*ifExt\(Iα\)∩Ext\(π\)=Oα\\mathrm\{Ext\}\\\!\\left\(I\_\{\\alpha\}\\right\)\\cap\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)=O\_\{\\alpha\}\. The set of correct policies isΠα\\Pi\_\{\\alpha\}\. Taskα\\alphais a*child*ofω\\omega, writtenα⊏ω\\alpha\\sqsubset\\omega, ifIα⊊IωI\_\{\\alpha\}\\subsetneq I\_\{\\omega\}andOα⊆OωO\_\{\\alpha\}\\subseteq O\_\{\\omega\}\.
A task specifies inputs and correct outputs\. A policy constrains how the system completes inputs\. Correctness means the policy produces exactly the right outputs on the given inputs\. A child task is a task with fewer examples\. Learning means generalising from a child task to its parent\.
The foundational result is that under an exchangeable prior over parent tasks, weakness maximisation is both necessary and sufficient for optimal generalisation\(Bennett,[2023](https://arxiv.org/html/2605.05209#bib.bib65),[2025a](https://arxiv.org/html/2605.05209#bib.bib75)\)\. Among all correct policies for the child, the ones most likely to also be correct for the unknown parent are the weakest\. Minimising description length is neither necessary nor sufficient\(Bennett,[2024](https://arxiv.org/html/2605.05209#bib.bib69)\)\.
## 3Extending Weakness to Continuous Domains
When the vocabulary is finite, weakness is a count\. You enumerate completions and take the cardinality\. When the vocabulary is infinite, every uncountable extension has the same cardinality, so counting breaks\. This section replaces counting with measuring\.
###### Definition 5\(Measurable vocabulary\)\.
A*measurable vocabulary*is a triple\(𝔳,𝒜L,μ\)\(\\mathfrak\{v\},\\mathcal\{A\}\_\{L\},\\mu\)where𝔳⊆𝒫\\mathfrak\{v\}\\subseteq\\mathcal\{P\}is a vocabulary \(possibly infinite\),𝒜L\\mathcal\{A\}\_\{L\}is aσ\\sigma\-algebra onL𝔳L\_\{\\mathfrak\{v\}\}, andμ\\muis aσ\\sigma\-finite measure on\(L𝔳,𝒜L\)\(L\_\{\\mathfrak\{v\}\},\\mathcal\{A\}\_\{L\}\)\. I require thatExt\(l\)\\mathrm\{Ext\}\\\!\\left\(l\\right\)is𝒜L\\mathcal\{A\}\_\{L\}\-measurable for everyl∈L𝔳l\\in L\_\{\\mathfrak\{v\}\}\. SinceExt\(X\)=⋃x∈XExt\(x\)\\mathrm\{Ext\}\\\!\\left\(X\\right\)=\\bigcup\_\{x\\in X\}\\mathrm\{Ext\}\\\!\\left\(x\\right\)for anyX⊆L𝔳X\\subseteq L\_\{\\mathfrak\{v\}\}, and a countable union of measurable sets is measurable,Ext\(Iα\)\\mathrm\{Ext\}\\\!\\left\(I\_\{\\alpha\}\\right\)is measurable wheneverIαI\_\{\\alpha\}is countable\. All tasks in this paper have finite input sets\.
The measureμ\\mudoes for infinite vocabularies what counting measure does for finite ones\. It assigns a size to sets of statements so that extensions can be compared\. When𝔳\\mathfrak\{v\}is finite, settingμ\\muto counting measure recovers every existing result\.
###### Definition 6\(Continuous weakness\)\.
The*μ\\mu\-weakness*ofl∈L𝔳l\\in L\_\{\\mathfrak\{v\}\}iswμ\(l\)=μ\(Ext\(l\)\)w\_\{\\mu\}\(l\)=\\mu\(\\mathrm\{Ext\}\\\!\\left\(l\\right\)\)\.
###### Definition 7\(Continuous extension model\)\.
Fix a measurable vocabulary\(𝔳,𝒜L,μ\)\(\\mathfrak\{v\},\\mathcal\{A\}\_\{L\},\\mu\)and a taskα\\alphawith output regionOαO\_\{\\alpha\}\. Define the*unseen region*U=L𝔳∖Ext\(Iα\)U=L\_\{\\mathfrak\{v\}\}\\setminus\\mathrm\{Ext\}\\\!\\left\(I\_\{\\alpha\}\\right\), where0<μ\(U\)<∞0<\\mu\(U\)<\\inftyandμ\(Oα\)<∞\\mu\(O\_\{\\alpha\}\)<\\infty\. Assume\(U,𝒜L\|U,μ\|U\)\(U,\\mathcal\{A\}\_\{L\}\|\_\{U\},\\mu\|\_\{U\}\)is an atomless standard measure space \(Appendix[A](https://arxiv.org/html/2605.05209#A1)\)\. A*continuous extension model*is a probability space\(Ω,ℱ,P\)\(\\Omega,\\mathcal\{F\},P\)with a random setS:Ω→2US\\colon\\Omega\\to 2^\{U\}such that\{S⊆B\}\\\{S\\subseteq B\\\}and\{S⊆B,S∩A≠∅\}\\\{S\\subseteq B,\\,S\\cap A\\neq\\emptyset\\\}areℱ\\mathcal\{F\}\-measurable for everyA,B∈𝒜L\|UA,B\\in\\mathcal\{A\}\_\{L\}\|\_\{U\}\. The*buffer*of a correct policyπ\\piisBπ=Ext\(π\)∩UB\_\{\\pi\}=\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\\cap U\. The*PP\-weakness*iswP\(π\)=P\(S⊆Bπ\)w\_\{P\}\(\\pi\)=P\(S\\subseteq B\_\{\\pi\}\)\.
The parent task will additionally demand some setSSof outputs\. The policy survives if and only if every demand falls inside its buffer\.
###### Definition 8\(μ\\mu\-exchangeability and nondegeneracy\)\.
A continuous extension model is*μ\\mu\-exchangeable*if for every measure\-preserving bijectionσ:\(U,μ\)→\(U,μ\)\\sigma\\colon\(U,\\mu\)\\to\(U,\\mu\)and every measurableB⊆UB\\subseteq U,P\(S⊆B\)=P\(S⊆σ\(B\)\)P\(S\\subseteq B\)=P\(S\\subseteq\\sigma\(B\)\)\. It is*nondegenerate*if for every measurableA⊆UA\\subseteq Uwithμ\(A\)\>0\\mu\(A\)\>0, we haveP\(∅≠S⊆A\)\>0P\(\\emptyset\\neq S\\subseteq A\)\>0\. \(This event is measurable because\{∅≠S⊆A\}=\{S⊆A\}∖\{S⊆∅\}\\\{\\emptyset\\neq S\\subseteq A\\\}=\\\{S\\subseteq A\\\}\\setminus\\\{S\\subseteq\\emptyset\\\}\.\)
WriteG\(π,P\)=wP\(π\)=P\(S⊆Bπ\)G\(\\pi,P\)=w\_\{P\}\(\\pi\)=P\(S\\subseteq B\_\{\\pi\}\)for the*generalisation probability*of policyπ\\piunder extension modelPP\.
###### Lemma 1\(Buffer measure determinesμ\\mu\-weakness for correct policies\)\.
For any correctπ∈Πα\\pi\\in\\Pi\_\{\\alpha\},wμ\(π\)=μ\(Oα\)\+μ\(Bπ\)w\_\{\\mu\}\(\\pi\)=\\mu\(O\_\{\\alpha\}\)\+\\mu\(B\_\{\\pi\}\), whereOα=Ext\(Iα\)∩Ext\(π\)O\_\{\\alpha\}=\\mathrm\{Ext\}\\\!\\left\(I\_\{\\alpha\}\\right\)\\cap\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)is the output region \(fixed for allπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}by the correctness condition\) andBπ=Ext\(π\)∩UB\_\{\\pi\}=\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\\cap U\.μ\\mu\-weakness ranking and buffer\-measure ranking agree among correct policies\.
###### Proof\.
By the correctness condition,Ext\(π\)=Oα⊔Bπ\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)=O\_\{\\alpha\}\\sqcup B\_\{\\pi\}whereOα⊆Ext\(Iα\)O\_\{\\alpha\}\\subseteq\\mathrm\{Ext\}\\\!\\left\(I\_\{\\alpha\}\\right\)andBπ⊆U=L𝔳∖Ext\(Iα\)B\_\{\\pi\}\\subseteq U=L\_\{\\mathfrak\{v\}\}\\setminus\\mathrm\{Ext\}\\\!\\left\(I\_\{\\alpha\}\\right\)\. These sets are disjoint and both𝒜L\\mathcal\{A\}\_\{L\}\-measurable \(by the measurability requirement on extensions in Definition[5](https://arxiv.org/html/2605.05209#Thmdefinition5)\)\. Sowμ\(π\)=μ\(Ext\(π\)\)=μ\(Oα\)\+μ\(Bπ\)w\_\{\\mu\}\(\\pi\)=\\mu\(\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\)=\\mu\(O\_\{\\alpha\}\)\+\\mu\(B\_\{\\pi\}\)\. Sinceμ\(Oα\)\\mu\(O\_\{\\alpha\}\)is the same for allπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}, ranking bywμ\(π\)w\_\{\\mu\}\(\\pi\)is equivalent to ranking byμ\(Bπ\)\\mu\(B\_\{\\pi\}\)\. ∎
The continuous theory produces four main theorems\. Full proofs are in Appendices[C](https://arxiv.org/html/2605.05209#A3)–[F](https://arxiv.org/html/2605.05209#A6)\.
###### Theorem 2\(Sufficiency\)\.
IfPPisμ\\mu\-exchangeable, thenwP\(π\)=f\(μ\(Bπ\)\)w\_\{P\}\(\\pi\)=f\(\\mu\(B\_\{\\pi\}\)\)for some nondecreasingff\. Among correct policies,μ\\mu\-weakness determines generalisation probability\.
###### Theorem 3\(Strict necessity\)\.
IfPPisμ\\mu\-exchangeable and nondegenerate, andμ\(Bπ1\)\>μ\(Bπ2\)\\mu\(B\_\{\\pi\_\{1\}\}\)\>\\mu\(B\_\{\\pi\_\{2\}\}\), thenwP\(π1\)\>wP\(π2\)w\_\{P\}\(\\pi\_\{1\}\)\>w\_\{P\}\(\\pi\_\{2\}\)\.
###### Theorem 4\(Unified optimality\)\.
Letℰ\\mathcal\{E\}be the class of allμ\\mu\-exchangeable models\.
1. 1\.*Sufficiency\.*μ\(Bπ1\)≥μ\(Bπ2\)⟹G\(π1,P\)≥G\(π2,P\)\\mu\(B\_\{\\pi\_\{1\}\}\)\\geq\\mu\(B\_\{\\pi\_\{2\}\}\)\\implies G\(\\pi\_\{1\},P\)\\geq G\(\\pi\_\{2\},P\)for allP∈ℰP\\in\\mathcal\{E\}\.
2. 2\.*Strict necessity\.*IfPPis nondegenerate,μ\(Bπ1\)\>μ\(Bπ2\)⟹G\(π1,P\)\>G\(π2,P\)\\mu\(B\_\{\\pi\_\{1\}\}\)\>\\mu\(B\_\{\\pi\_\{2\}\}\)\\implies G\(\\pi\_\{1\},P\)\>G\(\\pi\_\{2\},P\)\.
3. 3\.*Uniqueness\.*All optimal policies have the same buffer measureμ\(Bπ\)\\mu\(B\_\{\\pi\}\), hence the sameμ\\mu\-weakness\.
###### Theorem 5\(Minimax optimality\)\.
If aμ\\mu\-weakest correct policyπ∗\\pi^\{\*\}exists,
maxπ∈ΠαinfP∈ℰG\(π,P\)=infP∈ℰmaxπ∈ΠαG\(π,P\),\\max\_\{\\pi\\in\\Pi\_\{\\alpha\}\}\\inf\_\{P\\in\\mathcal\{E\}\}G\(\\pi,P\)=\\inf\_\{P\\in\\mathcal\{E\}\}\\max\_\{\\pi\\in\\Pi\_\{\\alpha\}\}G\(\\pi,P\),both achieved byπ∗\\pi^\{\*\}\.
This result is a corollary of sufficiency\. Becauseπ∗\\pi^\{\*\}maximisesG\(π,P\)G\(\\pi,P\)for everyP∈ℰP\\in\\mathcal\{E\}simultaneously, it trivially achieves the max\-min and the minimax equality holds\. The learner has a dominant strategy\. No adversary can make weakness maximisation suboptimal\.
## 4The Flat Minima Resolution
The continuous theory resolves the flat minima puzzle\. The resolution has three parts\.
### 4\.1Reparameterisation invariance of weakness
###### Definition 9\(Policy map and function\-preserving reparameterisation\)\.
Fix a measurable vocabulary defined over input\-output behaviour\. LetΘ⊆ℝd\\Theta\\subseteq\\mathbb\{R\}^\{d\}be a parameter space and letF\(θ\)F\(\\theta\)denote the function computed by the network with parametersθ\\theta\. A programp∈𝔳p\\in\\mathfrak\{v\}is*satisfied*byF\(θ\)F\(\\theta\)if the input\-output behaviour ofF\(θ\)F\(\\theta\)lies inpp\(viewed as a subset ofΦ\\Phi\)\. The*policy map*is
ΠF\(θ\)=\{p∈𝔳∣F\(θ\)satisfiesp\}\.\\Pi\_\{F\}\(\\theta\)=\\\{p\\in\\mathfrak\{v\}\\mid F\(\\theta\)\\text\{ satisfies \}p\\\}\.A*function\-preserving reparameterisation*is a diffeomorphismϕ:Θ→Θ\\phi\\colon\\Theta\\to\\ThetawithF\(ϕ\(θ\)\)=F\(θ\)F\(\\phi\(\\theta\)\)=F\(\\theta\)for allθ\\theta\.
###### Theorem 6\(Reparameterisation invariance ofμ\\mu\-weakness\)\.
Ifϕ\\phiis function\-preserving, thenΠF\(ϕ\(θ\)\)=ΠF\(θ\)\\Pi\_\{F\}\(\\phi\(\\theta\)\)=\\Pi\_\{F\}\(\\theta\), hencewμ\(ΠF\(ϕ\(θ\)\)\)=wμ\(ΠF\(θ\)\)w\_\{\\mu\}\(\\Pi\_\{F\}\(\\phi\(\\theta\)\)\)=w\_\{\\mu\}\(\\Pi\_\{F\}\(\\theta\)\)\.
###### Proof\.
F\(ϕ\(θ\)\)=F\(θ\)F\(\\phi\(\\theta\)\)=F\(\\theta\)by assumption\. The policy map depends only on which programsF\(θ\)F\(\\theta\)satisfies\. Same function, same programs satisfied, same policy, same extension, same weakness\. ∎
A three sentence proof\. The definitions do most of the work\. The vocabulary lives in function space\. The reparameterisation does not move you in function space, so weakness invariant\.
### 4\.2PAC\-Bayes bounds track weakness
###### Theorem 7\(PAC\-Bayes ordering agrees withμ\\mu\-weakness\)\.
Fix a measurable vocabulary withμ\(L𝔳\)<∞\\mu\(L\_\{\\mathfrak\{v\}\}\)<\\infty\. SetP0=μ/μ\(L𝔳\)P\_\{0\}=\\mu/\\mu\(L\_\{\\mathfrak\{v\}\}\)\(the normalised prior\) andQπ=μ\|Bπ/μ\(Bπ\)Q\_\{\\pi\}=\\mu\|\_\{B\_\{\\pi\}\}/\\mu\(B\_\{\\pi\}\)\(uniform posterior on the buffer\)\. Then for correctπ\\piwithμ\(Bπ\)\>0\\mu\(B\_\{\\pi\}\)\>0,
KL\(Qπ∥P0\)=logμ\(L𝔳\)μ\(Bπ\),\\mathrm\{KL\}\(Q\_\{\\pi\}\\\|P\_\{0\}\)=\\log\\frac\{\\mu\(L\_\{\\mathfrak\{v\}\}\)\}\{\\mu\(B\_\{\\pi\}\)\},strictly decreasing inμ\(Bπ\)\\mu\(B\_\{\\pi\}\)\.
###### Proof\.
The Radon\-Nikodym derivative isdQπ/dP0=μ\(L𝔳\)/μ\(Bπ\)⋅𝟏BπdQ\_\{\\pi\}/dP\_\{0\}=\\mu\(L\_\{\\mathfrak\{v\}\}\)/\\mu\(B\_\{\\pi\}\)\\cdot\\mathbf\{1\}\_\{B\_\{\\pi\}\}\. Computing the KL\-divergence,
KL\(Qπ∥P0\)=∫logdQπdP0dQπ=logμ\(L𝔳\)μ\(Bπ\)\.\\mathrm\{KL\}\(Q\_\{\\pi\}\\\|P\_\{0\}\)=\\int\\log\\frac\{dQ\_\{\\pi\}\}\{dP\_\{0\}\}\\,dQ\_\{\\pi\}=\\log\\frac\{\\mu\(L\_\{\\mathfrak\{v\}\}\)\}\{\\mu\(B\_\{\\pi\}\)\}\.Sinceμ\(Oα\)\\mu\(O\_\{\\alpha\}\)is fixed for allπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}andμ\(Ext\(π\)\)=μ\(Oα\)\+μ\(Bπ\)\\mu\(\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\)=\\mu\(O\_\{\\alpha\}\)\+\\mu\(B\_\{\\pi\}\), the KL is strictly decreasing inμ\(Bπ\)\\mu\(B\_\{\\pi\}\), hence strictly decreasing inμ\\mu\-weaknesswμ\(π\)=μ\(Ext\(π\)\)w\_\{\\mu\}\(\\pi\)=\\mu\(\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\)\. ∎
Since the standard PAC\-Bayes bound is increasing inKL\(Qπ∥P0\)\\mathrm\{KL\}\(Q\_\{\\pi\}\\\|P\_\{0\}\)for fixed empirical risk and sample size, and correct policies have zero empirical risk by definition, theμ\\mu\-weakest correct policy minimises the PAC\-Bayes bound\. PAC\-Bayes bounds work because they track weakness\. The KL\-divergence is a proxy for a proxy\. The fundamental quantity is the buffer measure\.
### 4\.3Diagnosis of confounding
The preceding results assemble the diagnosis\.
1. 1\.Weakness is reparameterisation\-invariant \(Theorem[6](https://arxiv.org/html/2605.05209#Thmtheorem6)\)\.
2. 2\.Weakness determines generalisation probability under exchangeable demands \(Theorems[2](https://arxiv.org/html/2605.05209#Thmtheorem2)–[5](https://arxiv.org/html/2605.05209#Thmtheorem5)\)\.
3. 3\.Sharpness is not reparameterisation\-invariant\(Dinhet al\.,[2017](https://arxiv.org/html/2605.05209#bib.bib6)\)\.
4. 4\.Therefore sharpness is not the driver\.
## 5Experiments
### 5\.1Setup
The cross\-regime experiments \(Sections[5\.2](https://arxiv.org/html/2605.05209#S5.SS2)–[5\.4](https://arxiv.org/html/2605.05209#S5.SS4)\) use a 3\-layer ReLU MLP \(784→\\to256→\\to128→\\to10\) trained on subsets of MNIST with no regularisation\. The model\-selection experiments \(Section[5\.6](https://arxiv.org/html/2605.05209#S5.SS6)and Section[5\.7](https://arxiv.org/html/2605.05209#S5.SS7)\) use smaller architectures as specified in each section\. I train two regimes per data scale\. The*small\-batch*regime uses batch size 64 with learning rate 0\.01–0\.05\. The*large\-batch*regime uses batch size equal to the training set size at 500 and 2,000, and batch size 4,096 at 6,000 and above, with learning rate 0\.2–0\.5\. All networks are trained to\>\>99\.5% training accuracy\. At most scales, all networks reach exactly 100%\. Exceptions are one network at 2,000 \(99\.95%\), the 24,000 large\-batch regime \(99\.98–100%\), and 4 networks at 60,000 \(99\.998%\)\. 25 networks per regime, 50 per data scale\.
For the reparameterisation test, I useTβ\(θ\)=\(βW1,βb1,W2/β,b2,W3,b3\)T\_\{\\beta\}\(\\theta\)=\(\\beta W\_\{1\},\\beta b\_\{1\},W\_\{2\}/\\beta,b\_\{2\},W\_\{3\},b\_\{3\}\)between layers 1 and 2, andTγ\(θ\)=\(W1,b1,γW2,γb2,W3/γ,b3\)T\_\{\\gamma\}\(\\theta\)=\(W\_\{1\},b\_\{1\},\\gamma W\_\{2\},\\gamma b\_\{2\},W\_\{3\}/\\gamma,b\_\{3\}\)between layers 2 and 3\. By the positive homogeneity of ReLU, both preserve the function computed by the network\. The 3\-layer architecture provides two independent reparameterisation axes\. For each network I measure:
1. 1\.Hessian trace\.Estimated via Hutchinson’s method with 50 Rademacher vectors and autodifferentiation Hessian\-vector products\.
2. 2\.Layer\-1 activation patterns\.The number of distinct binary activation patterns𝟏\[W1x\+b1\>0\]\\mathbf\{1\}\[W\_\{1\}x\+b\_\{1\}\>0\]across all unseen points\.
3. 3\.Layer\-2 activation patterns\.The number of distinct binary activation patterns𝟏\[W2h1\+b2\>0\]\\mathbf\{1\}\[W\_\{2\}h\_\{1\}\+b\_\{2\}\>0\]across all unseen points\.
4. 4\.Ensemble agreement\.For each network, among the unseen points it misclassifies, the fraction correctly classified by other networks in the same experiment\.
### 5\.2The vanishing advantage
Table[1](https://arxiv.org/html/2605.05209#S5.T1)presents test accuracy by regime across six training set sizes\. At 500, 2,000, 6,000, and 12,000 training points, the large\-batch regime generalises better\. At 24,000 the gap is small but statistically significant \(\+0\.08\+0\.08pp, Welchp=0\.002p=0\.002\)\. At 60,000 it is negligible \(\+0\.02\+0\.02pp\)\. The large\-batch advantage peaks at 2,000 training points and shrinks thereafter\.
Table 1:Test accuracy by training regime across data scales\. All rows use the 3\-layer architecture\.†\\daggerLarge\-batch networks that did not reach exactly 100% training accuracy \(one at 2,000 at 99\.95%, 23 at 24,000 \(\>\>99\.98%\), 4 at 60,000 at 99\.998%\)\. All other networks reached exactly 100%\.At every data scale, the large\-batch regime has*lower*Hessian trace \(Table[1](https://arxiv.org/html/2605.05209#S5.T1)\)\. The standard association between large batch size and sharp minima\(Keskaret al\.,[2017](https://arxiv.org/html/2605.05209#bib.bib309)\)is reversed at every scale tested\. The Hessian tracks the generalisation gap but does not explain it\.
### 5\.3Reparameterisation invariance on trained networks
I applyTβT\_\{\\beta\}andTγT\_\{\\gamma\}to trained networks at three data scales \(500, 2,000, 6,000\) and measure all four quantities\. The Hessian trace changes by up to 99×\\timesunderTβT\_\{\\beta\}and 54×\\timesunderTγT\_\{\\gamma\}, while test accuracy, L1 and L2 activation patterns, and ensemble agreement are exactly invariant to the precision of floating\-point arithmetic\. This holds for all 12 networks tested across all three data scales\. The two independent reparameterisation axes \(TβT\_\{\\beta\}between layers 1–2,TγT\_\{\\gamma\}between layers 2–3\) demonstrate that the argument generalises to arbitrary depth\. Full results are in Appendix[G](https://arxiv.org/html/2605.05209#A7)\.
### 5\.4Correlations with generalisation
Table[2](https://arxiv.org/html/2605.05209#S5.T2)presents Spearman rank correlations between each measure and test accuracy across all 50 networks at each data scale\.
Table 2:Spearman correlations \(ρ\\rho\) with test accuracy\. L1 and L2 are layer\-1 and layer\-2 activation pattern counts on unseen data\. EA is ensemble agreement\. Bold indicates\|ρ\|\>0\.7\|\{\}\\rho\{\}\|\>0\.7\. The unseen set comprises leftover training data plus the test set at 500–12,000, leftover training data only at 24,000, and the test set only at 60,000\.Hessian trace and L2 activation pattern count both predict generalisation at\|ρ\|≈0\.75\|\\rho\|\\approx 0\.75across 500–12,000\. At 24,000, the Hessian collapses \(ρ=\+0\.01\\rho=\+0\.01\) while L2 remains significant \(ρ=\+0\.39\\rho=\+0\.39,p=0\.005p=0\.005\)\. L1 saturates\. EA is confounded by error count\.
The key comparison is between the Hessian and L2 columns\. Both predict\. But the Hessian isnotreparameterisation\-invariant\. L2 is exactly invariant\. Two quantities predict equally well\. One is an artefact of the parameterisation\. The other is a property of the function\.
### 5\.5Connecting L2 diversity to weakness
L2 activation pattern diversity is not, on its face, a weakness measure\. More patterns means more commitments at layer 2, not fewer\. The connection to weakness runs through the training data and the formal construction in Appendix[H](https://arxiv.org/html/2605.05209#A8)\.
Freeze layers 1 and 2\. This fixes the L2 partition of input space\. Within each L2 region,h\(x\)h\(x\)is affine inxx, so the outputW3h\(x\)\+b3W\_\{3\}h\(x\)\+b\_\{3\}is also affine inxx\. Different inputs in the same region can in general receive different classifications\. The region\-class vocabulary in Appendix[H](https://arxiv.org/html/2605.05209#A8)approximates this by treating each region as assigning a single class\. This approximation is exact when each region contains at most one data point\. In the cross\-regime experiments at smaller scales, L2 regions outnumber training points but not unseen points\. At 60,000, L2 regions \(approximately 9,500\) are far fewer than training points\. The approximation is therefore inexact in all experiments reported here\. Define a*region\-class vocabulary*\(Definition[10](https://arxiv.org/html/2605.05209#Thmdefinition10)\) where programs are of the form “regionrris classified as classcc\.” The embodied language is the set of partial functions from regions to classes \(Proposition[11](https://arxiv.org/html/2605.05209#Thmtheorem11)\)\.
An L2 region is*free*if it contains unseen data but no training data\. In these regions, the network has committed to a classification, but the child task provides no constraint on what that classification should be\. Theorem[12](https://arxiv.org/html/2605.05209#Thmtheorem12)proves that the weakness of the training\-label policy isw\(π\)=11kw\(\\pi\)=11^\{k\}wherekkis the number of free regions\. Under the region\-class vocabulary \(an approximation\), every partial classification of free regions extends the training\-label policy without contradiction\. In the actual network, the shared weight matrixW3W\_\{3\}couples the classifications across regions, so the choices are not independent\. See Section[H\.6](https://arxiv.org/html/2605.05209#A8.SS6)for details\.
At 6,000 training points, the small\-batch networks have∼\\sim54,000 L2 regions and the large\-batch networks have∼\\sim63,000\. With 6,000 points spread across this many regions, the vast majority of regions contain zero training points\. The large\-batch networks have∼\\sim9,000 more L2 regions\. Under the region\-class approximation, more L2 regions means more free regions and therefore higher weakness \(Corollary[13](https://arxiv.org/html/2605.05209#Thmtheorem13)\)\. The cross\-regime L2 diversity correlation \(ρ=0\.73\\rho=0\.73–0\.780\.78, Table[2](https://arxiv.org/html/2605.05209#S5.T2)\) is consistent with this prediction, though L2 pattern count and free\-region count are not identical quantities\.
There is a caveat \(Section[H\.6](https://arxiv.org/html/2605.05209#A8.SS6)\)\. The region\-class vocabulary treats each free region independently\. In reality, the shared weight matrixW3W\_\{3\}creates correlations between regions\. The vocabulary overstates the freedom of the network\. Whether the ranking by free regions is preserved under the shared\-weight constraint is an open question that the experiments test empirically\.
### 5\.6Model selection by weakness
The cross\-regime results in Tables[1](https://arxiv.org/html/2605.05209#S5.T1)–[2](https://arxiv.org/html/2605.05209#S5.T2)compare networks from different training regimes\. A stronger test is model selection*within*a single regime\. I train 300 networks with the same architecture \(784→\\to64→\\to16→\\to10\), the same 250 training points, the same batch size, the same learning rate, and different random seeds\. All memorise the training data\. I measure six quantities per network\. Three are not reparameterisation\-invariant \(Hessian trace, weight L2 norm, weight L1 norm\)\. Three are invariant \(L2 unseen patterns, free parameters \(defined in Table[3](https://arxiv.org/html/2605.05209#S5.T3)caption\), unseen\-only regions=kfree=k\_\{\\mathrm\{free\}\}\)\.
Table 3:Model selection within a single regime \(n=300n=300networks, same config, different seeds\)\. Spearmanρ\\rhowith test accuracy\. Bold indicates statistically significant\. The two significant predictors are both reparameterisation\-invariant\. No sharpness or simplicity measure predicts\.∗\*Free parameters=∑rmax\(0,D2⋅10\+10−nr\)=\\sum\_\{r\}\\max\(0,D\_\{2\}\\cdot 10\+10\-n\_\{r\}\)wherenrn\_\{r\}is the number of training points in regionrr\.Unseen\-only regions \(kfreek\_\{\\mathrm\{free\}\}\) is statistically significant \(ρ=\+0\.117\\rho=\+0\.117,p=0\.043p=0\.043\)\. This is the quantity for which Theorem[12](https://arxiv.org/html/2605.05209#Thmtheorem12)proves the weakness is11k11^\{k\}under the region\-class vocabulary \(an approximation to the actual network; see Section[H\.5](https://arxiv.org/html/2605.05209#A8.SS5)\)\. The sufficiency theorem predicts that weaker policies generalise better within a fixed vocabulary\. The data confirms this prediction empirically, though comparing across networks involves different vocabularies \(see Appendix[H](https://arxiv.org/html/2605.05209#A8)for discussion\)\. L2 unseen patterns is also significant \(p=0\.050p=0\.050\)\. The effect is small \(ρ≈0\.12\\rho\\approx 0\.12\), but the direction is correct and backed by a theorem\. Sharpness or simplicity fail to predict generalisation here\.
### 5\.7Pair proxy
A separate experiment tests the pair proxy directly\. I train 100 networks \(784→\\to64→\\to8→\\to10, 250 training points, same batch size, different seeds\)\. For each network, I freeze layers 1–2 and solve a linear feasibility problem for each of 100 unseen inputs×\\times10 classes\. The linear program asks whether there exists a setting of layer\-3 weights that correctly classifies all training data AND assigns the unseen input to the candidate class\. The count of feasible pairs is the pair proxy\.
Table 4:Pair proxy experiment \(n=100n=100networks, 784→\\to64→\\to8→\\to10, 250 training points\)\. Weakness outperforms sharpness\. Simplicity does not predict\.The pair proxy is highly significant \(ρ=\+0\.374\\rho=\+0\.374,p=0\.00012p=0\.00012\)\. It outperforms sharpness \(ρ=−0\.226\\rho=\-0\.226,p=0\.024p=0\.024\)\. Simplicity does not predict at all \(p=0\.848p=0\.848\) for MNIST, and is inconsistent across datasets when it does work\. The pair proxy is reparameterisation\-invariant while the Hessian trace is not\. This is a head\-to\-head comparison on the same 100 networks\. Same architecture, same data, same training, different random seeds and as you might expect the answer is weakness\.
### 5\.8Replication on Fashion\-MNIST
I replicate the pair proxy experiment on Fashion\-MNIST\(Xiaoet al\.,[2017](https://arxiv.org/html/2605.05209#bib.bib80)\)with the same architecture \(784→\\to64→\\to8→\\to10\), same number of training points \(250\), same number of networks \(100\), and same number of unseen probe points \(100\)\. Theonlydifference is the dataset\. Fashion\-MNIST uses 28×\\times28 grayscale images of clothing items instead of handwritten digits\. It is harder than MNIST \(mean test accuracy 71\.8% vs∼\\sim80% on MNIST with this architecture and training set size\)\.
Table 5:Fashion\-MNIST pair proxy experiment \(n=100n=100networks, 784→\\to64→\\to8→\\to10, 250 training points\)\. Same setup as Table[4](https://arxiv.org/html/2605.05209#S5.T4)\.The pair proxy predicts generalisation atρ=\+0\.384\\rho=\+0\.384\(p=8\.15×10−5p=8\.15\\times 10^\{\-5\}\), closely matching the MNIST result \(ρ=\+0\.374\\rho=\+0\.374,p=0\.00012p=0\.00012\)\. Same architecture, same optimisation, same evaluation protocol, different data, same answer\.
Weight norm is also significant on Fashion\-MNIST \(ρ=−0\.226\\rho=\-0\.226,p=0\.024p=0\.024\), but the effect is modest and dataset\-dependent\. On MNIST it does not predict at all \(p=0\.848p=0\.848\)\. By contrast, the pair proxy is consistent across both datasets and is reparameterisation\-invariant by construction\. This is consistent with the theoretical prediction that simplicity is not the driver\(Bennett,[2024](https://arxiv.org/html/2605.05209#bib.bib69)\)\.
## 6Whereμ\\mu\-Weakness Sits
Generalisation theory has produced many complexity measures\. Most of them answer the wrong question, or the right question at the wrong level\.
VC dimension\(Vapnik and Chervonenkis,[1971](https://arxiv.org/html/2605.05209#bib.bib12)\)measures the expressiveness of a hypothesis*class*\. Weakness measures the committedness of a single*hypothesis*\. VC dimension cannot distinguish two hypotheses within the same class\.
Algorithmic stability\(Bousquet and Elisseeff,[2002](https://arxiv.org/html/2605.05209#bib.bib13)\)measures sensitivity of the learning*algorithm*to perturbation of a single training point\. Weakness is a property of the output, not the algorithm that produced it\.
PAC\-Bayes bounds\(McAllester,[1999](https://arxiv.org/html/2605.05209#bib.bib7)\)operate at the right level\. Theorem[7](https://arxiv.org/html/2605.05209#Thmtheorem7)shows the KL term is a monotone function of weakness among correct policies\. The difference is that weakness is defined in terms of the embodied language, which makes reparameterisation invariance obvious\.
Minimum description length\(Rissanen,[1978](https://arxiv.org/html/2605.05209#bib.bib443)\)is a property of form, not function\. Description length is encoding\-dependent and can be gamed\. Weakness cannot\. If a persisting object happens to be simple, that is incidental unless it is also weak\.
Sharpness\(Hochreiter and Schmidhuber,[1997](https://arxiv.org/html/2605.05209#bib.bib5); Keskaret al\.,[2017](https://arxiv.org/html/2605.05209#bib.bib309)\)is a geometric property of weight space\. It is not reparameterisation\-invariant\. The vanishing advantage \(Section[5\.2](https://arxiv.org/html/2605.05209#S5.SS2)\) shows it is not even a consistent correlate of generalisation\.
AIXI\(Hutter,[2005](https://arxiv.org/html/2605.05209#bib.bib19)\)selects policies by minimising description length\. Weakness maximisation is Bayes optimal under exchangeable task growth\(Bennett,[2024](https://arxiv.org/html/2605.05209#bib.bib69),[2025a](https://arxiv.org/html/2605.05209#bib.bib75)\)\. AIXI’s Occam rule is not\.
Maximum entropy\.Under a uniform prior over parent tasks, weakness maximisation is optimal \(Proposition[8](https://arxiv.org/html/2605.05209#Thmtheorem8)\)\. A uniform prior over tasks is a maximum\-entropy prior over which parent will appear\. So weakness maximisation is the optimal response to a maximum\-entropy assumption about tasks\. But tasks and outputs are different things\. A uniform prior over tasks implies a uniform distribution over outputs only if every combination of output constraints is satisfiable, which requiresL𝔳=2𝔳L\_\{\\mathfrak\{v\}\}=2^\{\\mathfrak\{v\}\}\. This is the degenerate case where weakness collapses to a cardinality proxy and maximum entropy over outputs coincides with weakness maximisation\(Bennett,[2024](https://arxiv.org/html/2605.05209#bib.bib69)\)\. In the non\-degenerate cases \(L𝔳≠2𝔳L\_\{\\mathfrak\{v\}\}\\neq 2^\{\\mathfrak\{v\}\}\), the satisfiability constraint filters the uniform prior over tasks into a non\-uniform distribution over achievable outputs\. Hence maximum entropy and weakness differ whenever the vocabulary has non\-trivial satisfiability structure, which is almost always\.
## 7Discussion
Limitations\.The pair proxy outperforms sharpness but explains a modest fraction of within\-regime variance \(ρ≈0\.37\\rho\\approx 0\.37on both datasets\)\. The region\-class vocabulary is an approximation\. The feature\-classifier vocabulary \(Appendix[I](https://arxiv.org/html/2605.05209#A9)\) solves the shared\-weight problem but the pair proxy sums single\-point marginals, not the full extension\. Isolated networks did not reach exactly 100% training accuracy \(see Table[1](https://arxiv.org/html/2605.05209#S5.T1)caption\)\. The experiments use MLPs on MNIST and Fashion\-MNIST\. The continuous weakness theory requires a reference measureμ\\munot present in the finite case\.
Future work\.Computing the full extension\|Ext\(π\)\|\|\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\|in the feature\-classifier vocabulary \(not just single\-point marginals\) and extending the sufficiency theorem to account for extension quality \(not just count\) are the main open problems\. This work partially addresses the operationalisation gap identified inBennett \([2026](https://arxiv.org/html/2605.05209#bib.bib83)\), but a full characterisation of\|Ext\(π\)\|\|\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\|for continuous function classes remains open\. Extending the pair proxy to CNNs and transformers is the empirical priority\.
## References
- The optimal choice of hypothesis is the weakest, not the shortest\.In16th International Conference on Artificial General Intelligence,Lecture Notes in Computer Science,pp\. 42–51\.External Links:[Document](https://dx.doi.org/10.1007/978-3-031-33469-6%5F5),[Link](https://doi.org/10.1007/978-3-031-33469-6_5)Cited by:[Appendix B](https://arxiv.org/html/2605.05209#A2.p1.1),[item 2](https://arxiv.org/html/2605.05209#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2605.05209#S2.p5.1)\.
- M\. T\. Bennett \(2024\)Is complexity an illusion?\.In17th International Conference on Artificial General Intelligence,Lecture Notes in Computer Science\.External Links:[Document](https://dx.doi.org/10.1007/978-3-031-65572-2%5F2),[Link](https://doi.org/10.1007/978-3-031-65572-2_2)Cited by:[§2](https://arxiv.org/html/2605.05209#S2.p5.1),[§5\.8](https://arxiv.org/html/2605.05209#S5.SS8.p3.3),[§6](https://arxiv.org/html/2605.05209#S6.p7.1),[§6](https://arxiv.org/html/2605.05209#S6.p8.2)\.
- M\. T\. Bennett \(2025a\)How to build conscious machines\.Ph\.D\. Thesis,The Australian National University\.External Links:[Document](https://dx.doi.org/10.25911/TA58-P428),[Link](https://hdl.handle.net/1885/733782452)Cited by:[Appendix B](https://arxiv.org/html/2605.05209#A2.p1.1),[item 2](https://arxiv.org/html/2605.05209#S1.I1.i2.p1.1),[§1](https://arxiv.org/html/2605.05209#S1.p8.1),[§2](https://arxiv.org/html/2605.05209#S2.p5.1),[§6](https://arxiv.org/html/2605.05209#S6.p7.1)\.
- M\. T\. Bennett \(2025b\)Technical appendices\.Note:Archived release on Zenodo\. Source repository: https://github\.com/ViscousLemming/Technical\-AppendicesExternal Links:[Document](https://dx.doi.org/10.5281/zenodo.7641741),[Link](https://doi.org/10.5281/zenodo.7641741)Cited by:[§1](https://arxiv.org/html/2605.05209#S1.p8.1)\.
- M\. T\. Bennett \(2026\)Regret is weighted forgetting\.Cited by:[item 3](https://arxiv.org/html/2605.05209#S1.I1.i3.p1.7),[§7](https://arxiv.org/html/2605.05209#S7.p2.2)\.
- O\. Bousquet and A\. Elisseeff \(2002\)Stability and generalization\.Journal of Machine Learning Research2,pp\. 499–526\.External Links:[Document](https://dx.doi.org/10.1162/153244302760200704)Cited by:[§6](https://arxiv.org/html/2605.05209#S6.p3.1)\.
- L\. Dinh, R\. Pascanu, S\. Bengio, and Y\. Bengio \(2017\)Sharp minima can generalize for deep nets\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 1019–1028\.External Links:[Link](https://proceedings.mlr.press/v70/dinh17b.html)Cited by:[§1](https://arxiv.org/html/2605.05209#S1.p2.1),[item 3](https://arxiv.org/html/2605.05209#S4.I1.i3.p1.1)\.
- P\. Foret, A\. Kleiner, H\. Mobahi, and B\. Neyshabur \(2021\)Sharpness\-aware minimization for efficiently improving generalization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=6Tm1mposlrM)Cited by:[§1](https://arxiv.org/html/2605.05209#S1.p1.1)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Flat minima\.Neural Computation9\(1\),pp\. 1–42\.External Links:[Document](https://dx.doi.org/10.1162/neco.1997.9.1.1),[Link](https://doi.org/10.1162/neco.1997.9.1.1)Cited by:[§1](https://arxiv.org/html/2605.05209#S1.p1.1),[§6](https://arxiv.org/html/2605.05209#S6.p6.1)\.
- M\. Hutter \(2005\)Universal artificial intelligence: sequential decisions based on algorithmic probability\.Springer,Berlin\.External Links:[Document](https://dx.doi.org/10.1007/b138233)Cited by:[§6](https://arxiv.org/html/2605.05209#S6.p7.1)\.
- N\. S\. Keskar, D\. Mudigere, J\. Nocedal, M\. Smelyanskiy, and P\. T\. P\. Tang \(2017\)On large\-batch training for deep learning: generalization gap and sharp minima\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=H1oyRlYgg)Cited by:[§1](https://arxiv.org/html/2605.05209#S1.p1.1),[§5\.2](https://arxiv.org/html/2605.05209#S5.SS2.p2.1),[§6](https://arxiv.org/html/2605.05209#S6.p6.1)\.
- J\. Kwon, J\. Kim, H\. Park, and I\. K\. Choi \(2021\)ASAM: adaptive sharpness\-aware minimization for scale\-invariant learning of deep neural networks\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.139,pp\. 5905–5914\.External Links:[Link](https://proceedings.mlr.press/v139/kwon21b.html)Cited by:[§1](https://arxiv.org/html/2605.05209#S1.p3.1)\.
- D\. A\. McAllester \(1999\)PAC\-Bayesian model averaging\.InProceedings of the Twelfth Annual Conference on Computational Learning Theory,pp\. 164–170\.External Links:[Document](https://dx.doi.org/10.1145/307400.307435),[Link](https://doi.org/10.1145/307400.307435)Cited by:[§6](https://arxiv.org/html/2605.05209#S6.p4.1)\.
- H\. Petzka, M\. Kamp, L\. Adilova, C\. Sminchisescu, and M\. Boley \(2021\)Relative flatness and generalization\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 18420–18432\.Cited by:[§1](https://arxiv.org/html/2605.05209#S1.p3.1)\.
- J\. Rissanen \(1978\)Modeling by shortest data description\.Automatica14\(5\),pp\. 465–471\.External Links:[Document](https://dx.doi.org/10.1016/0005-1098%2878%2990005-5),[Link](https://doi.org/10.1016/0005-1098(78)90005-5)Cited by:[§6](https://arxiv.org/html/2605.05209#S6.p5.1)\.
- Y\. Tsuzuku, I\. Sato, and M\. Sugiyama \(2020\)Normalized flat minima: exploring scale invariant definition of flat minima for neural networks using pac\-bayesian analysis\.InProceedings of the 37th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.119,pp\. 9636–9647\.Cited by:[§1](https://arxiv.org/html/2605.05209#S1.p3.1)\.
- V\. N\. Vapnik and A\. Ya\. Chervonenkis \(1971\)On the uniform convergence of relative frequencies of events to their probabilities\.Theory of Probability and Its Applications16\(2\),pp\. 264–280\.External Links:[Document](https://dx.doi.org/10.1137/1116025)Cited by:[§6](https://arxiv.org/html/2605.05209#S6.p2.1)\.
- H\. Xiao, K\. Rasul, and R\. Vollgraf \(2017\)Fashion\-mnist: a novel image dataset for benchmarking machine learning algorithms\.External Links:1708\.07747,[Link](https://arxiv.org/abs/1708.07747)Cited by:[§5\.8](https://arxiv.org/html/2605.05209#S5.SS8.p1.5)\.
## Appendix ATechnical Assumption
Throughout the proofs below, fix a measurable vocabulary\(𝔳,𝒜L,μ\)\(\\mathfrak\{v\},\\mathcal\{A\}\_\{L\},\\mu\)with0<μ\(U\)<∞0<\\mu\(U\)<\\infty\. Assume\(U,𝒜L\|U,μ\|U\)\(U,\\mathcal\{A\}\_\{L\}\|\_\{U\},\\mu\|\_\{U\}\)is an*atomless standard measure space*, meaning it is isomorphic \(as a measure space\) to a Lebesgue interval\. This guarantees the existence of measure\-preserving bijections between measurable subsets of equal measure, by Carathéodory’s isomorphism theorem\. The atomless requirement excludes pathological cases where atoms of equal measure but different demand probability would violate sufficiency\. This assumption is satisfied by Lebesgue measure onℝd\\mathbb\{R\}^\{d\}, which covers all cases of practical interest for neural networks\. The finite case, where counting measure has atoms, is handled separately by the finite proofs in Appendix[B](https://arxiv.org/html/2605.05209#A2)\.
## Appendix BFinite Sufficiency and Necessity \(Prior Results\)
These results are fromBennett \[[2023](https://arxiv.org/html/2605.05209#bib.bib65)\]andBennett \[[2025a](https://arxiv.org/html/2605.05209#bib.bib75)\]\. I include the proofs for self\-containment\.
###### Proposition 8\(Finite sufficiency\)\.
Letα⊏ω\\alpha\\sqsubset\\omegawith the parent’s additional demands drawn from the maximally uninformative extension model \(uniform overS⊆US\\subseteq UwhereU=L𝔳∖Ext\(Iα\)U=L\_\{\\mathfrak\{v\}\}\\setminus\\mathrm\{Ext\}\\\!\\left\(I\_\{\\alpha\}\\right\)\), and letπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}\. ThenP\(π∈Πω\)=2\|Bπ\|/2\|U\|P\(\\pi\\in\\Pi\_\{\\omega\}\)=2^\{\|B\_\{\\pi\}\|\}/2^\{\|U\|\}whereBπ=Ext\(π\)∩UB\_\{\\pi\}=\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\\cap U\.
###### Proof\.
The parent adds demands by selectingS⊆US\\subseteq Uuniformly from2U2^\{U\}\. The policyπ\\pisurvives if and only ifS⊆BπS\\subseteq B\_\{\\pi\}\. There are2\|U\|2^\{\|U\|\}equally likely subsets ofUUand2\|Bπ\|2^\{\|B\_\{\\pi\}\|\}of them lie insideBπB\_\{\\pi\}\. For correct policies,\|Bπ\|=\|Ext\(π\)\|−\|Oα\|\|B\_\{\\pi\}\|=\|\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\|\-\|O\_\{\\alpha\}\|and\|Oα\|\|O\_\{\\alpha\}\|is fixed, so maximising\|Ext\(π\)\|\|\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\|is equivalent to maximising\|Bπ\|\|B\_\{\\pi\}\|and therefore maximises the generalisation probability\. ∎
###### Proposition 9\(Finite necessity\)\.
If\|Ext\(π1\)\|\>\|Ext\(π2\)\|\|\\mathrm\{Ext\}\\\!\\left\(\\pi\_\{1\}\\right\)\|\>\|\\mathrm\{Ext\}\\\!\\left\(\\pi\_\{2\}\\right\)\|, thenP\(π1∈Πω\)\>P\(π2∈Πω\)P\(\\pi\_\{1\}\\in\\Pi\_\{\\omega\}\)\>P\(\\pi\_\{2\}\\in\\Pi\_\{\\omega\}\)\.
###### Proof\.
Immediate from Proposition[8](https://arxiv.org/html/2605.05209#Thmtheorem8)\. ∎
###### Theorem 10\(Finite exchangeable generalisation\)\.
Under any exchangeable prior over parent tasks,\|Ext\(π1\)\|≥\|Ext\(π2\)\|\|\\mathrm\{Ext\}\\\!\\left\(\\pi\_\{1\}\\right\)\|\\geq\|\\mathrm\{Ext\}\\\!\\left\(\\pi\_\{2\}\\right\)\|impliesP\(π1∈Πω\)≥P\(π2∈Πω\)P\(\\pi\_\{1\}\\in\\Pi\_\{\\omega\}\)\\geq P\(\\pi\_\{2\}\\in\\Pi\_\{\\omega\}\)\. Under nondegeneracy, the inequality is strict when the extension sizes differ\.
###### Proof\.
Under exchangeability, the prior treats all completions symmetrically\. A larger extension captures at least as many completions under any exchangeable measure\. Nondegeneracy guarantees that the extra completions in the larger extension have positive probability\. ∎
## Appendix CProof of Theorem[2](https://arxiv.org/html/2605.05209#Thmtheorem2)\(Sufficiency Underμ\\mu\-Exchangeable Demands\)
###### Proof\.
Letμ\(Bπ1\)≤μ\(Bπ2\)\\mu\(B\_\{\\pi\_\{1\}\}\)\\leq\\mu\(B\_\{\\pi\_\{2\}\}\)\.
Step 1\.Choose a measurableB2′⊆Bπ2B\_\{2\}^\{\\prime\}\\subseteq B\_\{\\pi\_\{2\}\}withμ\(B2′\)=μ\(Bπ1\)\\mu\(B\_\{2\}^\{\\prime\}\)=\\mu\(B\_\{\\pi\_\{1\}\}\)\(exists by the atomless standard measure space assumption, Appendix[A](https://arxiv.org/html/2605.05209#A1)\)\. Build a measure\-preserving bijectionσ1:Bπ1→B2′\\sigma\_\{1\}\\colon B\_\{\\pi\_\{1\}\}\\to B\_\{2\}^\{\\prime\}and a measure\-preserving bijectionσ2:U∖Bπ1→U∖B2′\\sigma\_\{2\}\\colon U\\setminus B\_\{\\pi\_\{1\}\}\\to U\\setminus B\_\{2\}^\{\\prime\}\(both pairs have equal measure\)\. Defineσ=σ1\\sigma=\\sigma\_\{1\}onBπ1B\_\{\\pi\_\{1\}\}andσ=σ2\\sigma=\\sigma\_\{2\}onU∖Bπ1U\\setminus B\_\{\\pi\_\{1\}\}\. Thenσ:U→U\\sigma\\colon U\\to Uis a measure\-preserving bijection withσ\(Bπ1\)=B2′⊆Bπ2\\sigma\(B\_\{\\pi\_\{1\}\}\)=B\_\{2\}^\{\\prime\}\\subseteq B\_\{\\pi\_\{2\}\}\.
Step 2\.IfS⊆Bπ1S\\subseteq B\_\{\\pi\_\{1\}\}, thenσ\(S\)⊆σ\(Bπ1\)=B2′⊆Bπ2\\sigma\(S\)\\subseteq\\sigma\(B\_\{\\pi\_\{1\}\}\)=B\_\{2\}^\{\\prime\}\\subseteq B\_\{\\pi\_\{2\}\}\.
Step 3\.Byμ\\mu\-exchangeability,P\(S⊆Bπ1\)=P\(S⊆σ\(Bπ1\)\)=P\(S⊆B2′\)P\(S\\subseteq B\_\{\\pi\_\{1\}\}\)=P\(S\\subseteq\\sigma\(B\_\{\\pi\_\{1\}\}\)\)=P\(S\\subseteq B\_\{2\}^\{\\prime\}\)\. SinceB2′⊆Bπ2B\_\{2\}^\{\\prime\}\\subseteq B\_\{\\pi\_\{2\}\},P\(S⊆B2′\)≤P\(S⊆Bπ2\)P\(S\\subseteq B\_\{2\}^\{\\prime\}\)\\leq P\(S\\subseteq B\_\{\\pi\_\{2\}\}\)\.
Step 4\.ThereforewP\(π1\)≤wP\(π2\)w\_\{P\}\(\\pi\_\{1\}\)\\leq w\_\{P\}\(\\pi\_\{2\}\)\. Setf\(t\)=wP\(π\)f\(t\)=w\_\{P\}\(\\pi\)for anyπ\\piwithμ\(Bπ\)=t\\mu\(B\_\{\\pi\}\)=t\. ∎
## Appendix DProof of Theorem[3](https://arxiv.org/html/2605.05209#Thmtheorem3)\(Strict Necessity\)
###### Proof\.
Step 1\.Sinceμ\(Bπ1\)\>μ\(Bπ2\)\\mu\(B\_\{\\pi\_\{1\}\}\)\>\\mu\(B\_\{\\pi\_\{2\}\}\), chooseB1′⊆Bπ1B\_\{1\}^\{\\prime\}\\subseteq B\_\{\\pi\_\{1\}\}withμ\(B1′\)=μ\(Bπ2\)\\mu\(B\_\{1\}^\{\\prime\}\)=\\mu\(B\_\{\\pi\_\{2\}\}\)\. LetD=Bπ1∖B1′D=B\_\{\\pi\_\{1\}\}\\setminus B\_\{1\}^\{\\prime\}, soμ\(D\)\>0\\mu\(D\)\>0\. By the two\-piece construction in Theorem[2](https://arxiv.org/html/2605.05209#Thmtheorem2), any two buffers of equal measure give equal generalisation probability\. SoP\(S⊆Bπ2\)=P\(S⊆B1′\)P\(S\\subseteq B\_\{\\pi\_\{2\}\}\)=P\(S\\subseteq B\_\{1\}^\{\\prime\}\)\.
Step 2\.SinceB1′⊊Bπ1B\_\{1\}^\{\\prime\}\\subsetneq B\_\{\\pi\_\{1\}\},
P\(S⊆Bπ1\)=P\(S⊆B1′\)\+P\(S⊆Bπ1,S∩D≠∅\)\.P\(S\\subseteq B\_\{\\pi\_\{1\}\}\)=P\(S\\subseteq B\_\{1\}^\{\\prime\}\)\+P\(S\\subseteq B\_\{\\pi\_\{1\}\},\\,S\\cap D\\neq\\emptyset\)\.
Step 3\.The second term is at leastP\(∅≠S⊆D\)P\(\\emptyset\\neq S\\subseteq D\)\. By nondegeneracy andμ\(D\)\>0\\mu\(D\)\>0, this is strictly positive\. ThereforeP\(S⊆Bπ1\)\>P\(S⊆B1′\)=P\(S⊆Bπ2\)P\(S\\subseteq B\_\{\\pi\_\{1\}\}\)\>P\(S\\subseteq B\_\{1\}^\{\\prime\}\)=P\(S\\subseteq B\_\{\\pi\_\{2\}\}\), givingwP\(π1\)\>wP\(π2\)w\_\{P\}\(\\pi\_\{1\}\)\>w\_\{P\}\(\\pi\_\{2\}\)\. ∎
## Appendix EProof of Theorem[4](https://arxiv.org/html/2605.05209#Thmtheorem4)\(Unified Optimality\)
###### Proof\.
Part \(1\) is Theorem[2](https://arxiv.org/html/2605.05209#Thmtheorem2)\. Part \(2\) is Theorem[3](https://arxiv.org/html/2605.05209#Thmtheorem3)\. For Part \(3\), supposeπ1,π2∈Πα\\pi\_\{1\},\\pi\_\{2\}\\in\\Pi\_\{\\alpha\}both maximiseG\(π,P\)G\(\\pi,P\)overΠα\\Pi\_\{\\alpha\}for every nondegenerateP∈ℰP\\in\\mathcal\{E\}\. Ifμ\(Bπ1\)≠μ\(Bπ2\)\\mu\(B\_\{\\pi\_\{1\}\}\)\\neq\\mu\(B\_\{\\pi\_\{2\}\}\), sayμ\(Bπ1\)\>μ\(Bπ2\)\\mu\(B\_\{\\pi\_\{1\}\}\)\>\\mu\(B\_\{\\pi\_\{2\}\}\), then Part \(2\) givesG\(π1,P\)\>G\(π2,P\)G\(\\pi\_\{1\},P\)\>G\(\\pi\_\{2\},P\)for every nondegeneratePP, soπ2\\pi\_\{2\}does not maximiseG\(⋅,P\)G\(\\cdot,P\)\. Contradiction\. Soμ\(Bπ1\)=μ\(Bπ2\)\\mu\(B\_\{\\pi\_\{1\}\}\)=\\mu\(B\_\{\\pi\_\{2\}\}\)\. ∎
## Appendix FProof of Theorem[5](https://arxiv.org/html/2605.05209#Thmtheorem5)\(Minimax Optimality\)
###### Proof\.
By Theorem[4](https://arxiv.org/html/2605.05209#Thmtheorem4)\(1\),G\(π∗,P\)≥G\(π,P\)G\(\\pi^\{\*\},P\)\\geq G\(\\pi,P\)for allπ\\piand allP∈ℰP\\in\\mathcal\{E\}\. SomaxπG\(π,P\)=G\(π∗,P\)\\max\_\{\\pi\}G\(\\pi,P\)=G\(\\pi^\{\*\},P\)for everyPP, giving
infP∈ℰmaxπG\(π,P\)=infP∈ℰG\(π∗,P\)\.\\inf\_\{P\\in\\mathcal\{E\}\}\\max\_\{\\pi\}G\(\\pi,P\)=\\inf\_\{P\\in\\mathcal\{E\}\}G\(\\pi^\{\*\},P\)\.
Sufficiency \(Theorem[4](https://arxiv.org/html/2605.05209#Thmtheorem4)\(1\)\) givesG\(π∗,P\)≥G\(π,P\)G\(\\pi^\{\*\},P\)\\geq G\(\\pi,P\)for allπ\\piand allP∈ℰP\\in\\mathcal\{E\}, so
infP∈ℰG\(π∗,P\)≥infP∈ℰG\(π,P\)for allπ\.\\inf\_\{P\\in\\mathcal\{E\}\}G\(\\pi^\{\*\},P\)\\geq\\inf\_\{P\\in\\mathcal\{E\}\}G\(\\pi,P\)\\quad\\text\{for all \}\\pi\.
Thereforeπ∗\\pi^\{\*\}achievesmaxπinfPG\(π,P\)\\max\_\{\\pi\}\\inf\_\{P\}G\(\\pi,P\)\. The minimax equality follows from the chain
maxπinfPG\(π,P\)≥infPG\(π∗,P\)=infPmaxπG\(π,P\)≥maxπinfPG\(π,P\)\.\\max\_\{\\pi\}\\inf\_\{P\}G\(\\pi,P\)\\geq\\inf\_\{P\}G\(\\pi^\{\*\},P\)=\\inf\_\{P\}\\max\_\{\\pi\}G\(\\pi,P\)\\geq\\max\_\{\\pi\}\\inf\_\{P\}G\(\\pi,P\)\.∎
## Appendix GReparameterisation Results
Table[6](https://arxiv.org/html/2605.05209#A7.T6)shows detailed results for one small\-batch network at 6,000 training points\. In every case, test accuracy, L1, L2, and EA are exactly invariant under bothTβT\_\{\\beta\}andTγT\_\{\\gamma\}\. Only the Hessian trace changes\.
Table 6:Reparameterisation invariance test\. Net 0 \(small\-batch regime, 6,000 training points\)\. Hessian trace changes by up to 99×\\times\. All other quantities are exactly invariant\.Figure 1:Reparameterisation invariance underTβT\_\{\\beta\}at 6,000 training points\. \(a\) Hessian trace changes by two orders of magnitude\. \(b\) Test accuracy is exactly invariant\. \(c\) Layer\-2 activation pattern count is exactly invariant\. \(d\) Ensemble agreement is exactly invariant\.500 training points\.Net 0 \(small\-batch, test=0\.8536\)\. Hessian atβ=1\\beta=1: 48\.9\. Hessian atβ=20\\beta=20: 4,751 \(97×\\times\)\. Test accuracy: 0\.8536 at allβ\\beta\. Net 25 \(large\-batch, test=0\.8619\)\. Hessian atβ=1\\beta=1: 22\.4\. Hessian atβ=20\\beta=20: 1,787 \(80×\\times\)\. Test accuracy: 0\.8619 at allβ\\beta\.
2,000 training points\.Net 0 \(small\-batch, test=0\.9050\)\. Hessian atβ=1\\beta=1: 61\.9\. Hessian atβ=20\\beta=20: 5,789 \(93×\\times\)\. Test accuracy: 0\.9050 at allβ\\beta\. Net 25 \(large\-batch, test=0\.9160\)\. Hessian atβ=1\\beta=1: 24\.5\. Hessian atβ=20\\beta=20: 1,671 \(68×\\times\)\. Test accuracy: 0\.9160 at allβ\\beta\.
6,000 training points\.See Table[6](https://arxiv.org/html/2605.05209#A7.T6)for Net 0\. Net 25 \(large\-batch, test=0\.9476\)\. Hessian atβ=1\\beta=1: 23\.5\. Hessian atβ=20\\beta=20: 1,328 \(56×\\times\)\. Test accuracy: 0\.9476 at allβ\\beta\. Net 30 \(large\-batch, test=0\.9498\)\. Hessian atβ=1\\beta=1: 22\.6\. Hessian atβ=20\\beta=20: 1,318 \(58×\\times\)\. Test accuracy: 0\.9498 at allβ\\beta\.
Figure 2:Scatter plots of test accuracy against \(a\) Hessian trace and \(b\) L2 activation pattern count atn=500n=500\. Both correlate with generalisation at similar strength\. Only L2 is reparameterisation\-invariant\.
## Appendix HWeakness for Frozen\-Partition ReLU Networks
This appendix constructs an approximate vocabulary for a ReLU network with frozen early layers and derives the weakness of the training\-label policy under that vocabulary\. The approximation is exact when each L2 region contains at most one data point and inexact otherwise \(see Section[H\.5](https://arxiv.org/html/2605.05209#A8.SS5)for details\)\.
### H\.1Setup
Fix a 3\-layer ReLU MLPfθ\(x\)=W3σ\(W2σ\(W1x\+b1\)\+b2\)\+b3f\_\{\\theta\}\(x\)=W\_\{3\}\\sigma\(W\_\{2\}\\sigma\(W\_\{1\}x\+b\_\{1\}\)\+b\_\{2\}\)\+b\_\{3\}whereσ\\sigmais the ReLU activation\. Partition the parameters asθ=\(θ1:2,θ3\)\\theta=\(\\theta\_\{1:2\},\\theta\_\{3\}\)whereθ1:2=\(W1,b1,W2,b2\)\\theta\_\{1:2\}=\(W\_\{1\},b\_\{1\},W\_\{2\},b\_\{2\}\)are the first two layers andθ3=\(W3,b3\)\\theta\_\{3\}=\(W\_\{3\},b\_\{3\}\)is the third\.
Freezeθ1:2\\theta\_\{1:2\}\. This fixes the*L2 partition*\. The functionh\(x\)=σ\(W2σ\(W1x\+b1\)\+b2\)h\(x\)=\\sigma\(W\_\{2\}\\sigma\(W\_\{1\}x\+b\_\{1\}\)\+b\_\{2\}\)maps each inputxxto aD2D\_\{2\}\-dimensional hidden representation\. The binary activation pattern𝟏\[W2σ\(W1x\+b1\)\+b2\>0\]\\mathbf\{1\}\[W\_\{2\}\\sigma\(W\_\{1\}x\+b\_\{1\}\)\+b\_\{2\}\>0\]determines a*region*of input space within whichhhis affine inxx\. Within a single region, the outputW3h\(x\)\+b3W\_\{3\}h\(x\)\+b\_\{3\}is also affine inxx, so different inputs in the same region can in general receive different classifications\. The region\-class vocabulary defined below is therefore an*approximation*that treats each region as assigning a single class\. This is exact when there is at most one data point per region and approximate otherwise\. In all experiments reported here, multiple data points share L2 regions, so the approximation is inexact\.
Letℛ=\{r1,…,rR\}\\mathcal\{R\}=\\\{r\_\{1\},\\ldots,r\_\{R\}\\\}be the set of L2 regions that contain at least one data point \(training or unseen\)\.
### H\.2The vocabulary
###### Definition 10\(Region\-class vocabulary\)\.
Define the environmentΦ=\{0,…,9\}ℛ\\Phi=\\\{0,\\ldots,9\\\}^\{\\mathcal\{R\}\}, the set of all total classification functions from regions to classes\. For each regionr∈ℛr\\in\\mathcal\{R\}and classc∈\{0,…,9\}c\\in\\\{0,\\ldots,9\\\}, define the program
pr,c=\{g∈Φ:g\(r\)=c\}\.p\_\{r,c\}=\\\{g\\in\\Phi:g\(r\)=c\\\}\.The vocabulary is𝔳net=\{pr,c:r∈ℛ,c∈\{0,…,9\}\}\\mathfrak\{v\}\_\{\\mathrm\{net\}\}=\\\{p\_\{r,c\}:r\\in\\mathcal\{R\},\\,c\\in\\\{0,\\ldots,9\\\}\\\}\.
Each program says “regionrris classified ascc\.” For a given region, the 10 programspr,0,…,pr,9p\_\{r,0\},\\ldots,p\_\{r,9\}are mutually exclusive\. No region can be assigned two classes\. This is the satisfiability constraint that makesL𝔳net≠2𝔳netL\_\{\\mathfrak\{v\}\_\{\\mathrm\{net\}\}\}\\neq 2^\{\\mathfrak\{v\}\_\{\\mathrm\{net\}\}\}\.
###### Proposition 11\(Structure of the embodied language\)\.
The embodied languageL𝔳netL\_\{\\mathfrak\{v\}\_\{\\mathrm\{net\}\}\}is the set of all partial functions fromℛ\\mathcal\{R\}to\{0,…,9\}\\\{0,\\ldots,9\\\}, encoded as consistent subsets of𝔳net\\mathfrak\{v\}\_\{\\mathrm\{net\}\}\. Its cardinality is\|L𝔳net\|=11R\|L\_\{\\mathfrak\{v\}\_\{\\mathrm\{net\}\}\}\|=11^\{R\}\(10 classes plus “unspecified” for each ofRRregions\)\.
###### Proof\.
A subsetl⊆𝔳netl\\subseteq\\mathfrak\{v\}\_\{\\mathrm\{net\}\}is consistent \(i\.e\.,⋂p∈lp≠∅\\bigcap\_\{p\\in l\}p\\neq\\emptyset\) if and only if no region is assigned two distinct classes\. This is exactly the condition forllto encode a partial function fromℛ\\mathcal\{R\}to\{0,…,9\}\\\{0,\\ldots,9\\\}\. Each of theRRregions is either assigned to one of 10 classes or left unspecified, giving11R11^\{R\}elements\. This count includes the empty set∅∈L𝔳net\\emptyset\\in L\_\{\\mathfrak\{v\}\_\{\\mathrm\{net\}\}\}\(no region classified\)\. ∎
### H\.3Policy and weakness
LetXtrain=\{x1,…,xn\}X\_\{\\mathrm\{train\}\}=\\\{x\_\{1\},\\ldots,x\_\{n\}\\\}be the training inputs with labelsy1,…,yny\_\{1\},\\ldots,y\_\{n\}\. Letℛtrain⊆ℛ\\mathcal\{R\}\_\{\\mathrm\{train\}\}\\subseteq\\mathcal\{R\}be the set of regions containing at least one training point\. For eachr∈ℛtrainr\\in\\mathcal\{R\}\_\{\\mathrm\{train\}\}, letyry\_\{r\}be the majority training label among points inrr\. When the region contains a single training point,yry\_\{r\}is that point’s label\. When it contains multiple training points with possibly different labels, the region\-class vocabulary is an approximation\. In the model\-selection experiments \(784→\\to64→\\to16→\\to10, 250 training points\), many regions contain multiple points, so the approximation is inexact\. The pair proxy experiment \(Section[5\.7](https://arxiv.org/html/2605.05209#S5.SS7)\) avoids this issue by using per\-input linear feasibility rather than the region\-class vocabulary\.
###### Definition 11\(Training\-label policy\)\.
The training\-label policy isπ=\{pr,yr:r∈ℛtrain\}\\pi=\\\{p\_\{r,y\_\{r\}\}:r\\in\\mathcal\{R\}\_\{\\mathrm\{train\}\}\\\}\. It assigns each training region to the labelyry\_\{r\}and is silent on all other regions\. When a region contains training points with conflicting labels,yry\_\{r\}is the majority label and the policy is an approximation of the network’s actual behaviour in that region\.
###### Definition 12\(Free regions\)\.
Letℛfree=ℛ∖ℛtrain\\mathcal\{R\}\_\{\\mathrm\{free\}\}=\\mathcal\{R\}\\setminus\\mathcal\{R\}\_\{\\mathrm\{train\}\}be the set of L2 regions containing unseen data but no training data\. Writek=\|ℛfree\|k=\|\\mathcal\{R\}\_\{\\mathrm\{free\}\}\|\.
These are the regions where the network has committed \(it classifies inputs landing there\) but the training data provides no constraint on what the classification should be\. The network’s behaviour in these regions is determined by the specific setting ofθ3\\theta\_\{3\}, not by the child task\.
###### Theorem 12\(Weakness of the training\-label policy\)\.
Under the vocabulary𝔳net\\mathfrak\{v\}\_\{\\mathrm\{net\}\}, the weakness of the training\-label policyπ\\piis
w\(π\)=\|Ext\(π\)\|=11k,w\(\\pi\)=\|\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\|=11^\{k\},wherek=\|ℛfree\|k=\|\\mathcal\{R\}\_\{\\mathrm\{free\}\}\|is the number of free regions\.
###### Proof\.
The extensionExt\(π\)=\{l∈L𝔳net:π⊆l\}\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)=\\\{l\\in L\_\{\\mathfrak\{v\}\_\{\\mathrm\{net\}\}\}:\\pi\\subseteq l\\\}consists of all partial classifications that agree withπ\\pion training regions\. For each training regionr∈ℛtrainr\\in\\mathcal\{R\}\_\{\\mathrm\{train\}\}, the class is fixed toyry\_\{r\}\. For each free regionr∈ℛfreer\\in\\mathcal\{R\}\_\{\\mathrm\{free\}\}, there are 11 independent choices \(assign to one of 10 classes, or leave unspecified\)\. So\|Ext\(π\)\|=11k\|\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\|=11^\{k\}\. ∎
###### Corollary 13\(Weakness ranking by free regions\)\.
Consider two frozen\-partition networks that both memorise the same training data\. Letk1k\_\{1\}andk2k\_\{2\}be their respective numbers of free L2 regions\. Ifk1\>k2k\_\{1\}\>k\_\{2\}, thenw\(π1\)\>w\(π2\)w\(\\pi\_\{1\}\)\>w\(\\pi\_\{2\}\)\.
###### Proof\.
11k1\>11k211^\{k\_\{1\}\}\>11^\{k\_\{2\}\}if and only ifk1\>k2k\_\{1\}\>k\_\{2\}\. ∎
The quantityk=\|ℛfree\|k=\|\\mathcal\{R\}\_\{\\mathrm\{free\}\}\|is the “unseen\-only regions” measure from the experiments in Section[5](https://arxiv.org/html/2605.05209#S5)\. It is reparameterisation\-invariant by Theorem[6](https://arxiv.org/html/2605.05209#Thmtheorem6), because the L2 activation patterns do not change under function\-preserving reparameterisationTβT\_\{\\beta\}orTγT\_\{\\gamma\}\.
### H\.4Connection to generalisation
The sufficiency theorem \(Theorem[2](https://arxiv.org/html/2605.05209#Thmtheorem2)\) and the finite sufficiency proposition \(Proposition[8](https://arxiv.org/html/2605.05209#Thmtheorem8)\) predict that policies with higher weakness generalise better\. Corollary[13](https://arxiv.org/html/2605.05209#Thmtheorem13)says that networks with more free regions have higher weakness\. Combining these yields the prediction that networks with more free regions generalise better\.
There is a subtlety\. The sufficiency theorem compares policies*within the same vocabulary*\. Two networks with different L2 partitions define different region\-class vocabularies\. In the strictest reading, the theorem does not directly compare policies across different vocabularies\.
The experimental test in Section[5](https://arxiv.org/html/2605.05209#S5)bypasses this subtlety\. It measureskfreek\_\{\\mathrm\{free\}\}for each network and correlates it with test accuracy across 300 networks\. The result \(ρ=\+0\.117\\rho=\+0\.117,p=0\.043p=0\.043\) confirms that the prediction holds empirically even across vocabularies\. The pair proxy experiment in Section[5\.7](https://arxiv.org/html/2605.05209#S5.SS7)provides a stronger test \(ρ=\+0\.374\\rho=\+0\.374,p=0\.00012p=0\.00012\) using linear feasibility rather than region counting\.
### H\.5Approximation caveats
The single\-class\-per\-region approximation\.The region\-class vocabulary assigns one class per region\. In reality,W3h\(x\)\+b3W\_\{3\}h\(x\)\+b\_\{3\}is affine inxxwithin a region, so different inputs in the same region can be classified differently\. The vocabulary would be exact if each region contained at most one data point\. In all experiments reported here, multiple data points share regions, so the approximation is inexact\. The pair proxy experiment \(Section[5\.7](https://arxiv.org/html/2605.05209#S5.SS7)\) avoids this issue entirely by checking per\-input linear feasibility rather than assigning one class per region\.
### H\.6The shared\-weight caveat
Theorem[12](https://arxiv.org/html/2605.05209#Thmtheorem12)counts the extension in the region\-class vocabulary𝔳net\\mathfrak\{v\}\_\{\\mathrm\{net\}\}, where each free region’s classification is an independent choice\. But the actual network shares a single weight matrixW3W\_\{3\}across all regions\. A setting ofW3W\_\{3\}that correctly classifies training regions also determines the classification of free regions\. The achievable total classifications of free regions are the cells of a hyperplane arrangement in the polytope of validW3W\_\{3\}values\. These cells are not independent across regions\.
This means the region\-class vocabulary overstates the freedom of the network\. The feature\-classifier vocabulary \(Appendix[I](https://arxiv.org/html/2605.05209#A9)\) solves this problem\. It defines programs over the shared parameter space rather than per\-region classifications, so every element of its embodied language is realisable by a single\(W,b\)\(W,b\)\. The pair proxy experiment \(Section[5\.7](https://arxiv.org/html/2605.05209#S5.SS7)\) uses this vocabulary\. The pair proxy computes single\-point extensions exactly \(Theorem[15](https://arxiv.org/html/2605.05209#Thmtheorem15)\)\. It is a sum of marginals, not the full extension count\|Ext\(π\)\|\|\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\|\. Computing the full extension exactly remains intractable\.
## Appendix IThe Feature\-Classifier Vocabulary
The region\-class vocabulary \(Definition[10](https://arxiv.org/html/2605.05209#Thmdefinition10)\) approximates the network by assigning one class per region and treating regions as independent\. Both are false in general\. The shared weight matrixW3W\_\{3\}couples the regions, and multiple inputs per region can receive different classes\. The pair proxy experiment avoids both approximations by checking per\-input linear feasibility\. This appendix proves that the pair proxy computes single\-point extensions of the training policy exactly in a well\-defined vocabulary that respects the shared\-weight constraint\.
Fix a feature mapφ:X→ℝd\\varphi\\colon X\\to\\mathbb\{R\}^\{d\}obtained by freezing all layers up to and including a ReLU activation\. For an MLP with frozen layers 1 and 2,φ\(x\)=σ\(W2σ\(W1x\+b1\)\+b2\)\\varphi\(x\)=\\sigma\(W\_\{2\}\\sigma\(W\_\{1\}x\+b\_\{1\}\)\+b\_\{2\}\)\. For a CNN with a frozen bottleneck,φ\(x\)=σ\(Wneckconv\(x\)\+bneck\)\\varphi\(x\)=\\sigma\(W\_\{\\mathrm\{neck\}\}\\,\\mathrm\{conv\}\(x\)\+b\_\{\\mathrm\{neck\}\}\)\. In both cases,φ\\varphiis fixed and the final classifier isargmaxc\(Wc⋅φ\(x\)\+bc\)\\operatorname\*\{arg\\,max\}\_\{c\}\\;\(W\_\{c\}\\cdot\\varphi\(x\)\+b\_\{c\}\)with free parameters\(W,b\)∈ℝKd\+K\(W,b\)\\in\\mathbb\{R\}^\{Kd\+K\}\.
LetS⊆XS\\subseteq Xbe a finite set of inputs \(training plus unseen probe points\) andKKthe number of classes\.
###### Definition 13\(Feature\-classifier vocabulary\)\.
Define the environmentΦ=ℝKd\+K\\Phi=\\mathbb\{R\}^\{Kd\+K\}, the set of all parameter settings\(W,b\)\(W,b\)for the final linear layer\. For each inputx∈Sx\\in Sand classc∈\{0,…,K−1\}c\\in\\\{0,\\ldots,K\\\!\-\\\!1\\\}, define the program
px,c=\{\(W,b\)∈Φ:Wc⋅φ\(x\)\+bc\>Wc′⋅φ\(x\)\+bc′for allc′≠c\}\.p\_\{x,c\}=\\bigl\\\{\(W,b\)\\in\\Phi:W\_\{c\}\\cdot\\varphi\(x\)\+b\_\{c\}\>W\_\{c^\{\\prime\}\}\\cdot\\varphi\(x\)\+b\_\{c^\{\\prime\}\}\\;\\text\{ for all \}c^\{\\prime\}\\neq c\\bigr\\\}\.The*feature\-classifier vocabulary*is𝔳φ=\{px,c:x∈S,c∈\{0,…,K−1\}\}\\mathfrak\{v\}\_\{\\varphi\}=\\\{p\_\{x,c\}:x\\in S,\\,c\\in\\\{0,\\ldots,K\\\!\-\\\!1\\\}\\\}\.
Each program says “inputxxis classified asccby the final linear layer, given the frozen featuresφ\(x\)\\varphi\(x\)\.” Eachpx,cp\_\{x,c\}is an open subset ofℝKd\+K\\mathbb\{R\}^\{Kd\+K\}defined by finitely many strict linear inequalities\. For fixedxx, theKKprogramspx,0,…,px,K−1p\_\{x,0\},\\ldots,p\_\{x,K\-1\}are pairwise disjoint\. No parameter vector can assignxxto two classes simultaneously\.
The embodied languageL𝔳φL\_\{\\mathfrak\{v\}\_\{\\varphi\}\}is the set of all consistent subsetsl⊆𝔳φl\\subseteq\\mathfrak\{v\}\_\{\\varphi\}, where consistency means⋂p∈lp≠∅\\bigcap\_\{p\\in l\}p\\neq\\emptyset\. Some parameter setting satisfies all programs inllsimultaneously\. This follows the standard Stack Theory definition ofL𝔳L\_\{\\mathfrak\{v\}\}\. The pairwise disjointness ofpx,0,…,px,K−1p\_\{x,0\},\\ldots,p\_\{x,K\-1\}ensures that everyl∈L𝔳φl\\in L\_\{\\mathfrak\{v\}\_\{\\varphi\}\}assigns at most one class per input\.
###### Proposition 14\(Consistency is LP feasibility\)\.
A subsetl⊆𝔳φl\\subseteq\\mathfrak\{v\}\_\{\\varphi\}is inL𝔳φL\_\{\\mathfrak\{v\}\_\{\\varphi\}\}if and only if the following system of strict linear inequalities is feasible\. For eachpxi,ci∈lp\_\{x\_\{i\},c\_\{i\}\}\\in land eachc≠cic\\neq c\_\{i\},
Wci⋅φ\(xi\)\+bci\>Wc⋅φ\(xi\)\+bc\.W\_\{c\_\{i\}\}\\cdot\\varphi\(x\_\{i\}\)\+b\_\{c\_\{i\}\}\>W\_\{c\}\\cdot\\varphi\(x\_\{i\}\)\+b\_\{c\}\.Equivalently,l∈L𝔳φl\\in L\_\{\\mathfrak\{v\}\_\{\\varphi\}\}if and only if for someε\>0\\varepsilon\>0, the LP with constraints\(Wci−Wc\)⋅φ\(xi\)\+\(bci−bc\)≥ε\(W\_\{c\_\{i\}\}\-W\_\{c\}\)\\cdot\\varphi\(x\_\{i\}\)\+\(b\_\{c\_\{i\}\}\-b\_\{c\}\)\\geq\\varepsilonfor allpxi,ci∈lp\_\{x\_\{i\},c\_\{i\}\}\\in landc≠cic\\neq c\_\{i\}is feasible\.
###### Proof\.
By Definition[13](https://arxiv.org/html/2605.05209#Thmdefinition13),⋂p∈lp\\bigcap\_\{p\\in l\}pis the set of all\(W,b\)\(W,b\)satisfying the strict inequalities for every program inll\. Ifllassigns two classes to the same input \(saypx,c1p\_\{x,c\_\{1\}\}andpx,c2p\_\{x,c\_\{2\}\}withc1≠c2c\_\{1\}\\neq c\_\{2\}\), the system containsWc1⋅φ\(x\)\+bc1\>Wc2⋅φ\(x\)\+bc2W\_\{c\_\{1\}\}\\cdot\\varphi\(x\)\+b\_\{c\_\{1\}\}\>W\_\{c\_\{2\}\}\\cdot\\varphi\(x\)\+b\_\{c\_\{2\}\}and its reverse, which is infeasible\. Otherwise,l∈L𝔳φl\\in L\_\{\\mathfrak\{v\}\_\{\\varphi\}\}if and only if the conjunction of strict inequalities has a solution\.
For the LP equivalence, suppose\(W∗,b∗\)\(W^\{\*\},b^\{\*\}\)satisfies all strict inequalities\. Defineε∗=mini,c≠ci\[\(Wci∗−Wc∗\)⋅φ\(xi\)\+\(bci∗−bc∗\)\]\>0\\varepsilon^\{\*\}=\\min\_\{i,\\,c\\neq c\_\{i\}\}\\bigl\[\(W^\{\*\}\_\{c\_\{i\}\}\-W^\{\*\}\_\{c\}\)\\cdot\\varphi\(x\_\{i\}\)\+\(b^\{\*\}\_\{c\_\{i\}\}\-b^\{\*\}\_\{c\}\)\\bigr\]\>0\. The minimum is over finitely many strictly positive terms, soε∗\>0\\varepsilon^\{\*\}\>0\. Then\(W∗,b∗\)\(W^\{\*\},b^\{\*\}\)is LP\-feasible for anyε≤ε∗\\varepsilon\\leq\\varepsilon^\{\*\}\. Conversely, any LP solution withε\>0\\varepsilon\>0satisfies all strict inequalities\. ∎
Checking whether a set of input\-class assignments is simultaneously achievable by some linear classifier is an LP feasibility problem\. The LP does not approximate the vocabulary\. It checks consistency in𝔳φ\\mathfrak\{v\}\_\{\\varphi\}exactly, provided the marginε\\varepsilonis small enough to not exclude feasible solutions\. In the Fashion\-MNIST replication, the adaptive margin procedure finds the maximum feasibleε\\varepsilonand uses half of it, guaranteeing this condition\. In the original MNIST pair\-proxy experiment, the LP instead uses a fixed small margin, which serves the same role provided it is chosen below the feasible maximum margin\.
###### Definition 14\(Training policy in𝔳φ\\mathfrak\{v\}\_\{\\varphi\}\)\.
Given training data\(x1,y1\),…,\(xn,yn\)\(x\_\{1\},y\_\{1\}\),\\ldots,\(x\_\{n\},y\_\{n\}\)withxi∈Sx\_\{i\}\\in S, the training policy is
π=\{pxi,yi:i=1,…,n\}\.\\pi=\\\{p\_\{x\_\{i\},y\_\{i\}\}:i=1,\\ldots,n\\\}\.Assumeπ∈L𝔳φ\\pi\\in L\_\{\\mathfrak\{v\}\_\{\\varphi\}\}\. This holds whenever the trained parameters\(W∗,b∗\)\(W^\{\*\},b^\{\*\}\)classify every training inputxix\_\{i\}asyiy\_\{i\}with a strict margin, because then\(W∗,b∗\)∈⋂p∈πp\(W^\{\*\},b^\{\*\}\)\\in\\bigcap\_\{p\\in\\pi\}p\.
###### Definition 15\(Pair proxy as extension count\)\.
For an unseen inputxj∈S∖\{x1,…,xn\}x\_\{j\}\\in S\\setminus\\\{x\_\{1\},\\ldots,x\_\{n\}\\\}, the*pointwise extension count*is
ext\(xj\)=\|\{c∈\{0,…,K−1\}:π∪\{pxj,c\}∈L𝔳φ\}\|\.\\mathrm\{ext\}\(x\_\{j\}\)=\|\\\{c\\in\\\{0,\\ldots,K\\\!\-\\\!1\\\}:\\pi\\cup\\\{p\_\{x\_\{j\},c\}\\\}\\in L\_\{\\mathfrak\{v\}\_\{\\varphi\}\}\\\}\|\.The*pair proxy*is
PP\(φ\)=∑jext\(xj\)\.\\mathrm\{PP\}\(\\varphi\)=\\sum\_\{j\}\\mathrm\{ext\}\(x\_\{j\}\)\.
Each termext\(xj\)\\mathrm\{ext\}\(x\_\{j\}\)counts the number of classes to whichxjx\_\{j\}can be assigned while keeping the training constraints satisfiable by some shared\(W,b\)\(W,b\)\. By Proposition[14](https://arxiv.org/html/2605.05209#Thmtheorem14), each such check is an LP feasibility problem\.
The pair proxy is not the full weakness\|Ext\(π\)\|\|\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\|\. The full weakness counts all consistent partial classifications extendingπ\\pi, including multi\-point extensions and partial assignments\. The pair proxy counts single\-point extensions, summed across unseen inputs\. Computing\|Ext\(π\)\|\|\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\|exactly would require enumerating an exponential number of joint assignments\. The pair proxy is a tractable quantity derived from the single\-point marginals of the extension set\.
###### Theorem 15\(Feature\-classifier reduction\)\.
Fix a feature mapφ\\varphi, training data\(xi,yi\)\(x\_\{i\},y\_\{i\}\)withπ∈L𝔳φ\\pi\\in L\_\{\\mathfrak\{v\}\_\{\\varphi\}\}, and the feature\-classifier vocabulary𝔳φ\\mathfrak\{v\}\_\{\\varphi\}\.
1. 1\.For each unseenxjx\_\{j\},ext\(xj\)=\|\{c:pxj,cappears in somel∈Ext\(π\)\}\|\\mathrm\{ext\}\(x\_\{j\}\)=\|\\\{c:p\_\{x\_\{j\},c\}\\text\{ appears in some \}l\\in\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\\\}\|\. The pointwise extension count equals the number of classes that appear in at least one completion of the training policy\.
2. 2\.PP\(φ\)\\mathrm\{PP\}\(\\varphi\)is computable by solvingK⋅\|Sunseen\|K\\cdot\|S\_\{\\mathrm\{unseen\}\}\|LP feasibility problems\.
3. 3\.PP\\mathrm\{PP\}is invariant under invertible affine reparameterisation of the feature space\. LetA∈ℝd×dA\\in\\mathbb\{R\}^\{d\\times d\}be invertible and defineφ′\(x\)=Aφ\(x\)\\varphi^\{\\prime\}\(x\)=A\\,\\varphi\(x\)\. ThenPP\(φ′\)=PP\(φ\)\\mathrm\{PP\}\(\\varphi^\{\\prime\}\)=\\mathrm\{PP\}\(\\varphi\)\. In particular,PP\\mathrm\{PP\}is invariant under the scaling reparameterisationTβT\_\{\\beta\}\(whereA=βIA=\\beta I\)\.
4. 4\.𝔳φ\\mathfrak\{v\}\_\{\\varphi\}respects the shared\-weight constraint by construction\. Everyl∈L𝔳φl\\in L\_\{\\mathfrak\{v\}\_\{\\varphi\}\}is realisable by some single\(W,b\)\(W,b\)\. The region\-class vocabulary does not have this property\.
###### Proof\.
Claim 1\.I prove both directions\.
\(⇐\\Leftarrow\) Supposeπ∪\{pxj,c\}\\pi\\cup\\\{p\_\{x\_\{j\},c\}\\\}is consistent\. Thenπ∪\{pxj,c\}∈L𝔳φ\\pi\\cup\\\{p\_\{x\_\{j\},c\}\\\}\\in L\_\{\\mathfrak\{v\}\_\{\\varphi\}\}\. Sinceπ⊆π∪\{pxj,c\}\\pi\\subseteq\\pi\\cup\\\{p\_\{x\_\{j\},c\}\\\}, this set is inExt\(π\)\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)and containspxj,cp\_\{x\_\{j\},c\}\.
\(⇒\\Rightarrow\) Supposepxj,c∈lp\_\{x\_\{j\},c\}\\in lfor somel∈Ext\(π\)l\\in\\mathrm\{Ext\}\\\!\\left\(\\pi\\right\)\. Thenπ∪\{pxj,c\}⊆l\\pi\\cup\\\{p\_\{x\_\{j\},c\}\\\}\\subseteq l\(becauseπ⊆l\\pi\\subseteq landpxj,c∈lp\_\{x\_\{j\},c\}\\in l\)\. Sincellis consistent and consistency is inherited by subsets \(l′⊆ll^\{\\prime\}\\subseteq limplies⋂p∈l′p⊇⋂p∈lp≠∅\\bigcap\_\{p\\in l^\{\\prime\}\}p\\supseteq\\bigcap\_\{p\\in l\}p\\neq\\emptyset\),π∪\{pxj,c\}\\pi\\cup\\\{p\_\{x\_\{j\},c\}\\\}is consistent\.
Claim 2\.By Proposition[14](https://arxiv.org/html/2605.05209#Thmtheorem14), each consistency checkπ∪\{pxj,c\}∈L𝔳φ\\pi\\cup\\\{p\_\{x\_\{j\},c\}\\\}\\in L\_\{\\mathfrak\{v\}\_\{\\varphi\}\}reduces to LP feasibility\. There areKKclasses and\|Sunseen\|\|S\_\{\\mathrm\{unseen\}\}\|unseen inputs\.
Claim 3\.Letφ′\(x\)=Aφ\(x\)\\varphi^\{\\prime\}\(x\)=A\\,\\varphi\(x\)for invertibleAA, and let𝔳φ′\\mathfrak\{v\}\_\{\\varphi^\{\\prime\}\}be the corresponding vocabulary\. Define the bijectionΨ:ℝKd\+K→ℝKd\+K\\Psi\\colon\\mathbb\{R\}^\{Kd\+K\}\\to\\mathbb\{R\}^\{Kd\+K\}byΨ\(W,b\)=\(WA−1,b\)\\Psi\(W,b\)=\(WA^\{\-1\},b\)\. For anyx∈Sx\\in Sand classcc,
\(W,b\)∈px,c\\displaystyle\(W,b\)\\in p\_\{x,c\}⇔Wc⋅φ\(x\)\+bc\>Wc′⋅φ\(x\)\+bc′for allc′≠c\\displaystyle\\iff W\_\{c\}\\cdot\\varphi\(x\)\+b\_\{c\}\>W\_\{c^\{\\prime\}\}\\cdot\\varphi\(x\)\+b\_\{c^\{\\prime\}\}\\text\{ for all \}c^\{\\prime\}\\neq c⇔\(WA−1\)c⋅Aφ\(x\)\+bc\>\(WA−1\)c′⋅Aφ\(x\)\+bc′for allc′≠c\\displaystyle\\iff\(WA^\{\-1\}\)\_\{c\}\\cdot A\\,\\varphi\(x\)\+b\_\{c\}\>\(WA^\{\-1\}\)\_\{c^\{\\prime\}\}\\cdot A\\,\\varphi\(x\)\+b\_\{c^\{\\prime\}\}\\text\{ for all \}c^\{\\prime\}\\neq c⇔Ψ\(W,b\)∈px,c′,\\displaystyle\\iff\\Psi\(W,b\)\\in p^\{\\prime\}\_\{x,c\},wherepx,c′p^\{\\prime\}\_\{x,c\}is the program forx,cx,cin𝔳φ′\\mathfrak\{v\}\_\{\\varphi^\{\\prime\}\}\. The second line usesWA−1⋅Aφ\(x\)=W⋅φ\(x\)WA^\{\-1\}\\cdot A\\varphi\(x\)=W\\cdot\\varphi\(x\)\.
SoΨ\\Psimapspx,cp\_\{x,c\}bijectively topx,c′p^\{\\prime\}\_\{x,c\}for everyx∈Sx\\in Sandcc\. It follows thatl∈L𝔳φl\\in L\_\{\\mathfrak\{v\}\_\{\\varphi\}\}if and only ifΨ\(l\)∈L𝔳φ′\\Psi\(l\)\\in L\_\{\\mathfrak\{v\}\_\{\\varphi^\{\\prime\}\}\}\(whereΨ\(l\)\\Psi\(l\)replaces each program by its image\)\. SinceΨ\\Psimapsπ\\pitoπ′\\pi^\{\\prime\}and preserves consistency,ext\(xj\)\\mathrm\{ext\}\(x\_\{j\}\)is the same underφ\\varphiandφ′\\varphi^\{\\prime\}, andPP\(φ\)=PP\(φ′\)\\mathrm\{PP\}\(\\varphi\)=\\mathrm\{PP\}\(\\varphi^\{\\prime\}\)\.
Claim 4\.By definition,l∈L𝔳φl\\in L\_\{\\mathfrak\{v\}\_\{\\varphi\}\}means⋂p∈lp≠∅\\bigcap\_\{p\\in l\}p\\neq\\emptyset\. Eachp∈lp\\in lis a subset ofΦ=ℝKd\+K\\Phi=\\mathbb\{R\}^\{Kd\+K\}, so a nonempty intersection is a parameter setting\(W,b\)\(W,b\)satisfying all constraints inllsimultaneously\. In contrast, the region\-class vocabulary definesΦ=\{0,…,9\}ℛ\\Phi=\\\{0,\\ldots,9\\\}^\{\\mathcal\{R\}\}and consistency means a compatible partial function exists, regardless of whether any single\(W,b\)\(W,b\)realises it\. ∎
The key distinction from the region\-class vocabulary is Claim 4\. The region\-class vocabulary has\|L𝔳net\|=11R\|L\_\{\\mathfrak\{v\}\_\{\\mathrm\{net\}\}\}\|=11^\{R\}and overstates the extension because it treats each region’s classification as an independent degree of freedom\. In the feature\-classifier vocabulary, every element ofL𝔳φL\_\{\\mathfrak\{v\}\_\{\\varphi\}\}must be realisable by a single shared parameter vector\. This is exactly the shared\-weight vocabulary identified in Section[H\.6](https://arxiv.org/html/2605.05209#A8.SS6)\.
###### Corollary 16\(Architecture\-independence\)\.
Letφ1\\varphi\_\{1\}andφ2\\varphi\_\{2\}be two frozen feature maps \(from potentially different architectures\) that produce the same feature vectors on all inputs inSS\. That is,φ1\(x\)=φ2\(x\)\\varphi\_\{1\}\(x\)=\\varphi\_\{2\}\(x\)for allx∈Sx\\in S\. Then𝔳φ1=𝔳φ2\\mathfrak\{v\}\_\{\\varphi\_\{1\}\}=\\mathfrak\{v\}\_\{\\varphi\_\{2\}\}andPP\(φ1\)=PP\(φ2\)\\mathrm\{PP\}\(\\varphi\_\{1\}\)=\\mathrm\{PP\}\(\\varphi\_\{2\}\)\.
###### Proof\.
The programspx,cp\_\{x,c\}depend onφ\\varphionly through the valuesφ\(x\)\\varphi\(x\)forx∈Sx\\in S\. Equal feature vectors produce identical programs, hence identical vocabularies, languages, extensions, and pair proxies\. ∎
The pair proxy depends only on the feature representation, not on the architecture that produced it\. An MLP with hidden representationφ\(x\)∈ℝ8\\varphi\(x\)\\in\\mathbb\{R\}^\{8\}and a CNN with an 8\-dimensional bottleneck define the same vocabulary if their features agree onSS\. For a bottleneck width ofd=8d=8andK=10K=10classes, the LP hasKd\+K=90Kd\+K=90variables\. This matches the scale of the MNIST and Fashion\-MNIST pair proxy experiments \(784→\\to64→\\to8→\\to10 givesd=8d=8\)\.Similar Articles
The Implicit Bias of Depth: From Neural Collapse to Softmax Codes
This paper studies how depth alone induces an implicit low-rank bias in deep unconstrained feature models trained without regularization, shifting the optimal solution from neural collapse to softmax codes, and provides the first asymptotic and dynamic characterization of this bias under gradient descent with cross-entropy loss.
Representation Gap: Explaining the Unreasonable Effectiveness of Neural Networks from a Geometric Perspective
This paper introduces the Representation Gap, a metric for neural network generalization error with better asymptotic dynamics. Using a geometric perspective and optimal quantization theory, the authors show it is governed by the intrinsic dimension of the task, and verify this empirically on synthetic and realistic datasets.
How Far Can Sharpness and Complexity Jointly Explain Generalization?
This paper investigates how well sharpness and complexity together explain generalization in deep neural networks, introducing a Pareto-based analysis and function-oriented definitions to expand the explanatory scope.
Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes
This paper formally proves that training neural networks with asymmetric activation functions like ReLU, GELU, or SiLU causes weights to drift negative, leading to up to 90% activation sparsity. It also shows that squared activations like ReLU² improve performance but cause activation spikes, which can be fixed by clipping, with GELU² achieving the best validation loss.
Revisiting the Volume Hypothesis
This paper revisits the volume hypothesis, which posits that generalization in over-parameterized networks is mainly due to the larger volume of good-generalizing regions in weight space rather than SGD's implicit bias. Through experiments with binary networks, the authors show that the generalization advantage of gradient learning over random sampling diminishes as training data size grows, potentially resolving contradictory prior findings.