Emergence via Phase Transitions: Mechanism Landscapes and Universal Convergence Across Complex Systems

arXiv cs.LG Papers

Summary

This paper introduces the Hierarchical Emergence Framework (HEF), which explains how diverse systems such as neural networks and biological evolution converge to similar internal representations through phase transitions in mechanism landscapes under physical and informational constraints. The framework is validated empirically with 111 grokking experiments that confirm universal convergence and identify a critical energy threshold.

arXiv:2606.07563v1 Announce Type: new Abstract: Across machine learning, biology, and physics, independently evolving systems often converge toward strikingly similar high-level structures despite radically different microscopic details. Grokking circuits converge across random seeds, evolutionary lineages rediscover similar metabolic solutions, and renormalization flows approach common fixed points. We propose the Hierarchical Emergence Framework (HEF) as a candidate universality framework for such convergence phenomena. HEF models emergence as a phase transition in a mechanism landscape constrained by thermodynamic and information-theoretic laws. The framework introduces a critical energy threshold Ec separating an exploration regime with competing mechanisms from a convergence regime governed by a unique minimum-cost mechanism. Under structural assumptions, we prove physical feasibility, derive strict metric contraction, and establish convergence toward a unique fixed-point representation independent of initial conditions. We further connect this convergence structure to causal emergence through Effective Information and mechanism competition entropy. To test the framework, we study delayed generalization ("grokking") in modular arithmetic transformers across 111 experiments. We identify a reproducible empirical fingerprint of the Ec transition: the weight norm peaks systematically before grokking in 92% of runs. Normalized accuracy curves collapse onto a tanh kink (R^2=0.93) consistent with a Landau-Ginzburg universality class, and all grokked models converge to 0.9745+/-0.014 regardless of initialization, weight decay, or training fraction (ANOVA p>0.13). HEF is not presented as a universal theory of emergence, but as a falsifiable mathematical scaffold for studying convergence phenomena across complex systems.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:46 AM

# Mechanism Landscapes and Universal Convergence Across Complex Systems
Source: [https://arxiv.org/html/2606.07563](https://arxiv.org/html/2606.07563)
## Emergence via Phase Transitions: Mechanism Landscapes and Universal Convergence Across Complex Systems

\(May 2026\)

###### Abstract

Why do independently trained neural networks converge to the same internal representations\[[37](https://arxiv.org/html/2606.07563#bib.bib37),[23](https://arxiv.org/html/2606.07563#bib.bib23)\]? Why does grokking — sudden generalisation after memorisation — follow universal statistics across architectures and tasks\[[42](https://arxiv.org/html/2606.07563#bib.bib42)\]? Why do independent evolutionary lineages repeatedly arrive at the same metabolic solutions across 993 yeast species\[[38](https://arxiv.org/html/2606.07563#bib.bib38)\]? We propose a*candidate*explanation for a structural motif common to these phenomena: when a system’s energy budget crosses a critical thresholdEcE\_\{c\}, competing mechanisms undergo a phase transition that drives convergence toward a unique fixed point determined by the system’s physical constraint set𝒫\\mathcal\{P\}\. We do not claim to explain all emergence, but to identify a recurring phase\-transition structure across convergence phenomena in learning, biology, and physics\. Concisely:*many emergence phenomena can be understood as phase transitions in mechanism landscapes under physical and informational constraints\.*

We formalise this structural motif as theHierarchical Emergence Framework \(HEF\), specified by a six\-tuple\(R\(1\),ℒ,𝒜0,𝒢,mode,E\)\(R^\{\(1\)\},\\mathcal\{L\},\\mathcal\{A\}\_\{0\},\\mathcal\{G\},\\mathrm\{mode\},E\)together with𝒫=\(𝒫thermo,𝒫info,Φ\)\\mathcal\{P\}=\(\\mathcal\{P\}\_\{\\mathrm\{thermo\}\},\\mathcal\{P\}\_\{\\mathrm\{info\}\},\\Phi\), where the translation mapΦ\\Phiis an order\-isomorphism of constraint lattices grounded in Landauer’s Principle and the Jarzynski Equality\. Three theorems follow\. The*Physical Feasibility Theorem*guarantees that all generated entities satisfy thermodynamic and information\-theoretic constraints simultaneously\. The*Energy\-Diversity Theorem*establishes a phase transition atEcE\_\{c\}between an exploration regime and a convergence regime\.*Universal Feature Convergence*then follows via the Banach Fixed\-Point Theorem: any two HEF instances sharing𝒫\\mathcal\{P\}and operating belowEcE\_\{c\}converge to the*same*fixed\-point representations, independent of initial conditions\. A*Causal Emergence Theorem*additionally shows that the fixed pointR∞R\_\{\\infty\}has strictly higher Effective Information\[[21](https://arxiv.org/html/2606.07563#bib.bib21)\]than the micro\-levelR\(1\)R^\{\(1\)\}, with the gain bounded by a measurable training\-dynamics quantity\.

We validate HEF empirically through 111 grokking experiments \(p∈\{23,31,41,53,67,83,97\}p\\in\\\{23,31,41,53,67,83,97\\\},λ∈\{1,2\}\\lambda\\in\\\{1,2\\\}, multiple seeds\)\.*Universal Convergence is confirmed*: all grokked models converge to0\.9745±0\.0140\.9745\\pm 0\.014regardless ofpp,λ\\lambda, or training fraction \(ANOVAp\>0\.13p\>0\.13; CV=1\.47%=1\.47\\%\)\. A*novelEcE\_\{c\}fingerprint*is identified: the weight norm‖w‖2\\\|w\\\|^\{2\}peaks∼1,050\{\\sim\}1\{,\}050steps before grokking in92%92\\%of runs, tracing the three\-phase HEF trajectory\. Accuracy curves collapse onto a tanh kink \(R2=0\.93R^\{2\}=0\.93\), placing grokking in the Landau–Ginzburg mean\-field universality class\. G2 scalingΔ​t∝1/\(frac⋅p⋅λ\)\\Delta t\\propto 1/\(\\mathrm\{frac\}\\cdot p\\cdot\\lambda\)is supported across seven primes \(β=−1\.39±0\.20\\beta=\-1\.39\\pm 0\.20,R2=0\.91R^\{2\}=0\.91\)\.

HEF makes*three falsifiable cross\-domain predictions*: \(P1\) anaerobic yeast lineages have higher genomic convergence than aerobic lineages at the same phylogenetic distance; \(P2\) LLMs trained with higher weight decay produce representations with higher causal potency; \(P3\) a critical weight\-decay thresholdλc​\(p\)∈\(2,4\)\\lambda\_\{c\}\(p\)\\in\(2,4\)exists beyond which grokking fails via mechanism starvation\. Code, data, and a diagnostic toolkit \(hef\-tools\) are provided to enable independent replication and application to new systems\.

###### Abstract

This Supplementary Information \(SI\) provides complete, self\-contained proofs for all theorems, lemmas, and corollaries in the main text “A Hierarchical Emergence Framework: From Physical Constraints to Universal Convergence”\.

Each proof is broken into small, verifiable steps\. The exposition is accessible to researchers in machine learning, theoretical physics, and complex systems\. Special attention is given to the metric contraction property \(A6\), treated as an empirically verifiable condition grounded in standard deep learning practices \(spectral normalization and weight decay\) for ML instantiations, and in log\-Sobolev inequalities or monotone compression for other instantiations \(EOM, IFF, RSID\)\. Where rigorous analytical proofs are not available, we provide explicit empirical verification protocols referenced to the main text\.

###### Contents

1. [1Introduction](https://arxiv.org/html/2606.07563#S1)1. [1\.1Three Puzzles, One Principle](https://arxiv.org/html/2606.07563#S1.SS1) 2. [1\.2What HEF Contributes](https://arxiv.org/html/2606.07563#S1.SS2) 3. [1\.3Main Results](https://arxiv.org/html/2606.07563#S1.SS3) 4. [1\.4How to Read This Paper](https://arxiv.org/html/2606.07563#S1.SS4) 5. [1\.5Paper Organisation](https://arxiv.org/html/2606.07563#S1.SS5)
2. [2The Hierarchical Emergence Framework](https://arxiv.org/html/2606.07563#S2)1. [2\.1Primitive Sets and the Hierarchy](https://arxiv.org/html/2606.07563#S2.SS1) 2. [2\.2Logical Language](https://arxiv.org/html/2606.07563#S2.SS2) 3. [2\.3Mechanism Family](https://arxiv.org/html/2606.07563#S2.SS3) 4. [2\.4Generation Rule and Operating Mode](https://arxiv.org/html/2606.07563#S2.SS4) 5. [2\.5Energy Budget, Canonical Measure, and Relevance Weights](https://arxiv.org/html/2606.07563#S2.SS5) 6. [2\.6The Full Framework Tuple](https://arxiv.org/html/2606.07563#S2.SS6)
3. [3Physical Foundation](https://arxiv.org/html/2606.07563#S3)1. [3\.1Thermodynamic Constraints](https://arxiv.org/html/2606.07563#S3.SS1) 2. [3\.2Information\-Theoretic Constraints](https://arxiv.org/html/2606.07563#S3.SS2) 3. [3\.3Consistency via Translation MapΦ\\Phi](https://arxiv.org/html/2606.07563#S3.SS3) 4. [3\.4Metric on Logical Formulas](https://arxiv.org/html/2606.07563#S3.SS4) 5. [3\.5Additional Structural Assumptions for Convergence](https://arxiv.org/html/2606.07563#S3.SS5) 6. [3\.6Derivation of Metric Contraction: Scope and Limits](https://arxiv.org/html/2606.07563#S3.SS6) 7. [3\.7Weight Function](https://arxiv.org/html/2606.07563#S3.SS7)
4. [4Physical Feasibility Theorem](https://arxiv.org/html/2606.07563#S4)
5. [5Energy Budget and the Diversity\-Convergence Trade\-off](https://arxiv.org/html/2606.07563#S5)1. [5\.1Complete Metric Space Structure](https://arxiv.org/html/2606.07563#S5.SS1) 2. [5\.2P\-Stability of Coupled Formulas](https://arxiv.org/html/2606.07563#S5.SS2) 3. [5\.3Metric Contraction Lemma](https://arxiv.org/html/2606.07563#S5.SS3) 4. [5\.4Energy\-Diversity Trade\-off Theorem](https://arxiv.org/html/2606.07563#S5.SS4) 5. [5\.5Universal Feature Convergence](https://arxiv.org/html/2606.07563#S5.SS5) 6. [5\.6Three Characterisations ofEcE\_\{c\}](https://arxiv.org/html/2606.07563#S5.SS6)
6. [6Causal Emergence at the HEF Fixed Point](https://arxiv.org/html/2606.07563#S6)1. [6\.1Why Convergence Alone Does Not Establish Causal Emergence](https://arxiv.org/html/2606.07563#S6.SS1) 2. [6\.2The Theorem](https://arxiv.org/html/2606.07563#S6.SS2)
7. [7Mechanism Landscape Theory: What Determines Emergence](https://arxiv.org/html/2606.07563#S7)1. [7\.1Proposition A: Domain Determines Form,𝒫\\mathcal\{P\}Determines Type](https://arxiv.org/html/2606.07563#S7.SS1) 2. [7\.2Proposition B: Mechanism Landscape Determines Universality Class](https://arxiv.org/html/2606.07563#S7.SS2) 3. [7\.3Proposition C: Mechanism Competition Entropy Bounds Causal Potency](https://arxiv.org/html/2606.07563#S7.SS3) 4. [7\.4An Emergence Classification Scheme](https://arxiv.org/html/2606.07563#S7.SS4)
8. [8Instantiations](https://arxiv.org/html/2606.07563#S8)1. [8\.1ML: LLM Training Dynamics and Grokking](https://arxiv.org/html/2606.07563#S8.SS1)1. [8\.1\.1How Emergence Forms: The Three\-Phase HEF Trajectory](https://arxiv.org/html/2606.07563#S8.SS1.SSS1) 2. [8\.1\.2Formal Derivation](https://arxiv.org/html/2606.07563#S8.SS1.SSS2) 3. [8\.1\.3Small\-Scale Empirical Evidence](https://arxiv.org/html/2606.07563#S8.SS1.SSS3) 2. [8\.2EOM: Prebiotic Chemistry and Evolutionary Biology](https://arxiv.org/html/2606.07563#S8.SS2) 3. [8\.3IFF: Information Field Theory](https://arxiv.org/html/2606.07563#S8.SS3) 4. [8\.4RSID: Nanoparticle Signal Detection](https://arxiv.org/html/2606.07563#S8.SS4)
9. [9Practitioner’s Guide: Applying HEF to New Systems](https://arxiv.org/html/2606.07563#S9)1. [9\.1Step 1: Identify the HEF Tuple](https://arxiv.org/html/2606.07563#S9.SS1) 2. [9\.2Step 2: Detect theEcE\_\{c\}Fingerprint](https://arxiv.org/html/2606.07563#S9.SS2) 3. [9\.3Step 3: Classify the Emergence Type](https://arxiv.org/html/2606.07563#S9.SS3) 4. [9\.4Step 4: Intervene via HEF Predictions](https://arxiv.org/html/2606.07563#S9.SS4) 5. [9\.5Thehef\-toolsPackage](https://arxiv.org/html/2606.07563#S9.SS5)
10. [10Related Work](https://arxiv.org/html/2606.07563#S10)
11. [11Conclusion](https://arxiv.org/html/2606.07563#S11)
12. [AIllustrative Example: Grokking Delay atp=97p=97](https://arxiv.org/html/2606.07563#A1)
13. [BProof of Compression Coefficients from Cost Minimality](https://arxiv.org/html/2606.07563#A2)
14. [CReproducibility Package](https://arxiv.org/html/2606.07563#A3)
15. [S1Summary of Assumptions](https://arxiv.org/html/2606.07563#A1a)
16. [S2Flow of Proofs](https://arxiv.org/html/2606.07563#A2a)
17. [S3Notation and Preliminaries](https://arxiv.org/html/2606.07563#A3a)1. [S3\.1Physical Attribute Space](https://arxiv.org/html/2606.07563#A3.SS1) 2. [S3\.2Formula Metric](https://arxiv.org/html/2606.07563#A3.SS2) 3. [S3\.3Hausdorff Metric](https://arxiv.org/html/2606.07563#A3.SS3)
18. [S4Physical Foundation: The Translation MapΦ\\Phi](https://arxiv.org/html/2606.07563#A4)
19. [S5Physical Feasibility Theorem](https://arxiv.org/html/2606.07563#A5)
20. [S6Compression Coefficients](https://arxiv.org/html/2606.07563#A6)
21. [S7Metric Contraction in ML Instantiations](https://arxiv.org/html/2606.07563#A7)1. [S7\.1Background: Lipschitz Properties of Neural Network Layers](https://arxiv.org/html/2606.07563#A7.SS1) 2. [S7\.2Argument A: Structural Contraction via Monotone Compression](https://arxiv.org/html/2606.07563#A7.SS2) 3. [S7\.3Argument B: Dynamical Contraction near the Fixed Point](https://arxiv.org/html/2606.07563#A7.SS3) 4. [S7\.4Explicit Contraction Constant and Summary](https://arxiv.org/html/2606.07563#A7.SS4) 5. [S7\.5P\-Stability under Type\-Preserving Atom Replacement](https://arxiv.org/html/2606.07563#A7.SS5) 6. [S7\.6Open Experimental Protocol: G1\-test](https://arxiv.org/html/2606.07563#A7.SS6)
22. [S8Energy\-Diversity Trade\-off and Universal Convergence](https://arxiv.org/html/2606.07563#A8)
23. [S9Causal Emergence at the Fixed Point](https://arxiv.org/html/2606.07563#A9)1. [S9\.1Effective Information](https://arxiv.org/html/2606.07563#A9.SS1) 2. [S9\.2Main Causal Emergence Theorem](https://arxiv.org/html/2606.07563#A9.SS2)
24. [S10Grokking Delay: Conditional Derivation](https://arxiv.org/html/2606.07563#A10)
25. [S11Summary of Results](https://arxiv.org/html/2606.07563#A11)
26. [S12Discussion: On the Status of A6](https://arxiv.org/html/2606.07563#A12)
27. [S13References](https://arxiv.org/html/2606.07563#A13)
28. [References](https://arxiv.org/html/2606.07563#bib)

## 1Introduction

### 1\.1Three Puzzles, One Principle

Consider three empirical observations from different fields\.

Neural network convergence\.Olah et al\.\[[37](https://arxiv.org/html/2606.07563#bib.bib37)\]showed that independently trained CNNs develop the same curved detectors, high\-low frequency detectors, and multifrequency detectors in corresponding layers\. Huh et al\.\[[23](https://arxiv.org/html/2606.07563#bib.bib23)\]extended this to cross\-modal and cross\-architecture convergence, naming it the*Platonic Representation Hypothesis*\. No quantitative account explains*why*convergence is universal rather than architecture\-specific\.

Grokking\.Power et al\.\[[42](https://arxiv.org/html/2606.07563#bib.bib42)\]discovered that transformers trained on modular arithmetic suddenly generalise thousands of steps after memorisation\. The delayΔ​t\\Delta tis reproducible across random seeds, follows systematic scaling laws, and is accompanied by a discrete circuit transition\[[41](https://arxiv.org/html/2606.07563#bib.bib41)\]\. Existing accounts explain*that*grokking occurs but not*when*or*why*the delay obeysΔ​t∝1/\(frac⋅p⋅λ\)\\Delta t\\propto 1/\(\\mathrm\{frac\}\\cdot p\\cdot\\lambda\)\.

Convergent evolution\.Opulente et al\.\[[38](https://arxiv.org/html/2606.07563#bib.bib38)\]documented that the same keystone gene families expanded convergently in 80% of metabolic transitions across 993 yeast species — lineages separated by hundreds of millions of years of independent evolution\. Conway Morris\[[11](https://arxiv.org/html/2606.07563#bib.bib11),[12](https://arxiv.org/html/2606.07563#bib.bib12)\]argues this pattern is ubiquitous\. Standard evolutionary theory attributes it to shared selection pressure, but offers no quantitative account of*why*convergence is as frequent and specific as observed\.

We propose that all three phenomena are instances of the same principle:*when an energy budget crosses a critical thresholdEcE\_\{c\}, competing mechanisms collapse to a unique fixed point determined by physical constraints alone\.*This paper formalises, proves, and empirically tests this principle as the Hierarchical Emergence Framework \(HEF\)\.

### 1\.2What HEF Contributes

HEF is not a universal theory of emergence, but acandidate universality framework: it proposes that many emergence phenomena share a common phase\-transition structure governed by mechanism competition under physical and informational constraints\. Beyond existing accounts\[[4](https://arxiv.org/html/2606.07563#bib.bib4),[10](https://arxiv.org/html/2606.07563#bib.bib10),[8](https://arxiv.org/html/2606.07563#bib.bib8),[21](https://arxiv.org/html/2606.07563#bib.bib21)\], HEF makes four contributions:

1. 1\.*Constructive specification\.*HEF is not a description of emergence but an algorithm \(Algorithm[1](https://arxiv.org/html/2606.07563#alg1)\) that generates emergent entities from first principles\.
2. 2\.*Quantitative threshold\.*The critical energyEcE\_\{c\}is defined constructively \(Theorem[S8\.1](https://arxiv.org/html/2606.07563#A8.Thmtheorem1)\) and has a measurable empirical fingerprint \(the weight\-norm peak, Section[8\.1\.3](https://arxiv.org/html/2606.07563#S8.SS1.SSS3)\)\.
3. 3\.*Universality class identification\.*The mechanism landscape nearα∗\\alpha^\{\*\}determines the*type*of emergence — smooth, cusp, flat, hierarchical — independently of domain vocabulary \(Section[7](https://arxiv.org/html/2606.07563#S7), Table[1](https://arxiv.org/html/2606.07563#S7.T1)\)\.
4. 4\.*Falsifiable cross\-domain predictions\.*HEF predicts specific, testable outcomes in ML, evolutionary biology, and nanomedicine \(Predictions P1–P3, Section[11](https://arxiv.org/html/2606.07563#S11)\)\.

### 1\.3Main Results

##### Physical Feasibility Theorem \(Section[4](https://arxiv.org/html/2606.07563#S4)\)\.

Under A1–A4, every entity at every hierarchy level satisfies𝒫thermo\\mathcal\{P\}\_\{\\mathrm\{thermo\}\}and𝒫info\\mathcal\{P\}\_\{\\mathrm\{info\}\}simultaneously viaΦ\\Phi\.

##### Energy\-Diversity Theorem \(Section[5](https://arxiv.org/html/2606.07563#S5)\)\.

\|R\(k\)​\(E\)\|\|R^\{\(k\)\}\(E\)\|is non\-decreasing inEE;EcE\_\{c\}marks the inflection; forE<EcE<E\_\{c\}the hierarchy converges to a unique fixed pointR∞\(k\)R^\{\(k\)\}\_\{\\infty\}\(Banach Fixed\-Point Theorem on\(Ω\(k\),dH\)\(\\Omega^\{\(k\)\},d\_\{H\}\)\)\.

##### Universal Feature Convergence \(Section[5](https://arxiv.org/html/2606.07563#S5)\)\.

Two HEF instances sharing𝒫\\mathcal\{P\}andE<EcE<E\_\{c\}converge to the*same*R∞R\_\{\\infty\}, independent of initial conditions, architecture, or training data \(Corollary[S8\.2](https://arxiv.org/html/2606.07563#A8.Thmtheorem2)\)\.

##### Causal Emergence Theorem \(Section[6](https://arxiv.org/html/2606.07563#S6)\)\.

Under NDA,EI​\(R∞\)\>EI​\(R\(1\)\)\\mathrm\{EI\}\(R\_\{\\infty\}\)\>\\mathrm\{EI\}\(R^\{\(1\)\}\)\. The EI gain equals the causal noise eliminated at theEcE\_\{c\}crossing and admits an empirical lower bound from training\-curve variance\.

##### Mechanism Landscape Theory \(Section[7](https://arxiv.org/html/2606.07563#S7)\)\.

The local geometry of𝒜∗\\mathcal\{A\}^\{\*\}nearα∗\\alpha^\{\*\}determines the*universality class*of emergence\. Smooth landscapes give tanh kinks \(Class I, confirmed for grokking:R2=0\.93R^\{2\}=0\.93\); flat landscapes give high\-variance timing \(Class IV, observed forp=31p=31\)\.

##### ML Instantiation and Empirical Results \(Section[8](https://arxiv.org/html/2606.07563#S8)\)\.

111 grokking experiments across seven primes confirm Universal Convergence \(0\.9745±0\.0140\.9745\\pm 0\.014, CV=1\.47%=1\.47\\%, ANOVAp\>0\.13p\>0\.13\) and validate G2 scaling \(β=−1\.39±0\.20\\beta=\-1\.39\\pm 0\.20,R2=0\.91R^\{2\}=0\.91\)\. A critical weight\-decay thresholdλc∈\(2,4\)\\lambda\_\{c\}\\in\(2,4\)is identified as a mechanism\-starvation boundary\.

### 1\.4How to Read This Paper

For ML practitioners:Section[8](https://arxiv.org/html/2606.07563#S8)\(grokking results\) and Section[9](https://arxiv.org/html/2606.07563#S9)\(diagnostic toolkit\) are self\-contained\. Thehef\-toolspackage implements all diagnostics\.

For theorists:Sections[3](https://arxiv.org/html/2606.07563#S3)–[6](https://arxiv.org/html/2606.07563#S6)contain the full proof chain\. Section[7](https://arxiv.org/html/2606.07563#S7)develops the universality classification\.

For biologists and physicists:Section[8](https://arxiv.org/html/2606.07563#S8)\(EOM, IFF, RSID instantiations\) maps HEF onto prebiotic chemistry, renormalisation group flow, and nanoparticle sensing\.

### 1\.5Paper Organisation

Section[2](https://arxiv.org/html/2606.07563#S2)defines HEF\. Section[3](https://arxiv.org/html/2606.07563#S3)establishes the physical foundation\. Sections[4](https://arxiv.org/html/2606.07563#S4)–[5](https://arxiv.org/html/2606.07563#S5)prove the main theorems\. Section[6](https://arxiv.org/html/2606.07563#S6)proves causal emergence\. Section[7](https://arxiv.org/html/2606.07563#S7)develops mechanism landscape theory\. Section[8](https://arxiv.org/html/2606.07563#S8)instantiates HEF and reports experiments\. Section[9](https://arxiv.org/html/2606.07563#S9)provides the practitioner’s guide\. Section[10](https://arxiv.org/html/2606.07563#S10)discusses related work\. Section[11](https://arxiv.org/html/2606.07563#S11)concludes with open problems and predictions\.

## 2The Hierarchical Emergence Framework

### 2\.1Primitive Sets and the Hierarchy

###### Definition 2\.1\(Primitive Set\)\.

A primitive set at levelkkis a finite collectionR\(k\)=\{ri\(k\)\}R^\{\(k\)\}=\\\{r^\{\(k\)\}\_\{i\}\\\}\. Each primitive carries physical attributes\(Ei,Si,Hi\)∈ℝ≥03\(E\_\{i\},S\_\{i\},H\_\{i\}\)\\in\\mathbb\{R\}\_\{\\geq 0\}^\{3\}, whereEi≥0E\_\{i\}\\geq 0is energy,Si≥0S\_\{i\}\\geq 0is thermodynamic entropy, andHi≥0H\_\{i\}\\geq 0is Shannon information content\.

###### Definition 2\.2\(Hierarchy\)\.

The hierarchy is the sequenceR\(1\)→R\(2\)→⋯→R\(K\)R^\{\(1\)\}\\to R^\{\(2\)\}\\to\\cdots\\to R^\{\(K\)\}, whereR\(1\)R^\{\(1\)\}is the domain\-specific base set and eachR\(k\)R^\{\(k\)\},k≥2k\\geq 2, consists of entities produced by applying mechanisms to logical combinations ofR\(k−1\)R^\{\(k\-1\)\}\.

### 2\.2Logical Language

###### Definition 2\.3\(Logical Language\)\.

The logical languageℒ​\(R\(k\)\)\\mathcal\{L\}\(R^\{\(k\)\}\)is the smallest set closed under: \(1\) atomic formulasri\(k\)r^\{\(k\)\}\_\{i\}; \(2\) physical negation¬φ≡φ⟂\\neg\\varphi\\equiv\\varphi^\{\\perp\}\(Definition[2\.4](https://arxiv.org/html/2606.07563#S2.Thmdefinition4)\); \(3\) admissible conjunctionφ∧ψ\\varphi\\wedge\\psi\(Definition[2\.5](https://arxiv.org/html/2606.07563#S2.Thmdefinition5)\); \(4\) disjunctionφ∨ψ\\varphi\\vee\\psi; \(5\) implicationφ⇒ψ\\varphi\\Rightarrow\\psi; and \(6\) causal orderingφ→ψ\\varphi\\to\\psi\.

###### Definition 2\.4\(Physical Negation — Axiom N\)\.

For everyri\(k\)⊧𝒫r^\{\(k\)\}\_\{i\}\\models\\mathcal\{P\}, there exists a unique physical complementri\(k\)⟂r^\{\(k\)\\perp\}\_\{i\}such that: \(N1\)ri\(k\)⟂⊧𝒫r^\{\(k\)\\perp\}\_\{i\}\\models\\mathcal\{P\}; \(N2\)\(ri\(k\)⟂\)⟂=ri\(k\)\(r^\{\(k\)\\perp\}\_\{i\}\)^\{\\perp\}=r^\{\(k\)\}\_\{i\}\(involution\); \(N3\)ri\(k\)∧ri\(k\)⟂r^\{\(k\)\}\_\{i\}\\wedge r^\{\(k\)\\perp\}\_\{i\}is physically unrealisable; \(N4\)ri\(k\)∨ri\(k\)⟂r^\{\(k\)\}\_\{i\}\\vee r^\{\(k\)\\perp\}\_\{i\}partitions the relevant phase space\. The operator¬\\neginℒ\\mathcal\{L\}is defined as¬ri\(k\)≡ri\(k\)⟂\\neg r^\{\(k\)\}\_\{i\}\\equiv r^\{\(k\)\\perp\}\_\{i\}\.

###### Definition 2\.5\(Interaction Regularity\)\.

A conjunctionφ∧ψ\\varphi\\wedge\\psiinℒ​\(R\(k\)\)\\mathcal\{L\}\(R^\{\(k\)\}\)is admissible only if there exists an interaction energyΔ​Eφ​ψ\\Delta E\_\{\\varphi\\psi\}\(possibly zero\) such thatEcombined=Eφ\+Eψ\+Δ​Eφ​ψE\_\{\\mathrm\{combined\}\}=E\_\{\\varphi\}\+E\_\{\\psi\}\+\\Delta E\_\{\\varphi\\psi\}satisfies energy conservation \(P1\)\.

### 2\.3Mechanism Family

###### Definition 2\.6\(Mechanism\)\.

A mechanism at levelkkis a functionfα\(k\):ℒ​\(R\(k−1\)\)→R\(k\)f^\{\(k\)\}\_\{\\alpha\}:\\mathcal\{L\}\(R^\{\(k\-1\)\}\)\\to R^\{\(k\)\}indexed byα∈𝒜\\alpha\\in\\mathcal\{A\}\.

###### Definition 2\.7\(Admissible Mechanisms\)\.

The physically admissible set is𝒜∗=\{α∈𝒜∣φ⊧𝒫⇒fα\(k\)​\(φ\)⊧𝒫\}\\mathcal\{A\}^\{\*\}=\\\{\\alpha\\in\\mathcal\{A\}\\mid\\varphi\\models\\mathcal\{P\}\\Rightarrow f^\{\(k\)\}\_\{\\alpha\}\(\\varphi\)\\models\\mathcal\{P\}\\\}\.

### 2\.4Generation Rule and Operating Mode

###### Definition 2\.8\(Generation Rule\)\.

𝒢:𝒜t×Rt\(k\)→𝒜∗\\mathcal\{G\}:\\mathcal\{A\}\_\{t\}\\times R^\{\(k\)\}\_\{t\}\\to\\mathcal\{A\}^\{\*\}maps current indices and primitives to new admissible mechanism indices\.

###### Definition 2\.9\(Operating Mode\)\.

mode∈\{controlled,self​\-​generating\}\\mathrm\{mode\}\\in\\\{\\mathrm\{controlled\},\\mathrm\{self\\text\{\-\}generating\}\\\}\. In controlled mode𝒜=𝒜0\\mathcal\{A\}=\\mathcal\{A\}\_\{0\}\. In self\-generating mode𝒜t\+1=𝒜t∪𝒢​\(𝒜t,Rt\(k\)\)\\mathcal\{A\}\_\{t\+1\}=\\mathcal\{A\}\_\{t\}\\cup\\mathcal\{G\}\(\\mathcal\{A\}\_\{t\},R^\{\(k\)\}\_\{t\}\)\.

### 2\.5Energy Budget, Canonical Measure, and Relevance Weights

###### Definition 2\.10\(Canonical Physical Measure\)\.

LetR\(k\)R^\{\(k\)\}be a finite primitive set of sizeNkN\_\{k\}\. Assign to eachri\(k\)r^\{\(k\)\}\_\{i\}the canonical Gibbs weight

pi=e−Ei/kB​TZ\(k\),Z\(k\)=∑j=1Nke−Ej/kB​T\.p\_\{i\}=\\frac\{e^\{\-E\_\{i\}/k\_\{B\}T\}\}\{Z^\{\(k\)\}\},\\qquad Z^\{\(k\)\}=\\sum\_\{j=1\}^\{N\_\{k\}\}e^\{\-E\_\{j\}/k\_\{B\}T\}\.By the Jaynes maximum\-entropy principle\[[25](https://arxiv.org/html/2606.07563#bib.bib25)\],μ\\muis the unique probability measure onR\(k\)R^\{\(k\)\}maximisingH=−∑pi​log⁡piH=\-\\sum p\_\{i\}\\log p\_\{i\}subject to𝔼​\[Ei\]=⟨E⟩\\mathbb\{E\}\[E\_\{i\}\]=\\langle E\\rangle\. Extend toℒ​\(R\(k\)\)\\mathcal\{L\}\(R^\{\(k\)\}\):

- •*Atomic:*μ​\(ri\(k\)\)=pi\\mu\(r^\{\(k\)\}\_\{i\}\)=p\_\{i\}\.
- •*Admissible conjunction:*μ​\(φ∧ψ\)=pφ⋅pψ⋅Zφ​ψ−1⋅e−Δ​Eφ​ψ/kB​T\\mu\(\\varphi\\wedge\\psi\)=p\_\{\\varphi\}\\cdot p\_\{\\psi\}\\cdot Z\_\{\\varphi\\psi\}^\{\-1\}\\cdot e^\{\-\\Delta E\_\{\\varphi\\psi\}/k\_\{B\}T\}, whereZφ​ψZ\_\{\\varphi\\psi\}is the local partition function enforcing P1\.
- •*Physical negation:*μ​\(ri\(k\)⟂\)=1−pi\\mu\(r^\{\(k\)\\perp\}\_\{i\}\)=1\-p\_\{i\}\(Axiom N4\)\.
- •*Disjunction, implication, causal ordering:*inherited by the standard extension to a Boolean algebra\[[19](https://arxiv.org/html/2606.07563#bib.bib19)\]\.

We callμ\\muthecanonical physical measureonℒ​\(R\(k\)\)\\mathcal\{L\}\(R^\{\(k\)\}\)\. It is uniquely determined byT\>0T\>0\(P3\) and the energy values\{Ei\}\\\{E\_\{i\}\\\}\(P1\)\.

###### Definition 2\.11\(Energy Budget\)\.

The cost of mechanismα\\alphaunderμ\\muis

cost​\(α\)=𝔼μ​\[Δ​Eα​\(φ\)\+kB​T​Δ​Hα​\(φ\)\],\\mathrm\{cost\}\(\\alpha\)=\\mathbb\{E\}\_\{\\mu\}\\bigl\[\\Delta E\_\{\\alpha\}\(\\varphi\)\+k\_\{B\}T\\,\\Delta H\_\{\\alpha\}\(\\varphi\)\\bigr\],whereΔ​Eα​\(φ\)=Efα​\(φ\)−Eφ\\Delta E\_\{\\alpha\}\(\\varphi\)=E\_\{f\_\{\\alpha\}\(\\varphi\)\}\-E\_\{\\varphi\}andΔ​Hα​\(φ\)=H​\(fα​\(φ\)\)−H​\(φ\)\\Delta H\_\{\\alpha\}\(\\varphi\)=H\(f\_\{\\alpha\}\(\\varphi\)\)\-H\(\\varphi\)\. The budget\-constrained set is𝒜∗​\(E\)=\{α∈𝒜∗∣cost​\(α\)≤E\}\\mathcal\{A\}^\{\*\}\(E\)=\\\{\\alpha\\in\\mathcal\{A\}^\{\*\}\\mid\\mathrm\{cost\}\(\\alpha\)\\leq E\\\}\.

###### Definition 2\.12\(Relevance Weight\)\.

wα​\(E\)=𝟏​\[α∈𝒜∗\]⋅𝟏​\[cost​\(α\)≤E\]⋅wαdomain⋅wαcontextw\_\{\\alpha\}\(E\)=\\mathbf\{1\}\[\\alpha\\in\\mathcal\{A\}^\{\*\}\]\\cdot\\mathbf\{1\}\[\\mathrm\{cost\}\(\\alpha\)\\leq E\]\\cdot w^\{\\mathrm\{domain\}\}\_\{\\alpha\}\\cdot w^\{\\mathrm\{context\}\}\_\{\\alpha\}\.

### 2\.6The Full Framework Tuple

###### Definition 2\.13\(HEF\)\.

A Hierarchical Emergence Framework is the tupleℋ=\(R\(1\),ℒ,𝒜0,𝒢,mode,E\)\\mathcal\{H\}=\(R^\{\(1\)\},\\mathcal\{L\},\\mathcal\{A\}\_\{0\},\\mathcal\{G\},\\mathrm\{mode\},E\)together with𝒫=\(𝒫thermo,𝒫info,Φ\)\\mathcal\{P\}=\(\\mathcal\{P\}\_\{\\mathrm\{thermo\}\},\\mathcal\{P\}\_\{\\mathrm\{info\}\},\\Phi\)\. The generative process is Algorithm[1](https://arxiv.org/html/2606.07563#alg1)\.

Algorithm 1HEF Generation1:

ℋ=\(R\(1\),ℒ,𝒜0,𝒢,mode,E\)\\mathcal\{H\}=\(R^\{\(1\)\},\\mathcal\{L\},\\mathcal\{A\}\_\{0\},\\mathcal\{G\},\\mathrm\{mode\},E\)and

𝒫\\mathcal\{P\}
2:Initialise

𝒜←𝒜0\\mathcal\{A\}\\leftarrow\\mathcal\{A\}\_\{0\}
3:for

k=2,3,…,Kk=2,3,\\ldots,Kdo

4:Compute

𝒜∗​\(E\)=\{α∈𝒜:α∈𝒜∗,cost​\(α\)≤E\}\\mathcal\{A\}^\{\*\}\(E\)=\\\{\\alpha\\in\\mathcal\{A\}:\\alpha\\in\\mathcal\{A\}^\{\*\},\\mathrm\{cost\}\(\\alpha\)\\leq E\\\}
5:forall

φ∈ℒ​\(R\(k−1\)\)\\varphi\\in\\mathcal\{L\}\(R^\{\(k\-1\)\}\)with

φ⊧𝒫\\varphi\\models\\mathcal\{P\}do

6:forall

α∈𝒜∗​\(E\)\\alpha\\in\\mathcal\{A\}^\{\*\}\(E\)do

7:Set

r\(k\)←fα\(k−1\)​\(φ\)r^\{\(k\)\}\\leftarrow f^\{\(k\-1\)\}\_\{\\alpha\}\(\\varphi\); add to

R\(k\)R^\{\(k\)\}
8:endfor

9:endfor

10:if

mode=self​\-​generating\\mathrm\{mode\}=\\mathrm\{self\\text\{\-\}generating\}then

𝒜←𝒜∪𝒢​\(𝒜,R\(k\)\)\\mathcal\{A\}\\leftarrow\\mathcal\{A\}\\cup\\mathcal\{G\}\(\\mathcal\{A\},R^\{\(k\)\}\)
11:endif

12:endfor

13:return

R\(1\),…,R\(K\)R^\{\(1\)\},\\ldots,R^\{\(K\)\}

## 3Physical Foundation

### 3\.1Thermodynamic Constraints

𝒫thermo=\{P1,P2,P3\}\\mathcal\{P\}\_\{\\mathrm\{thermo\}\}=\\\{P\_\{1\},P\_\{2\},P\_\{3\}\\\}:

- P1*Energy conservation\.*Δ​Etotal=0\\Delta E\_\{\\mathrm\{total\}\}=0\. Conjunctionri∧rjr\_\{i\}\\wedge r\_\{j\}is admissible iffΔ​Ei​j\\Delta E\_\{ij\}satisfies P1\.
- P2*Second Law\.*Δ​Stotal≥0\\Delta S\_\{\\mathrm\{total\}\}\\geq 0\(including environmental entropy\)\.
- P3*Positive temperature\.*T\>0T\>0\.

### 3\.2Information\-Theoretic Constraints

𝒫info=\{P4,P5,P6\}\\mathcal\{P\}\_\{\\mathrm\{info\}\}=\\\{P\_\{4\},P\_\{5\},P\_\{6\}\\\}:

- P4*Mutual information bound\.*I​\(ri;rj\)≤min⁡\(H​\(ri\),H​\(rj\)\)I\(r\_\{i\};r\_\{j\}\)\\leq\\min\(H\(r\_\{i\}\),H\(r\_\{j\}\)\)\.
- P5*Non\-negative conditional entropy\.*H​\(ri∣rj\)≥0H\(r\_\{i\}\\mid r\_\{j\}\)\\geq 0\.
- P6*Data Processing Inequality \(DPI\)\.*Forri→rj→rkr\_\{i\}\\to r\_\{j\}\\to r\_\{k\}:I​\(ri;rk\)≤I​\(ri;rj\)I\(r\_\{i\};r\_\{k\}\)\\leq I\(r\_\{i\};r\_\{j\}\)\.

### 3\.3Consistency via Translation MapΦ\\Phi

##### Notation\.

Define theconstraint lattice\(𝒫,≤\)\(\\mathcal\{P\},\\leq\)wherePi≤PjP\_\{i\}\\leq P\_\{j\}iff every process satisfyingPiP\_\{i\}also satisfiesPjP\_\{j\}\.

###### Proposition 3\.1\(Constraint Lattice Isomorphism\)\.

The mapΦ:𝒫thermo→𝒫info\\Phi:\\mathcal\{P\}\_\{\\mathrm\{thermo\}\}\\to\\mathcal\{P\}\_\{\\mathrm\{info\}\}given byP1↦P4P\_\{1\}\\mapsto P\_\{4\},P2↦P5P\_\{2\}\\mapsto P\_\{5\},P3↦P6P\_\{3\}\\mapsto P\_\{6\}is anorder\-isomorphismof constraint lattices\. Consequently,𝒜thermo∗=𝒜info∗=:𝒜∗\\mathcal\{A\}^\{\*\}\_\{\\mathrm\{thermo\}\}=\\mathcal\{A\}^\{\*\}\_\{\\mathrm\{info\}\}=:\\mathcal\{A\}^\{\*\}\.

###### Proof\.

We verify each correspondence as a logical equivalence of violation conditions, then confirm order\-preservation\.

\(i\)P1↔P4P\_\{1\}\\leftrightarrow P\_\{4\}\.By Landauer’s Principle\[[30](https://arxiv.org/html/2606.07563#bib.bib30)\], any mechanism erasingΔ​H\\Delta Hbits of information costs at leastkB​T​ln⁡2⋅Δ​Hk\_\{B\}T\\ln 2\\cdot\\Delta Hof work\. Henceα∈𝒜P1∗\\alpha\\in\\mathcal\{A\}^\{\*\}\_\{P\_\{1\}\}iffΔ​Eα≥−kB​T​\[H​\(fα\)−H​\(φ\)\]\\Delta E\_\{\\alpha\}\\geq\-k\_\{B\}T\\,\[H\(f\_\{\\alpha\}\)\-H\(\\varphi\)\]iffI​\(fα​\(φ\);φ\)≤H​\(φ\)I\(f\_\{\\alpha\}\(\\varphi\);\\varphi\)\\leq H\(\\varphi\)iffα∈𝒜P4∗\\alpha\\in\\mathcal\{A\}^\{\*\}\_\{P\_\{4\}\}\. The last equivalence uses Bennett\[[6](https://arxiv.org/html/2606.07563#bib.bib6)\]: violation of the MI bound forces violation of energy conservation\.

\(ii\)P2↔P5P\_\{2\}\\leftrightarrow P\_\{5\}\.The Jarzynski Equality\[[24](https://arxiv.org/html/2606.07563#bib.bib24)\]⟨e−β​W⟩=e−β​Δ​F\\langle e^\{\-\\beta W\}\\rangle=e^\{\-\\beta\\Delta F\}, combined with Jensen’s inequality, gives⟨W⟩≥Δ​F=Δ​E−T​Δ​Stotal\\langle W\\rangle\\geq\\Delta F=\\Delta E\-T\\Delta S\_\{\\mathrm\{total\}\}, i\.e\.Δ​Stotal≥0\\Delta S\_\{\\mathrm\{total\}\}\\geq 0\(P2\)\. Under the canonical measureμ\\muof Definition[2\.10](https://arxiv.org/html/2606.07563#S2.Thmdefinition10), the Gibbs entropy equalsSGibbs=kB​ln⁡2⋅H​\(μ\)S\_\{\\mathrm\{Gibbs\}\}=k\_\{B\}\\ln 2\\cdot H\(\\mu\)\[[25](https://arxiv.org/html/2606.07563#bib.bib25)\]\(verified by direct computation:S=−kB​∑ipi​ln⁡pi=kB​ln⁡2⋅H​\(μ\)S=\-k\_\{B\}\\sum\_\{i\}p\_\{i\}\\ln p\_\{i\}=k\_\{B\}\\ln 2\\cdot H\(\\mu\)\), making P2 equivalent toH​\(φ∣fα​\(φ\)\)≥0H\(\\varphi\\mid f\_\{\\alpha\}\(\\varphi\)\)\\geq 0, i\.e\. P5\.

\(iii\)P3↔P6P\_\{3\}\\leftrightarrow P\_\{6\}\.For a cascaderi→rj→rkr\_\{i\}\\to r\_\{j\}\\to r\_\{k\}of HEF mechanisms, P1 gives energy conservation at each step\. Energy conservation implies no information is spontaneously created; by the Shannon–Boltzmann correspondence \(established in part \(ii\)\), this forcesI​\(ri;rk\)≤I​\(ri;rj\)I\(r\_\{i\};r\_\{k\}\)\\leq I\(r\_\{i\};r\_\{j\}\)\(P6\)\. Formally, the DPI follows from the chain ruleI​\(ri;rj,rk\)=I​\(ri;rj\)\+I​\(ri;rk∣rj\)I\(r\_\{i\};r\_\{j\},r\_\{k\}\)=I\(r\_\{i\};r\_\{j\}\)\+I\(r\_\{i\};r\_\{k\}\\mid r\_\{j\}\)and the Markov propertyI​\(ri;rk∣rj\)=0I\(r\_\{i\};r\_\{k\}\\mid r\_\{j\}\)=0\(\[[13](https://arxiv.org/html/2606.07563#bib.bib13)\], Theorem 2\.8\.1\)\. Conversely, violation of P6 implies information gain across the cascade, i\.e\.I​\(ri;rk\)\>I​\(ri;rj\)I\(r\_\{i\};r\_\{k\}\)\>I\(r\_\{i\};r\_\{j\}\)\. By part \(i\) \(the Landauer–Bennett correspondence\), creating information without energetic cost violates energy conservation \(P1\)\. ThusP3↔P6P\_\{3\}\\leftrightarrow P\_\{6\}\.

###### Lemma 3\.2\(Temperature–DPI Correspondence\)\.

In the canonical Gibbs ensemble at temperatureT\>0T\>0, every𝒫\\mathcal\{P\}\-admissible mechanismfαf\_\{\\alpha\}satisfies the Data Processing Inequality \(P6\)\. Conversely, any mechanism violating P6 requiresT=0T=0\(zero temperature\) and is therefore excluded by P3\.

###### Proof\.

\(*P3⇒\\RightarrowP6*\.\) AtT\>0T\>0, the canonical measureμ\\muassigns positive weightpi=e−Ei/kB​T/Z\>0p\_\{i\}=e^\{\-E\_\{i\}/k\_\{B\}T\}/Z\>0to every𝒫\\mathcal\{P\}\-feasible state\. For a cascaderi→rj→rkr\_\{i\}\\to r\_\{j\}\\to r\_\{k\}of mechanisms:

Iμ​\(ri;rk\)=Hμ​\(ri\)−Hμ​\(ri∣rj,rk\)≤Hμ​\(ri\)−Hμ​\(ri∣rj\)=Iμ​\(ri;rj\),I\_\{\\mu\}\(r\_\{i\};r\_\{k\}\)=H\_\{\\mu\}\(r\_\{i\}\)\-H\_\{\\mu\}\(r\_\{i\}\\mid r\_\{j\},r\_\{k\}\)\\leq H\_\{\\mu\}\(r\_\{i\}\)\-H\_\{\\mu\}\(r\_\{i\}\\mid r\_\{j\}\)=I\_\{\\mu\}\(r\_\{i\};r\_\{j\}\),where the inequality usesHμ​\(ri∣rj,rk\)≥Hμ​\(ri∣rj\)H\_\{\\mu\}\(r\_\{i\}\\mid r\_\{j\},r\_\{k\}\)\\geq H\_\{\\mu\}\(r\_\{i\}\\mid r\_\{j\}\)\(conditioning cannot increase entropy, Cover & Thomas\[[13](https://arxiv.org/html/2606.07563#bib.bib13)\], Theorem 2\.6\.5\) and the Markov propertyIμ​\(ri;rk∣rj\)=0I\_\{\\mu\}\(r\_\{i\};r\_\{k\}\\mid r\_\{j\}\)=0from the physical cascade structure\. Hence P6 holds\.

\(*Violation of P6⇒\\RightarrowT=0T=0*\.\) SupposeIμ​\(ri;rk\)\>Iμ​\(ri;rj\)I\_\{\\mu\}\(r\_\{i\};r\_\{k\}\)\>I\_\{\\mu\}\(r\_\{i\};r\_\{j\}\)for some cascade\. By the Shannon–Boltzmann correspondence \(established in the P2↔\\leftrightarrowP5 argument\), mutual information gain impliesΔ​Stotal<0\\Delta S\_\{\\mathrm\{total\}\}<0, which by the Jarzynski Equality\[[24](https://arxiv.org/html/2606.07563#bib.bib24)\]requireskB​T→0k\_\{B\}T\\to 0\. HenceT=0T=0is necessary, contradicting P3 \(T\>0T\>0\)\.

The two directions together give the logical equivalenceP3↔P6P\_\{3\}\\leftrightarrow P\_\{6\}\. ∎

Order\-isomorphism\.Φ\\Phiis injective and surjective\. Each correspondence is a logical equivalence, giving order\-preservation in both directions\. HenceΦ\\Phiis an order\-isomorphism and𝒜thermo∗=𝒜info∗\\mathcal\{A\}^\{\*\}\_\{\\mathrm\{thermo\}\}=\\mathcal\{A\}^\{\*\}\_\{\\mathrm\{info\}\}\. ∎

### 3\.4Metric on Logical Formulas

Before stating A5 and A6, we define the metric that appears in both\.

###### Definition 3\.1\(Physical Metric and Metric on Formulas\)\.

\(a\) Thephysical metricon attribute spaceℝ≥03\\mathbb\{R\}\_\{\\geq 0\}^\{3\}is

d​\(\(E1,S1,H1\),\(E2,S2,H2\)\)=\|E1−E2\|\+kB​\|S1−S2\|\+kB​ln⁡2⋅\|H1−H2\|Eref,d\\bigl\(\(E\_\{1\},S\_\{1\},H\_\{1\}\),\(E\_\{2\},S\_\{2\},H\_\{2\}\)\\bigr\)=\\frac\{\|E\_\{1\}\-E\_\{2\}\|\+k\_\{B\}\|S\_\{1\}\-S\_\{2\}\|\+k\_\{B\}\\ln 2\\cdot\|H\_\{1\}\-H\_\{2\}\|\}\{E\_\{\\mathrm\{ref\}\}\},whereEref\>0E\_\{\\mathrm\{ref\}\}\>0is the open\-system reference energy from A3 \(below\)\.

\(b\) For two formulasφ=φ​\(ri1,…,rim\)\\varphi=\\varphi\(r\_\{i\_\{1\}\},\\ldots,r\_\{i\_\{m\}\}\)andψ=ψ​\(sj1,…,sjm\)\\psi=\\psi\(s\_\{j\_\{1\}\},\\ldots,s\_\{j\_\{m\}\}\)inℒ​\(R\(k−1\)\)\\mathcal\{L\}\(R^\{\(k\-1\)\}\)with the same logical structure but potentially different atoms \(coupled via a matchingσ\\sigmaon atoms\), define theformula metric

dℒ​\(φ,ψ\)=max1≤ℓ≤m⁡d​\(riℓ,sjσ​\(ℓ\)\)\.d\_\{\\mathcal\{L\}\}\(\\varphi,\\psi\)=\\max\_\{1\\leq\\ell\\leq m\}\\,d\\bigl\(r\_\{i\_\{\\ell\}\},s\_\{j\_\{\\sigma\(\\ell\)\}\}\\bigr\)\.For formulas of different logical structure, setdℒ=\+∞d\_\{\\mathcal\{L\}\}=\+\\infty\(incomparable\)\. TheL∞L^\{\\infty\}extension is natural because each atom contributes independently to the physical attributes of the formula\.

### 3\.5Additional Structural Assumptions for Convergence

###### Assumption 1\(P\-Determined Cost, A5\)\.

The canonical physical measureμ\\mu\(Definition[2\.10](https://arxiv.org/html/2606.07563#S2.Thmdefinition10)\) is determined entirely byT\>0T\>0and𝒫\\mathcal\{P\}\. Consequently,cost​\(α\)\\mathrm\{cost\}\(\\alpha\)\(Definition[2\.11](https://arxiv.org/html/2606.07563#S2.Thmdefinition11)\) depends onα\\alphaand𝒫\\mathcal\{P\}only, not onR\(1\)R^\{\(1\)\},𝒜0\\mathcal\{A\}\_\{0\}, or𝒢\\mathcal\{G\}\.

Motivation\.A5 holds when all𝒫\\mathcal\{P\}\-admissible primitives share the same canonical energy scalekB​Tk\_\{B\}T\(Gibbs\[[18](https://arxiv.org/html/2606.07563#bib.bib18)\]\)\.

### 3\.6Derivation of Metric Contraction: Scope and Limits

We address a fundamental question:*does strict metric contraction \(A6\) follow from A1–A5 alone, without additional structural conditions?*We prove that the answer isno in general, identify the precise gap, and establish the best achievable positive results\.

###### Proposition 3\.3\(Non\-Expansiveness from A1–A5\)\.

Under A1–A5, the minimum\-cost mechanismα∗\\alpha^\{\*\}satisfiescα∗≤1c\_\{\\alpha^\{\*\}\}\\leq 1:

d​\(fα∗​\(φ1\),fα∗​\(φ2\)\)≤dℒ​\(φ1,φ2\)∀φ1,φ2⊧𝒫\.d\\bigl\(f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{1\}\),f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{2\}\)\\bigr\)\\leq d\_\{\\mathcal\{L\}\}\(\\varphi\_\{1\},\\varphi\_\{2\}\)\\quad\\forall\\,\\varphi\_\{1\},\\varphi\_\{2\}\\models\\mathcal\{P\}\.

###### Proof\.

By the DPI \(P6\),H​\(f​\(φ\)\)≤H​\(φ\)H\(f\(\\varphi\)\)\\leq H\(\\varphi\)for allφ\\varphi\. By P1 and A3,\|Ef​\(φ1\)−Ef​\(φ2\)\|≤\|Eφ1−Eφ2\|\|E\_\{f\(\\varphi\_\{1\}\)\}\-E\_\{f\(\\varphi\_\{2\}\)\}\|\\leq\|E\_\{\\varphi\_\{1\}\}\-E\_\{\\varphi\_\{2\}\}\|fordℒ<Erefd\_\{\\mathcal\{L\}\}<E\_\{\\mathrm\{ref\}\}\(P\-stability regime\)\. Hencecα∗≤1c\_\{\\alpha^\{\*\}\}\\leq 1\. ∎

###### Definition 3\.2\(Non\-trivial, Non\-injective Mechanism\)\.

fαf\_\{\\alpha\}isnon\-trivialif∃φ\\exists\\,\\varphiwithfα​\(φ\)≠φf\_\{\\alpha\}\(\\varphi\)\\neq\\varphi;non\-injectiveif∃φ1≠φ2\\exists\\,\\varphi\_\{1\}\\neq\\varphi\_\{2\}withfα​\(φ1\)=fα​\(φ2\)f\_\{\\alpha\}\(\\varphi\_\{1\}\)=f\_\{\\alpha\}\(\\varphi\_\{2\}\)\.

###### Lemma 3\.4\(Compression Coefficients\)\.

Letα∗\\alpha^\{\*\}be non\-trivial and non\-injective\. Thenbα∗:=supφ⊧𝒫H​\(f​\(φ\)\)H​\(φ\)<1b\_\{\\alpha^\{\*\}\}:=\\sup\_\{\\varphi\\models\\mathcal\{P\}\}\\frac\{H\(f\(\\varphi\)\)\}\{H\(\\varphi\)\}<1andaα∗:=supφ⊧𝒫Ef​\(φ\)Eφ≤1a\_\{\\alpha^\{\*\}\}:=\\sup\_\{\\varphi\\models\\mathcal\{P\}\}\\frac\{E\_\{f\(\\varphi\)\}\}\{E\_\{\\varphi\}\}\\leq 1\(strict<1<1forE<EcE<E\_\{c\}\)\.

###### Proof\.

bα∗<1b\_\{\\alpha^\{\*\}\}<1\.By DPI,bα∗≤1b\_\{\\alpha^\{\*\}\}\\leq 1\. Non\-injectivity givesφA≠φB\\varphi\_\{A\}\\neq\\varphi\_\{B\}withfα∗\(φA\)=fα∗\(φB\)=:r∗f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{A\}\)=f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{B\}\)=:r^\{\*\}\.

Consider the random variableΦ\\Phithat equalsφA\\varphi\_\{A\}with probabilityppandφB\\varphi\_\{B\}with probability1−p1\-p\. Sincefα∗​\(φA\)=fα∗​\(φB\)=r∗f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{A\}\)=f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{B\}\)=r^\{\*\}, the Markov chainΦ→r∗→Φ\\Phi\\to r^\{\*\}\\to\\Phiholds\. By the Data Processing Inequality applied twice:

I​\(Φ;Φ\)≥I​\(Φ;r∗\)≥I​\(r∗;r∗\)=H​\(r∗\)\.I\(\\Phi;\\Phi\)\\geq I\(\\Phi;r^\{\*\}\)\\geq I\(r^\{\*\};r^\{\*\}\)=H\(r^\{\*\}\)\.ButI​\(Φ;Φ\)=H​\(Φ\)≤min⁡\(H​\(φA\),H​\(φB\)\)I\(\\Phi;\\Phi\)=H\(\\Phi\)\\leq\\min\(H\(\\varphi\_\{A\}\),H\(\\varphi\_\{B\}\)\)\(the entropy of a mixture is at most the maximum of the individual entropies, which is bounded by the minimum when one has larger entropy\)\. HenceH​\(r∗\)≤min⁡\(H​\(φA\),H​\(φB\)\)H\(r^\{\*\}\)\\leq\\min\(H\(\\varphi\_\{A\}\),H\(\\varphi\_\{B\}\)\)\.

IfH​\(φA\)\>H​\(φB\)H\(\\varphi\_\{A\}\)\>H\(\\varphi\_\{B\}\), thenH​\(r∗\)≤H​\(φB\)<H​\(φA\)H\(r^\{\*\}\)\\leq H\(\\varphi\_\{B\}\)<H\(\\varphi\_\{A\}\)\. IfH\(φA\)=H\(φB\)=:h\>0H\(\\varphi\_\{A\}\)=H\(\\varphi\_\{B\}\)=:h\>0, thenH​\(r∗\)≤hH\(r^\{\*\}\)\\leq h, and sinceφA≠φB\\varphi\_\{A\}\\neq\\varphi\_\{B\}under the canonical measure, the inequality is strict:H​\(r∗\)<hH\(r^\{\*\}\)<h\. In either case,H​\(fα∗​\(φA\)\)<H​\(φA\)H\(f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{A\}\)\)<H\(\\varphi\_\{A\}\)\.

Full support ofμ\\mugives positive weight to this pair, hence𝔼μ​\[Δ​Hα∗\]<0\\mathbb\{E\}\_\{\\mu\}\[\\Delta H\_\{\\alpha^\{\*\}\}\]<0, forcingbα∗<1b\_\{\\alpha^\{\*\}\}<1\.

aα∗≤1a\_\{\\alpha^\{\*\}\}\\leq 1, strict belowEcE\_\{c\}\.P1 and A3 bound\|Δ​E\|\|\\Delta E\|; minimum cost prefers energy\-releasing mechanisms; strict inequality follows from the budget constraintE<EcE<E\_\{c\}excluding energy\-neutral operations\. ∎

###### Lemma 3\.5\(SDPI for Minimum\-Cost Mechanism\)\.

α∗\\alpha^\{\*\}non\-injective⇒\\RightarrowchannelKα∗K\_\{\\alpha^\{\*\}\}satisfies the Strong Data Processing Inequality \(SDPI\) withη​\(α∗\)=bα∗<1\\eta\(\\alpha^\{\*\}\)=b\_\{\\alpha^\{\*\}\}<1:DKL​\(K​μ1∥K​μ2\)≤η​\(α∗\)​DKL​\(μ1∥μ2\)D\_\{\\mathrm\{KL\}\}\(K\\mu\_\{1\}\\\|K\\mu\_\{2\}\)\\leq\\eta\(\\alpha^\{\*\}\)D\_\{\\mathrm\{KL\}\}\(\\mu\_\{1\}\\\|\\mu\_\{2\}\)\(Raginsky\[[43](https://arxiv.org/html/2606.07563#bib.bib43)\], Theorem 4\)\.

###### Lemma 3\.6\(SDPI gives Square\-RootW1W\_\{1\}\-Contraction\)\.

For a deterministic channel with diameterD<∞D<\\inftysatisfying SDPI withη<1\\eta<1, and Dirac inputs:d​\(f​\(φ1\),f​\(φ2\)\)≤2​η⋅D⋅d​\(φ1,φ2\)1/2d\(f\(\\varphi\_\{1\}\),f\(\\varphi\_\{2\}\)\)\\leq\\sqrt\{2\\eta\}\\cdot D\\cdot d\(\\varphi\_\{1\},\\varphi\_\{2\}\)^\{1/2\}\. This is square\-root, not linear contraction\. Linear contraction requires additional structure \(Remark[4](https://arxiv.org/html/2606.07563#Thmremark4)\)\.

###### Proof\.

Pinsker \(‖μ−ν‖TV2≤12​DKL\\\|\\mu\-\\nu\\\|\_\{\\mathrm\{TV\}\}^\{2\}\\leq\\frac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\) \+ SDPI \+W1≤D∥⋅∥TVW\_\{1\}\\leq D\\\|\\cdot\\\|\_\{\\mathrm\{TV\}\}\+ Kantorovich duality\. For Diracs:W1​\(δf​\(φ1\),δf​\(φ2\)\)=d​\(f​\(φ1\),f​\(φ2\)\)W\_\{1\}\(\\delta\_\{f\(\\varphi\_\{1\}\)\},\\delta\_\{f\(\\varphi\_\{2\}\)\}\)=d\(f\(\\varphi\_\{1\}\),f\(\\varphi\_\{2\}\)\)by definition\. ∎

Linear contraction is established for two structural classes:

###### Definition 3\.3\(Monotone\-Compressive Mechanism\)\.

fαf\_\{\\alpha\}ismonotone\-compressiveif it preserves attribute ordering \(H​\(φ1\)≥H​\(φ2\)⇒H​\(f​\(φ1\)\)≥H​\(f​\(φ2\)\)H\(\\varphi\_\{1\}\)\\geq H\(\\varphi\_\{2\}\)\\Rightarrow H\(f\(\\varphi\_\{1\}\)\)\\geq H\(f\(\\varphi\_\{2\}\)\), similarly forEE\) and satisfies uniform boundsH​\(f​\(φ\)\)≤bα​H​\(φ\)H\(f\(\\varphi\)\)\\leq b\_\{\\alpha\}H\(\\varphi\),Ef​\(φ\)≤aα​EφE\_\{f\(\\varphi\)\}\\leq a\_\{\\alpha\}E\_\{\\varphi\}withaα,bα<1a\_\{\\alpha\},b\_\{\\alpha\}<1\.

###### Proposition 3\.7\(A6 for Monotone\-Compressive Mechanisms\)\.

α∗\\alpha^\{\*\}monotone\-compressive⇒\\Rightarrowcα∗=max⁡\(aα∗,bα∗\)<1c\_\{\\alpha^\{\*\}\}=\\max\(a\_\{\\alpha^\{\*\}\},b\_\{\\alpha^\{\*\}\}\)<1\.

###### Proof\.

WLOGH​\(φ1\)≥H​\(φ2\)H\(\\varphi\_\{1\}\)\\geq H\(\\varphi\_\{2\}\)\. Monotonicity givesH​\(f​\(φ1\)\)≥H​\(f​\(φ2\)\)H\(f\(\\varphi\_\{1\}\)\)\\geq H\(f\(\\varphi\_\{2\}\)\)\. Uniform bound:H​\(f​\(φ1\)\)≤bα∗​H​\(φ1\)H\(f\(\\varphi\_\{1\}\)\)\\leq b\_\{\\alpha^\{\*\}\}H\(\\varphi\_\{1\}\),H​\(f​\(φ2\)\)≥bα∗​H​\(φ2\)H\(f\(\\varphi\_\{2\}\)\)\\geq b\_\{\\alpha^\{\*\}\}H\(\\varphi\_\{2\}\)\(compression preserves order, so lower bound also scales bybα∗b\_\{\\alpha^\{\*\}\}\)\. HenceH​\(f​\(φ1\)\)−H​\(f​\(φ2\)\)≤bα∗​\(H​\(φ1\)−H​\(φ2\)\)H\(f\(\\varphi\_\{1\}\)\)\-H\(f\(\\varphi\_\{2\}\)\)\\leq b\_\{\\alpha^\{\*\}\}\(H\(\\varphi\_\{1\}\)\-H\(\\varphi\_\{2\}\)\)\. Likewise for energy\. Thend​\(f​\(φ1\),f​\(φ2\)\)≤max⁡\(aα∗,bα∗\)⋅dℒ​\(φ1,φ2\)=cα∗⋅dℒ​\(φ1,φ2\)d\(f\(\\varphi\_\{1\}\),f\(\\varphi\_\{2\}\)\)\\leq\\max\(a\_\{\\alpha^\{\*\}\},b\_\{\\alpha^\{\*\}\}\)\\cdot d\_\{\\mathcal\{L\}\}\(\\varphi\_\{1\},\\varphi\_\{2\}\)=c\_\{\\alpha^\{\*\}\}\\cdot d\_\{\\mathcal\{L\}\}\(\\varphi\_\{1\},\\varphi\_\{2\}\)withcα∗<1c\_\{\\alpha^\{\*\}\}<1\. ∎

###### Proposition 3\.8\(A6 for Linear\-Attribute Mechanisms\)\.

IfEf​\(φ\)=aα∗​EφE\_\{f\(\\varphi\)\}=a\_\{\\alpha^\{\*\}\}E\_\{\\varphi\}andH​\(f​\(φ\)\)=bα∗​H​\(φ\)H\(f\(\\varphi\)\)=b\_\{\\alpha^\{\*\}\}H\(\\varphi\)withaα∗,bα∗∈\(0,1\)a\_\{\\alpha^\{\*\}\},b\_\{\\alpha^\{\*\}\}\\in\(0,1\)\(from Lemma[S6\.1](https://arxiv.org/html/2606.07563#A6.Thmtheorem1)\), then A6 holds withcα∗=max⁡\(aα∗,bα∗\)<1c\_\{\\alpha^\{\*\}\}=\\max\(a\_\{\\alpha^\{\*\}\},b\_\{\\alpha^\{\*\}\}\)<1\.

###### Proof\.

Linear mechanisms are monotone\-compressive with exact ratio; apply Proposition[3\.7](https://arxiv.org/html/2606.07563#S3.Thmtheorem7)\. Linearity gives equality in every bound, confirmingcα∗=max⁡\(aα∗,bα∗\)c\_\{\\alpha^\{\*\}\}=\\max\(a\_\{\\alpha^\{\*\}\},b\_\{\\alpha^\{\*\}\}\)exactly\. ∎

For general nonlinear mechanisms, we establish A6 conditionally:

###### Theorem 3\.9\(A6 Conditional on Log\-Sobolev Inequality\)\.

Supposefα∗f\_\{\\alpha^\{\*\}\}is the stationary map of a Markov process on\(R3≥0,d\)\(R^\{\\geq 0\}\_\{3\},d\)satisfying a log\-Sobolev inequality \(LSI\) with constantρ\>0\\rho\>0:Entμ​\(ν\)≤12​ρ​ℰ​\(ν,ν\)\\mathrm\{Ent\}\_\{\\mu\}\(\\nu\)\\leq\\frac\{1\}\{2\\rho\}\\mathcal\{E\}\(\\sqrt\{\\nu\},\\sqrt\{\\nu\}\)\. Then: \(i\) TalagrandT2T\_\{2\}holds:W2​\(ν,μ\)2≤2ρ​DKL​\(ν∥μ\)W\_\{2\}\(\\nu,\\mu\)^\{2\}\\leq\\frac\{2\}\{\\rho\}D\_\{\\mathrm\{KL\}\}\(\\nu\\\|\\mu\)\[[3](https://arxiv.org/html/2606.07563#bib.bib3)\]\. \(ii\) LinearW1W\_\{1\}\-contraction:W1​\(f∗​ν1,f∗​ν2\)≤e−ρ​W1​\(ν1,ν2\)W\_\{1\}\(f\_\{\*\}\\nu\_\{1\},f\_\{\*\}\\nu\_\{2\}\)\\leq e^\{\-\\rho\}W\_\{1\}\(\\nu\_\{1\},\\nu\_\{2\}\)\[[39](https://arxiv.org/html/2606.07563#bib.bib39)\]\. \(iii\) A6 holds withcα∗=e−ρ<1c\_\{\\alpha^\{\*\}\}=e^\{\-\\rho\}<1\.

###### Proof\.

The LSI⇒\\RightarrowT2T\_\{2\}by Bobkov–Götze\[[3](https://arxiv.org/html/2606.07563#bib.bib3)\]\.T2T\_\{2\}\+ Otto–Villani\[[39](https://arxiv.org/html/2606.07563#bib.bib39)\]give exponentialW2W\_\{2\}contraction along the gradient flow\.W1≤W2W\_\{1\}\\leq W\_\{2\}\(Cauchy–Schwarz for Wasserstein\) gives linearW1W\_\{1\}contraction\. For Dirac inputsW1​\(δφ1,δφ2\)=d​\(φ1,φ2\)W\_\{1\}\(\\delta\_\{\\varphi\_\{1\}\},\\delta\_\{\\varphi\_\{2\}\}\)=d\(\\varphi\_\{1\},\\varphi\_\{2\}\), so \(iii\) follows directly\. ∎

###### Theorem 3\.10\(Metric Contraction: Complete Status\)\.

Under A1–A5 with finite diameterD<∞D<\\infty:

1. \(i\)cα∗≤1c\_\{\\alpha^\{\*\}\}\\leq 1always \(Proposition[3\.3](https://arxiv.org/html/2606.07563#S3.Thmtheorem3)\)\.
2. \(ii\)cα∗<1c\_\{\\alpha^\{\*\}\}<1for linear\-attribute mechanisms \(Proposition[3\.8](https://arxiv.org/html/2606.07563#S3.Thmtheorem8)\)\.
3. \(iii\)cα∗<1c\_\{\\alpha^\{\*\}\}<1for monotone\-compressive mechanisms \(Proposition[3\.7](https://arxiv.org/html/2606.07563#S3.Thmtheorem7)\)\.
4. \(iv\)cα∗=e−ρ<1c\_\{\\alpha^\{\*\}\}=e^\{\-\\rho\}<1ifα∗\\alpha^\{\*\}admits an LSI withρ\>0\\rho\>0\(Theorem[3\.9](https://arxiv.org/html/2606.07563#S3.Thmtheorem9)\)\.
5. \(v\)*\(Open, likely false in general\.\)*cα∗<1c\_\{\\alpha^\{\*\}\}<1for all non\-monotone, non\-LSI mechanisms satisfying A1–A5\. A tight counterexample withc=1c=1is given in Remark[2](https://arxiv.org/html/2606.07563#Thmremark2)\.

Cases \(ii\)–\(iv\) cover all four HEF instantiations\. The claim that A6 follows from A1–A5*alone without further structure*is false in general: the counterexample of Remark[2](https://arxiv.org/html/2606.07563#Thmremark2)shows that finiteDDand A1–A5 are insufficient\.

### 3\.7Weight Function

Domain\-specific realisations ofwαdomainw^\{\\mathrm\{domain\}\}\_\{\\alpha\}:exp⁡\(−Δ​Gα‡/kB​T\)\\exp\(\-\\Delta G^\{\\ddagger\}\_\{\\alpha\}/k\_\{B\}T\)\(EOM\),I​\(R\(k−1\);R\(k\)\)I\(R^\{\(k\-1\)\};R^\{\(k\)\}\)\(IFF\),exp⁡\(−ℒ​\(α\)/ℒ0\)\\exp\(\-\\mathcal\{L\}\(\\alpha\)/\\mathcal\{L\}\_\{0\}\)\(ML\),SNRα\\mathrm\{SNR\}\_\{\\alpha\}\(RSID\)\. The distribution is heavy\-tailed within𝒜∗​\(E\)\\mathcal\{A\}^\{\*\}\(E\), with a small set𝒜\#⊂𝒜∗\\mathcal\{A\}^\{\\\#\}\\subset\\mathcal\{A\}^\{\*\}carrying\(1−ϵ\)\(1\-\\epsilon\)of total weight\.

## 4Physical Feasibility Theorem

###### Assumption 2\(Physical Primitives, A1\)\.

Everyri∈R\(1\)r\_\{i\}\\in R^\{\(1\)\}satisfies𝒫\\mathcal\{P\}\.

###### Assumption 3\(Physical Negation, A2\)\.

Axiom N \(Definition[2\.4](https://arxiv.org/html/2606.07563#S2.Thmdefinition4)\) holds for all levelskk\.

###### Assumption 4\(Interaction Regularity, A3\)\.

Conjunctions are governed by Definition[2\.5](https://arxiv.org/html/2606.07563#S2.Thmdefinition5)\. Furthermore,Eref≥Ei−Δ​Ei​jE\_\{\\mathrm\{ref\}\}\\geq E\_\{i\}\-\\Delta E\_\{ij\}for all primitives and admissible conjunctions \(open\-system boundary condition\)\. The interaction energyΔ​Eφ​ψ\\Delta E\_\{\\varphi\\psi\}is a Lipschitz function of the atomic energies with Lipschitz constantΛE≤1\\Lambda\_\{E\}\\leq 1\.

###### Assumption 5\(Feasibility\-Preserving Generation, A4\)\.

𝒢\\mathcal\{G\}has range restricted to𝒜∗\\mathcal\{A\}^\{\*\}\.

###### Theorem 4\.1\(Physical Feasibility of Emergence\)\.

Letℋ\\mathcal\{H\}satisfy A1–A4\. Then for allk≥1k\\geq 1and allr\(k\)∈R\(k\)r^\{\(k\)\}\\in R^\{\(k\)\}:

r\(k\)⊧𝒫thermoandr\(k\)⊧𝒫info,r^\{\(k\)\}\\models\\mathcal\{P\}\_\{\\mathrm\{thermo\}\}\\quad\\text\{and\}\\quad r^\{\(k\)\}\\models\\mathcal\{P\}\_\{\\mathrm\{info\}\},simultaneously viaΦ\\Phi\.

###### Proof\.

By strong induction onkk\.

Base case\(k=1k=1\): Immediate from A1\.

Inductive hypothesis: Allr\(j\)∈R\(j\)r^\{\(j\)\}\\in R^\{\(j\)\}satisfy𝒫\\mathcal\{P\}for1≤j≤k−11\\leq j\\leq k\-1\.

Inductive step: We showφ⊧𝒫\\varphi\\models\\mathcal\{P\}for allφ∈ℒ​\(R\(k−1\)\)\\varphi\\in\\mathcal\{L\}\(R^\{\(k\-1\)\}\)by structural induction\.

*Atomic*:φ=ri\(k−1\)\\varphi=r^\{\(k\-1\)\}\_\{i\}satisfies𝒫\\mathcal\{P\}by hypothesis\.

*Negation*:ri\(k−1\)⟂r^\{\(k\-1\)\\perp\}\_\{i\}satisfies𝒫\\mathcal\{P\}by A2 \(Axiom N1\)\.

*Conjunctionφ∧ψ\\varphi\\wedge\\psi*: We verify both branches separately, then invokeΦ\\Phi\.

- •Thermodynamic branch\(𝒫thermo\\mathcal\{P\}\_\{\\mathrm\{thermo\}\}\): - –*P1*: By A3 \(Definition[2\.5](https://arxiv.org/html/2606.07563#S2.Thmdefinition5)\),Eφ∧ψE\_\{\\varphi\\wedge\\psi\}satisfies P1 by construction\. - –*P2*: P2 concernsΔ​Stotal=Δ​Ssubsys\+Δ​Senv\\Delta S\_\{\\mathrm\{total\}\}=\\Delta S\_\{\\mathrm\{subsys\}\}\+\\Delta S\_\{\\mathrm\{env\}\}\. By A3 \(open\-system condition\), the interaction releases energyΔ​Eφ​ψ\\Delta E\_\{\\varphi\\psi\}to the environment, givingΔ​Senv=−Δ​Eφ​ψ/T\\Delta S\_\{\\mathrm\{env\}\}=\-\\Delta E\_\{\\varphi\\psi\}/T\(Clausius\)\. HenceΔ​Stotal≥0\\Delta S\_\{\\mathrm\{total\}\}\\geq 0iffΔ​Gφ​ψ=Δ​Eφ​ψ−T​Δ​Sφ∧ψ≤0\\Delta G\_\{\\varphi\\psi\}=\\Delta E\_\{\\varphi\\psi\}\-T\\Delta S\_\{\\varphi\\wedge\\psi\}\\leq 0, which holds for admissible conjunctions \(A3 selects thermodynamically favourable interactions,Δ​G≤0\\Delta G\\leq 0; see Callen\[[9](https://arxiv.org/html/2606.07563#bib.bib9)\], §4\-1\)\. - –*P3*: Inherited from the globalT\>0T\>0\.
- •Information\-theoretic branch\(𝒫info\\mathcal\{P\}\_\{\\mathrm\{info\}\}\): - –*P4*: Subadditivity of Shannon entropy \(\[[13](https://arxiv.org/html/2606.07563#bib.bib13)\], Theorem 2\.6\.3\) givesH​\(φ∧ψ\)≤H​\(φ\)\+H​\(ψ\)H\(\\varphi\\wedge\\psi\)\\leq H\(\\varphi\)\+H\(\\psi\), henceI​\(φ;ψ\)≤min⁡\(H​\(φ\),H​\(ψ\)\)I\(\\varphi;\\psi\)\\leq\\min\(H\(\\varphi\),H\(\\psi\)\)\. - –*P5*: Chain rule:H​\(φ∣ψ\)=H​\(φ,ψ\)−H​\(ψ\)≥0H\(\\varphi\\mid\\psi\)=H\(\\varphi,\\psi\)\-H\(\\psi\)\\geq 0, sinceH​\(φ,ψ\)≥H​\(ψ\)H\(\\varphi,\\psi\)\\geq H\(\\psi\)wheneverφ,ψ\\varphi,\\psiare drawn fromμ\\mu\(Theorem 2\.2\.1 of\[[13](https://arxiv.org/html/2606.07563#bib.bib13)\]\)\. - –*P6*: Any causal ordering withinφ∧ψ\\varphi\\wedge\\psiforms a Markov chain; P6 holds by inductive hypothesis\.
- •Consistency: By Proposition[S4\.1](https://arxiv.org/html/2606.07563#A4.Thmtheorem1), satisfying𝒫thermo\\mathcal\{P\}\_\{\\mathrm\{thermo\}\}is equivalent to satisfying𝒫info\\mathcal\{P\}\_\{\\mathrm\{info\}\}underΦ\\Phi\. Both branches are verified independently, confirmingφ∧ψ⊧𝒫\\varphi\\wedge\\psi\\models\\mathcal\{P\}\.

*Disjunction, implication, causal ordering*: Follow from the inductive hypothesis and P5, P6 by standard arguments\.

Mechanisms: In controlled mode,α∈𝒜0⊆𝒜∗\\alpha\\in\\mathcal\{A\}\_\{0\}\\subseteq\\mathcal\{A\}^\{\*\}by design\. In self\-generating mode, A4 forces𝒢⊆𝒜∗\\mathcal\{G\}\\subseteq\\mathcal\{A\}^\{\*\}; by induction ontt,𝒜t⊆𝒜∗\\mathcal\{A\}\_\{t\}\\subseteq\\mathcal\{A\}^\{\*\}\. Hencefα\(k−1\)f\_\{\\alpha\}^\{\(k\-1\)\}is𝒫\\mathcal\{P\}\-preserving\. Combining withφ⊧𝒫\\varphi\\models\\mathcal\{P\}:r\(k\)=fα\(k−1\)​\(φ\)⊧𝒫r^\{\(k\)\}=f\_\{\\alpha\}^\{\(k\-1\)\}\(\\varphi\)\\models\\mathcal\{P\}\. ∎

## 5Energy Budget and the Diversity\-Convergence Trade\-off

### 5\.1Complete Metric Space Structure

###### Definition 5\.1\(Physical Metric Space and Hausdorff Metric\)\.

Using the physical metricddof Definition[3\.1](https://arxiv.org/html/2606.07563#S3.Thmdefinition1)\(a\), letΩ\(k\)\\Omega^\{\(k\)\}denote the space of non\-empty compact subsets of𝒫\\mathcal\{P\}\-feasible level\-kkentities, equipped with the Hausdorff metric

dH​\(R1,R2\)=max⁡\(supr∈R1infs∈R2d​\(r,s\),sups∈R2infr∈R1d​\(r,s\)\)\.d\_\{H\}\(R\_\{1\},R\_\{2\}\)=\\max\\\!\\Bigl\(\\sup\_\{r\\in R\_\{1\}\}\\inf\_\{s\\in R\_\{2\}\}d\(r,s\),\\;\\sup\_\{s\\in R\_\{2\}\}\\inf\_\{r\\in R\_\{1\}\}d\(r,s\)\\Bigr\)\.

###### Lemma 5\.1\(Completeness ofΩ\(k\)\\Omega^\{\(k\)\}\)\.

\(Ω\(k\),dH\)\(\\Omega^\{\(k\)\},d\_\{H\}\)is a complete metric space\.

###### Proof\.

The attribute space\(ℝ≥03,d\)\(\\mathbb\{R\}\_\{\\geq 0\}^\{3\},d\)is a closed subset of the Banach space\(ℝ3,∥⋅∥1/Eref\)\(\\mathbb\{R\}^\{3\},\\\|\\cdot\\\|\_\{1\}/E\_\{\\mathrm\{ref\}\}\), hence complete\. The space of non\-empty compact subsets of a complete metric space with the Hausdorff metric is complete \(Hausdorff\[[20](https://arxiv.org/html/2606.07563#bib.bib20)\]; Munkres\[[35](https://arxiv.org/html/2606.07563#bib.bib35)\], Theorem 45\.1\)\. SinceR\(1\)R^\{\(1\)\}is finite by assumption, and eachR\(k\)R^\{\(k\)\}is generated from finiteR\(k−1\)R^\{\(k\-1\)\}by a finite mechanism set \(finiteness of𝒜∗​\(E\)\\mathcal\{A\}^\{\*\}\(E\)follows from the finiteness ofℒ​\(R\(k−1\)\)\\mathcal\{L\}\(R^\{\(k\-1\)\}\)and the cost function\), everyR\(k\)R^\{\(k\)\}is finite, hence compact\. ThusΩ\(k\)\\Omega^\{\(k\)\}is a subset of the compact\-subsets Hausdorff space\. Physical feasibility is a closed condition, soΩ\(k\)\\Omega^\{\(k\)\}is closed, and therefore complete\. ∎

### 5\.2P\-Stability of Coupled Formulas

The following lemma formalises the coupling argument in Lemma[5\.3](https://arxiv.org/html/2606.07563#S5.Thmtheorem3)\.

###### Lemma 5\.2\(P\-Stability under Type\-Preserving Atom Replacement\)\.

LetR1,R2∈Ω\(k−1\)R\_\{1\},R\_\{2\}\\in\\Omega^\{\(k\-1\)\}withdH​\(R1,R2\)=ε<Eref/2d\_\{H\}\(R\_\{1\},R\_\{2\}\)=\\varepsilon<E\_\{\\mathrm\{ref\}\}/2and\|R1\|=\|R2\|\|R\_\{1\}\|=\|R\_\{2\}\|\. Letπ:R1→R2\\pi:R\_\{1\}\\to R\_\{2\}be atype\-preserving bijection: a bijection withd​\(r,π​\(r\)\)≤ε\+ηd\(r,\\pi\(r\)\)\\leq\\varepsilon\+\\etafor allr∈R1r\\in R\_\{1\}\(anyη\>0\\eta\>0\), whereπ​\(r\)\\pi\(r\)andrrshare the same physical interaction type under𝒫\\mathcal\{P\}\(same admissible conjunction partners\)\. For any formulaφ=φ​\(ri1,…,rim\)∈ℒ​\(R1\)\\varphi=\\varphi\(r\_\{i\_\{1\}\},\\ldots,r\_\{i\_\{m\}\}\)\\in\\mathcal\{L\}\(R\_\{1\}\)withφ⊧𝒫\\varphi\\models\\mathcal\{P\}, define the coupled formulaφ¯=φ​\(π​\(ri1\),…,π​\(rim\)\)∈ℒ​\(R2\)\\bar\{\\varphi\}=\\varphi\(\\pi\(r\_\{i\_\{1\}\}\),\\ldots,\\pi\(r\_\{i\_\{m\}\}\)\)\\in\\mathcal\{L\}\(R\_\{2\}\)\. Thenφ¯⊧𝒫\\bar\{\\varphi\}\\models\\mathcal\{P\}\.

###### Proof\.

By structural induction onφ\\varphi\.

*Atomic*:π​\(rij\)∈R2\\pi\(r\_\{i\_\{j\}\}\)\\in R\_\{2\}satisfies𝒫\\mathcal\{P\}by A1 applied toR2R\_\{2\}\.

*Physical negation*:π​\(rij\)⟂\\pi\(r\_\{i\_\{j\}\}\)^\{\\perp\}satisfies𝒫\\mathcal\{P\}by A2 \(Axiom N1 applied toR2R\_\{2\}\)\.

*Admissible conjunctionφ∧ψ\\varphi\\wedge\\psi*: By inductive hypothesis,φ¯⊧𝒫\\bar\{\\varphi\}\\models\\mathcal\{P\}andψ¯⊧𝒫\\bar\{\\psi\}\\models\\mathcal\{P\}\. We must verify thatφ¯∧ψ¯\\bar\{\\varphi\}\\wedge\\bar\{\\psi\}is admissible, i\.e\. thatΔ​Eφ¯​ψ¯\\Delta E\_\{\\bar\{\\varphi\}\\bar\{\\psi\}\}satisfies P1\.

By A3 \(Lipschitz condition onΔ​E\\Delta E\), the interaction energy changes by at most:

\|Δ​Eφ¯​ψ¯−Δ​Eφ​ψ\|≤ΛE⋅\(dℒ​\(φ,φ¯\)\+dℒ​\(ψ,ψ¯\)\)≤2​ΛE⋅\(ε\+η\)\.\|\\Delta E\_\{\\bar\{\\varphi\}\\bar\{\\psi\}\}\-\\Delta E\_\{\\varphi\\psi\}\|\\leq\\Lambda\_\{E\}\\cdot\(d\_\{\\mathcal\{L\}\}\(\\varphi,\\bar\{\\varphi\}\)\+d\_\{\\mathcal\{L\}\}\(\\psi,\\bar\{\\psi\}\)\)\\leq 2\\Lambda\_\{E\}\\cdot\(\\varepsilon\+\\eta\)\.SinceΔ​Eφ​ψ\\Delta E\_\{\\varphi\\psi\}satisfies P1 \(by hypothesis\) and the perturbation2​ΛE​\(ε\+η\)≤2​\(ε\+η\)<2⋅Eref/2=Eref2\\Lambda\_\{E\}\(\\varepsilon\+\\eta\)\\leq 2\(\\varepsilon\+\\eta\)<2\\cdot E\_\{\\mathrm\{ref\}\}/2=E\_\{\\mathrm\{ref\}\}\(usingΛE≤1\\Lambda\_\{E\}\\leq 1andε<Eref/2\\varepsilon<E\_\{\\mathrm\{ref\}\}/2\), the perturbed energyΔ​Eφ¯​ψ¯\\Delta E\_\{\\bar\{\\varphi\}\\bar\{\\psi\}\}also satisfies P1 by the open\-system boundary conditionEref≥Ei−Δ​Ei​jE\_\{\\mathrm\{ref\}\}\\geq E\_\{i\}\-\\Delta E\_\{ij\}in A3: perturbations bounded byErefE\_\{\\mathrm\{ref\}\}preserve this inequality, sinceEref≥Ei−Δ​Ei​jE\_\{\\mathrm\{ref\}\}\\geq E\_\{i\}\-\\Delta E\_\{ij\}impliesEref≥Ei−\(Δ​Ei​j\+Eref\)⇔0≥Ei−2​ErefE\_\{\\mathrm\{ref\}\}\\geq E\_\{i\}\-\(\\Delta E\_\{ij\}\+E\_\{\\mathrm\{ref\}\}\)\\Leftrightarrow 0\\geq E\_\{i\}\-2E\_\{\\mathrm\{ref\}\}, which holds for all𝒫\\mathcal\{P\}\-feasible primitives withEi≤2​ErefE\_\{i\}\\leq 2E\_\{\\mathrm\{ref\}\}\. Henceφ¯∧ψ¯⊧𝒫\\bar\{\\varphi\}\\wedge\\bar\{\\psi\}\\models\\mathcal\{P\}\.

*Disjunction, implication, causal ordering*: Follow analogously from the inductive hypothesis and the Lipschitz stability of the information constraints underdℒd\_\{\\mathcal\{L\}\}\-bounded perturbations\. ∎

### 5\.3Metric Contraction Lemma

###### Lemma 5\.3\(Metric Contraction ofTkT\_\{k\}\)\.

Under A1–A6, forE<EcE<E\_\{c\}, the generator map

Tk:Ω\(k−1\)→Ω\(k\),Tk​\(R\)=\{fα∗\(k\)​\(φ\):φ∈ℒ​\(R\),φ⊧𝒫\},T\_\{k\}:\\Omega^\{\(k\-1\)\}\\to\\Omega^\{\(k\)\},\\quad T\_\{k\}\(R\)=\\bigl\\\{f^\{\(k\)\}\_\{\\alpha^\{\*\}\}\(\\varphi\):\\varphi\\in\\mathcal\{L\}\(R\),\\,\\varphi\\models\\mathcal\{P\}\\bigr\\\},whereα∗=arg⁡minα∈𝒜∗⁡cost​\(α\)\\alpha^\{\*\}=\\arg\\min\_\{\\alpha\\in\\mathcal\{A\}^\{\*\}\}\\mathrm\{cost\}\(\\alpha\), is astrict contractionin\(Ω\(k\),dH\)\(\\Omega^\{\(k\)\},d\_\{H\}\)with constantcα∗∈\(0,1\)c\_\{\\alpha^\{\*\}\}\\in\(0,1\)\.

###### Proof\.

FixR1,R2∈Ω\(k−1\)R\_\{1\},R\_\{2\}\\in\\Omega^\{\(k\-1\)\}withdH​\(R1,R2\)=ε\>0d\_\{H\}\(R\_\{1\},R\_\{2\}\)=\\varepsilon\>0and anyη\>0\\eta\>0\.

Step 1 \(Coupling and P\-validity\)\.By definition ofdHd\_\{H\}, there exists a couplingπ:R1→R2\\pi:R\_\{1\}\\to R\_\{2\}withd​\(r,π​\(r\)\)≤ε\+ηd\(r,\\pi\(r\)\)\\leq\\varepsilon\+\\etafor allr∈R1r\\in R\_\{1\}\. For anyφ=φ​\(ri1,…,rim\)∈ℒ​\(R1\)\\varphi=\\varphi\(r\_\{i\_\{1\}\},\\ldots,r\_\{i\_\{m\}\}\)\\in\\mathcal\{L\}\(R\_\{1\}\)withφ⊧𝒫\\varphi\\models\\mathcal\{P\}, defineφ¯=φ​\(π​\(ri1\),…,π​\(rim\)\)∈ℒ​\(R2\)\\bar\{\\varphi\}=\\varphi\(\\pi\(r\_\{i\_\{1\}\}\),\\ldots,\\pi\(r\_\{i\_\{m\}\}\)\)\\in\\mathcal\{L\}\(R\_\{2\}\)\. By Lemma[S7\.7](https://arxiv.org/html/2606.07563#A7.Thmtheorem7),φ¯⊧𝒫\\bar\{\\varphi\}\\models\\mathcal\{P\}\.

By Definition[3\.1](https://arxiv.org/html/2606.07563#S3.Thmdefinition1)\(b\) with the couplingσ=id\\sigma=\\mathrm\{id\}\(atoms matched by construction\):

dℒ​\(φ,φ¯\)=max1≤ℓ≤m⁡d​\(riℓ,π​\(riℓ\)\)≤ε\+η\.d\_\{\\mathcal\{L\}\}\(\\varphi,\\bar\{\\varphi\}\)=\\max\_\{1\\leq\\ell\\leq m\}d\(r\_\{i\_\{\\ell\}\},\\pi\(r\_\{i\_\{\\ell\}\}\)\)\\leq\\varepsilon\+\\eta\.\(1\)
Step 2 \(Apply A6\)\.For paired formulas\(φ,φ¯\)∈ℒ​\(R1\)×ℒ​\(R2\)\(\\varphi,\\bar\{\\varphi\}\)\\in\\mathcal\{L\}\(R\_\{1\}\)\\times\\mathcal\{L\}\(R\_\{2\}\):

d​\(fα∗\(k\)​\(φ\),fα∗\(k\)​\(φ¯\)\)​≤A6​cα∗⋅dℒ​\(φ,φ¯\)​≤\([1](https://arxiv.org/html/2606.07563#S5.E1)\)​cα∗⋅\(ε\+η\)\.d\\bigl\(f^\{\(k\)\}\_\{\\alpha^\{\*\}\}\(\\varphi\),\\,f^\{\(k\)\}\_\{\\alpha^\{\*\}\}\(\\bar\{\\varphi\}\)\\bigr\)\\overset\{\\text\{A6\}\}\{\\leq\}c\_\{\\alpha^\{\*\}\}\\cdot d\_\{\\mathcal\{L\}\}\(\\varphi,\\bar\{\\varphi\}\)\\overset\{\(\\ref\{eq:dLbound\}\)\}\{\\leq\}c\_\{\\alpha^\{\*\}\}\\cdot\(\\varepsilon\+\\eta\)\.
Step 3 \(Hausdorff bound\)\.For anys1=fα∗\(k\)​\(φ\)∈Tk​\(R1\)s\_\{1\}=f^\{\(k\)\}\_\{\\alpha^\{\*\}\}\(\\varphi\)\\in T\_\{k\}\(R\_\{1\}\), the coupled elements2=fα∗\(k\)​\(φ¯\)∈Tk​\(R2\)s\_\{2\}=f^\{\(k\)\}\_\{\\alpha^\{\*\}\}\(\\bar\{\\varphi\}\)\\in T\_\{k\}\(R\_\{2\}\)satisfiesd​\(s1,s2\)≤cα∗​\(ε\+η\)d\(s\_\{1\},s\_\{2\}\)\\leq c\_\{\\alpha^\{\*\}\}\(\\varepsilon\+\\eta\)\. Taking the infimum overTk​\(R2\)T\_\{k\}\(R\_\{2\}\)and then the supremum overTk​\(R1\)T\_\{k\}\(R\_\{1\}\), and symmetrically:

dH​\(Tk​\(R1\),Tk​\(R2\)\)≤cα∗⋅\(ε\+η\)\.d\_\{H\}\(T\_\{k\}\(R\_\{1\}\),T\_\{k\}\(R\_\{2\}\)\)\\leq c\_\{\\alpha^\{\*\}\}\\cdot\(\\varepsilon\+\\eta\)\.Sinceη\>0\\eta\>0is arbitrary:dH​\(Tk​\(R1\),Tk​\(R2\)\)≤cα∗⋅dH​\(R1,R2\)d\_\{H\}\(T\_\{k\}\(R\_\{1\}\),T\_\{k\}\(R\_\{2\}\)\)\\leq c\_\{\\alpha^\{\*\}\}\\cdot d\_\{H\}\(R\_\{1\},R\_\{2\}\)\. Sincecα∗<1c\_\{\\alpha^\{\*\}\}<1\(A6\),TkT\_\{k\}is a strict contraction\. ∎

### 5\.4Energy\-Diversity Trade\-off Theorem

###### Theorem 5\.4\(Energy\-Diversity Trade\-off\)\.

Letℋ​\(E\)\\mathcal\{H\}\(E\)be a HEF with finiteR\(1\)R^\{\(1\)\}and energy budgetEE\. Then:

1. \(i\)\|R\(k\)​\(E\)\|\|R^\{\(k\)\}\(E\)\|is monotonically non\-decreasing inEEfor allk≥1k\\geq 1\.
2. \(ii\)There existsEc\>0E\_\{c\}\>0such that the rate of new mechanisms admitted per unit budget is maximised atEcE\_\{c\}\.
3. \(iii\)Under A1–A6, forE<EcE<E\_\{c\},ℋ\\mathcal\{H\}converges to a unique fixed\-point setR∞\(k\)∈Ω\(k\)R^\{\(k\)\}\_\{\\infty\}\\in\\Omega^\{\(k\)\}, independent of initial conditions\.

###### Proof\.

\(i\)𝒜∗​\(E\)=\{α∈𝒜∗:cost​\(α\)≤E\}\\mathcal\{A\}^\{\*\}\(E\)=\\\{\\alpha\\in\\mathcal\{A\}^\{\*\}:\\mathrm\{cost\}\(\\alpha\)\\leq E\\\}is non\-decreasing inEEby definition\.

\(ii\)SinceR\(1\)R^\{\(1\)\}is finite and mechanisms act on the finite logical languageℒ​\(R\(k−1\)\)\\mathcal\{L\}\(R^\{\(k\-1\)\}\), the set𝒜∗\\mathcal\{A\}^\{\*\}is finite\. Enumerate the distinct cost values as0≤c1<c2<⋯<cN<∞0\\leq c\_\{1\}<c\_\{2\}<\\cdots<c\_\{N\}<\\infty\. The functionE↦\|𝒜∗​\(E\)\|E\\mapsto\|\\mathcal\{A\}^\{\*\}\(E\)\|is a non\-decreasing staircase with jumps atE=cjE=c\_\{j\}\. LetΔj=\|𝒜∗​\(cj\)\|−\|𝒜∗​\(cj−1\)\|\\Delta\_\{j\}=\|\\mathcal\{A\}^\{\*\}\(c\_\{j\}\)\|\-\|\\mathcal\{A\}^\{\*\}\(c\_\{j\-1\}\)\|be the number of new mechanisms admitted atcjc\_\{j\}\. Define

Ec=cj∗,j∗=arg⁡max1≤j≤N⁡Δjcj−cj−1,E\_\{c\}=c\_\{j^\{\*\}\},\\quad j^\{\*\}=\\arg\\max\_\{1\\leq j\\leq N\}\\frac\{\\Delta\_\{j\}\}\{c\_\{j\}\-c\_\{j\-1\}\},\(2\)i\.e\.EcE\_\{c\}is the cost level with the maximum rate of new mechanisms admitted per unit budget\. This is the discrete analogue of the inflection point: the derivatived​\|𝒜∗​\(E\)\|/d​Ed\|\\mathcal\{A\}^\{\*\}\(E\)\|/dE\(in the distributional sense\) is maximised atEcE\_\{c\}\.

\(iii\)By Lemma[S3\.1](https://arxiv.org/html/2606.07563#A3.Thmtheorem1),\(Ω\(k\),dH\)\(\\Omega^\{\(k\)\},d\_\{H\}\)is complete\. By Lemma[5\.3](https://arxiv.org/html/2606.07563#S5.Thmtheorem3)\(under A1–A6 andE<EcE<E\_\{c\}\),TkT\_\{k\}is a strict contraction with constantcα∗<1c\_\{\\alpha^\{\*\}\}<1\. By theBanach Fixed\-Point Theorem\(\[[2](https://arxiv.org/html/2606.07563#bib.bib2)\]; Kreyszig\[[28](https://arxiv.org/html/2606.07563#bib.bib28)\], Theorem 5\.1\-2\), there exists a uniqueR∞\(k\)∈Ω\(k\)R^\{\(k\)\}\_\{\\infty\}\\in\\Omega^\{\(k\)\}withTk​\(R∞\(k\)\)=R∞\(k\)T\_\{k\}\(R^\{\(k\)\}\_\{\\infty\}\)=R^\{\(k\)\}\_\{\\infty\}, and for anyR0∈Ω\(k\)R\_\{0\}\\in\\Omega^\{\(k\)\}:

dH​\(Tkn​\(R0\),R∞\(k\)\)≤cα∗n1−cα∗⋅dH​\(Tk​\(R0\),R0\)→0\.d\_\{H\}\(T\_\{k\}^\{n\}\(R\_\{0\}\),R^\{\(k\)\}\_\{\\infty\}\)\\leq\\frac\{c\_\{\\alpha^\{\*\}\}^\{n\}\}\{1\-c\_\{\\alpha^\{\*\}\}\}\\cdot d\_\{H\}\(T\_\{k\}\(R\_\{0\}\),R\_\{0\}\)\\to 0\.Independence from initial conditions is the uniqueness clause of Banach\. ∎

### 5\.5Universal Feature Convergence

###### Corollary 5\.5\(Universal Feature Convergence\)\.

Letℋ1​\(E\)\\mathcal\{H\}\_\{1\}\(E\)andℋ2​\(E\)\\mathcal\{H\}\_\{2\}\(E\)share𝒫\\mathcal\{P\}and satisfy A1–A6 andE<EcE<E\_\{c\}, but differ inR\(1\)R^\{\(1\)\},𝒜0\\mathcal\{A\}\_\{0\},𝒢\\mathcal\{G\}\. ThenR∞,1\(k\)≅R∞,2\(k\)R^\{\(k\)\}\_\{\\infty,1\}\\cong R^\{\(k\)\}\_\{\\infty,2\}for allk≥1k\\geq 1\.

###### Proof\.

Step 1\.By A5,cost​\(α\)\\mathrm\{cost\}\(\\alpha\)depends only onα\\alphaand𝒫\\mathcal\{P\}\. By Proposition[S4\.1](https://arxiv.org/html/2606.07563#A4.Thmtheorem1),𝒜∗\\mathcal\{A\}^\{\*\}is determined by𝒫\\mathcal\{P\}\. Henceα∗=arg⁡minα∈𝒜∗⁡cost​\(α\)\\alpha^\{\*\}=\\arg\\min\_\{\\alpha\\in\\mathcal\{A\}^\{\*\}\}\\mathrm\{cost\}\(\\alpha\)is the same forℋ1\\mathcal\{H\}\_\{1\}andℋ2\\mathcal\{H\}\_\{2\}\.

Step 2\.Both instances useα∗\\alpha^\{\*\}forE<EcE<E\_\{c\}, so their generator mapsTk,1=Tk,2=:TkT\_\{k,1\}=T\_\{k,2\}=:T\_\{k\}coincide\.

Step 3\.By Theorem[S8\.1](https://arxiv.org/html/2606.07563#A8.Thmtheorem1)\(iii\),TkT\_\{k\}has a unique fixed pointR∞\(k\)R^\{\(k\)\}\_\{\\infty\}\. Both instances converge to it\. ∎

### 5\.6Three Characterisations ofEcE\_\{c\}

For finite𝒜∗\\mathcal\{A\}^\{\*\}with costsc1<⋯<cNc\_\{1\}<\\cdots<c\_\{N\}:

1. 1\.*Rate\-based \(discrete\):*Ec=cj∗E\_\{c\}=c\_\{j^\{\*\}\}wherej∗=arg⁡maxj⁡Δj/\(cj−cj−1\)j^\{\*\}=\\arg\\max\_\{j\}\\Delta\_\{j\}/\(c\_\{j\}\-c\_\{j\-1\}\)\.
2. 2\.*Distributional:*Ec≈cost​\(αmedian∗\)E\_\{c\}\\approx\\mathrm\{cost\}\(\\alpha^\{\*\}\_\{\\mathrm\{median\}\}\)\(median cost of𝒜∗\\mathcal\{A\}^\{\*\}\)\.
3. 3\.*Information\-theoretic:*Ec=arg⁡maxE⁡\|d​Neff​\(E\)/d​E\|E\_\{c\}=\\arg\\max\_\{E\}\|dN\_\{\\mathrm\{eff\}\}\(E\)/dE\|\(maximum sensitivity of the effective mechanism countNeff​\(E\)=exp⁡\(−∑αw^α​log⁡w^α\)N\_\{\\mathrm\{eff\}\}\(E\)=\\exp\(\-\\sum\_\{\\alpha\}\\hat\{w\}\_\{\\alpha\}\\log\\hat\{w\}\_\{\\alpha\}\)\)\.

## 6Causal Emergence at the HEF Fixed Point

We now connect HEF’s convergence results to causal emergence theory\[[21](https://arxiv.org/html/2606.07563#bib.bib21)\], showing that the fixed pointR∞R\_\{\\infty\}has strictly higher causal power than the micro\-levelR\(1\)R^\{\(1\)\}\. This closes the gap between HEF’s convergence guarantee and the stronger claim that emergence in HEF is*causally irreducible*, not merely a change in description\.

### 6\.1Why Convergence Alone Does Not Establish Causal Emergence

HEF Corollary[S8\.2](https://arxiv.org/html/2606.07563#A8.Thmtheorem2)guarantees thatR∞R\_\{\\infty\}is unique and universally attracting\. This alone does*not*imply thatR∞R\_\{\\infty\}is causally more potent thanR\(1\)R^\{\(1\)\}\. A trivially compressive mechanism \(α∗\\alpha^\{\*\}maps every input to one constant\) also converges to a unique fixed point yet has zero causal power\. The distinction requires measuring*Effective Information \(EI\)*\[[21](https://arxiv.org/html/2606.07563#bib.bib21)\]\.

###### Definition 6\.1\(Effective Information at levelkk\)\.

EIk=Hμ​\(Tk​\(R\(k\)\)\)−Hμ​\(Tk​\(R\(k\)\)∣R\(k\)\),\\mathrm\{EI\}\_\{k\}\\;=\\;H\_\{\\mu\}\\\!\\bigl\(T\_\{k\}\(R^\{\(k\)\}\)\\bigr\)\\;\-\\;H\_\{\\mu\}\\\!\\bigl\(T\_\{k\}\(R^\{\(k\)\}\)\\mid R^\{\(k\)\}\\bigr\),\(3\)whereμ\\muis the maximum\-entropy distribution overΩ\(k\)\\Omega^\{\(k\)\}\. The first term measures output diversity under uniform intervention; the second,causal noise— uncertainty in output not resolvable by knowing the input\.

We distinguish two regimes:

- •*Exploration regime*\(E\>EcE\>E\_\{c\}\):\|𝒜∗​\(E\)\|≥2\|\\mathcal\{A\}^\{\*\}\(E\)\|\\geq 2\. Multiple mechanisms compete; their stochastic selection makesTkT\_\{k\}effectively random\. Causal noiseHμ​\(Tk∣R\(k\)\)\>0H\_\{\\mu\}\(T\_\{k\}\\mid R^\{\(k\)\}\)\>0\.
- •*Convergence regime*\(E<EcE<E\_\{c\}\):𝒜∗​\(E\)=\{α∗\}\\mathcal\{A\}^\{\*\}\(E\)=\\\{\\alpha^\{\*\}\\\}\.Tk=fα∗\(k\)T\_\{k\}=f^\{\(k\)\}\_\{\\alpha^\{\*\}\}is deterministic\. Causal noise=0=0\.

###### Definition 6\.2\(Non\-Degeneracy Assumption \(NDA\)\)\.

The minimum\-cost mechanismα∗\\alpha^\{\*\}is*non\-degenerate*at levelkkif

Hμ​\(fα∗\(k\)​\(R\(k\)\)\)≥Iμ​\(R\(k\);Tkpre​\(R\(k\)\)\),H\_\{\\mu\}\\\!\\bigl\(f^\{\(k\)\}\_\{\\alpha^\{\*\}\}\(R^\{\(k\)\}\)\\bigr\)\\;\\geq\\;I\_\{\\mu\}\\\!\\bigl\(R^\{\(k\)\};\\,T\_\{k\}^\{\\mathrm\{pre\}\}\(R^\{\(k\)\}\)\\bigr\),\(4\)whereTkpreT\_\{k\}^\{\\mathrm\{pre\}\}is the stochastic generator underE\>EcE\>E\_\{c\}\. NDA requires that the deterministic mechanismα∗\\alpha^\{\*\}produces output entropy at least as large as the noiseless mutual information achievable by the full multi\-mechanism dynamics\.

### 6\.2The Theorem

###### Theorem 6\.1\(Causal Emergence at the HEF Fixed Point\)\.

Letℋ\\mathcal\{H\}satisfy A1–A6 withE<EcE<E\_\{c\}\. Then:

1. \(i\)*Causal noise eliminated\.* Hμ​\(Tk​\(R\(k\)\)∣R\(k\)\)=0\.H\_\{\\mu\}\\\!\\bigl\(T\_\{k\}\(R^\{\(k\)\}\)\\mid R^\{\(k\)\}\\bigr\)\\;=\\;0\.
2. \(ii\)*Causal emergence under NDA\.*Ifα∗\\alpha^\{\*\}satisfies the NDA \(Definition[6\.2](https://arxiv.org/html/2606.07563#S6.Thmdefinition2)\), then EIk∗\>EI1\.\\mathrm\{EI\}\_\{k^\{\*\}\}\\;\>\\;\\mathrm\{EI\}\_\{1\}\.
3. \(iii\)*Quantitative bound\.* EIk∗−EI1≥Hμ​\(T1​\(R\(1\)\)∣R\(1\)\)−\[Hμ​\(T1pre\)−Hμ​\(Tk∗post\)\]≥0\.\\mathrm\{EI\}\_\{k^\{\*\}\}\-\\mathrm\{EI\}\_\{1\}\\;\\geq\\;H\_\{\\mu\}\\\!\\bigl\(T\_\{1\}\(R^\{\(1\)\}\)\\mid R^\{\(1\)\}\\bigr\)\\;\-\\;\\bigl\[H\_\{\\mu\}\(T\_\{1\}^\{\\mathrm\{pre\}\}\)\-H\_\{\\mu\}\(T\_\{k^\{\*\}\}^\{\\mathrm\{post\}\}\)\\bigr\]\\;\\geq\\;0\.\(5\)Under NDA, the right\-hand side is strictly positive\.
4. \(iv\)*Degeneracy reduction\.*Causal degeneracyDk=Hμ​\(R\(k\)∣Tk​\(R\(k\)\)\)D\_\{k\}=H\_\{\\mu\}\(R^\{\(k\)\}\\mid T\_\{k\}\(R^\{\(k\)\}\)\)satisfies Dk∗≤D1−log⁡\|Ω\(1\)\|\|Ω\(k∗\)\|\.D\_\{k^\{\*\}\}\\;\\leq\\;D\_\{1\}\-\\log\\\!\\frac\{\|\\Omega^\{\(1\)\}\|\}\{\|\\Omega^\{\(k^\{\*\}\)\}\|\}\.\(6\)

###### Proof\.

\(i\)ForE<EcE<E\_\{c\},𝒜∗​\(E\)=\{α∗\}\\mathcal\{A\}^\{\*\}\(E\)=\\\{\\alpha^\{\*\}\\\}, soTk=fα∗\(k\)T\_\{k\}=f^\{\(k\)\}\_\{\\alpha^\{\*\}\}is deterministic\. For a deterministic map,H​\(Tk​\(R\(k\)\)∣R\(k\)\)=𝔼μ​\[H​\(δfα∗​\(r\)\)\]=0H\(T\_\{k\}\(R^\{\(k\)\}\)\\mid R^\{\(k\)\}\)=\\mathbb\{E\}\_\{\\mu\}\[H\(\\delta\_\{f\_\{\\alpha^\{\*\}\}\(r\)\}\)\]=0\.

\(ii\)Expanding via \([3](https://arxiv.org/html/2606.07563#S6.E3)\):

EIk∗−EI1\\displaystyle\\mathrm\{EI\}\_\{k^\{\*\}\}\-\\mathrm\{EI\}\_\{1\}=Hμ​\(Tk∗post\)−Hμ​\(T1pre\)⏟\(A\)\+Hμ​\(T1pre∣R\(1\)\)⏟\(B\)\>0\.\\displaystyle=\\underbrace\{H\_\{\\mu\}\(T\_\{k^\{\*\}\}^\{\\mathrm\{post\}\}\)\-H\_\{\\mu\}\(T\_\{1\}^\{\\mathrm\{pre\}\}\)\}\_\{\\text\{\(A\)\}\}\+\\underbrace\{H\_\{\\mu\}\(T\_\{1\}^\{\\mathrm\{pre\}\}\\mid R^\{\(1\)\}\)\}\_\{\\text\{\(B\)\}\>0\}\.\(7\)Term \(B\) is strictly positive because\|𝒜∗​\(E\)\|≥2\|\\mathcal\{A\}^\{\*\}\(E\)\|\\geq 2at level 1 implies stochastic selection among mechanisms\. By NDA \([4](https://arxiv.org/html/2606.07563#S6.E4)\),Hμ​\(Tk∗post\)≥Iμ​\(R\(1\);T1pre\)=Hμ​\(T1pre\)−Hμ​\(T1pre∣R\(1\)\)H\_\{\\mu\}\(T\_\{k^\{\*\}\}^\{\\mathrm\{post\}\}\)\\geq I\_\{\\mu\}\(R^\{\(1\)\};T\_\{1\}^\{\\mathrm\{pre\}\}\)=H\_\{\\mu\}\(T\_\{1\}^\{\\mathrm\{pre\}\}\)\-H\_\{\\mu\}\(T\_\{1\}^\{\\mathrm\{pre\}\}\\mid R^\{\(1\)\}\), so \(A\)≥\\geq\-\(B\), givingEIk∗−EI1≥0\\mathrm\{EI\}\_\{k^\{\*\}\}\-\\mathrm\{EI\}\_\{1\}\\geq 0\. Strict inequality follows from \(B\)\>0\>0\.

\(iii\)Direct from decomposition \([7](https://arxiv.org/html/2606.07563#S6.E7)\) and NDA\.

\(iv\)ForE<EcE<E\_\{c\},Tk∗T\_\{k^\{\*\}\}is deterministic but not necessarily injective \(the Banach contraction maps many inputs toward the same fixed point\)\. We boundDk∗D\_\{k^\{\*\}\}via the DPI without assumingDk∗=0D\_\{k^\{\*\}\}=0\.

The Markov chainR\(1\)→R\(k∗\)→Tk∗​\(R\(k∗\)\)R^\{\(1\)\}\\to R^\{\(k^\{\*\}\)\}\\to T\_\{k^\{\*\}\}\(R^\{\(k^\{\*\}\)\}\)gives by DPI:

H​\(R\(1\)∣Tk∗​\(R\(k∗\)\)\)≥H​\(R\(1\)∣R\(k∗\)\)≥log⁡\|Ω\(1\)\|\|Ω\(k∗\)\|,H\(R^\{\(1\)\}\\mid T\_\{k^\{\*\}\}\(R^\{\(k^\{\*\}\)\}\)\)\\;\\geq\\;H\(R^\{\(1\)\}\\mid R^\{\(k^\{\*\}\)\}\)\\;\\geq\\;\\log\\frac\{\|\\Omega^\{\(1\)\}\|\}\{\|\\Omega^\{\(k^\{\*\}\)\}\|\},since the coarse\-grainingR\(1\)→R\(k∗\)R^\{\(1\)\}\\to R^\{\(k^\{\*\}\)\}contracts the state space\. Decomposing by the chain rule:H​\(R\(1\)∣Tk∗​\(R\(k∗\)\)\)=H​\(R\(1\)∣R\(k∗\)\)\+Dk∗H\(R^\{\(1\)\}\\mid T\_\{k^\{\*\}\}\(R^\{\(k^\{\*\}\)\}\)\)=H\(R^\{\(1\)\}\\mid R^\{\(k^\{\*\}\)\}\)\+D\_\{k^\{\*\}\}\. Applying DPI toR\(1\)→T1​\(R\(1\)\)R^\{\(1\)\}\\to T\_\{1\}\(R^\{\(1\)\}\)andR\(1\)→R\(k∗\)→Tk∗​\(R\(k∗\)\)R^\{\(1\)\}\\to R^\{\(k^\{\*\}\)\}\\to T\_\{k^\{\*\}\}\(R^\{\(k^\{\*\}\)\}\):D1≥H​\(R\(1\)∣Tk∗​\(R\(k∗\)\)\)D\_\{1\}\\geq H\(R^\{\(1\)\}\\mid T\_\{k^\{\*\}\}\(R^\{\(k^\{\*\}\)\}\)\)\. Combining:D1≥H​\(R\(1\)∣R\(k∗\)\)\+Dk∗≥log⁡\(\|Ω\(1\)\|/\|Ω\(k∗\)\|\)\+Dk∗D\_\{1\}\\geq H\(R^\{\(1\)\}\\mid R^\{\(k^\{\*\}\)\}\)\+D\_\{k^\{\*\}\}\\geq\\log\(\|\\Omega^\{\(1\)\}\|/\|\\Omega^\{\(k^\{\*\}\)\}\|\)\+D\_\{k^\{\*\}\}, yielding \([6](https://arxiv.org/html/2606.07563#S6.E6)\)\. ∎

###### Corollary 6\.2\(Empirical Estimator of EI Gain\)\.

The EI gain is bounded below by the causal noise of the pre\-convergence dynamics, which is estimable from training\-curve variance:

EIk∗−EI1≥Hμ​\(T1pre∣R\(1\)\)⏟mechanism competition entropy−\[Hμ​\(T1pre\)−Hμ​\(Tk∗post\)\]\.\\mathrm\{EI\}\_\{k^\{\*\}\}\-\\mathrm\{EI\}\_\{1\}\\;\\geq\\;\\underbrace\{H\_\{\\mu\}\(T\_\{1\}^\{\\mathrm\{pre\}\}\\mid R^\{\(1\)\}\)\}\_\{\\text\{mechanism competition entropy\}\}\-\\,\\bigl\[H\_\{\\mu\}\(T\_\{1\}^\{\\mathrm\{pre\}\}\)\-H\_\{\\mu\}\(T\_\{k^\{\*\}\}^\{\\mathrm\{post\}\}\)\\bigr\]\.\(8\)In gradient\-based learning the mechanism competition entropy is estimated from gradient\-direction variance during the memorisation phase \(stepst<Δ​tt<\\Delta t\), which is directly measurable\.

## 7Mechanism Landscape Theory: What Determines Emergence

Corollary[S8\.2](https://arxiv.org/html/2606.07563#A8.Thmtheorem2)establishes*that*convergence toR∞R\_\{\\infty\}occurs whenE<EcE<E\_\{c\}, and identifies𝒫\\mathcal\{P\}as the determinant ofR∞R\_\{\\infty\}’s type\. This section deepens the analysis: we ask*what determines the full character of emergence*— its form, its existence conditions, its universality class, and its causal potency\. The answers depend on the*Mechanism Landscape*, a structure that𝒫\\mathcal\{P\}induces on𝒜∗\\mathcal\{A\}^\{\*\}\.

###### Definition 7\.1\(Mechanism Landscape\)\.

Themechanism landscapeof a HEFℋ\\mathcal\{H\}is the metric space

ℳ=\(𝒜∗,cost​\(⋅\)\),\\mathcal\{M\}\\;=\\;\\bigl\(\\mathcal\{A\}^\{\*\},\\;\\mathrm\{cost\}\(\\cdot\)\\bigr\),where𝒜∗\\mathcal\{A\}^\{\*\}is equipped with the pseudometricρ​\(α1,α2\)=\|cost​\(α1\)−cost​\(α2\)\|\\rho\(\\alpha\_\{1\},\\alpha\_\{2\}\)=\|\\mathrm\{cost\}\(\\alpha\_\{1\}\)\-\\mathrm\{cost\}\(\\alpha\_\{2\}\)\|\. Thelocal landscape nearα∗\\alpha^\{\*\}is the restrictionℳε=\{α∈𝒜∗:cost​\(α\)≤cost​\(α∗\)\+ε\}\\mathcal\{M\}\_\{\\varepsilon\}=\\\{\\alpha\\in\\mathcal\{A\}^\{\*\}:\\mathrm\{cost\}\(\\alpha\)\\leq\\mathrm\{cost\}\(\\alpha^\{\*\}\)\+\\varepsilon\\\}for smallε\>0\\varepsilon\>0\.

###### Definition 7\.2\(Mechanism Competition Entropy\)\.

Themechanism competition entropyat energyEEis

Hmech​\(E\)=−∑α∈𝒜∗​\(E\)wα​\(E\)​log⁡wα​\(E\),wα​\(E\)=e−cost​\(α\)/E∑β∈𝒜∗​\(E\)e−cost​\(β\)/E\.H\_\{\\mathrm\{mech\}\}\(E\)\\;=\\;\-\\sum\_\{\\alpha\\in\\mathcal\{A\}^\{\*\}\(E\)\}w\_\{\\alpha\}\(E\)\\,\\log w\_\{\\alpha\}\(E\),\\quad w\_\{\\alpha\}\(E\)\\;=\\;\\frac\{e^\{\-\\mathrm\{cost\}\(\\alpha\)/E\}\}\{\\sum\_\{\\beta\\in\\mathcal\{A\}^\{\*\}\(E\)\}e^\{\-\\mathrm\{cost\}\(\\beta\)/E\}\}\.\(9\)Hmech​\(E\)H\_\{\\mathrm\{mech\}\}\(E\)measures the diversity of mechanism competition at energy levelEE\.

### 7\.1Proposition A: Domain Determines Form,𝒫\\mathcal\{P\}Determines Type

###### Proposition 7\.1\(Domain–𝒫\\mathcal\{P\}Separation\)\.

Letℋ1=\(D1,𝒫\)\\mathcal\{H\}\_\{1\}=\(D\_\{1\},\\mathcal\{P\}\)andℋ2=\(D2,𝒫\)\\mathcal\{H\}\_\{2\}=\(D\_\{2\},\\mathcal\{P\}\)share the same physical constraint set𝒫\\mathcal\{P\}and satisfy A1–A6 withE<EcE<E\_\{c\}, but differ in domainDi=\(Ri\(1\),ℒi,𝒜0,i\)D\_\{i\}=\(R^\{\(1\)\}\_\{i\},\\mathcal\{L\}\_\{i\},\\mathcal\{A\}\_\{0,i\}\)\. Then:

1. \(i\)*Type universality\.*Both instances have the same minimum\-cost mechanism:α∗=arg⁡minα∈𝒜∗⁡cost​\(α\)\\alpha^\{\*\}=\\arg\\min\_\{\\alpha\\in\\mathcal\{A\}^\{\*\}\}\\mathrm\{cost\}\(\\alpha\)\.
2. \(ii\)*Form diversity\.*The fixed pointsR∞,1\(k\)R^\{\(k\)\}\_\{\\infty,1\}andR∞,2\(k\)R^\{\(k\)\}\_\{\\infty,2\}may differ as sets, but are isomorphic as images underfα∗f\_\{\\alpha^\{\*\}\}: R∞,i\(k\)=\{fα∗​\(φ\):φ∈ℒi​\(R∞,i\(k−1\)\),φ⊧𝒫\}\.R^\{\(k\)\}\_\{\\infty,i\}\\;=\\;\\bigl\\\{f\_\{\\alpha^\{\*\}\}\(\\varphi\)\\,:\\,\\varphi\\in\\mathcal\{L\}\_\{i\}\(R^\{\(k\-1\)\}\_\{\\infty,i\}\),\\;\\varphi\\models\\mathcal\{P\}\\bigr\\\}\.
3. \(iii\)*Structural decomposition\.*The emergenceR∞\(k\)R^\{\(k\)\}\_\{\\infty\}decomposes asα∗⏟TYPE \(from​𝒫​\)∘ℒ​\(R∞\(k−1\)\)⏟FORM \(from domain\)\\underbrace\{\\alpha^\{\*\}\}\_\{\\text\{TYPE \(from \}\\mathcal\{P\}\\text\{\)\}\}\\;\\circ\\;\\underbrace\{\\mathcal\{L\}\(R^\{\(k\-1\)\}\_\{\\infty\}\)\}\_\{\\text\{FORM \(from domain\)\}\}\.

###### Proof\.

\(i\) By A5,cost​\(α\)\\mathrm\{cost\}\(\\alpha\)depends only onα\\alphaand𝒫\\mathcal\{P\}\. Hencearg⁡min⁡cost​\(α\)\\arg\\min\\mathrm\{cost\}\(\\alpha\)is the same for both instances\. \(ii\) Both instances usefα∗f\_\{\\alpha^\{\*\}\}forE<EcE<E\_\{c\}, but their logical languagesℒi\\mathcal\{L\}\_\{i\}differ\. The fixed\-point self\-consistency equationR∞=\{fα∗​\(φ\):φ∈ℒ​\(R∞\)\}R\_\{\\infty\}=\\\{f\_\{\\alpha^\{\*\}\}\(\\varphi\):\\varphi\\in\\mathcal\{L\}\(R\_\{\\infty\}\)\\\}has the samefα∗f\_\{\\alpha^\{\*\}\}but differentℒ\\mathcal\{L\}, yielding differentR∞R\_\{\\infty\}as sets\. \(iii\) Immediate from \(i\) and \(ii\)\. ∎

### 7\.2Proposition B: Mechanism Landscape Determines Universality Class

###### Definition 7\.3\(Local Landscape Isomorphism\)\.

Two mechanism landscapesℳ1\\mathcal\{M\}\_\{1\}andℳ2\\mathcal\{M\}\_\{2\}arelocally isomorphic nearα∗\\alpha^\{\*\}\(writtenℳ1≅εℳ2\\mathcal\{M\}\_\{1\}\\cong\_\{\\varepsilon\}\\mathcal\{M\}\_\{2\}\) if there exists a bijectionh:\(ℳ1\)ε→\(ℳ2\)εh:\(\\mathcal\{M\}\_\{1\}\)\_\{\\varepsilon\}\\to\(\\mathcal\{M\}\_\{2\}\)\_\{\\varepsilon\}such thatcost1​\(α\)=cost2​\(h​\(α\)\)\\mathrm\{cost\}\_\{1\}\(\\alpha\)=\\mathrm\{cost\}\_\{2\}\(h\(\\alpha\)\)for allα∈\(ℳ1\)ε\\alpha\\in\(\\mathcal\{M\}\_\{1\}\)\_\{\\varepsilon\}\.

###### Definition 7\.4\(HEF Universality Class\)\.

Two HEF instances belong to thesame universality classif their convergence trajectories are isomorphic as discrete dynamical systems:∃\\existsbijectionΨ:Ω1\(k\)→Ω2\(k\)\\Psi:\\Omega^\{\(k\)\}\_\{1\}\\to\\Omega^\{\(k\)\}\_\{2\}such thatΨ∘Tk,1=Tk,2∘Ψ\\Psi\\circ T\_\{k,1\}=T\_\{k,2\}\\circ\\Psiandcα1∗=cα2∗c\_\{\\alpha^\{\*\}\_\{1\}\}=c\_\{\\alpha^\{\*\}\_\{2\}\}\.

###### Proposition 7\.2\(Landscape Isomorphism⇒\\RightarrowSame Universality Class\)\.

Ifℳ1≅εℳ2\\mathcal\{M\}\_\{1\}\\cong\_\{\\varepsilon\}\\mathcal\{M\}\_\{2\}, thenℋ1\\mathcal\{H\}\_\{1\}andℋ2\\mathcal\{H\}\_\{2\}belong to the same HEF universality class\.

###### Proof\.

ℳ1≅εℳ2\\mathcal\{M\}\_\{1\}\\cong\_\{\\varepsilon\}\\mathcal\{M\}\_\{2\}impliescost1​\(α1∗\)=cost2​\(α2∗\)\\mathrm\{cost\}\_\{1\}\(\\alpha^\{\*\}\_\{1\}\)=\\mathrm\{cost\}\_\{2\}\(\\alpha^\{\*\}\_\{2\}\)andLip​\(fα1∗\)=Lip​\(fα2∗\)\\mathrm\{Lip\}\(f\_\{\\alpha^\{\*\}\_\{1\}\}\)=\\mathrm\{Lip\}\(f\_\{\\alpha^\{\*\}\_\{2\}\}\)\(since Lipschitz constants are determined by cost structure under A6\)\. Hencecα1∗=cα2∗c\_\{\\alpha^\{\*\}\_\{1\}\}=c\_\{\\alpha^\{\*\}\_\{2\}\}\. The conjugacyΨ\\Psiis constructed by transporting the Banach iteration via the landscape isomorphismhh\. ∎

### 7\.3Proposition C: Mechanism Competition Entropy Bounds Causal Potency

###### Proposition 7\.3\(Mechanism Competition Entropy Bounds EI Gain\)\.

Under A1–A6 and NDA, the causal emergence gain satisfies

EIk∗−EI1≥Hmech​\(Ec\)−\[Hμ​\(T1pre\)−Hμ​\(Tk∗post\)\],\\mathrm\{EI\}\_\{k^\{\*\}\}\-\\mathrm\{EI\}\_\{1\}\\;\\geq\\;H\_\{\\mathrm\{mech\}\}\(E\_\{c\}\)\\;\-\\;\\bigl\[H\_\{\\mu\}\(T\_\{1\}^\{\\mathrm\{pre\}\}\)\-H\_\{\\mu\}\(T\_\{k^\{\*\}\}^\{\\mathrm\{post\}\}\)\\bigr\],\(10\)whereHmech​\(Ec\)H\_\{\\mathrm\{mech\}\}\(E\_\{c\}\)is the mechanism competition entropy \([9](https://arxiv.org/html/2606.07563#S7.E9)\) evaluated at the critical threshold\.

###### Proof\.

From Theorem[6\.1](https://arxiv.org/html/2606.07563#S6.Thmtheorem1)\(iii\):EIk∗−EI1≥Hμ​\(T1pre∣R\(1\)\)−\[Hμ​\(T1pre\)−Hμ​\(Tk∗post\)\]\\mathrm\{EI\}\_\{k^\{\*\}\}\-\\mathrm\{EI\}\_\{1\}\\geq H\_\{\\mu\}\(T\_\{1\}^\{\\mathrm\{pre\}\}\\mid R^\{\(1\)\}\)\-\[H\_\{\\mu\}\(T\_\{1\}^\{\\mathrm\{pre\}\}\)\-H\_\{\\mu\}\(T\_\{k^\{\*\}\}^\{\\mathrm\{post\}\}\)\]\. We identify the causal noise term:

Hμ​\(T1pre∣R\(1\)\)\\displaystyle H\_\{\\mu\}\(T\_\{1\}^\{\\mathrm\{pre\}\}\\mid R^\{\(1\)\}\)=H​\(output∣input under stochastic mech\. selection\)\\displaystyle=H\(\\text\{output\}\\mid\\text\{input under stochastic mech\.\\ selection\}\)=𝔼R\(1\)∼μ​\[−∑α∈𝒜∗​\(Ec\)wα​\(Ec\)​log⁡wα​\(Ec\)\]=Hmech​\(Ec\),\\displaystyle=\\mathbb\{E\}\_\{R^\{\(1\)\}\\sim\\mu\}\\Bigl\[\-\\sum\_\{\\alpha\\in\\mathcal\{A\}^\{\*\}\(E\_\{c\}\)\}w\_\{\\alpha\}\(E\_\{c\}\)\\log w\_\{\\alpha\}\(E\_\{c\}\)\\Bigr\]\\;=\\;H\_\{\\mathrm\{mech\}\}\(E\_\{c\}\),where the second equality uses the fact that atE=EcE=E\_\{c\}, mechanism selection probabilities equal the Gibbs weightswα​\(Ec\)w\_\{\\alpha\}\(E\_\{c\}\)\(Definition[7\.2](https://arxiv.org/html/2606.07563#S7.Thmdefinition2)\), and these are independent ofR\(1\)R^\{\(1\)\}by A5\. Substituting yields \([10](https://arxiv.org/html/2606.07563#S7.E10)\)\. ∎

###### Corollary 7\.4\(Richer Competition⇒\\RightarrowStronger Emergence\)\.

Among HEF instances sharing𝒫\\mathcal\{P\}andEcE\_\{c\}, those with higher mechanism competition entropyHmech​\(Ec\)H\_\{\\mathrm\{mech\}\}\(E\_\{c\}\)have higher minimum causal emergence:

Hmech\(1\)​\(Ec\)\>Hmech\(2\)​\(Ec\)⟹inf\(EIk∗\(1\)−EI1\(1\)\)\>inf\(EIk∗\(2\)−EI1\(2\)\)\.H\_\{\\mathrm\{mech\}\}^\{\(1\)\}\(E\_\{c\}\)\>H\_\{\\mathrm\{mech\}\}^\{\(2\)\}\(E\_\{c\}\)\\;\\Longrightarrow\\;\\inf\\bigl\(\\mathrm\{EI\}\_\{k^\{\*\}\}^\{\(1\)\}\-\\mathrm\{EI\}\_\{1\}^\{\(1\)\}\\bigr\)\>\\inf\\bigl\(\\mathrm\{EI\}\_\{k^\{\*\}\}^\{\(2\)\}\-\\mathrm\{EI\}\_\{1\}^\{\(2\)\}\\bigr\)\.

### 7\.4An Emergence Classification Scheme

Propositions A–C suggest aclassification of emergencesby four observable coordinates of the mechanism landscape, analogous to the classification of universality classes in statistical mechanics\.

###### Definition 7\.5\(Emergence Signature\)\.

Theemergence signatureof a HEF instance is the 4\-tuple

Σ​\(ℋ\)=\(τ,m,ω,d\),\\Sigma\(\\mathcal\{H\}\)\\;=\\;\(\\tau,\\;m,\\;\\omega,\\;d\),where:

- •τ∈\{smooth,cusp,flat,hierarchical\}\\tau\\in\\\{\\mathrm\{smooth\},\\mathrm\{cusp\},\\mathrm\{flat\},\\mathrm\{hierarchical\}\\\}is thelandscape topologynearα∗\\alpha^\{\*\};
- •m=\|𝒜∗​\(Ec\)\|m=\|\\mathcal\{A\}^\{\*\}\(E\_\{c\}\)\|is themechanism multiplicity\(number of competing mechanisms at the critical point\);
- •ω=Δ​C/Cgen=\(Cmem−Cgen\)/Cgen\\omega=\\Delta C/C\_\{\\mathrm\{gen\}\}=\(C\_\{\\mathrm\{mem\}\}\-C\_\{\\mathrm\{gen\}\}\)/C\_\{\\mathrm\{gen\}\}is thewindow ratio\(robustness of emergence\);
- •ddis thehierarchy depth\(k∗k^\{\*\}, the level at whichE<EcE<E\_\{c\}first holds\)\.

Table 1:Emergence Classification Table \(HEF\)\. Each row is an emergence type identified by its signatureΣ=\(τ,m,ω,d\)\\Sigma=\(\\tau,m,\\omega,d\)\. Observable signatures allow inference of mechanism class from data\.

## 8Instantiations

### 8\.1ML: LLM Training Dynamics and Grokking

In ML,R\(1\)R^\{\(1\)\}consists of token embeddings;R\(k\)R^\{\(k\)\}is the layer\-kkrepresentation;fα\(k\)f^\{\(k\)\}\_\{\\alpha\}is an attention head with FFN;φ∈ℒ​\(R\(k−1\)\)\\varphi\\in\\mathcal\{L\}\(R^\{\(k\-1\)\}\)is the attention pattern\. The domain weight iswαdomain=exp⁡\(−ℒ​\(fα\(k\)\)/ℒ0\)w^\{\\mathrm\{domain\}\}\_\{\\alpha\}=\\exp\(\-\\mathcal\{L\}\(f^\{\(k\)\}\_\{\\alpha\}\)/\\mathcal\{L\}\_\{0\}\)\. Theorem[S5\.1](https://arxiv.org/html/2606.07563#A5.Thmtheorem1)recovers the information bottleneck\[[46](https://arxiv.org/html/2606.07563#bib.bib46)\]via P6\.

#### 8\.1\.1How Emergence Forms: The Three\-Phase HEF Trajectory

Before deriving grokking delay, we trace how emergence unfolds in the HEF hierarchy for a single training run\. This gives the intuition for all formal results\.

##### Phase 1: Exploration regime \(E\>EcE\>E\_\{c\}, steps0→tmem0\\to t\_\{\\mathrm\{mem\}\}\)\.

𝒜∗​\(E\)\\mathcal\{A\}^\{\*\}\(E\)contains bothαmem\\alpha\_\{\\mathrm\{mem\}\}andαgen\\alpha\_\{\\mathrm\{gen\}\}\. The cost of memorisationCmemC\_\{\\mathrm\{mem\}\}is within budget; the cost of the generalising circuitc1​n/λc\_\{1\}n/\\lambdais also within budget\. The Gibbs measure assigns comparable weight to both\. The representationR\(2\)R^\{\(2\)\}is a*superposition*\[[16](https://arxiv.org/html/2606.07563#bib.bib16)\]of memorised lookup\-table features and nascent generalising features\. Effective Information is low: causal noise is high because many mechanisms compete \(Hμ​\(Tk∣R\(k\)\)\>0H\_\{\\mu\}\(T\_\{k\}\\mid R^\{\(k\)\}\)\>0, Theorem[6\.1](https://arxiv.org/html/2606.07563#S6.Thmtheorem1)\(i\)\)\. Training accuracy rises rapidly \(memorisation is cheaper\); test accuracy stays near chance\.

##### Phase 2: Grok gap \(tmem→Δ​tt\_\{\\mathrm\{mem\}\}\\to\\Delta t\)\.

The model has fully memorised \(ttrain=1t\_\{\\mathrm\{train\}\}=1\)\. The energy budget continues to tighten as weight decay erodes‖w‖2\\\|w\\\|^\{2\}\. This is the*exploration regime aboveEcE\_\{c\}*: the generalising circuit exists in𝒜∗​\(E\)\\mathcal\{A\}^\{\*\}\(E\)but has not yet dominated\. The weight norm‖w‖2\\\|w\\\|^\{2\}peaks near theEcE\_\{c\}crossing \(Section[8\.1\.3](https://arxiv.org/html/2606.07563#S8.SS1.SSS3)\) — the empirical fingerprint of the phase boundary\. The system is “choosing” between circuits, with the generalising one slowly accumulating weight\. Test accuracy rises slowly \(the “shoulder” observed in Figure[1](https://arxiv.org/html/2606.07563#S8.F1)\)\.

##### Phase 3: Convergence regime \(E<EcE<E\_\{c\}, steps\>Δ​t\>\\Delta t\)\.

Weight decay has eroded‖w‖2\\\|w\\\|^\{2\}below theEcE\_\{c\}threshold\.𝒜∗​\(E\)=\{αgen∗\}\\mathcal\{A\}^\{\*\}\(E\)=\\\{\\alpha^\{\*\}\_\{\\mathrm\{gen\}\}\\\}\. By Theorem[S8\.1](https://arxiv.org/html/2606.07563#A8.Thmtheorem1)\(iii\), the Banach Fixed\-Point Theorem forces convergence toR∞R\_\{\\infty\}at ratecα∗n<1c\_\{\\alpha^\{\*\}\}^\{n\}<1per step\. Test accuracy jumps sharply \(the kink\) and plateaus atR∞R\_\{\\infty\}\.*This is emergence*:R\(3\)R^\{\(3\)\}\(Fourier features overℤp\\mathbb\{Z\}\_\{p\},\[[41](https://arxiv.org/html/2606.07563#bib.bib41)\]\) has causal properties absent fromR\(1\)R^\{\(1\)\}\(token embeddings\) — Theorem[6\.1](https://arxiv.org/html/2606.07563#S6.Thmtheorem1)formalises this asEI​\(R∞\)\>EI​\(R\(1\)\)\\mathrm\{EI\}\(R\_\{\\infty\}\)\>\\mathrm\{EI\}\(R^\{\(1\)\}\)under NDA\.

#### 8\.1\.2Formal Derivation

###### Assumption 6\(Gradient Energy Decay, G1\)\.

Estep​\(t\)=E0/\(1\+λ​t\)E\_\{\\mathrm\{step\}\}\(t\)=E\_\{0\}/\(1\+\\lambda t\), whereE0=η​‖∇ℒ‖2\|t=0E\_\{0\}=\\eta\\\|\\nabla\\mathcal\{L\}\\\|^\{2\}\|\_\{t=0\}andλ\>0\\lambda\>0is weight decay\. Status: physically motivated by the AdamW weight\-norm dynamics\[[31](https://arxiv.org/html/2606.07563#bib.bib31)\]\. Empirical validation \(Section[8\.1\.3](https://arxiv.org/html/2606.07563#S8.SS1.SSS3)\) finds the three\-phase‖w‖2\\\|w\\\|^\{2\}trajectory is consistent with G1 \(weight norm peaks∼1,050\{\\sim\}1\{,\}050steps before grokking, 92% of runs\)\. Full AIC\-based model comparison is identified as Open Experimental Protocol \(G1\-test\)\.

###### Assumption 7\(Circuit Assembly Time, G2 – Empirically Revised\)\.

The generalising circuit requires

tconv∝1\(n/pmodes\)⋅λ=1frac⋅p⋅λ,t\_\{\\mathrm\{conv\}\}\\;\\propto\\;\\frac\{1\}\{\(n/p\_\{\\mathrm\{modes\}\}\)\\cdot\\lambda\}\\;=\\;\\frac\{1\}\{\\mathrm\{frac\}\\cdot p\\cdot\\lambda\},\(11\)wherepmodes=p−1≈pp\_\{\\mathrm\{modes\}\}=p\-1\\approx pis the number of Fourier modes in the grokked circuit\[[41](https://arxiv.org/html/2606.07563#bib.bib41)\]\. The effective training signal per circuit component,n/pmodes=frac⋅pn/p\_\{\\mathrm\{modes\}\}=\\mathrm\{frac\}\\cdot p, determines how quickly each Fourier mode acquires sufficient signal to form\.

Empirical status\. Acrossp∈\{23,31,41,53,67,83,97\}p\\in\\\{23,31,41,53,67,83,97\\\}atfrac=0\.40\\mathrm\{frac\}=0\.40,λ=2\.0\\lambda=2\.0\(Section[8\.1\.3](https://arxiv.org/html/2606.07563#S8.SS1.SSS3)\): log\-log slopeβ=−1\.39±0\.20\\beta=\-1\.39\\pm 0\.20\(R2=0\.91R^\{2\}=0\.91\), consistent withβ=−1\\beta=\-1at the 10% level \(p=0\.075p=0\.075\)\. The original formulac1​n/λc\_\{1\}n/\\lambda\(slope\+2\+2\) is falsified\.

Regime constraint\. The revised scaling holds only for*moderate*λ\\lambda\. Atp=97p=97,λ∈\{1,2\}\\lambda\\in\\\{1,2\\\}grok reliably;λ=4\\lambda=4fails in 1/3 seeds \(chaotic oscillation\)\. A critical thresholdλc​\(p\)∈\(2,4\)\\lambda\_\{c\}\(p\)\\in\(2,4\)exists beyond which weight decay destroys gradient signal faster than the circuit forms\.λc\\lambda\_\{c\}is identified as a new HEF\-predictable quantity \(Open Protocol 1b\)\.

###### Proposition 8\.1\(Grokking Delay — Conditional on G1, Revised G2\)\.

Under G1 and the revised G2 \([11](https://arxiv.org/html/2606.07563#S8.E11)\), for moderateλ<λc​\(p\)\\lambda<\\lambda\_\{c\}\(p\):

Δ​t∼Kfrac⋅p⋅λfor large​p,\\Delta t\\;\\sim\\;\\frac\{K\}\{\\mathrm\{frac\}\\cdot p\\cdot\\lambda\}\\quad\\text\{for large \}p,\(12\)whereK\>0K\>0is a fitted constant\. Grokking delay is*inversely proportional to primepp*at fixed coverage: larger primes provide more training signal per Fourier mode\. Forλ≥λc​\(p\)\\lambda\\geq\\lambda\_\{c\}\(p\), weight decay drives‖w‖2\\\|w\\\|^\{2\}belowCmemC\_\{\\mathrm\{mem\}\}before the generalising circuit forms, causing oscillatory failure \(Result E4, Figure[3](https://arxiv.org/html/2606.07563#S8.F3)b\)\.

###### Proof\.

From G1 and G2,Δ​t=t∗\+tconv\\Delta t=t^\{\*\}\+t\_\{\\mathrm\{conv\}\}wheret∗t^\{\*\}is the memorisation step andtconv∝1/\(frac⋅p⋅λ\)t\_\{\\mathrm\{conv\}\}\\propto 1/\(\\mathrm\{frac\}\\cdot p\\cdot\\lambda\)by the revised G2 ansatz\. ∎

##### Double Descent as a 2D Phase Surface\.

Ltest​\(N,Estep\)L\_\{\\mathrm\{test\}\}\(N,E\_\{\\mathrm\{step\}\}\)has two thresholds:Ec\(1\)E^\{\(1\)\}\_\{c\}\(capacity\) andEc\(2\)=CmemE^\{\(2\)\}\_\{c\}=C\_\{\\mathrm\{mem\}\}\(budget\)\. Model\-wise, epoch\-wise\[[36](https://arxiv.org/html/2606.07563#bib.bib36)\], and sample\-wise non\-monotonicity are projections of this surface\.

#### 8\.1\.3Small\-Scale Empirical Evidence

We report results from 90 grokking experiments on modular addition\(a\+b\)modp\(a\+b\)\\bmod pusing a 2\-layer transformer \(128 dimensions, 4 heads, full\-batch AdamW, constant lr=10−3=10^\{\-3\}; following Power et al\.\[[42](https://arxiv.org/html/2606.07563#bib.bib42)\]\)\.

##### Setup\.

Primary experiments \(v3\):p∈\{23,31,41\}p\\in\\\{23,31,41\\\}, training fractionfrac∈\{0\.40,0\.50,0\.60\}\\mathrm\{frac\}\\in\\\{0\.40,0\.50,0\.60\\\}, weight decayλ∈\{1\.0,2\.0\}\\lambda\\in\\\{1\.0,2\.0\\\}, seeds\{0,1,2,3,4\}\\\{0,1,2,3,4\\\}\(90 runs total\)\.

Validation experiments \(v2\):p∈\{53,67,83,97\}p\\in\\\{53,67,83,97\\\},frac=0\.40\\mathrm\{frac\}=0\.40,λ=2\.0\\lambda=2\.0, seeds\{0,1,2\}\\\{0,1,2\\\}\(12 runs\); plusλ\\lambda\-validation atp=97p=97,frac=0\.40\\mathrm\{frac\}=0\.40,λ∈\{1\.0,2\.0,4\.0\}\\lambda\\in\\\{1\.0,2\.0,4\.0\\\}, seeds\{0,1,2\}\\\{0,1,2\\\}\(9 runs\)\. Total validation: 21 runs; 17 of 21 grokked \(81%\); the 4 non\-grokking runs are allλ=4\.0\\lambda=4\.0, consistent with theλc\\lambda\_\{c\}regime transition \(Result E4\)\.

Both sets use: check\_every=50=50; grokking detected as the first step where test accuracy exceeds 95% for two consecutive evaluations; per\-step gradient energy‖∇ℒ‖2\\\|\\nabla\\mathcal\{L\}\\\|^\{2\}, weight norm‖w‖2\\\|w\\\|^\{2\}, and accuracy logged throughout\. All data and code are available in the reproducibility package \(Appendix[C](https://arxiv.org/html/2606.07563#A3)\)\.

##### Result E1: Universal Convergence confirmed\.

89 of 90 runs grokked \(98\.9%\)\. All grokked models converged to test accuracy0\.9745±0\.0140\.9745\\pm 0\.014\(mean±\\pmstd\), with coefficient of variation1\.47%1\.47\\%\. One\-way ANOVA finds no factor \(pp, frac,λ\\lambda, seed\) significantly predicts final accuracy \(F2,86=2\.06F\_\{2,86\}=2\.06,p=0\.134p=0\.134forpp;F1,87=0\.48F\_\{1,87\}=0\.48,p=0\.490p=0\.490forλ\\lambda;F2,86=0\.85F\_\{2,86\}=0\.85,p=0\.431p=0\.431for frac\)\. This directly confirms Corollary[S8\.2](https://arxiv.org/html/2606.07563#A8.Thmtheorem2): all instances sharing𝒫\\mathcal\{P\}belowEcE\_\{c\}converge to the sameR∞R\_\{\\infty\}, independent ofpp, training data, and weight decay\.

##### Result E2: Weight\-normEcE\_\{c\}fingerprint\.

In 92\.1% of runs,‖w‖2\\\|w\\\|^\{2\}peaks*before*grokking with median lead of1,0501\{,\}050steps \(λ=1\.0\\lambda=1\.0: 1 170 steps;λ=2\.0\\lambda=2\.0: 890 steps\)\. The trajectory follows the three\-phase HEF narrative: \(1\)‖w‖2\\\|w\\\|^\{2\}rises during memorisation \(exploration regime,E\>EcE\>E\_\{c\}\); \(2\) peaks at the phase boundary \(E≈EcE\\approx E\_\{c\}\); \(3\) decays as the convergence regime takes over \(E<EcE<E\_\{c\}\)\. To our knowledge, this three\-phase weight\-norm trajectory has not been reported previously\. It provides a model\-agnostic empirical signature forEcE\_\{c\}crossings\.

##### Result E3: Landau\-Ginzburg data collapse\.

Normalising all 89 accuracy curves to\[0,1\]\[0,1\]and rescaling time byτ=500\\tau=500steps, the curves collapse onto the tanh kink functionσ​\(t\)=12​\(1\+tanh⁡\(\(t−Δ​t\)/τ\)\)\\sigma\(t\)=\\frac\{1\}\{2\}\(1\+\\tanh\(\(t\-\\Delta t\)/\\tau\)\)withR2=0\.93R^\{2\}=0\.93per run andR2=0\.79R^\{2\}=0\.79on the mean collapse\. This identifies grokking as an instance of the Landau\-Ginzburg universality class: the tanh kink is the exact domain\-wall solution of theϕ4\\phi^\{4\}field equation, representing a topological transition between two ordered phases\. The residual 21% from the mean collapse \(R2=0\.79R^\{2\}=0\.79\) corresponds to the pre\-grokking “shoulder” — the slow rise of test accuracy during the grok gap, consistent with the gradual circuit formation predicted by Phase 2 of the HEF narrative\.

![Refer to caption](https://arxiv.org/html/2606.07563v1/x1.png)Figure 1:Empirical evidence for HEF’s three\-phase energy trajectory and universality class\.\(a\) Weight\-normEcE\_\{c\}fingerprint \(Result E2\)\.The normalised weight norm‖w‖2/‖w0‖2\\\|w\\\|^\{2\}/\\\|w\_\{0\}\\\|^\{2\}traces the three\-phase HEF trajectory: rising during exploration \(E\>EcE\>E\_\{c\}\), peaking near the phase boundary \(dotted, median lead1,0501\{,\}050steps before grokking\), then falling during convergence \(E<EcE<E\_\{c\}\)\. The peak precedes grokking in92\.1%92\.1\\%of runs, providing a model\-agnostic fingerprint ofEcE\_\{c\}\.\(b\) Landau–Ginzburg data collapse \(Result E3\)\.All 89 normalised accuracy curves collapse ontoσ​\(t\)=12​\(1\+tanh⁡\(\(t−Δ​t\)/τ\)\)\\sigma\(t\)=\\tfrac\{1\}\{2\}\(1\+\\tanh\(\(t\-\\Delta t\)/\\tau\)\)\(per\-runR2=0\.93R^\{2\}=0\.93; collapseR2=0\.79R^\{2\}=0\.79\)\. The tanh domain\-wall solution places grokking in the mean\-field / Ising\-1D universality class \(Class I, Table[1](https://arxiv.org/html/2606.07563#S7.T1)\), consistent with a smooth mechanism landscape nearα∗\\alpha^\{\*\}\(Proposition[7\.2](https://arxiv.org/html/2606.07563#S7.Thmtheorem2)\)\. Shaded band:±1\\pm 1standard deviation\.![Refer to caption](https://arxiv.org/html/2606.07563v1/x2.png)Figure 2:Universal Feature Convergence confirms Corollary[S8\.2](https://arxiv.org/html/2606.07563#A8.Thmtheorem2)\(Result E1\)\.All 89 grokked models converge to final test accuracy0\.9745±0\.0140\.9745\\pm 0\.014\(CV=1\.47%=1\.47\\%\), independent of initial conditions\.\(a\)Distribution across all 89 runs\.\(b\)One\-way ANOVA by primepp:F2,86=2\.06F\_\{2,86\}=2\.06,p=0\.134p=0\.134— no significant effect\.\(c\)By weight decayλ\\lambda:F1,87=0\.48F\_\{1,87\}=0\.48,p=0\.490p=0\.490— no significant effect\. Convergence to the sameR∞R\_\{\\infty\}regardless ofpp, training fraction,λ\\lambda, and random seed directly supports the prediction that two HEF instances sharing𝒫\\mathcal\{P\}and operating belowEcE\_\{c\}converge to the same fixed point \(Corollary[S8\.2](https://arxiv.org/html/2606.07563#A8.Thmtheorem2)\)\.
##### Result E4: Scaling law andλc\\lambda\_\{c\}regime transition\.

We combinep∈\{23,31,41\}p\\in\\\{23,31,41\\\}\(v3, check\_every=50=50\) withp∈\{53,67,83,97\}p\\in\\\{53,67,83,97\\\}\(v2, check\_every=50=50, same architecture\) for a uniform\-precision 7\-point scaling curve \(Figure[3](https://arxiv.org/html/2606.07563#S8.F3)a\)\.

*P\-scaling\.*Log\-log slopeβ=−1\.39±0\.20\\beta=\-1\.39\\pm 0\.20\(R2=0\.91R^\{2\}=0\.91\), consistent withΔ​t∝p−1\\Delta t\\propto p^\{\-1\}at the 10% level \(p=0\.075p=0\.075forH0:β=−1H\_\{0\}:\\beta=\-1\)\. The original G2 \(β=\+2\\beta=\+2\) is falsified\.

*λc\\lambda\_\{c\}regime transition \(Figure[3](https://arxiv.org/html/2606.07563#S8.F3)b\)\.*Atp=97p=97:λ=1\.0\\lambda=1\.0andλ=2\.0\\lambda=2\.0grok reliably \(3/3 seeds\);λ=4\.0\\lambda=4\.0fails in 1/3 seeds \(seed 2 oscillates for 200,000 steps without reaching train accuracy≥99%\\geq 99\\%, CV of train accuracy=0\.21=0\.21\)\. The ratioΔ​t​\(λ=1\)/Δ​t​\(λ=2\)=1\.56\\Delta t\(\\lambda\{=\}1\)/\\Delta t\(\\lambda\{=\}2\)=1\.56is broadly consistent with theλ−1\\lambda^\{\-1\}prediction \(factor 2\.0\), butλ=4\.0\\lambda=4\.0breaks the monotonicity\. We identify a critical thresholdλc​\(p=97\)∈\(2,4\)\\lambda\_\{c\}\(p\{=\}97\)\\in\(2,4\)beyond which weight decay destroys gradient signal faster than the generalising circuit forms\. This is a new HEF prediction: from Proposition[8\.1](https://arxiv.org/html/2606.07563#S8.Thmtheorem1),λc\\lambda\_\{c\}is the value at whichtconvt\_\{\\mathrm\{conv\}\}diverges \(circuit formation time exceeds the training horizon\)\. Empirical determination ofλc​\(p\)\\lambda\_\{c\}\(p\)as a function ofppis Open Experimental Protocol 1b\.

*Phase structure vspp\(Figure[3](https://arxiv.org/html/2606.07563#S8.F3)c\)\.*Test accuracy at the memorisation step rises from≈0\.38\\approx 0\.38atp∈\{23,31,41\}p\\in\\\{23,31,41\\\}to≈0\.76\\approx 0\.76atp=97p=97: the classic two\-phase grokking gradually collapses as each Fourier mode receives richer training signal, consistent with the HEF three\-phase energy trajectory\.

![Refer to caption](https://arxiv.org/html/2606.07563v1/x3.png)Figure 3:G2 scaling validation andλc\\lambda\_\{c\}regime transition \(Result E4; Open Protocols 1a–c\)\.\(a\)Grokking delayΔ​t\\Delta tvs primepp\(log–log\),λ=2\.0\\lambda=2\.0, frac=0\.40=0\.40\. The original G2 prediction \(Δ​t∝n/λ\\Delta t\\propto n/\\lambda, slope\+2\+2\) is falsified; observed slopeβ=−1\.39±0\.20\\beta=\-1\.39\\pm 0\.20\(R2=0\.91R^\{2\}=0\.91\) is consistent with the revised G2Δ​t∝1/\(frac⋅p⋅λ\)\\Delta t\\propto 1/\(\\mathrm\{frac\}\\cdot p\\cdot\\lambda\)at the 10% level \(p=0\.075p=0\.075\)\. Error bars: 95% CI\.\(b\)λ\\lambda\-dependence atp=97p=97\.λ∈\{1,2\}\\lambda\\in\\\{1,2\\\}grok reliably;λ=4\\lambda=4fails in1/31/3seeds \(oscillatory regime\)\. A critical thresholdλc​\(p=97\)∈\(2,4\)\\lambda\_\{c\}\(p\{=\}97\)\\in\(2,4\)identifies*mechanism starvation*\(Class VI, Table[1](https://arxiv.org/html/2606.07563#S7.T1)\)\.\(c\)Test accuracy at memorisation step vspp: transition from classic two\-phase grokking \(≈38%\\approx 38\\%atp≤41p\\leq 41\) to simultaneous learning \(≈76%\\approx 76\\%atp=97p=97\), consistent with richer training signal per Fourier mode\.
##### Interpretation\.

Results E1–E3 provide empirical support for the core theoretical predictions of HEF \(Universal Convergence,EcE\_\{c\}phase boundary, Landau\-Ginzburg transition\)\. Result E4 identifies a limitation of Proposition[8\.1](https://arxiv.org/html/2606.07563#S8.Thmtheorem1)at small scale and motivates a refinement of G2\. The honest summary is:*the phase transition structure of grokking is confirmed; the specificn/λn/\\lambdascaling formula is not yet confirmed and requires larger\-scale experiments\.*

### 8\.2EOM: Prebiotic Chemistry and Evolutionary Biology

In EOM,R\(1\)R^\{\(1\)\}consists of prebiotic molecules\. The framework was introduced in Truong and Truong\[[47](https://arxiv.org/html/2606.07563#bib.bib47)\], which establishes∇Σ\\nabla\\Sigmaand∇ΦI\\nabla\\Phi\_\{I\}as linearly independent forces off equilibrium\. The hierarchy spans molecules→\\tooligomers→\\toautocatalytic sets→\\toprotocells→\\toDarwinian units, with domain weightwαdomain=exp⁡\(−Δ​Gα‡/kB​T\)w^\{\\mathrm\{domain\}\}\_\{\\alpha\}=\\exp\(\-\\Delta G^\{\\ddagger\}\_\{\\alpha\}/k\_\{B\}T\)\.

Convergent evolution\[[11](https://arxiv.org/html/2606.07563#bib.bib11)\]follows from Corollary[S8\.2](https://arxiv.org/html/2606.07563#A8.Thmtheorem2): metabolic constraints enforceE<EcE<E\_\{c\}, guaranteeing convergence across independent lineages\. The Cambrian explosion corresponds toEmetabolicE\_\{\\mathrm\{metabolic\}\}transiently crossingEcE\_\{c\}during oxygenation, with timescaleτCambrian∝\|d​E/d​t\|−1\\tau\_\{\\mathrm\{Cambrian\}\}\\propto\|dE/dt\|^\{\-1\}\.

### 8\.3IFF: Information Field Theory

In IFF,R\(1\)R^\{\(1\)\}consists of Fourier modes of a physical field\. The domain weightwαdomain=I​\(R\(k−1\);fα​\(R\(k−1\)\)\)∝\|d​α/d​ℓ\|RGw^\{\\mathrm\{domain\}\}\_\{\\alpha\}=I\(R^\{\(k\-1\)\};f\_\{\\alpha\}\(R^\{\(k\-1\)\}\)\)\\propto\|d\\alpha/d\\ell\|\_\{\\mathrm\{RG\}\}recovers RG relevance\[[49](https://arxiv.org/html/2606.07563#bib.bib49),[26](https://arxiv.org/html/2606.07563#bib.bib26)\]\. AsE=kB​TE=k\_\{B\}Tdecreases throughEcE\_\{c\}values, successive phase transitions eliminate mechanism classes\.

### 8\.4RSID: Nanoparticle Signal Detection

In RSID,R\(1\)R^\{\(1\)\}consists of nanoparticle–target binding configurations withEi​j=Δ​Gbind∘​\(i,j\)E\_\{ij\}=\\Delta G^\{\\circ\}\_\{\\mathrm\{bind\}\}\(i,j\)\. AND\-NOT logic maps tori​A∧ri​B⟂r\_\{iA\}\\wedge r^\{\\perp\}\_\{iB\}inℒ\\mathcal\{L\}\. The Hill coefficient equals the conjunction arity\[[34](https://arxiv.org/html/2606.07563#bib.bib34)\]\.Testable prediction:false\-positive rate increases sharply nearTc=Ec/kBT\_\{c\}=E\_\{c\}/k\_\{B\}\.

## 9Practitioner’s Guide: Applying HEF to New Systems

This section provides a self\-contained diagnostic toolkit for applying HEF to a new system without engaging the full theoretical apparatus\. All diagnostics are implemented in thehef\-toolsPython package \(Section[9\.5](https://arxiv.org/html/2606.07563#S9.SS5)\)\. The workflow proceeds in four steps\.

### 9\.1Step 1: Identify the HEF Tuple

Map your system to the six\-tupleℋ=\(R\(1\),ℒ,𝒜0,𝒢,mode,E\)\\mathcal\{H\}=\(R^\{\(1\)\},\\mathcal\{L\},\\mathcal\{A\}\_\{0\},\\mathcal\{G\},\\mathrm\{mode\},E\):

Practical check\.If you cannot identify all six components, HEF may still apply — start withEEandR\(1\)R^\{\(1\)\}, which are sufficient for Step 2\.

### 9\.2Step 2: Detect theEcE\_\{c\}Fingerprint

TheEcE\_\{c\}crossing produces a universal signature in the*energy proxy*of your system\. For ML systems, the energy proxy is the weight norm‖w‖2\\\|w\\\|^\{2\}\.

1. 1\.Log your energy proxyat regular intervals throughout training or system evolution\.
2. 2\.Look for a peak: the proxy should rise, reach a maximum, and then fall\. If no peak exists, your system may be operating in a single regime \(always above or always belowEcE\_\{c\}\)\.
3. 3\.Measure the lead time: the interval between the energy peak and the emergence event \(grokking, phase transition, speciation event\)\. In our experiments this was1,050±4201\{,\}050\\pm 420steps\.
4. 4\.Interpret: the peak is the empiricalEcE\_\{c\}\. Systems that never peak have not undergone HEF\-type emergence; they are operating in the Class VI \(mechanism starvation\) or flat \(Class IV\) regime\.

hef\-toolscommand``` from hef_tools import detect_ec_fingerprint result = detect_ec_fingerprint(weight_norm_series, emergence_step) # Returns: peak_step, lead_time, phase_class ```

### 9\.3Step 3: Classify the Emergence Type

Once theEcE\_\{c\}fingerprint is identified, compute the four\-component*emergence signature*Σ​\(ℋ\)=\(τ,m,ω,d\)\\Sigma\(\\mathcal\{H\}\)=\(\\tau,m,\\omega,d\)\(Definition[7\.5](https://arxiv.org/html/2606.07563#S7.Thmdefinition5)\):

Match your signature to Table[1](https://arxiv.org/html/2606.07563#S7.T1)to identify the*universality class*and associated predictions\.

hef\-toolscommand``` from hef_tools import classify_emergence sig = classify_emergence( acc_curve=test_acc, weight_norm=wnorm, delta_t=grok_step ) # Returns: Sigma(tau=’smooth’, m=2, omega=150, d=3) -> Class I ```

### 9\.4Step 4: Intervene via HEF Predictions

HEF provides actionable predictions for each universality class:

### 9\.5Thehef\-toolsPackage

All diagnostics above are implemented inhef\-tools, a lightweight Python package requiring onlynumpy,pandas, andmatplotlib\. The package provides:

- •detect\_ec\_fingerprint: detects the energy\-proxy peak and measures lead time\.
- •classify\_emergence: computesΣ​\(τ,m,ω,d\)\\Sigma\(\\tau,m,\\omega,d\)and returns the universality class\.
- •fit\_tanh\_collapse: fits and plots the Landau–Ginzburg data collapse\.
- •plot\_hef\_trajectory: generates the three\-phase trajectory plot \(as in Figure[1](https://arxiv.org/html/2606.07563#S8.F1)\)\.
- •predict\_delta\_t: predicts grokking delay from hyperparameters under the revised G2 formula\.

Installation``` pip install hef-tools ```

The package source, documentation, and worked examples are available at[https://github\.com/ClevixLab/hef\-tools](https://github.com/ClevixLab/hef-tools)and in the reproducibility package accompanying this submission\.

## 10Related Work

##### Emergence theory\.

Bedau\[[4](https://arxiv.org/html/2606.07563#bib.bib4)\]: weak emergence as simulation\-irreducibility; Algorithm[1](https://arxiv.org/html/2606.07563#alg1)provides the constructive simulation\. Hoel et al\.\[[21](https://arxiv.org/html/2606.07563#bib.bib21)\]: causal emergence via effective information; HEF provides the generative mechanism\. Deutsch and Marletto\[[15](https://arxiv.org/html/2606.07563#bib.bib15)\]: constructor theory; HEF’s𝒜∗\\mathcal\{A\}^\{\*\}is the set of possible constructors,𝒢\\mathcal\{G\}the meta\-constructor\.

##### Feature convergence\.

Huh et al\.\[[23](https://arxiv.org/html/2606.07563#bib.bib23)\]: empirical convergence of neural representations across modalities — consistent with Corollary[S8\.2](https://arxiv.org/html/2606.07563#A8.Thmtheorem2)\. Olah et al\.\[[37](https://arxiv.org/html/2606.07563#bib.bib37)\]: universal features across independently trained CNNs — consistent\. Boix\-Adsera et al\.\[[7](https://arxiv.org/html/2606.07563#bib.bib7)\]: FACT proves a self\-consistency equation at convergence; weight decay in FACT plays an analogous role toEcE\_\{c\}in HEF \(see Open Problem 2, Section[11](https://arxiv.org/html/2606.07563#S11)\)\.

##### Grokking\.

Power et al\.\[[42](https://arxiv.org/html/2606.07563#bib.bib42)\]: discovery\. Miller et al\.\[[33](https://arxiv.org/html/2606.07563#bib.bib33)\]: empirical confirmation as phase transition\. Doshi et al\.\[[14](https://arxiv.org/html/2606.07563#bib.bib14)\]: circuit decomposition into memorisation and generalisation, consistent with𝒜mem/𝒜gen\\mathcal\{A\}\_\{\\mathrm\{mem\}\}/\\mathcal\{A\}\_\{\\mathrm\{gen\}\}\. Xu\[[51](https://arxiv.org/html/2606.07563#bib.bib51)\]: weight decay as compression pressure\. Truong\[[48](https://arxiv.org/html/2606.07563#bib.bib48)\]: first\-passage law for grokking delay — the companion to Proposition[8\.1](https://arxiv.org/html/2606.07563#S8.Thmtheorem1)\.

##### Double descent\.

Belkin et al\.\[[5](https://arxiv.org/html/2606.07563#bib.bib5)\]and Nakkiran et al\.\[[36](https://arxiv.org/html/2606.07563#bib.bib36)\]: empirical documentation\. HEF provides a unified 2D phase surface interpretation\.

##### Convergent evolution\.

Conway Morris\[[11](https://arxiv.org/html/2606.07563#bib.bib11),[12](https://arxiv.org/html/2606.07563#bib.bib12)\]\. Opulente et al\.\[[38](https://arxiv.org/html/2606.07563#bib.bib38)\]: same gene families in 80% of convergent metabolic cases across 993 yeast species — consistent with Corollary[S8\.2](https://arxiv.org/html/2606.07563#A8.Thmtheorem2)\. HEF’s prediction of stronger convergence in anaerobic lineages is novel\.

##### Thermodynamics and information\.

Landauer\[[30](https://arxiv.org/html/2606.07563#bib.bib30)\], Bennett\[[6](https://arxiv.org/html/2606.07563#bib.bib6)\]: thermodynamics of computation\. Jarzynski\[[24](https://arxiv.org/html/2606.07563#bib.bib24)\]: free energy from non\-equilibrium work\. Tishby et al\.\[[46](https://arxiv.org/html/2606.07563#bib.bib46)\]: information bottleneck recovered from P6\.

##### SDPI\.

Raginsky\[[43](https://arxiv.org/html/2606.07563#bib.bib43)\]: Strong Data Processing Inequality; used in Lemmas[3\.5](https://arxiv.org/html/2606.07563#S3.Thmtheorem5)–[5\.3](https://arxiv.org/html/2606.07563#S5.Thmtheorem3)\.

##### EOM\-IFF\.

Truong and Truong\[[47](https://arxiv.org/html/2606.07563#bib.bib47)\]: the foundation generalised by HEF\.

## 11Conclusion

HEF is a constructive mathematical framework for a recurring pattern in convergence phenomena\. Rather than claiming to explain all emergence, it specifies, for systems exhibiting this pattern:*when*a phase transition occurs \(whenEEcrossesEcE\_\{c\}\),*why*convergence is universal \(Banach contraction under physical constraints𝒫\\mathcal\{P\}\), and*what emerges*\(R∞R\_\{\\infty\}, the unique fixed point, up to the limitations noted below\)\. We summarise the status of all claims\.

##### What is proven \(no additional assumptions required\)\.

Theorem[S5\.1](https://arxiv.org/html/2606.07563#A5.Thmtheorem1)\(Physical Feasibility\) is proven under A1–A4 with separate thermodynamic and information\-theoretic branches\. Theorem[S8\.1](https://arxiv.org/html/2606.07563#A8.Thmtheorem1)\(Energy\-Diversity\) is proven under A1–A6\. Corollary[S8\.2](https://arxiv.org/html/2606.07563#A8.Thmtheorem2)\(Universal Convergence\) follows in three steps under A5 and A6\. Theorem[6\.1](https://arxiv.org/html/2606.07563#S6.Thmtheorem1)\(Causal Emergence\) is proven under A1–A6 and the Non\-Degeneracy Assumption \(NDA\), which is necessary and satisfied in all four instantiations\. A6 is an empirically verifiable condition; for linear\-attribute and monotone\-compressive mechanisms it follows directly from the compression coefficients \(Propositions 3\.6\-3\.7\), and for mechanisms admitting a log\-Sobolev inequality it holds withcα∗=e−ρc\_\{\\alpha^\{\*\}\}=e^\{\-\\rho\}\(Theorem 3\.8\)\. For ML instantiations, A6 is verified empirically via spectral normalization and weight decay \(see Supplementary Information, Section 7\)\.

##### What is empirically validated\.

From 89 grokking experiments \(p∈\{23,31,41\}p\\in\\\{23,31,41\\\}, five seeds\):E1Universal Convergence: all grokked models converge to0\.9745±0\.0140\.9745\\pm 0\.014, ANOVAp\>0\.13p\>0\.13for all factors\.E2EcE\_\{c\}fingerprint:‖w‖2\\\|w\\\|^\{2\}peaks∼1,050\{\\sim\}1\{,\}050steps before grokking in 92% of runs, tracing the HEF three\-phase trajectory\.E3Landau\-Ginzburg data collapse:R2=0\.93R^\{2\}=0\.93per run\. Findings E1–E3 are consistent with the core theoretical structure; they constitute supporting evidence, not complete validation\.

##### What is assumed \(G1, G2\) and their status\.

G1 is consistent with the three\-phase‖w‖2\\\|w\\\|^\{2\}trajectory; full validation requires dense\-gradient logging \(Open Protocol G1\-test\)\. G2 \(original:∝n/λ\\propto n/\\lambda\) is revised: slopeβ=−1\.39±0\.20\\beta=\-1\.39\\pm 0\.20\(R2=0\.91R^\{2\}=0\.91\) is consistent withΔ​t∝1/\(frac⋅p⋅λ\)\\Delta t\\propto 1/\(\\mathrm\{frac\}\\cdot p\\cdot\\lambda\)at the 10% level\. A new empirical finding is theλc​\(p\)\\lambda\_\{c\}\(p\)regime transition: forλ≥λc∈\(2,4\)\\lambda\\geq\\lambda\_\{c\}\\in\(2,4\)atp=97p=97, grokking fails \(oscillatory instability\)\. This is a novel HEF\-predictable threshold\. frac\-dependence andλc​\(p\)\\lambda\_\{c\}\(p\)scaling remain Open Protocols 1a–c\.

##### What is a retrospective consistency check\.

The IFF instantiation \(cosmological phase transitions\) and the Cambrian explosion timescale estimate are consistency checks with established physics and palaeontology, not new predictions\. They illustrate HEF’s scope without constituting independent evidence\.

##### Open problems\.

1. 1\.A6 for non\-LSI mechanisms\.LinearW1W\_\{1\}\-contraction beyond the monotone and LSI classes; Poincaré\-inequality\-based approach as a candidate\.
2. 2\.G2 from circuit complexity\.Derivetconv​\(n,p,λ\)t\_\{\\mathrm\{conv\}\}\(n,p,\\lambda\)from first principles\. FACT\[[7](https://arxiv.org/html/2606.07563#bib.bib7)\]gives partial progress\.
3. 3\.EI gain upper bound\.Tighten Corollary[6\.2](https://arxiv.org/html/2606.07563#S6.Thmtheorem2)using mechanistic interpretability of the grokked circuit\[[41](https://arxiv.org/html/2606.07563#bib.bib41)\]\.
4. 4\.Continuous E\-diversity inflection\.Extend Theorem[S8\.1](https://arxiv.org/html/2606.07563#A8.Thmtheorem1)\(ii\) to\|𝒜∗\|→∞\|\\mathcal\{A\}^\{\*\}\|\\to\\infty\.

##### Open experimental protocols\.

1. 1\.G2 revised formula andλc\\lambda\_\{c\}boundary\.Current data: slopeβ=−1\.39±0\.20\\beta=\-1\.39\\pm 0\.20\(p∈\{23,…,97\}p\\in\\\{23,\\ldots,97\\\},λ=2\.0\\lambda=2\.0, consistent withβ=−1\\beta=\-1at 10% level\)\.1a\.Confirm with frac∈\{0\.30,0\.50\}\\in\\\{0\.30,0\.50\\\}atp∈\{97,113\}p\\in\\\{97,113\\\}to separate frac\-dependence\.1b\.Pin downλc​\(p=97\)\\lambda\_\{c\}\(p\{=\}97\): runλ∈\{2\.5,3\.0,3\.5\}\\lambda\\in\\\{2\.5,3\.0,3\.5\\\}, 3 seeds each \(9 runs\)\.λc\\lambda\_\{c\}is the value below which all 3 seeds grok within 50,000 steps\. Decisive:λc<3\\lambda\_\{c\}<3orλc\>3\\lambda\_\{c\}\>3distinguishes competing mechanistic hypotheses\.1c\.Testλc​\(p\)\\lambda\_\{c\}\(p\)dependence: doesλc\\lambda\_\{c\}grow withpp? HEF predictsλc∝frac⋅p\\lambda\_\{c\}\\propto\\mathrm\{frac\}\\cdot p\(same scaling asn/pmodesn/p\_\{\\mathrm\{modes\}\}\)\.
2. 2\.G1 model comparison\.Dense gradient logging \(p=23p=23, log every step\) and AIC comparison ofE0/\(1\+λ​t\)E\_\{0\}/\(1\+\\lambda t\)vsE0​e−γ​tE\_\{0\}e^\{\-\\gamma t\}vs power law\. Critical for the foundation of Proposition[8\.1](https://arxiv.org/html/2606.07563#S8.Thmtheorem1)\.
3. 3\.Convergent evolution\.Anaerobic vs aerobic yeast lineages \(Opulente et al\.\[[38](https://arxiv.org/html/2606.07563#bib.bib38)\]data\): test whether lower metabolicEEpredicts stronger genomic convergence\. Falsifiable: Spearmanρ​\(ATP​yield,convergence​score\)<−0\.3\\rho\(\\mathrm\{ATP~yield\},\\mathrm\{convergence~score\}\)<\-0\.3,p<0\.05p<0\.05\.
4. 4\.RSID temperature sensitivity\.Nanoparticle assays at five temperatures to detect theTc=Ec/kBT\_\{c\}=E\_\{c\}/k\_\{B\}false\-positive spike\.

HEF is offered as a mathematical scaffold and a source of falsifiable predictions\. Its value depends on whether the protocols above confirm, refine, or refute the theoretical predictions\. We welcome independent replication; all experiment code, data, and proofs are provided in the reproducibility package \(Appendix[C](https://arxiv.org/html/2606.07563#A3)\)\.

## Appendix AIllustrative Example: Grokking Delay atp=97p=97

We illustrate Proposition[8\.1](https://arxiv.org/html/2606.07563#S8.Thmtheorem1)with parameters following Power et al\.\[[42](https://arxiv.org/html/2606.07563#bib.bib42)\]at primep=97p=97\.

##### Note on parameter regime\.

Power et al\.\[[42](https://arxiv.org/html/2606.07563#bib.bib42)\]use weight decayλ=10−2\\lambda=10^\{\-2\}, whereas our experiments \(Section[8\.1\.3](https://arxiv.org/html/2606.07563#S8.SS1.SSS3)\) useλ∈\{1,2\}\\lambda\\in\\\{1,2\\\}\. These are different training regimes\. The calculation below uses Power et al\.’s parameters \(λ=10−2\\lambda=10^\{\-2\}\) to match their reportedΔ​t≈104\\Delta t\\approx 10^\{4\}steps; our experimental data validates the*scaling withppandλ\\lambda*in the larger\-λ\\lambdaregime\.

##### Scope and limitation\.

The parametersCmemC\_\{\\mathrm\{mem\}\}andc1c\_\{1\}below are fitted to a single empirical data point \(p=97p=97,n=0\.4​p2n=0\.4p^\{2\},λ=10−2\\lambda=10^\{\-2\}\)\. This example is therefore an*illustration of the formula’s structure*, not a predictive validation\. The falsifiable content of Proposition[8\.1](https://arxiv.org/html/2606.07563#S8.Thmtheorem1)lies exclusively in the*joint scaling*Δ​t∝1/\(frac⋅p⋅λ\)\\Delta t\\propto 1/\(\\mathrm\{frac\}\\cdot p\\cdot\\lambda\)whenpp,λ\\lambda, and frac are varied — partially covered by Section[8\.1\.3](https://arxiv.org/html/2606.07563#S8.SS1.SSS3)\(Open Experimental Protocols 1a–c\)\.

##### Parameter choices\.

- •n=⌊0\.4×p2⌋=⌊0\.4×9409⌋=3763n=\\lfloor 0\.4\\times p^\{2\}\\rfloor=\\lfloor 0\.4\\times 9409\\rfloor=3763training examples\.
- •λ=10−2\\lambda=10^\{\-2\}\(weight decay, following\[[42](https://arxiv.org/html/2606.07563#bib.bib42)\]\)\.
- •η=10−3\\eta=10^\{\-3\}\(learning rate\);‖∇ℒ‖0≈1\\\|\\nabla\\mathcal\{L\}\\\|\_\{0\}\\approx 1, givingE0=η​‖∇ℒ‖02=10−3E\_\{0\}=\\eta\\\|\\nabla\\mathcal\{L\}\\\|\_\{0\}^\{2\}=10^\{\-3\}\.
- •Cmem=10−4C\_\{\\mathrm\{mem\}\}=10^\{\-4\}\(memorisation cost threshold\)\. This is calibrated as follows\. Under G1, memorisation activates whenEstep​\(t∗\)=CmemE\_\{\\mathrm\{step\}\}\(t^\{\*\}\)=C\_\{\\mathrm\{mem\}\}, i\.e\. aftert∗=\(E0/Cmem−1\)/λt^\{\*\}=\(E\_\{0\}/C\_\{\\mathrm\{mem\}\}\-1\)/\\lambdasteps\. Empirically, modular arithmetic memorisation is observed within∼100\\sim 100steps at these parameters\[[42](https://arxiv.org/html/2606.07563#bib.bib42)\], giving100≈\(10−3/Cmem−1\)/10−2100\\approx\(10^\{\-3\}/C\_\{\\mathrm\{mem\}\}\-1\)/10^\{\-2\}, henceCmem≈10−3/\(1\+100×10−2\)=10−3/2=5×10−4C\_\{\\mathrm\{mem\}\}\\approx 10^\{\-3\}/\(1\+100\\times 10^\{\-2\}\)=10^\{\-3\}/2=5\\times 10^\{\-4\}\. We useCmem=10−4C\_\{\\mathrm\{mem\}\}=10^\{\-4\}as a conservative lower estimate that yieldst∗≈1000t^\{\*\}\\approx 1000steps, a reasonable order\-of\-magnitude for the memorisation onset\. The sensitivity ofΔ​t\\Delta ttoCmemC\_\{\\mathrm\{mem\}\}is logarithmic in the dominant termc1​n/λc\_\{1\}n/\\lambda, so a factor\-of\-5 uncertainty inCmemC\_\{\\mathrm\{mem\}\}changes the prediction by only∼10%\\sim 10\\%\.
- •c1=0\.026c\_\{1\}=0\.026\(circuit assembly constant; calibrated so thatΔ​t≈104\\Delta t\\approx 10^\{4\}steps matches the empirical grokking delay atp=97,n=0\.4​p2,λ=10−2p=97,n=0\.4p^\{2\},\\lambda=10^\{\-2\}reported in\[[42](https://arxiv.org/html/2606.07563#bib.bib42)\]\)\.

##### Prediction\.

From equation \([12](https://arxiv.org/html/2606.07563#S8.E12)\):

Δ​t=E0Cmem​λ\+c1​nλ=10−310−4×10−2\+0\.026×376310−2=10−310−6\+97\.810−2=1000\+9780=10,780​steps\.\\Delta t=\\frac\{E\_\{0\}\}\{C\_\{\\mathrm\{mem\}\}\\lambda\}\+\\frac\{c\_\{1\}n\}\{\\lambda\}=\\frac\{10^\{\-3\}\}\{10^\{\-4\}\\times 10^\{\-2\}\}\+\\frac\{0\.026\\times 3763\}\{10^\{\-2\}\}=\\frac\{10^\{\-3\}\}\{10^\{\-6\}\}\+\\frac\{97\.8\}\{10^\{\-2\}\}=1000\+9780=10\{,\}780\\text\{ steps\}\.

##### Interpretation\.

The predictionΔ​t≈10,780\\Delta t\\approx 10\{,\}780steps is consistent with the empirically observed grokking delay of∼104\\sim 10^\{4\}steps at these parameters\[[42](https://arxiv.org/html/2606.07563#bib.bib42)\]\. The dominant term isc1​n/λ≈9780c\_\{1\}n/\\lambda\\approx 9780\(circuit assembly\), confirming then/λn/\\lambdascaling at this parameter range\. The formula predicts that doublingnn\(ton=0\.7​p2≈6586n=0\.7p^\{2\}\\approx 6586\) at fixedλ\\lambdagivesΔ​t≈1000\+17,124=18,124\\Delta t\\approx 1000\+17\{,\}124=18\{,\}124steps, a∼68%\\sim 68\\%increase; doublingλ\\lambda\(toλ=0\.02\\lambda=0\.02\) at fixednngivesΔ​t≈500\+4890=5390\\Delta t\\approx 500\+4890=5390steps, a∼50%\\sim 50\\%decrease\.

Note that the calculation above uses the original G2 formula \(c1​n/λc\_\{1\}n/\\lambda\) calibrated to Power et al\.\[[42](https://arxiv.org/html/2606.07563#bib.bib42)\]atλ=10−2\\lambda=10^\{\-2\}\. The revised G2 formulaK/\(frac⋅p⋅λ\)K/\(\\mathrm\{frac\}\\cdot p\\cdot\\lambda\)applies in our experimental regimeλ∈\{1,2\}\\lambda\\in\\\{1,2\\\}and is validated in Section[8\.1\.3](https://arxiv.org/html/2606.07563#S8.SS1.SSS3)\.

## Appendix BProof of Compression Coefficients from Cost Minimality

This appendix provides the detailed proof that the minimum\-cost mechanismα∗\\alpha^\{\*\}satisfiesaα∗,bα∗<1a\_\{\\alpha^\{\*\}\},b\_\{\\alpha^\{\*\}\}<1, formalising Lemma[S6\.1](https://arxiv.org/html/2606.07563#A6.Thmtheorem1)of Section 3\.5\.

###### Lemma B\.1\(Formal proof ofbα∗<1b\_\{\\alpha^\{\*\}\}<1\)\.

Letα∗=arg⁡minα∈𝒜∗⁡cost​\(α\)\\alpha^\{\*\}=\\arg\\min\_\{\\alpha\\in\\mathcal\{A\}^\{\*\}\}\\mathrm\{cost\}\(\\alpha\)be a non\-trivial, non\-injective mechanism \(Definition[S6\.1](https://arxiv.org/html/2606.07563#A6.Thmdefinition1)\)\. Thenbα∗=supφ⊧𝒫,H​\(φ\)\>0H​\(fα∗​\(φ\)\)/H​\(φ\)<1b\_\{\\alpha^\{\*\}\}=\\sup\_\{\\varphi\\models\\mathcal\{P\},H\(\\varphi\)\>0\}H\(f\_\{\\alpha^\{\*\}\}\(\\varphi\)\)/H\(\\varphi\)<1\.

###### Proof\.

Step 1\(Upper bound≤1\\leq 1\)\. By the DPI \(P6\) applied to the deterministic mapfα∗f\_\{\\alpha^\{\*\}\}: for anyφ⊧𝒫\\varphi\\models\\mathcal\{P\},H​\(fα∗​\(φ\)\)≤H​\(φ\)H\(f\_\{\\alpha^\{\*\}\}\(\\varphi\)\)\\leq H\(\\varphi\), hencebα∗≤1b\_\{\\alpha^\{\*\}\}\\leq 1\.

Step 2\(Non\-injectivity gives strict compression for someφ\\varphi\)\. Sinceα∗\\alpha^\{\*\}is non\-injective, there existφA≠φB\\varphi\_\{A\}\\neq\\varphi\_\{B\}withfα∗\(φA\)=fα∗\(φB\)=:r∗f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{A\}\)=f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{B\}\)=:r^\{\*\}\. Consider the random variableΦ\\Phithat equalsφA\\varphi\_\{A\}with probabilityppandφB\\varphi\_\{B\}with probability1−p1\-p\. Sincefα∗​\(φA\)=fα∗​\(φB\)=r∗f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{A\}\)=f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{B\}\)=r^\{\*\}, the Markov chainΦ→r∗→Φ\\Phi\\to r^\{\*\}\\to\\Phiholds\. By the Data Processing Inequality applied twice:

I​\(Φ;Φ\)≥I​\(Φ;r∗\)≥I​\(r∗;r∗\)=H​\(r∗\)\.I\(\\Phi;\\Phi\)\\geq I\(\\Phi;r^\{\*\}\)\\geq I\(r^\{\*\};r^\{\*\}\)=H\(r^\{\*\}\)\.ButI​\(Φ;Φ\)=H​\(Φ\)≤min⁡\(H​\(φA\),H​\(φB\)\)I\(\\Phi;\\Phi\)=H\(\\Phi\)\\leq\\min\(H\(\\varphi\_\{A\}\),H\(\\varphi\_\{B\}\)\)\(the entropy of a mixture is at most the maximum of the individual entropies, which is bounded by the minimum when one has larger entropy\)\. HenceH​\(r∗\)≤min⁡\(H​\(φA\),H​\(φB\)\)H\(r^\{\*\}\)\\leq\\min\(H\(\\varphi\_\{A\}\),H\(\\varphi\_\{B\}\)\)\.

IfH​\(φA\)\>H​\(φB\)H\(\\varphi\_\{A\}\)\>H\(\\varphi\_\{B\}\), thenH​\(r∗\)≤H​\(φB\)<H​\(φA\)H\(r^\{\*\}\)\\leq H\(\\varphi\_\{B\}\)<H\(\\varphi\_\{A\}\)\. IfH\(φA\)=H\(φB\)=:h\>0H\(\\varphi\_\{A\}\)=H\(\\varphi\_\{B\}\)=:h\>0, thenH​\(r∗\)≤hH\(r^\{\*\}\)\\leq h, and sinceφA≠φB\\varphi\_\{A\}\\neq\\varphi\_\{B\}under the canonical measure, the inequality is strict:H​\(r∗\)<hH\(r^\{\*\}\)<h\. In either case,H​\(fα∗​\(φA\)\)<H​\(φA\)H\(f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{A\}\)\)<H\(\\varphi\_\{A\}\)\.

Step 3\(bα∗<1b\_\{\\alpha^\{\*\}\}<1from full support ofμ\\mu\)\. Under the canonical Gibbs measureμ\\muof Definition[2\.10](https://arxiv.org/html/2606.07563#S2.Thmdefinition10)\(full support on𝒫\\mathcal\{P\}\-feasible states\), the pair\(φA,φB\)\(\\varphi\_\{A\},\\varphi\_\{B\}\)occurs with positive measure\. Letβ:=1−H​\(r∗\)/H​\(φA\)\>0\\beta:=1\-H\(r^\{\*\}\)/H\(\\varphi\_\{A\}\)\>0be the compression gap atφA\\varphi\_\{A\}\.

For anyφ\\varphiin the support ofμ\\mu, the compression ratio satisfies:H​\(f​\(φ\)\)/H​\(φ\)≤max⁡\(b∗,1−β′\)H\(f\(\\varphi\)\)/H\(\\varphi\)\\leq\\max\(b^\{\*\},1\-\\beta^\{\\prime\}\)for someβ′\>0\\beta^\{\\prime\}\>0on aμ\\mu\-positive set\. Hencebα∗≤1−β′<1b\_\{\\alpha^\{\*\}\}\\leq 1\-\\beta^\{\\prime\}<1whereβ′\>0\\beta^\{\\prime\}\>0follows from the strict compression atφA\\varphi\_\{A\}\.

More precisely:bα∗=supφH​\(f​\(φ\)\)/H​\(φ\)b\_\{\\alpha^\{\*\}\}=\\sup\_\{\\varphi\}H\(f\(\\varphi\)\)/H\(\\varphi\)\. Ifbα∗=1b\_\{\\alpha^\{\*\}\}=1, then the supremum is approached, implying a sequenceφn\\varphi\_\{n\}withH​\(f​\(φn\)\)/H​\(φn\)→1H\(f\(\\varphi\_\{n\}\)\)/H\(\\varphi\_\{n\}\)\\to 1\. But the non\-injective pair\(φA,φB\)\(\\varphi\_\{A\},\\varphi\_\{B\}\)always givesH​\(f​\(φA\)\)/H​\(φA\)≤1−β<1H\(f\(\\varphi\_\{A\}\)\)/H\(\\varphi\_\{A\}\)\\leq 1\-\\beta<1for fixedβ\>0\\beta\>0\. Since the canonical measure gives positive weight to both the sequenceφn\\varphi\_\{n\}and the pair, and the DPI is strict for non\-injective maps,bα∗<1b\_\{\\alpha^\{\*\}\}<1must hold\. \(Formally: the supremum of a set that excludes a positive\-measure region below1−β1\-\\betais itself<1<1\.\) ∎

###### Lemma B\.2\(Formal proof ofaα∗≤1a\_\{\\alpha^\{\*\}\}\\leq 1and conditions for<1<1\)\.

Under the same conditions,aα∗≤1a\_\{\\alpha^\{\*\}\}\\leq 1\. Strict inequalityaα∗<1a\_\{\\alpha^\{\*\}\}<1holds when the minimum\-cost mechanism does not increase subsystem energy on average, which is guaranteed whenE<EcE<E\_\{c\}\.

###### Proof\.

aα∗≤1a\_\{\\alpha^\{\*\}\}\\leq 1: The minimum\-cost mechanism minimises𝔼μ​\[Δ​E\+kB​T​Δ​H\]\\mathbb\{E\}\_\{\\mu\}\[\\Delta E\+k\_\{B\}T\\Delta H\]\. SinceΔ​H≤0\\Delta H\\leq 0\(by Lemma B\.1\), the information termkB​T​Δ​H≤0k\_\{B\}T\\Delta H\\leq 0reduces cost\. The energy term𝔼μ​\[Δ​E\]\\mathbb\{E\}\_\{\\mu\}\[\\Delta E\]could be positive \(mechanism draws energy from environment\) or negative \(releases energy\)\. However,Δ​E\>0\\Delta E\>0for allφ\\varphiwould mean the mechanism always draws energy, increasing cost; the minimum\-cost mechanism prefersΔ​E≤0\\Delta E\\leq 0on average\. Combined with the open\-system bound\|Ef−Eφ\|≤Eref\|E\_\{f\}\-E\_\{\\varphi\}\|\\leq E\_\{\\mathrm\{ref\}\}\(A3\), the average givesaα∗=supEf/Eφ≤1\+Eref/Emina\_\{\\alpha^\{\*\}\}=\\sup E\_\{f\}/E\_\{\\varphi\}\\leq 1\+E\_\{\\mathrm\{ref\}\}/E\_\{\\min\}, which is bounded; and for mechanisms with𝔼μ​\[Δ​E\]≤0\\mathbb\{E\}\_\{\\mu\}\[\\Delta E\]\\leq 0\(energy\-neutral or releasing\),aα∗≤1a\_\{\\alpha^\{\*\}\}\\leq 1\.

aα∗<1a\_\{\\alpha^\{\*\}\}<1belowEcE\_\{c\}: AtEcE\_\{c\}, the definition of the critical threshold \(Theorem[S8\.1](https://arxiv.org/html/2606.07563#A8.Thmtheorem1)\(ii\)\) requirescα∗​\(Ec\)=1c\_\{\\alpha^\{\*\}\}\(E\_\{c\}\)=1, i\.e\.max⁡\(aα∗,bα∗\)=1\\max\(a\_\{\\alpha^\{\*\}\},b\_\{\\alpha^\{\*\}\}\)=1\. Sincebα∗<1b\_\{\\alpha^\{\*\}\}<1\(proved above\), we must haveaα∗​\(Ec\)=1a\_\{\\alpha^\{\*\}\}\(E\_\{c\}\)=1\(the energy component saturates atEcE\_\{c\}\)\. BelowEcE\_\{c\}: the budget constraintE<EcE<E\_\{c\}excludes energy\-neutral mechanisms, forcing𝔼μ​\[Δ​E\]<0\\mathbb\{E\}\_\{\\mu\}\[\\Delta E\]<0, henceaα∗<1a\_\{\\alpha^\{\*\}\}<1\. ∎

###### Corollary B\.3\(Contraction constants belowEcE\_\{c\}\)\.

ForE<EcE<E\_\{c\}:cα∗=max⁡\(aα∗,bα∗\)<1c\_\{\\alpha^\{\*\}\}=\\max\(a\_\{\\alpha^\{\*\}\},b\_\{\\alpha^\{\*\}\}\)<1, and the Banach contraction \(Theorem[S8\.1](https://arxiv.org/html/2606.07563#A8.Thmtheorem1)\(iii\)\) applies with this constant\.

## Appendix CReproducibility Package

- •hef\_grok\_exp\_v3\.py: PyTorch training script for grokking experiments\. Full\-batch AdamW, 2\-layer transformer, checkpoint/resume, per\-step gradient and weight\-norm logging\. Runs on GPU \(recommended\) or CPU\.
- •hef\_grok\_analysis\_v3\.py: Statistical analysis script\. Reproduces all figures and statistical tests in Section[8\.1\.3](https://arxiv.org/html/2606.07563#S8.SS1.SSS3)\.
- •results/grokking\_results\.csv: Per\-run summary for all 90 experiments \(pp, frac,λ\\lambda, seed,Δ​t\\Delta t, mem\_step, final/peak accuracy, wall time\)\.
- •results/Temp/: Per\-run time series: accuracy curves, gradient energy \(‖∇ℒ‖2\\\|\\nabla\\mathcal\{L\}\\\|^\{2\}every 10 steps\), weight norm \(‖w‖2\\\|w\\\|^\{2\}every 10 steps\), training loss\. 89 runs×\\times4 series=356=356CSV files\.
- •hef\_causal\_emergence\_proof\.tex: StandaloneLaTeXsource for Theorem[6\.1](https://arxiv.org/html/2606.07563#S6.Thmtheorem1)with full proof\.

##### Environment\.

Python 3\.11, PyTorch 2\.12, CUDA 13 \(experiments run on NVIDIA RTX 4000 Ada Generation\)\. Install:pip install torch numpy pandas scipy matplotlib\.

##### Reproducing the experiments\.

```
# Quick sanity check (~5 min on GPU):
python hef_grok_exp_v3.py --quick

# Full 90-run experiment (~6-8h on RTX 4000 Ada):
python hef_grok_exp_v3.py

# Analysis and figures:
python hef_grok_analysis_v3.py \
  --input D:/Colab-local/hef_grok_TIMESTAMP/results/grokking_results.csv
```

The script automatically resumes from checkpoints if interrupted\. All Temp data is preserved and never deleted\.

Supplementary Information Complete Proofs for the Hierarchical Emergence Framework

###### Contents

1. [1Introduction](https://arxiv.org/html/2606.07563#S1)1. [1\.1Three Puzzles, One Principle](https://arxiv.org/html/2606.07563#S1.SS1) 2. [1\.2What HEF Contributes](https://arxiv.org/html/2606.07563#S1.SS2) 3. [1\.3Main Results](https://arxiv.org/html/2606.07563#S1.SS3) 4. [1\.4How to Read This Paper](https://arxiv.org/html/2606.07563#S1.SS4) 5. [1\.5Paper Organisation](https://arxiv.org/html/2606.07563#S1.SS5)
2. [2The Hierarchical Emergence Framework](https://arxiv.org/html/2606.07563#S2)1. [2\.1Primitive Sets and the Hierarchy](https://arxiv.org/html/2606.07563#S2.SS1) 2. [2\.2Logical Language](https://arxiv.org/html/2606.07563#S2.SS2) 3. [2\.3Mechanism Family](https://arxiv.org/html/2606.07563#S2.SS3) 4. [2\.4Generation Rule and Operating Mode](https://arxiv.org/html/2606.07563#S2.SS4) 5. [2\.5Energy Budget, Canonical Measure, and Relevance Weights](https://arxiv.org/html/2606.07563#S2.SS5) 6. [2\.6The Full Framework Tuple](https://arxiv.org/html/2606.07563#S2.SS6)
3. [3Physical Foundation](https://arxiv.org/html/2606.07563#S3)1. [3\.1Thermodynamic Constraints](https://arxiv.org/html/2606.07563#S3.SS1) 2. [3\.2Information\-Theoretic Constraints](https://arxiv.org/html/2606.07563#S3.SS2) 3. [3\.3Consistency via Translation MapΦ\\Phi](https://arxiv.org/html/2606.07563#S3.SS3) 4. [3\.4Metric on Logical Formulas](https://arxiv.org/html/2606.07563#S3.SS4) 5. [3\.5Additional Structural Assumptions for Convergence](https://arxiv.org/html/2606.07563#S3.SS5) 6. [3\.6Derivation of Metric Contraction: Scope and Limits](https://arxiv.org/html/2606.07563#S3.SS6) 7. [3\.7Weight Function](https://arxiv.org/html/2606.07563#S3.SS7)
4. [4Physical Feasibility Theorem](https://arxiv.org/html/2606.07563#S4)
5. [5Energy Budget and the Diversity\-Convergence Trade\-off](https://arxiv.org/html/2606.07563#S5)1. [5\.1Complete Metric Space Structure](https://arxiv.org/html/2606.07563#S5.SS1) 2. [5\.2P\-Stability of Coupled Formulas](https://arxiv.org/html/2606.07563#S5.SS2) 3. [5\.3Metric Contraction Lemma](https://arxiv.org/html/2606.07563#S5.SS3) 4. [5\.4Energy\-Diversity Trade\-off Theorem](https://arxiv.org/html/2606.07563#S5.SS4) 5. [5\.5Universal Feature Convergence](https://arxiv.org/html/2606.07563#S5.SS5) 6. [5\.6Three Characterisations ofEcE\_\{c\}](https://arxiv.org/html/2606.07563#S5.SS6)
6. [6Causal Emergence at the HEF Fixed Point](https://arxiv.org/html/2606.07563#S6)1. [6\.1Why Convergence Alone Does Not Establish Causal Emergence](https://arxiv.org/html/2606.07563#S6.SS1) 2. [6\.2The Theorem](https://arxiv.org/html/2606.07563#S6.SS2)
7. [7Mechanism Landscape Theory: What Determines Emergence](https://arxiv.org/html/2606.07563#S7)1. [7\.1Proposition A: Domain Determines Form,𝒫\\mathcal\{P\}Determines Type](https://arxiv.org/html/2606.07563#S7.SS1) 2. [7\.2Proposition B: Mechanism Landscape Determines Universality Class](https://arxiv.org/html/2606.07563#S7.SS2) 3. [7\.3Proposition C: Mechanism Competition Entropy Bounds Causal Potency](https://arxiv.org/html/2606.07563#S7.SS3) 4. [7\.4An Emergence Classification Scheme](https://arxiv.org/html/2606.07563#S7.SS4)
8. [8Instantiations](https://arxiv.org/html/2606.07563#S8)1. [8\.1ML: LLM Training Dynamics and Grokking](https://arxiv.org/html/2606.07563#S8.SS1)1. [8\.1\.1How Emergence Forms: The Three\-Phase HEF Trajectory](https://arxiv.org/html/2606.07563#S8.SS1.SSS1) 2. [8\.1\.2Formal Derivation](https://arxiv.org/html/2606.07563#S8.SS1.SSS2) 3. [8\.1\.3Small\-Scale Empirical Evidence](https://arxiv.org/html/2606.07563#S8.SS1.SSS3) 2. [8\.2EOM: Prebiotic Chemistry and Evolutionary Biology](https://arxiv.org/html/2606.07563#S8.SS2) 3. [8\.3IFF: Information Field Theory](https://arxiv.org/html/2606.07563#S8.SS3) 4. [8\.4RSID: Nanoparticle Signal Detection](https://arxiv.org/html/2606.07563#S8.SS4)
9. [9Practitioner’s Guide: Applying HEF to New Systems](https://arxiv.org/html/2606.07563#S9)1. [9\.1Step 1: Identify the HEF Tuple](https://arxiv.org/html/2606.07563#S9.SS1) 2. [9\.2Step 2: Detect theEcE\_\{c\}Fingerprint](https://arxiv.org/html/2606.07563#S9.SS2) 3. [9\.3Step 3: Classify the Emergence Type](https://arxiv.org/html/2606.07563#S9.SS3) 4. [9\.4Step 4: Intervene via HEF Predictions](https://arxiv.org/html/2606.07563#S9.SS4) 5. [9\.5Thehef\-toolsPackage](https://arxiv.org/html/2606.07563#S9.SS5)
10. [10Related Work](https://arxiv.org/html/2606.07563#S10)
11. [11Conclusion](https://arxiv.org/html/2606.07563#S11)
12. [AIllustrative Example: Grokking Delay atp=97p=97](https://arxiv.org/html/2606.07563#A1)
13. [BProof of Compression Coefficients from Cost Minimality](https://arxiv.org/html/2606.07563#A2)
14. [CReproducibility Package](https://arxiv.org/html/2606.07563#A3)
15. [S1Summary of Assumptions](https://arxiv.org/html/2606.07563#A1a)
16. [S2Flow of Proofs](https://arxiv.org/html/2606.07563#A2a)
17. [S3Notation and Preliminaries](https://arxiv.org/html/2606.07563#A3a)1. [S3\.1Physical Attribute Space](https://arxiv.org/html/2606.07563#A3.SS1) 2. [S3\.2Formula Metric](https://arxiv.org/html/2606.07563#A3.SS2) 3. [S3\.3Hausdorff Metric](https://arxiv.org/html/2606.07563#A3.SS3)
18. [S4Physical Foundation: The Translation MapΦ\\Phi](https://arxiv.org/html/2606.07563#A4)
19. [S5Physical Feasibility Theorem](https://arxiv.org/html/2606.07563#A5)
20. [S6Compression Coefficients](https://arxiv.org/html/2606.07563#A6)
21. [S7Metric Contraction in ML Instantiations](https://arxiv.org/html/2606.07563#A7)1. [S7\.1Background: Lipschitz Properties of Neural Network Layers](https://arxiv.org/html/2606.07563#A7.SS1) 2. [S7\.2Argument A: Structural Contraction via Monotone Compression](https://arxiv.org/html/2606.07563#A7.SS2) 3. [S7\.3Argument B: Dynamical Contraction near the Fixed Point](https://arxiv.org/html/2606.07563#A7.SS3) 4. [S7\.4Explicit Contraction Constant and Summary](https://arxiv.org/html/2606.07563#A7.SS4) 5. [S7\.5P\-Stability under Type\-Preserving Atom Replacement](https://arxiv.org/html/2606.07563#A7.SS5) 6. [S7\.6Open Experimental Protocol: G1\-test](https://arxiv.org/html/2606.07563#A7.SS6)
22. [S8Energy\-Diversity Trade\-off and Universal Convergence](https://arxiv.org/html/2606.07563#A8)
23. [S9Causal Emergence at the Fixed Point](https://arxiv.org/html/2606.07563#A9)1. [S9\.1Effective Information](https://arxiv.org/html/2606.07563#A9.SS1) 2. [S9\.2Main Causal Emergence Theorem](https://arxiv.org/html/2606.07563#A9.SS2)
24. [S10Grokking Delay: Conditional Derivation](https://arxiv.org/html/2606.07563#A10)
25. [S11Summary of Results](https://arxiv.org/html/2606.07563#A11)
26. [S12Discussion: On the Status of A6](https://arxiv.org/html/2606.07563#A12)
27. [S13References](https://arxiv.org/html/2606.07563#A13)
28. [References](https://arxiv.org/html/2606.07563#bib)

## Appendix S1Summary of Assumptions

###### Assumption 8\(A1: Physical Primitives\)\.

Every primitiveri∈R\(1\)r\_\{i\}\\in R^\{\(1\)\}satisfies the physical constraint set𝒫=\(𝒫thermo,𝒫info\)\\mathcal\{P\}=\(\\mathcal\{P\}\_\{\\mathrm\{thermo\}\},\\mathcal\{P\}\_\{\\mathrm\{info\}\}\)\.

###### Assumption 9\(A2: Physical Negation\)\.

Axiom N holds for all levelskk\.

###### Assumption 10\(A3: Interaction Regularity\)\.

Conjunctions are governed by Definition 2\.5 of the main text\. The interaction energyΔ​Eφ​ψ\\Delta E\_\{\\varphi\\psi\}is Lipschitz in atomic energies with constantΛE≤1\\Lambda\_\{E\}\\leq 1\.

###### Assumption 11\(A4: Feasibility\-Preserving Generation\)\.

The generation rule𝒢\\mathcal\{G\}has range restricted to𝒜∗\\mathcal\{A\}^\{\*\}\.

###### Assumption 12\(A5: P\-Determined Cost\)\.

cost​\(α\)\\mathrm\{cost\}\(\\alpha\)depends only onα\\alphaand𝒫\\mathcal\{P\}, not onR\(1\)R^\{\(1\)\},𝒜0\\mathcal\{A\}\_\{0\}, or𝒢\\mathcal\{G\}\.

###### Assumption 13\(A6: Metric Contraction\)\.

ForE<EcE<E\_\{c\}, the generator mapTkT\_\{k\}satisfiesdH​\(Tk​\(R1\),Tk​\(R2\)\)≤c⋅dH​\(R1,R2\)d\_\{H\}\(T\_\{k\}\(R\_\{1\}\),T\_\{k\}\(R\_\{2\}\)\)\\leq c\\cdot d\_\{H\}\(R\_\{1\},R\_\{2\}\)for somec∈\(0,1\)c\\in\(0,1\)independent ofR1,R2R\_\{1\},R\_\{2\}\. This property is empirically verifiable \(Section[S7](https://arxiv.org/html/2606.07563#A7)\)\.

###### Assumption 14\(G1: Gradient Energy Decay\)\.

Estep​\(t\)=E0/\(1\+λ​t\)E\_\{\\mathrm\{step\}\}\(t\)=E\_\{0\}/\(1\+\\lambda t\)\.

###### Assumption 15\(G2: Circuit Assembly Time – Revised\)\.

tconv∝1/\(frac⋅p⋅λ\)t\_\{\\mathrm\{conv\}\}\\propto 1/\(\\mathrm\{frac\}\\cdot p\\cdot\\lambda\), consistent withβ=−1\.39±0\.20\\beta=\-1\.39\\pm 0\.20acrossp∈\{23,…,97\}p\\in\\\{23,\\ldots,97\\\}\.

###### Assumption 16\(NDA: Non\-Degeneracy Assumption\)\.

Hμ​\(fα∗\(k\)​\(R\(k\)\)\)≥Iμ​\(R\(k\);Tkpre​\(R\(k\)\)\)H\_\{\\mu\}\(f^\{\(k\)\}\_\{\\alpha^\{\*\}\}\(R^\{\(k\)\}\)\)\\geq I\_\{\\mu\}\(R^\{\(k\)\};T^\{\\mathrm\{pre\}\}\_\{k\}\(R^\{\(k\)\}\)\)\.

## Appendix S2Flow of Proofs

A1–A4Prop 3\.1:Φ\\PhiIsomorphismThm 4\.1:Physical FeasibilityLemma 3\.3:CompressionCoefficientsThm 5\.4\(ii\):Existence ofEcE\_\{c\}A6:Metric ContractionThm 5\.4\(iii\):Energy\-DiversityCor 5\.5:UniversalConvergenceThm 6\.1:CausalEmergenceProp 7\.1:Grokking Delay

Color code:Blue= theorems/corollaries;Green= lemmas;Dark blue= assumptions A1–A4;Red= A6 \(key technical condition\)\.

## Appendix S3Notation and Preliminaries

### S3\.1Physical Attribute Space

###### Definition S3\.1\(Physical Attribute Space\)\.

Each primitiveri\(k\)r^\{\(k\)\}\_\{i\}carries a triple of non\-negative real numbers\(Ei,Si,Hi\)∈ℝ≥03\(E\_\{i\},S\_\{i\},H\_\{i\}\)\\in\\mathbb\{R\}^\{3\}\_\{\\geq 0\}, whereEiE\_\{i\}is energy \(Joules\),SiS\_\{i\}is thermodynamic entropy \(J/K\), andHiH\_\{i\}is Shannon information content \(bits\)\.

###### Definition S3\.2\(Physical Metric\)\.

Fora=\(E1,S1,H1\)a=\(E\_\{1\},S\_\{1\},H\_\{1\}\)andb=\(E2,S2,H2\)b=\(E\_\{2\},S\_\{2\},H\_\{2\}\),

d​\(a,b\)=\|E1−E2\|\+kB​\|S1−S2\|\+kB​ln⁡2⋅\|H1−H2\|Eref,d\(a,b\)=\\frac\{\|E\_\{1\}\-E\_\{2\}\|\+k\_\{B\}\|S\_\{1\}\-S\_\{2\}\|\+k\_\{B\}\\ln 2\\cdot\|H\_\{1\}\-H\_\{2\}\|\}\{E\_\{\\mathrm\{ref\}\}\},wherekB=1\.380649×10−23​J/Kk\_\{B\}=1\.380649\\times 10^\{\-23\}\\,\\text\{J/K\}is Boltzmann’s constant, andEref\>0E\_\{\\mathrm\{ref\}\}\>0is the reference energy from A3\.

### S3\.2Formula Metric

###### Definition S3\.3\(Formula Metric\)\.

For formulasφ=φ​\(ri1,…,rim\)\\varphi=\\varphi\(r\_\{i\_\{1\}\},\\ldots,r\_\{i\_\{m\}\}\)andψ=ψ​\(sj1,…,sjm\)\\psi=\\psi\(s\_\{j\_\{1\}\},\\ldots,s\_\{j\_\{m\}\}\)inℒ​\(R\(k−1\)\)\\mathcal\{L\}\(R^\{\(k\-1\)\}\)with the same logical structure matched by bijectionσ\\sigma:

dℒ​\(φ,ψ\)=max1≤ℓ≤m⁡d​\(riℓ,sjσ​\(ℓ\)\)\.d\_\{\\mathcal\{L\}\}\(\\varphi,\\psi\)=\\max\_\{1\\leq\\ell\\leq m\}d\(r\_\{i\_\{\\ell\}\},s\_\{j\_\{\\sigma\(\\ell\)\}\}\)\.For formulas of different logical structure,dℒ=\+∞d\_\{\\mathcal\{L\}\}=\+\\infty\.

### S3\.3Hausdorff Metric

###### Definition S3\.4\(Hausdorff Metric\)\.

LetΩ\(k\)\\Omega^\{\(k\)\}denote the space of non\-empty compact subsets of𝒫\\mathcal\{P\}\-feasible level\-kkentities\. ForR1,R2∈Ω\(k\)R\_\{1\},R\_\{2\}\\in\\Omega^\{\(k\)\},

dH​\(R1,R2\)=max⁡\{supr∈R1infs∈R2d​\(r,s\),sups∈R2infr∈R1d​\(r,s\)\}\.d\_\{H\}\(R\_\{1\},R\_\{2\}\)=\\max\\Bigl\\\{\\sup\_\{r\\in R\_\{1\}\}\\inf\_\{s\\in R\_\{2\}\}d\(r,s\),\\;\\sup\_\{s\\in R\_\{2\}\}\\inf\_\{r\\in R\_\{1\}\}d\(r,s\)\\Bigr\\\}\.

###### Lemma S3\.1\(Completeness ofΩ\(k\)\\Omega^\{\(k\)\}\)\.

\(Ω\(k\),dH\)\(\\Omega^\{\(k\)\},d\_\{H\}\)is a complete metric space\.

###### Proof\.

Step 1\.\(ℝ≥03,d\)\(\\mathbb\{R\}^\{3\}\_\{\\geq 0\},d\)is complete: it is a closed subset of the Banach space\(ℝ3,∥⋅∥1\)\(\\mathbb\{R\}^\{3\},\\\|\\cdot\\\|\_\{1\}\); limits of non\-negative sequences are non\-negative\.

Step 2\.The Hausdorff metric space of non\-empty compact subsets of any complete metric space is itself complete \(Hausdorff 1914; Munkres 2000, Theorem 45\.1\)\.

Step 3\.EachR\(k\)R^\{\(k\)\}is finite by induction\.R\(1\)R^\{\(1\)\}is finite by definition\. For the inductive step, we restrict to*depth\-bounded formulas*ℒd​\(R\(k−1\)\)\\mathcal\{L\}\_\{d\}\(R^\{\(k\-1\)\}\): formulas whose parse tree has depth≤dmax\\leq d\_\{\\max\}\. The maximum depth is determined by physical constraints: by Landauer’s principle, each logical operation \(conjunction, negation, etc\.\) dissipates at leastΔ​Emin=kB​T​ln⁡2\\Delta E\_\{\\min\}=k\_\{B\}T\\ln 2of energy\. A formula of depthddrequires at leastd⋅Δ​Emind\\cdot\\Delta E\_\{\\min\}energy to evaluate\. Since the reference energyErefE\_\{\\mathrm\{ref\}\}bounds the total energy available \(A3\), any formula with depthd\>Eref/Δ​Emind\>E\_\{\\mathrm\{ref\}\}/\\Delta E\_\{\\min\}is physically unrealizable\. Hencedmax=⌊Eref/Δ​Emin⌋d\_\{\\max\}=\\lfloor E\_\{\\mathrm\{ref\}\}/\\Delta E\_\{\\min\}\\rfloor\.

Under this restriction,\|ℒd​\(R\(k−1\)\)\|≤\(2​\|R\(k−1\)\|\)2dmax\|\\mathcal\{L\}\_\{d\}\(R^\{\(k\-1\)\}\)\|\\leq\(2\|R^\{\(k\-1\)\}\|\)^\{2^\{d\_\{\\max\}\}\}, which is finite becausedmaxd\_\{\\max\}is finite\. Finite sets are compact\.

Step 4\.𝒫\\mathcal\{P\}\-feasibility is defined by a finite set of closed inequalities \(P1–P6\), soΩ\(k\)\\Omega^\{\(k\)\}is closed\. Closed subsets of complete metric spaces are complete\. ∎

## Appendix S4Physical Foundation: The Translation MapΦ\\Phi

###### Proposition S4\.1\(Constraint Lattice Isomorphism\)\.

The mapΦ:𝒫thermo→𝒫info\\Phi:\\mathcal\{P\}\_\{\\mathrm\{thermo\}\}\\to\\mathcal\{P\}\_\{\\mathrm\{info\}\}defined byP1↦P4P\_\{1\}\\mapsto P\_\{4\},P2↦P5P\_\{2\}\\mapsto P\_\{5\},P3↦P6P\_\{3\}\\mapsto P\_\{6\}is an order\-isomorphism of constraint lattices\. Consequently,𝒜thermo∗=𝒜info∗=:𝒜∗\\mathcal\{A\}^\{\*\}\_\{\\mathrm\{thermo\}\}=\\mathcal\{A\}^\{\*\}\_\{\\mathrm\{info\}\}=:\\mathcal\{A\}^\{\*\}\.

###### Proof\.

We verify each correspondence as a logical equivalence\.

P1↔P4P\_\{1\}\\leftrightarrow P\_\{4\}\(Landauer–Bennett\)\.Landauer’s principle \(Landauer 1961\): erasingΔ​H\\Delta Hbits requires energy≥kB​T​ln⁡2⋅Δ​H\\geq k\_\{B\}T\\ln 2\\cdot\\Delta H\. For mechanismfαf\_\{\\alpha\}:Δ​Eα\+kB​T​Δ​Hα≥0\\Delta E\_\{\\alpha\}\+k\_\{B\}T\\Delta H\_\{\\alpha\}\\geq 0\. Bennett \(1982\): violatingI​\(fα​\(φ\);φ\)≤H​\(φ\)I\(f\_\{\\alpha\}\(\\varphi\);\\varphi\)\\leq H\(\\varphi\)creates information without energetic cost, violating the above\. Henceα∈𝒜P1∗⇔α∈𝒜P4∗\\alpha\\in\\mathcal\{A\}^\{\*\}\_\{P\_\{1\}\}\\Leftrightarrow\\alpha\\in\\mathcal\{A\}^\{\*\}\_\{P\_\{4\}\}\.

P2↔P5P\_\{2\}\\leftrightarrow P\_\{5\}\(Jarzynski–Gibbs\)\.The Jarzynski equality \(Jarzynski 1997\)⟨e−β​W⟩=e−β​Δ​F\\langle e^\{\-\\beta W\}\\rangle=e^\{\-\\beta\\Delta F\}combined with Jensen gives⟨W⟩≥Δ​F=Δ​E−T​Δ​Stotal\\langle W\\rangle\\geq\\Delta F=\\Delta E\-T\\Delta S\_\{\\mathrm\{total\}\}\. ForW=0W=0:Δ​Stotal≥0\\Delta S\_\{\\mathrm\{total\}\}\\geq 0\(P2\)\. Under the canonical Gibbs measure,SGibbs=kB​ln⁡2⋅H​\(μ\)S\_\{\\mathrm\{Gibbs\}\}=k\_\{B\}\\ln 2\\cdot H\(\\mu\)\(Jaynes 1957\), soH​\(φ\|fα​\(φ\)\)≥0H\(\\varphi\|f\_\{\\alpha\}\(\\varphi\)\)\\geq 0\(P5\) is equivalent toΔ​Stotal≥0\\Delta S\_\{\\mathrm\{total\}\}\\geq 0\.

P3↔P6P\_\{3\}\\leftrightarrow P\_\{6\}\(Markov–DPI\)\.Any cascaderi→rj→rkr\_\{i\}\\to r\_\{j\}\\to r\_\{k\}forms a Markov chain by construction\. The Data Processing Inequality \(Cover & Thomas 2006, Thm\. 2\.8\.1\) givesI​\(ri;rk\)≤I​\(ri;rj\)I\(r\_\{i\};r\_\{k\}\)\\leq I\(r\_\{i\};r\_\{j\}\)\(P6\)\. Violation would create information across the cascade without energy cost, violating P1\.

Order\-isomorphism\.EachPi↔Φ​\(Pi\)P\_\{i\}\\leftrightarrow\\Phi\(P\_\{i\}\)is a logical equivalence\. The partial orderPi≤PjP\_\{i\}\\leq P\_\{j\}iff everyPiP\_\{i\}\-satisfying process also satisfiesPjP\_\{j\}is preserved\. Bijectivity is immediate from\{P1,P2,P3\}↔\{P4,P5,P6\}\\\{P\_\{1\},P\_\{2\},P\_\{3\}\\\}\\leftrightarrow\\\{P\_\{4\},P\_\{5\},P\_\{6\}\\\}\. ∎

## Appendix S5Physical Feasibility Theorem

###### Theorem S5\.1\(Physical Feasibility of Emergence\)\.

Under A1–A4, for allk≥1k\\geq 1and allr\(k\)∈R\(k\)r^\{\(k\)\}\\in R^\{\(k\)\},r\(k\)⊧𝒫thermor^\{\(k\)\}\\models\\mathcal\{P\}\_\{\\mathrm\{thermo\}\}andr\(k\)⊧𝒫infor^\{\(k\)\}\\models\\mathcal\{P\}\_\{\\mathrm\{info\}\}simultaneously viaΦ\\Phi\.

###### Proof\.

Strong induction onkk\. Base casek=1k=1: immediate from A1\.

Inductive step: assume allr\(j\)∈R\(j\)r^\{\(j\)\}\\in R^\{\(j\)\},j≤k−1j\\leq k\-1, satisfy𝒫\\mathcal\{P\}\. We prove by structural induction onφ∈ℒd​\(R\(k−1\)\)\\varphi\\in\\mathcal\{L\}\_\{d\}\(R^\{\(k\-1\)\}\)thatφ⊧𝒫\\varphi\\models\\mathcal\{P\}\.

Atomic:ri\(k−1\)⊧𝒫r^\{\(k\-1\)\}\_\{i\}\\models\\mathcal\{P\}by inductive hypothesis\.

Negation:ri\(k−1\)⟂⊧𝒫r^\{\(k\-1\)\\perp\}\_\{i\}\\models\\mathcal\{P\}by A2 \(Axiom N1\)\.

Conjunctionφ∧ψ\\varphi\\wedge\\psi:Thermodynamic branch:P1 holds by A3 \(energy conservation by construction\); P2 byΔ​Gφ​ψ≤0\\Delta G\_\{\\varphi\\psi\}\\leq 0\(A3 selects thermodynamically favourable interactions\); P3 inherited\.Information\-theoretic branch:P4 by subadditivity of entropy; P5 byH​\(φ\|ψ\)≥0H\(\\varphi\|\\psi\)\\geq 0\(chain rule\); P6 by inductive hypothesis on causal orderings\.Consistency:Proposition[S4\.1](https://arxiv.org/html/2606.07563#A4.Thmtheorem1)gives𝒫thermo⇔𝒫info\\mathcal\{P\}\_\{\\mathrm\{thermo\}\}\\Leftrightarrow\\mathcal\{P\}\_\{\\mathrm\{info\}\}\.

Disjunction:φ∨ψ≡¬\(¬φ∧¬ψ\)\\varphi\\vee\\psi\\equiv\\neg\(\\neg\\varphi\\wedge\\neg\\psi\); follows from negation and conjunction cases\.

Implication:φ⇒ψ≡¬φ∨ψ\\varphi\\Rightarrow\\psi\\equiv\\neg\\varphi\\vee\\psi\.

Causal orderingφ→ψ\\varphi\\to\\psi:By definition imposes a Markov chain structure, preserving P6; other constraints follow from components\.

Mechanism application:r\(k\)=fα\(k−1\)​\(φ\)r^\{\(k\)\}=f^\{\(k\-1\)\}\_\{\\alpha\}\(\\varphi\)withα∈𝒜∗\\alpha\\in\\mathcal\{A\}^\{\*\}\(by A4\)\.fα\(k−1\)f^\{\(k\-1\)\}\_\{\\alpha\}preserves𝒫\\mathcal\{P\}by Definition 2\.7 of the main text\. ∎

## Appendix S6Compression Coefficients

###### Definition S6\.1\(Non\-trivial, Non\-injective Mechanism\)\.

fαf\_\{\\alpha\}isnon\-trivialif∃φ\\exists\\varphiwithfα​\(φ\)≠φf\_\{\\alpha\}\(\\varphi\)\\neq\\varphi;non\-injectiveif∃φ1≠φ2\\exists\\varphi\_\{1\}\\neq\\varphi\_\{2\}withfα​\(φ1\)=fα​\(φ2\)f\_\{\\alpha\}\(\\varphi\_\{1\}\)=f\_\{\\alpha\}\(\\varphi\_\{2\}\)\.

###### Lemma S6\.1\(Compression Coefficients\)\.

Letα∗=arg⁡minα∈𝒜∗⁡cost​\(α\)\\alpha^\{\*\}=\\arg\\min\_\{\\alpha\\in\\mathcal\{A\}^\{\*\}\}\\mathrm\{cost\}\(\\alpha\)be non\-trivial and non\-injective\. Thenbα∗:=supφ⊧𝒫H​\(fα∗​\(φ\)\)/H​\(φ\)<1b\_\{\\alpha^\{\*\}\}:=\\sup\_\{\\varphi\\models\\mathcal\{P\}\}H\(f\_\{\\alpha^\{\*\}\}\(\\varphi\)\)/H\(\\varphi\)<1andaα∗:=supφ⊧𝒫Efα∗​\(φ\)/Eφ<1a\_\{\\alpha^\{\*\}\}:=\\sup\_\{\\varphi\\models\\mathcal\{P\}\}E\_\{f\_\{\\alpha^\{\*\}\}\(\\varphi\)\}/E\_\{\\varphi\}<1forE<EcE<E\_\{c\}\.

###### Proof\.

bα∗<1b\_\{\\alpha^\{\*\}\}<1:DPI \(P6\) givesH​\(fα∗​\(φ\)\)≤H​\(φ\)H\(f\_\{\\alpha^\{\*\}\}\(\\varphi\)\)\\leq H\(\\varphi\), sobα∗≤1b\_\{\\alpha^\{\*\}\}\\leq 1\. Non\-injectivity givesφA≠φB\\varphi\_\{A\}\\neq\\varphi\_\{B\}withfα∗\(φA\)=fα∗\(φB\)=:r∗f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{A\}\)=f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{B\}\)=:r^\{\*\}\.

Consider the random variableΦ\\Phithat equalsφA\\varphi\_\{A\}with probabilityppandφB\\varphi\_\{B\}with probability1−p1\-p\. Sincefα∗​\(φA\)=fα∗​\(φB\)=r∗f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{A\}\)=f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{B\}\)=r^\{\*\}, the Markov chainΦ→r∗→Φ\\Phi\\to r^\{\*\}\\to\\Phiholds\. By the Data Processing Inequality applied twice:

I​\(Φ;Φ\)≥I​\(Φ;r∗\)≥I​\(r∗;r∗\)=H​\(r∗\)\.I\(\\Phi;\\Phi\)\\geq I\(\\Phi;r^\{\*\}\)\\geq I\(r^\{\*\};r^\{\*\}\)=H\(r^\{\*\}\)\.ButI​\(Φ;Φ\)=H​\(Φ\)≤min⁡\(H​\(φA\),H​\(φB\)\)I\(\\Phi;\\Phi\)=H\(\\Phi\)\\leq\\min\(H\(\\varphi\_\{A\}\),H\(\\varphi\_\{B\}\)\)\(the entropy of a mixture is at most the maximum of the individual entropies, which is bounded by the minimum when one has larger entropy\)\. HenceH​\(r∗\)≤min⁡\(H​\(φA\),H​\(φB\)\)H\(r^\{\*\}\)\\leq\\min\(H\(\\varphi\_\{A\}\),H\(\\varphi\_\{B\}\)\)\.

IfH​\(φA\)\>H​\(φB\)H\(\\varphi\_\{A\}\)\>H\(\\varphi\_\{B\}\), thenH​\(r∗\)≤H​\(φB\)<H​\(φA\)H\(r^\{\*\}\)\\leq H\(\\varphi\_\{B\}\)<H\(\\varphi\_\{A\}\)\. IfH\(φA\)=H\(φB\)=:h\>0H\(\\varphi\_\{A\}\)=H\(\\varphi\_\{B\}\)=:h\>0, thenH​\(r∗\)≤hH\(r^\{\*\}\)\\leq h, and sinceφA≠φB\\varphi\_\{A\}\\neq\\varphi\_\{B\}under the canonical measure, the inequality is strict:H​\(r∗\)<hH\(r^\{\*\}\)<h\. In either case,H​\(fα∗​\(φA\)\)<H​\(φA\)H\(f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{A\}\)\)<H\(\\varphi\_\{A\}\)\.

Full support ofμ\\mugives positive weight to this pair, hence𝔼μ​\[Δ​Hα∗\]<0\\mathbb\{E\}\_\{\\mu\}\[\\Delta H\_\{\\alpha^\{\*\}\}\]<0, forcingbα∗<1b\_\{\\alpha^\{\*\}\}<1\.

aα∗<1a\_\{\\alpha^\{\*\}\}<1forE<EcE<E\_\{c\}:The minimum\-cost mechanism minimises𝔼μ​\[Δ​E\+kB​T​Δ​H\]\\mathbb\{E\}\_\{\\mu\}\[\\Delta E\+k\_\{B\}T\\Delta H\]\. SinceΔ​H≤0\\Delta H\\leq 0\(from above\), the information term reduces cost\. Energy\-neutral mechanisms \(𝔼​\[Δ​E\]=0\\mathbb\{E\}\[\\Delta E\]=0\) would achievea=1a=1, but they requireE≥EcE\\geq E\_\{c\}to be affordable \(by definition ofEcE\_\{c\}as the inflection point where the rate of new mechanisms is maximised\)\. ForE<EcE<E\_\{c\}, only mechanisms with𝔼μ​\[Δ​E\]<0\\mathbb\{E\}\_\{\\mu\}\[\\Delta E\]<0are in𝒜∗​\(E\)\\mathcal\{A\}^\{\*\}\(E\), givingaα∗<1a\_\{\\alpha^\{\*\}\}<1\. ∎

## Appendix S7Metric Contraction in ML Instantiations

This section provides a complete, rigorous derivation of A6 for machine learning instantiations\. We givetwo complementary arguments:

1. \(A\)*Structural argument*\(primary\): the grokked Fourier circuit is a monotone\-compressive projection⇒\\RightarrowA6 from Proposition[3\.7](https://arxiv.org/html/2606.07563#S3.Thmtheorem7)of the main text\. No spectral normalisation or PL condition required\.
2. \(B\)*Dynamical argument*\(explicit constant\): gradient descent with weight decayλ\>0\\lambda\>0drives perturbations around the fixed point to zero exponentially, yielding the explicit constantcα∗≈1−η​λ<1c\_\{\\alpha^\{\*\}\}\\approx 1\-\\eta\\lambda<1\.

Both arguments are self\-contained and mutually reinforcing\. The structural argument establishes thatcα∗<1c\_\{\\alpha^\{\*\}\}<1at the grokked fixed point; the dynamical argument provides a quantitative lower bound on the contraction gap1−cα∗1\-c\_\{\\alpha^\{\*\}\}in terms of training hyperparameters\.

### S7\.1Background: Lipschitz Properties of Neural Network Layers

###### Lemma S7\.1\(Lipschitz Bound for Standard Layers\)\.

Forf​\(x\)=σ​\(W​x\+b\)f\(x\)=\\sigma\(Wx\+b\)withCC\-Lipschitz activationσ\\sigma\(C=Lip​\(σ\)C=\\mathrm\{Lip\}\(\\sigma\)\):

Lip​\(f\)≤C⋅‖W‖2\.\\mathrm\{Lip\}\(f\)\\;\\leq\\;C\\cdot\\\|W\\\|\_\{2\}\.

###### Proof\.

‖σ​\(W​x1\+b\)−σ​\(W​x2\+b\)‖2≤C​‖W​x1−W​x2‖2≤C​‖W‖2​‖x1−x2‖2\\\|\\sigma\(Wx\_\{1\}\+b\)\-\\sigma\(Wx\_\{2\}\+b\)\\\|\_\{2\}\\leq C\\\|Wx\_\{1\}\-Wx\_\{2\}\\\|\_\{2\}\\leq C\\\|W\\\|\_\{2\}\\\|x\_\{1\}\-x\_\{2\}\\\|\_\{2\}\. The bound is tight \(achieved by the right singular vector ofWW\)\. ∎

###### Lemma S7\.2\(Lipschitz Bound for Self\-Attention\)\.

LetAttn​\(Q,K,V\)=softmax​\(Q​K⊤/dk\)​V\\mathrm\{Attn\}\(Q,K,V\)=\\mathrm\{softmax\}\(QK^\{\\top\}/\\sqrt\{d\_\{k\}\}\)VwithQ=X​WQQ=XW\_\{Q\},K=X​WKK=XW\_\{K\},V=X​WVV=XW\_\{V\}\. If‖WQ‖2,‖WK‖2,‖WV‖2≤1\\\|W\_\{Q\}\\\|\_\{2\},\\\|W\_\{K\}\\\|\_\{2\},\\\|W\_\{V\}\\\|\_\{2\}\\leq 1:

Lipℓ2​\(Attn\)≤ddk,\\mathrm\{Lip\}\_\{\\ell^\{2\}\}\(\\mathrm\{Attn\}\)\\;\\leq\\;\\frac\{\\sqrt\{d\}\}\{\\sqrt\{d\_\{k\}\}\},whereddis the sequence embedding dimension\. Ford=dkd=d\_\{k\}\(standard\):Lipℓ2​\(Attn\)≤1\\mathrm\{Lip\}\_\{\\ell^\{2\}\}\(\\mathrm\{Attn\}\)\\leq 1\.

###### Proof\.

The softmax satisfiesLipℓ∞​\(softmax\)≤1\\mathrm\{Lip\}\_\{\\ell^\{\\infty\}\}\(\\mathrm\{softmax\}\)\\leq 1\(Gao & Pavel, 2017\)\. The cross\-norm bound‖u−v‖2≤d​‖u−v‖∞\\\|u\-v\\\|\_\{2\}\\leq\\sqrt\{d\}\\\|u\-v\\\|\_\{\\infty\}introduces thed\\sqrt\{d\}factor; thedk\\sqrt\{d\_\{k\}\}scaling in the attention formula compensates\. Full computation in Kim et al\.\[kim2021\]\. ∎

### S7\.2Argument A: Structural Contraction via Monotone Compression

The grokked mechanismα∗\\alpha^\{\*\}for modular arithmetic has been characterised by mechanistic interpretability: it implements aFourier circuitthat computes\(x\+y\)modp\(x\+y\)\\bmod pby projecting representations onto a finite setK⊂\{1,…,⌊p/2⌋\}K\\subset\\\{1,\\ldots,\\lfloor p/2\\rfloor\\\}of Fourier frequencies\[[41](https://arxiv.org/html/2606.07563#bib.bib41)\]\.

###### Definition S7\.1\(Fourier Projection Mechanism\)\.

The grokked Fourier circuit acts as a rank\-\|K\|\|K\|projection:

fα∗​\(φ\)=PK​g​\(φ\),f\_\{\\alpha^\{\*\}\}\(\\varphi\)=P\_\{K\}\\,g\(\\varphi\),whereg:ℒ​\(R\(k−1\)\)→ℝdg:\\mathcal\{L\}\(R^\{\(k\-1\)\}\)\\to\\mathbb\{R\}^\{d\}is the embedding function andPK=∑k∈K\(ek\(cos\)​\(ek\(cos\)\)⊤\+ek\(sin\)​\(ek\(sin\)\)⊤\)P\_\{K\}=\\sum\_\{k\\in K\}\(e\_\{k\}^\{\(\\cos\)\}\(e\_\{k\}^\{\(\\cos\)\}\)^\{\\top\}\+e\_\{k\}^\{\(\\sin\)\}\(e\_\{k\}^\{\(\\sin\)\}\)^\{\\top\}\)is the orthogonal projection onto the subspace spanned by Fourier basis vectors\{ek\(cos\),ek\(sin\)\}k∈K\\\{e\_\{k\}^\{\(\\cos\)\},e\_\{k\}^\{\(\\sin\)\}\\\}\_\{k\\in K\}\.

###### Proposition S7\.3\(Grokked Circuit is Monotone\-Compressive\)\.

The grokked Fourier circuitfα∗f\_\{\\alpha^\{\*\}\}is monotone\-compressive in the sense of Definition[3\.3](https://arxiv.org/html/2606.07563#S3.Thmdefinition3)of the main text, with

bα∗≤2​\|K\|d<1,b\_\{\\alpha^\{\*\}\}\\;\\leq\\;\\frac\{2\|K\|\}\{d\}\\;<\\;1,\(13\)where\|K\|≪d/2\|K\|\\ll d/2in the overparameterised regime \(d=128d=128,\|K\|≈2\|K\|\\approx 2in our experiments\)\.

###### Proof\.

Non\-injectivity\.For distinct inputsφA,φB\\varphi\_\{A\},\\varphi\_\{B\}with the same modular sum\(xA\+yA\)modp=\(xB\+yB\)modp\(x\_\{A\}\+y\_\{A\}\)\\bmod p=\(x\_\{B\}\+y\_\{B\}\)\\bmod p:fα∗​\(φA\)=PK​g​\(φA\)f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{A\}\)=P\_\{K\}g\(\\varphi\_\{A\}\)\. Ifg​\(φA\)g\(\\varphi\_\{A\}\)andg​\(φB\)g\(\\varphi\_\{B\}\)have the same projection onto the Fourier subspace, thenfα∗​\(φA\)=fα∗​\(φB\)f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{A\}\)=f\_\{\\alpha^\{\*\}\}\(\\varphi\_\{B\}\)\. This occurs for theppdistinct pairs\(x,y\)\(x,y\)satisfying\(x\+y\)≡c\(modp\)\(x\+y\)\\equiv c\\pmod\{p\}for any fixedcc: all map to the same Fourier representation\.

Compression coefficient\.H​\(fα∗​\(φ\)\)H\(f\_\{\\alpha^\{\*\}\}\(\\varphi\)\)measures the information in the Fourier projection\. The subspace has dimension2​\|K\|2\|K\|inℝd\\mathbb\{R\}^\{d\}, so the projection discards the fraction\(d−2​\|K\|\)/d\(d\-2\|K\|\)/dof the spectral energy\. Under the canonical measure:

H​\(fα∗​\(φ\)\)H​\(φ\)≤2​\|K\|d,\\frac\{H\(f\_\{\\alpha^\{\*\}\}\(\\varphi\)\)\}\{H\(\\varphi\)\}\\;\\leq\\;\\frac\{2\|K\|\}\{d\},since the Fourier projection is a rank\-\(2​\|K\|\)\(2\|K\|\)map\. In our experiments:d=128d=128,\|K\|≈2\|K\|\\approx 2\(Nanda et al\., 2023\), givingbα∗≤4/128=0\.031≪1b\_\{\\alpha^\{\*\}\}\\leq 4/128=0\.031\\ll 1\.

Monotone ordering\.PKP\_\{K\}is an orthogonal projection, so it preserves the ordering of norms:‖PK​v1‖≥‖PK​v2‖\\\|P\_\{K\}v\_\{1\}\\\|\\geq\\\|P\_\{K\}v\_\{2\}\\\|whenever‖v1‖≥‖v2‖\\\|v\_\{1\}\\\|\\geq\\\|v\_\{2\}\\\|andv1,v2v\_\{1\},v\_\{2\}are both in the Fourier subspace \(for inputs outside the subspace, the projection can only decrease the norm\)\. Hence the monotone\-compressive condition of Definition[3\.3](https://arxiv.org/html/2606.07563#S3.Thmdefinition3)is satisfied withaα∗,bα∗≤2​\|K\|/d<1a\_\{\\alpha^\{\*\}\},b\_\{\\alpha^\{\*\}\}\\leq 2\|K\|/d<1\.

Conclusion\.By Proposition[3\.7](https://arxiv.org/html/2606.07563#S3.Thmtheorem7)of the main text,cα∗=max⁡\(aα∗,bα∗\)≤2​\|K\|/d<1c\_\{\\alpha^\{\*\}\}=\\max\(a\_\{\\alpha^\{\*\}\},b\_\{\\alpha^\{\*\}\}\)\\leq 2\|K\|/d<1\. ∎

### S7\.3Argument B: Dynamical Contraction near the Fixed Point

Argument A establishescα∗<1c\_\{\\alpha^\{\*\}\}<1from the structure of the grokked representation\. Argument B provides an*explicit formula*forcα∗c\_\{\\alpha^\{\*\}\}in terms of training hyperparameters, valid near the fixed pointW∗W^\{\*\}\.

###### Definition S7\.2\(Fixed\-Point Perturbation\)\.

LetW∗W^\{\*\}be the weight matrix at the grokked fixed point \(the Fourier circuit characterised by Nanda et al\.\[[41](https://arxiv.org/html/2606.07563#bib.bib41)\]\)\. Define the perturbationδ​Wt=Wt−W∗\\delta W\_\{t\}=W\_\{t\}\-W^\{\*\}\.

###### Theorem S7\.4\(Exponential Perturbation Decay\)\.

Assume:

1. \(B1\)The total lossℒtotal=ℒtask\+λ2​‖W‖F2\\mathcal\{L\}\_\{\\mathrm\{total\}\}=\\mathcal\{L\}\_\{\\mathrm\{task\}\}\+\\frac\{\\lambda\}\{2\}\\\|W\\\|\_\{F\}^\{2\}withλ\>0\\lambda\>0\.
2. \(B2\)W∗W^\{\*\}is a local minimum ofℒtotal\\mathcal\{L\}\_\{\\mathrm\{total\}\}with positive\-semidefinite HessianHtask​\(W∗\)≽0H\_\{\\mathrm\{task\}\}\(W^\{\*\}\)\\succcurlyeq 0\.
3. \(B3\)Learning rate satisfiesη≤1/\(λ\+‖Htask​\(W∗\)‖2\)\\eta\\leq 1/\(\\lambda\+\\\|H\_\{\\mathrm\{task\}\}\(W^\{\*\}\)\\\|\_\{2\}\)\.

Then for gradient descent, in a neighbourhood ofW∗W^\{\*\}:

‖δ​Wt\+1‖F≤\(1−η​λ\)​‖δ​Wt‖F\+O​\(‖δ​Wt‖F2\)\.\\\|\\delta W\_\{t\+1\}\\\|\_\{F\}\\;\\leq\\;\(1\-\\eta\\lambda\)\\\|\\delta W\_\{t\}\\\|\_\{F\}\+O\(\\\|\\delta W\_\{t\}\\\|\_\{F\}^\{2\}\)\.Hence‖δ​Wt‖F→0\\\|\\delta W\_\{t\}\\\|\_\{F\}\\to 0exponentially with ratecdyn=1−η​λ∈\(0,1\)c\_\{\\mathrm\{dyn\}\}=1\-\\eta\\lambda\\in\(0,1\)\.

###### Proof\.

The gradient descent update fromWt=W∗\+δ​WtW\_\{t\}=W^\{\*\}\+\\delta W\_\{t\}gives:

δ​Wt\+1\\displaystyle\\delta W\_\{t\+1\}=Wt\+1−W∗=Wt−η​∇ℒtotal​\(Wt\)−W∗\\displaystyle=W\_\{t\+1\}\-W^\{\*\}=W\_\{t\}\-\\eta\\nabla\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(W\_\{t\}\)\-W^\{\*\}=δ​Wt−η​\[∇ℒtask​\(W∗\+δ​Wt\)\+λ​\(W∗\+δ​Wt\)\]\.\\displaystyle=\\delta W\_\{t\}\-\\eta\\bigl\[\\nabla\\mathcal\{L\}\_\{\\mathrm\{task\}\}\(W^\{\*\}\+\\delta W\_\{t\}\)\+\\lambda\(W^\{\*\}\+\\delta W\_\{t\}\)\\bigr\]\.By stationarity atW∗W^\{\*\}:∇ℒtotal​\(W∗\)=0\\nabla\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(W^\{\*\}\)=0, i\.e\.∇ℒtask​\(W∗\)=−λ​W∗\\nabla\\mathcal\{L\}\_\{\\mathrm\{task\}\}\(W^\{\*\}\)=\-\\lambda W^\{\*\}\. Taylor expansion:

∇ℒtask​\(W∗\+δ​Wt\)=−λ​W∗\+Htask​\(W∗\)​δ​Wt\+O​\(‖δ​Wt‖2\)\.\\nabla\\mathcal\{L\}\_\{\\mathrm\{task\}\}\(W^\{\*\}\+\\delta W\_\{t\}\)=\-\\lambda W^\{\*\}\+H\_\{\\mathrm\{task\}\}\(W^\{\*\}\)\\,\\delta W\_\{t\}\+O\(\\\|\\delta W\_\{t\}\\\|^\{2\}\)\.Substituting:

δ​Wt\+1\\displaystyle\\delta W\_\{t\+1\}=δ​Wt−η​\[−λ​W∗\+Htask​\(W∗\)​δ​Wt\+O​\(‖δ​Wt‖2\)\+λ​W∗\+λ​δ​Wt\]\\displaystyle=\\delta W\_\{t\}\-\\eta\\bigl\[\-\\lambda W^\{\*\}\+H\_\{\\mathrm\{task\}\}\(W^\{\*\}\)\\delta W\_\{t\}\+O\(\\\|\\delta W\_\{t\}\\\|^\{2\}\)\+\\lambda W^\{\*\}\+\\lambda\\delta W\_\{t\}\\bigr\]=δ​Wt−η​\(Htask​\(W∗\)\+λ​I\)​δ​Wt\+O​\(η​‖δ​Wt‖2\)\.\\displaystyle=\\delta W\_\{t\}\-\\eta\\bigl\(H\_\{\\mathrm\{task\}\}\(W^\{\*\}\)\+\\lambda I\\bigr\)\\delta W\_\{t\}\+O\(\\eta\\\|\\delta W\_\{t\}\\\|^\{2\}\)\.SinceHtask​\(W∗\)≽0H\_\{\\mathrm\{task\}\}\(W^\{\*\}\)\\succcurlyeq 0\(B2\) andλ\>0\\lambda\>0: all eigenvalues ofHtask​\(W∗\)\+λ​IH\_\{\\mathrm\{task\}\}\(W^\{\*\}\)\+\\lambda Iare≥λ\>0\\geq\\lambda\>0\. Under condition \(B3\):‖I−η​\(Htask​\(W∗\)\+λ​I\)‖2≤1−η​λ<1\\\|I\-\\eta\(H\_\{\\mathrm\{task\}\}\(W^\{\*\}\)\+\\lambda I\)\\\|\_\{2\}\\leq 1\-\\eta\\lambda<1\. Hence:

‖δ​Wt\+1‖F≤\(1−η​λ\)​‖δ​Wt‖F\+O​\(‖δ​Wt‖F2\)\.\\\|\\delta W\_\{t\+1\}\\\|\_\{F\}\\leq\(1\-\\eta\\lambda\)\\\|\\delta W\_\{t\}\\\|\_\{F\}\+O\(\\\|\\delta W\_\{t\}\\\|\_\{F\}^\{2\}\)\.For‖δ​Wt‖F\\\|\\delta W\_\{t\}\\\|\_\{F\}small enough that theO​\(‖δ​Wt‖2\)O\(\\\|\\delta W\_\{t\}\\\|^\{2\}\)term is negligible,‖δ​Wt‖F→0\\\|\\delta W\_\{t\}\\\|\_\{F\}\\to 0exponentially with ratecdyn=1−η​λc\_\{\\mathrm\{dyn\}\}=1\-\\eta\\lambda\. ∎

###### Corollary S7\.5\(Explicit Lipschitz Bound near Convergence\)\.

Forttsufficiently large \(after theEcE\_\{c\}crossing\):

Lip​\(fα∗\(k\)​\(t\)\)≤C⋅‖W∗‖2\+C⋅‖δ​Wt‖2\.\\mathrm\{Lip\}\(f\_\{\\alpha^\{\*\}\}^\{\(k\)\}\(t\)\)\\;\\leq\\;C\\cdot\\\|W^\{\*\}\\\|\_\{2\}\+C\\cdot\\\|\\delta W\_\{t\}\\\|\_\{2\}\.Since‖δ​Wt‖2→0\\\|\\delta W\_\{t\}\\\|\_\{2\}\\to 0\(Theorem[S7\.4](https://arxiv.org/html/2606.07563#A7.Thmtheorem4)\) andC​‖W∗‖2≤bα∗<1C\\\|W^\{\*\}\\\|\_\{2\}\\leq b\_\{\\alpha^\{\*\}\}<1\(Argument A, Proposition[S7\.3](https://arxiv.org/html/2606.07563#A7.Thmtheorem3)\), there existst0t\_\{0\}such thatLip​\(fα∗\(k\)​\(t\)\)<1\\mathrm\{Lip\}\(f\_\{\\alpha^\{\*\}\}^\{\(k\)\}\(t\)\)<1for allt≥t0t\\geq t\_\{0\}\.

### S7\.4Explicit Contraction Constant and Summary

Combining Arguments A and B:

###### Theorem S7\.6\(A6 for ML Instantiations\)\.

Letℋ\\mathcal\{H\}be an ML instantiation of HEF \(modular arithmetic grokking\) trained with AdamW, learning rateη\>0\\eta\>0, weight decayλ\>0\\lambda\>0, and 2\-layer transformer architecture\. After theEcE\_\{c\}crossing \(i\.e\. after grokking\), the generator mapTkT\_\{k\}is a strict contraction in\(Ω\(k\),dH\)\(\\Omega^\{\(k\)\},d\_\{H\}\)with constant

cα∗=max⁡\(aα∗,bα∗\)≤2​\|K\|d<1,c\_\{\\alpha^\{\*\}\}\\;=\\;\\max\(a\_\{\\alpha^\{\*\}\},\\,b\_\{\\alpha^\{\*\}\}\)\\;\\leq\\;\\frac\{2\|K\|\}\{d\}\\;<\\;1,where\|K\|\|K\|is the number of active Fourier frequencies in the grokked circuit\[[41](https://arxiv.org/html/2606.07563#bib.bib41)\]andddis the embedding dimension\. In our experiments \(\|K\|≈2\|K\|\\approx 2,d=128d=128\):cα∗≤0\.031c\_\{\\alpha^\{\*\}\}\\leq 0\.031\.

###### Proof\.

Proposition[S7\.3](https://arxiv.org/html/2606.07563#A7.Thmtheorem3)establishescα∗≤2​\|K\|/d<1c\_\{\\alpha^\{\*\}\}\\leq 2\|K\|/d<1from the structural monotone\-compressive argument \(A6 from Proposition[3\.7](https://arxiv.org/html/2606.07563#S3.Thmtheorem7)of the main text\)\. Theorem[S7\.4](https://arxiv.org/html/2606.07563#A7.Thmtheorem4)and Remark[23](https://arxiv.org/html/2606.07563#Thmremark23)confirm that the weights converge toW∗W^\{\*\}and perturbations decay, ensuring the structural constant is attained in the limit\. Lemma[5\.3](https://arxiv.org/html/2606.07563#S5.Thmtheorem3)of the main text then yields the Hausdorff contraction\.A6 is established\.∎

Table 2:Summary of conditions for A6 in ML instantiations\. All conditions are either provable or empirically verifiable\.
### S7\.5P\-Stability under Type\-Preserving Atom Replacement

###### Lemma S7\.7\(P\-Stability\)\.

LetR1,R2∈Ω\(k−1\)R\_\{1\},R\_\{2\}\\in\\Omega^\{\(k\-1\)\}withdH​\(R1,R2\)=ϵ<Eref/2d\_\{H\}\(R\_\{1\},R\_\{2\}\)=\\epsilon<E\_\{\\mathrm\{ref\}\}/2\. For anyφ∈ℒ​\(R1\)\\varphi\\in\\mathcal\{L\}\(R\_\{1\}\)withφ⊧𝒫\\varphi\\models\\mathcal\{P\}, define the coupled formulaφ¯∈ℒ​\(R2\)\\bar\{\\varphi\}\\in\\mathcal\{L\}\(R\_\{2\}\)by replacing each atomr∈R1r\\in R\_\{1\}with its nearest neighbourπ​\(r\)∈R2\\pi\(r\)\\in R\_\{2\}under a type\-preserving bijectionπ\\pi\(if\|R1\|≠\|R2\|\|R\_\{1\}\|\\neq\|R\_\{2\}\|, pad the smaller set with dummy atoms withE→∞E\\to\\inftyso they never appear in𝒫\\mathcal\{P\}\-feasible formulas\)\. Thenφ¯⊧𝒫\\bar\{\\varphi\}\\models\\mathcal\{P\}\.

###### Proof\.

Structural induction onφ\\varphi\.

Atomic:π​\(rij\)∈R2\\pi\(r\_\{i\_\{j\}\}\)\\in R\_\{2\}satisfies𝒫\\mathcal\{P\}by A1\.

Physical negation:π​\(rij\)⟂\\pi\(r\_\{i\_\{j\}\}\)^\{\\perp\}satisfies𝒫\\mathcal\{P\}by A2\.

Admissible conjunctionφ∧ψ\\varphi\\wedge\\psi: By inductive hypothesis,φ¯,ψ¯⊧𝒫\\bar\{\\varphi\},\\bar\{\\psi\}\\models\\mathcal\{P\}\. By A3 \(Lipschitz condition onΔ​E\\Delta E\):

\|Δ​Eφ¯​ψ¯−Δ​Eφ​ψ\|≤ΛE​\(dℒ​\(φ,φ¯\)\+dℒ​\(ψ,ψ¯\)\)≤2​\(ϵ\+η\)<Eref,\|\\Delta E\_\{\\bar\{\\varphi\}\\bar\{\\psi\}\}\-\\Delta E\_\{\\varphi\\psi\}\|\\leq\\Lambda\_\{E\}\(d\_\{\\mathcal\{L\}\}\(\\varphi,\\bar\{\\varphi\}\)\+d\_\{\\mathcal\{L\}\}\(\\psi,\\bar\{\\psi\}\)\)\\leq 2\(\\epsilon\+\\eta\)<E\_\{\\mathrm\{ref\}\},soφ¯∧ψ¯\\bar\{\\varphi\}\\wedge\\bar\{\\psi\}satisfies P1\. Other constraints follow by inductive hypothesis\.Disjunction, implication, causal ordering: follow analogously\. ∎

###### Corollary S7\.8\(From Lipschitz to Hausdorff Contraction\)\.

Under A1–A6 withE<EcE<E\_\{c\}:

dH​\(Tk​\(R1\),Tk​\(R2\)\)≤cα∗⋅dH​\(R1,R2\)d\_\{H\}\(T\_\{k\}\(R\_\{1\}\),T\_\{k\}\(R\_\{2\}\)\)\\;\\leq\\;c\_\{\\alpha^\{\*\}\}\\cdot d\_\{H\}\(R\_\{1\},R\_\{2\}\)for allR1,R2∈Ω\(k−1\)R\_\{1\},R\_\{2\}\\in\\Omega^\{\(k\-1\)\}\.

###### Proof\.

FixdH​\(R1,R2\)=ϵ\>0d\_\{H\}\(R\_\{1\},R\_\{2\}\)=\\epsilon\>0and anyη\>0\\eta\>0\. By definition ofdHd\_\{H\}, there exists a couplingπ\\piwithd​\(r,π​\(r\)\)≤ϵ\+ηd\(r,\\pi\(r\)\)\\leq\\epsilon\+\\etafor allr∈R1r\\in R\_\{1\}\(with padding if needed\)\. For anyφ∈ℒ​\(R1\)\\varphi\\in\\mathcal\{L\}\(R\_\{1\}\)withφ⊧𝒫\\varphi\\models\\mathcal\{P\}, the coupled formulaφ¯\\bar\{\\varphi\}satisfiesφ¯⊧𝒫\\bar\{\\varphi\}\\models\\mathcal\{P\}\(Lemma[5\.2](https://arxiv.org/html/2606.07563#S5.Thmtheorem2)\) anddℒ​\(φ,φ¯\)≤ϵ\+ηd\_\{\\mathcal\{L\}\}\(\\varphi,\\bar\{\\varphi\}\)\\leq\\epsilon\+\\eta\. Then by A6:

d​\(fα∗\(k\)​\(φ\),fα∗\(k\)​\(φ¯\)\)≤cα∗⋅dℒ​\(φ,φ¯\)≤cα∗​\(ϵ\+η\)\.d\(f^\{\(k\)\}\_\{\\alpha^\{\*\}\}\(\\varphi\),f^\{\(k\)\}\_\{\\alpha^\{\*\}\}\(\\bar\{\\varphi\}\)\)\\leq c\_\{\\alpha^\{\*\}\}\\cdot d\_\{\\mathcal\{L\}\}\(\\varphi,\\bar\{\\varphi\}\)\\leq c\_\{\\alpha^\{\*\}\}\(\\epsilon\+\\eta\)\.Taking sup overTk​\(R1\)T\_\{k\}\(R\_\{1\}\)and inf overTk​\(R2\)T\_\{k\}\(R\_\{2\}\), then lettingη→0\\eta\\to 0:dH​\(Tk​\(R1\),Tk​\(R2\)\)≤cα∗⋅ϵd\_\{H\}\(T\_\{k\}\(R\_\{1\}\),T\_\{k\}\(R\_\{2\}\)\)\\leq c\_\{\\alpha^\{\*\}\}\\cdot\\epsilon\. ∎

### S7\.6Open Experimental Protocol: G1\-test

## Appendix S8Energy\-Diversity Trade\-off and Universal Convergence

###### Theorem S8\.1\(Energy\-Diversity Trade\-off\)\.

Under A1–A6: \(i\)\|R\(k\)​\(E\)\|\|R^\{\(k\)\}\(E\)\|is non\-decreasing inEE; \(ii\) there existsEc\>0E\_\{c\}\>0maximising the marginal gainΔj/\(cj−cj−1\)\\Delta\_\{j\}/\(c\_\{j\}\-c\_\{j\-1\}\); \(iii\) forE<EcE<E\_\{c\},TkT\_\{k\}converges to a unique fixed pointR∞\(k\)∈Ω\(k\)R^\{\(k\)\}\_\{\\infty\}\\in\\Omega^\{\(k\)\}\.

###### Proof\.

\(i\)𝒜∗​\(E\)=\{α∈𝒜∗:cost​\(α\)≤E\}\\mathcal\{A\}^\{\*\}\(E\)=\\\{\\alpha\\in\\mathcal\{A\}^\{\*\}:\\mathrm\{cost\}\(\\alpha\)\\leq E\\\}is non\-decreasing; so is\|R\(k\)​\(E\)\|\|R^\{\(k\)\}\(E\)\|\.

\(ii\)𝒜∗\\mathcal\{A\}^\{\*\}is finite \(finite domainℒ​\(R\(k−1\)\)\\mathcal\{L\}\(R^\{\(k\-1\)\}\)and finite codomainR\(k\)R^\{\(k\)\}\)\. Enumerate distinct cost values as0≤c1<c2<⋯<cN<∞0\\leq c\_\{1\}<c\_\{2\}<\\cdots<c\_\{N\}<\\infty\. LetΔj=\|𝒜∗​\(cj\)\|−\|𝒜∗​\(cj−1\)\|\\Delta\_\{j\}=\|\\mathcal\{A\}^\{\*\}\(c\_\{j\}\)\|\-\|\\mathcal\{A\}^\{\*\}\(c\_\{j\-1\}\)\|\. DefineEc=cj∗E\_\{c\}=c\_\{j^\{\*\}\}wherej∗=arg⁡maxj⁡Δj/\(cj−cj−1\)j^\{\*\}=\\arg\\max\_\{j\}\\Delta\_\{j\}/\(c\_\{j\}\-c\_\{j\-1\}\)\. This maximum exists because we maximise over a finite set\.

\(iii\) ForE<EcE<E\_\{c\},𝒜∗​\(E\)=\{α∗\}\\mathcal\{A\}^\{\*\}\(E\)=\\\{\\alpha^\{\*\}\\\}\(only the minimal\-cost mechanism is affordable\)\. By Lemma[5\.3](https://arxiv.org/html/2606.07563#S5.Thmtheorem3),TkT\_\{k\}is a strict contraction on the complete space\(Ω\(k\),dH\)\(\\Omega^\{\(k\)\},d\_\{H\}\)\. By the Banach Fixed\-Point Theorem \(Banach 1922; Kreyszig 1978\), there exists a uniqueR∞\(k\)R^\{\(k\)\}\_\{\\infty\}withTk​\(R∞\(k\)\)=R∞\(k\)T\_\{k\}\(R^\{\(k\)\}\_\{\\infty\}\)=R^\{\(k\)\}\_\{\\infty\}, and for anyR0∈Ω\(k\)R\_\{0\}\\in\\Omega^\{\(k\)\}:

dH​\(Tkn​\(R0\),R∞\(k\)\)≤cα∗n1−cα∗⋅dH​\(Tk​\(R0\),R0\)→0\.d\_\{H\}\(T\_\{k\}^\{n\}\(R\_\{0\}\),R^\{\(k\)\}\_\{\\infty\}\)\\leq\\frac\{c\_\{\\alpha^\{\*\}\}^\{n\}\}\{1\-c\_\{\\alpha^\{\*\}\}\}\\cdot d\_\{H\}\(T\_\{k\}\(R\_\{0\}\),R\_\{0\}\)\\to 0\.Uniqueness guarantees independence of initial conditions\. ∎

###### Corollary S8\.2\(Universal Feature Convergence\)\.

Two HEF instances sharing𝒫\\mathcal\{P\}and satisfying A1–A6 withE<EcE<E\_\{c\}converge to the sameR∞\(k\)R^\{\(k\)\}\_\{\\infty\}, independent ofR\(1\)R^\{\(1\)\},𝒜0\\mathcal\{A\}\_\{0\}, and𝒢\\mathcal\{G\}\.

###### Proof\.

By A5,cost​\(α\)\\mathrm\{cost\}\(\\alpha\)depends only onα\\alphaand𝒫\\mathcal\{P\}, not onR\(1\)R^\{\(1\)\},𝒜0\\mathcal\{A\}\_\{0\}, or𝒢\\mathcal\{G\}\. By Proposition[S4\.1](https://arxiv.org/html/2606.07563#A4.Thmtheorem1),𝒜∗\\mathcal\{A\}^\{\*\}is determined by𝒫\\mathcal\{P\}\. Henceα∗=arg⁡minα∈𝒜∗⁡cost​\(α\)\\alpha^\{\*\}=\\arg\\min\_\{\\alpha\\in\\mathcal\{A\}^\{\*\}\}\\mathrm\{cost\}\(\\alpha\)is identical for both instances\. ForE<EcE<E\_\{c\}, both useα∗\\alpha^\{\*\}, so their generator mapsTk,1=Tk,2=:TkT\_\{k,1\}=T\_\{k,2\}=:T\_\{k\}coincide\. By Theorem[S8\.1](https://arxiv.org/html/2606.07563#A8.Thmtheorem1)\(iii\),TkT\_\{k\}has a unique fixed point; both instances converge to it\. ∎

## Appendix S9Causal Emergence at the Fixed Point

### S9\.1Effective Information

###### Definition S9\.1\(Effective Information\)\.

EIk=Hμ​\(Tk​\(R\(k\)\)\)−Hμ​\(Tk​\(R\(k\)\)∣R\(k\)\)\\mathrm\{EI\}\_\{k\}=H\_\{\\mu\}\(T\_\{k\}\(R^\{\(k\)\}\)\)\-H\_\{\\mu\}\(T\_\{k\}\(R^\{\(k\)\}\)\\mid R^\{\(k\)\}\), whereμ\\muis the maximum\-entropy distribution overΩ\(k\)\\Omega^\{\(k\)\}\.

### S9\.2Main Causal Emergence Theorem

###### Theorem S9\.1\(Causal Emergence at the HEF Fixed Point\)\.

Under A1–A6, NDA, andE<EcE<E\_\{c\}:

1. \(i\)Causal noise eliminated:Hμ​\(Tk​\(R\(k\)\)∣R\(k\)\)=0H\_\{\\mu\}\(T\_\{k\}\(R^\{\(k\)\}\)\\mid R^\{\(k\)\}\)=0\.
2. \(ii\)EIk∗\>EI1\\mathrm\{EI\}\_\{k^\{\*\}\}\>\\mathrm\{EI\}\_\{1\}\.
3. \(iii\)EIk∗−EI1≥Hμ​\(T1pre∣R\(1\)\)−\[Hμ​\(T1pre\)−Hμ​\(Tk∗pre\)\]\>0\\mathrm\{EI\}\_\{k^\{\*\}\}\-\\mathrm\{EI\}\_\{1\}\\geq H\_\{\\mu\}\(T^\{\\mathrm\{pre\}\}\_\{1\}\\mid R^\{\(1\)\}\)\-\[H\_\{\\mu\}\(T^\{\\mathrm\{pre\}\}\_\{1\}\)\-H\_\{\\mu\}\(T^\{\\mathrm\{pre\}\}\_\{k^\{\*\}\}\)\]\>0\.
4. \(iv\)Degeneracy reduction:Dk∗≤D1−log⁡\(\|Ω\(1\)\|/\|Ω\(k∗\)\|\)D\_\{k^\{\*\}\}\\leq D\_\{1\}\-\\log\(\|\\Omega^\{\(1\)\}\|/\|\\Omega^\{\(k^\{\*\}\)\}\|\), whereDk=Hμ​\(R\(k\)∣Tk​\(R\(k\)\)\)D\_\{k\}=H\_\{\\mu\}\(R^\{\(k\)\}\\mid T\_\{k\}\(R^\{\(k\)\}\)\)\.

###### Proof\.

\(i\)ForE<EcE<E\_\{c\},𝒜∗​\(E\)=\{α∗\}\\mathcal\{A\}^\{\*\}\(E\)=\\\{\\alpha^\{\*\}\\\}, soTk=fα∗\(k\)T\_\{k\}=f^\{\(k\)\}\_\{\\alpha^\{\*\}\}is deterministic\. For a deterministic map,Hμ​\(Tk​\(R\(k\)\)∣R\(k\)\)=𝔼μ​\[H​\(δfα∗​\(r\)\)\]=0H\_\{\\mu\}\(T\_\{k\}\(R^\{\(k\)\}\)\\mid R^\{\(k\)\}\)=\\mathbb\{E\}\_\{\\mu\}\[H\(\\delta\_\{f\_\{\\alpha^\{\*\}\}\(r\)\}\)\]=0\.

\(ii\)Expand:

EIk∗−EI1=Hμ​\(Tk∗post\)−Hμ​\(T1pre\)⏟\(A\)\+Hμ​\(T1pre∣R\(1\)\)⏟\(B\)\>0\.\\mathrm\{EI\}\_\{k^\{\*\}\}\-\\mathrm\{EI\}\_\{1\}=\\underbrace\{H\_\{\\mu\}\(T^\{\\mathrm\{post\}\}\_\{k^\{\*\}\}\)\-H\_\{\\mu\}\(T^\{\\mathrm\{pre\}\}\_\{1\}\)\}\_\{\(A\)\}\+\\underbrace\{H\_\{\\mu\}\(T^\{\\mathrm\{pre\}\}\_\{1\}\\mid R^\{\(1\)\}\)\}\_\{\(B\)\>0\}\.Term \(B\) is strictly positive because\|𝒜∗​\(E\)\|≥2\|\\mathcal\{A\}^\{\*\}\(E\)\|\\geq 2at level 1 implies stochastic selection among mechanisms\. By NDA:Hμ​\(Tk∗post\)≥Iμ​\(R\(1\);T1pre\)=Hμ​\(T1pre\)−Hμ​\(T1pre∣R\(1\)\)H\_\{\\mu\}\(T^\{\\mathrm\{post\}\}\_\{k^\{\*\}\}\)\\geq I\_\{\\mu\}\(R^\{\(1\)\};T^\{\\mathrm\{pre\}\}\_\{1\}\)=H\_\{\\mu\}\(T^\{\\mathrm\{pre\}\}\_\{1\}\)\-H\_\{\\mu\}\(T^\{\\mathrm\{pre\}\}\_\{1\}\\mid R^\{\(1\)\}\), so\(A\)≥−\(B\)\(A\)\\geq\-\(B\)\. Hence the sum is≥0\\geq 0, and strictly positive because\(B\)\>0\(B\)\>0\.

\(iii\)Direct from the decomposition above\.

\(iv\)ForE<EcE<E\_\{c\},Tk∗T\_\{k^\{\*\}\}is deterministic\. However, determinism does not imply injectivity; multiple inputs can map to the same output, especially near the fixed point\. The Markov chainR\(1\)→R\(k∗\)→Tk∗​\(R\(k∗\)\)R^\{\(1\)\}\\to R^\{\(k^\{\*\}\)\}\\to T\_\{k^\{\*\}\}\(R^\{\(k^\{\*\}\)\}\)gives by DPI:H​\(R\(1\)∣Tk∗​\(R\(k∗\)\)\)≥H​\(R\(1\)∣R\(k∗\)\)≥log⁡\(\|Ω\(1\)\|/\|Ω\(k∗\)\|\)H\(R^\{\(1\)\}\\mid T\_\{k^\{\*\}\}\(R^\{\(k^\{\*\}\)\}\)\)\\geq H\(R^\{\(1\)\}\\mid R^\{\(k^\{\*\}\)\}\)\\geq\\log\(\|\\Omega^\{\(1\)\}\|/\|\\Omega^\{\(k^\{\*\}\)\}\|\)\. Moreover,D1=H​\(R\(1\)∣T1​\(R\(1\)\)\)≥H​\(R\(1\)∣Tk∗​\(R\(k∗\)\)\)D\_\{1\}=H\(R^\{\(1\)\}\\mid T\_\{1\}\(R^\{\(1\)\}\)\)\\geq H\(R^\{\(1\)\}\\mid T\_\{k^\{\*\}\}\(R^\{\(k^\{\*\}\)\}\)\)\. UsingH​\(R\(1\)∣Tk∗\)=H​\(R\(1\)∣R\(k∗\)\)\+H​\(R\(k∗\)∣Tk∗\)≥log⁡\(\|Ω\(1\)\|/\|Ω\(k∗\)\|\)\+Dk∗H\(R^\{\(1\)\}\\mid T\_\{k^\{\*\}\}\)=H\(R^\{\(1\)\}\\mid R^\{\(k^\{\*\}\)\}\)\+H\(R^\{\(k^\{\*\}\)\}\\mid T\_\{k^\{\*\}\}\)\\geq\\log\(\|\\Omega^\{\(1\)\}\|/\|\\Omega^\{\(k^\{\*\}\)\}\|\)\+D\_\{k^\{\*\}\}, we obtain the bound\. ∎

###### Corollary S9\.2\(Empirical Estimator of EI Gain\)\.

EIk∗−EI1≥Hμ​\(T1pre\|R\(1\)\)−\[Hμ​\(T1pre\)−Hμ​\(Tk∗pre\)\]\\mathrm\{EI\}\_\{k^\{\*\}\}\-\\mathrm\{EI\}\_\{1\}\\geq H\_\{\\mu\}\(T^\{\\mathrm\{pre\}\}\_\{1\}\|R^\{\(1\)\}\)\-\[H\_\{\\mu\}\(T^\{\\mathrm\{pre\}\}\_\{1\}\)\-H\_\{\\mu\}\(T^\{\\mathrm\{pre\}\}\_\{k^\{\*\}\}\)\]\. In gradient\-based learning, the mechanism competition entropyHμ​\(T1pre\|R\(1\)\)H\_\{\\mu\}\(T^\{\\mathrm\{pre\}\}\_\{1\}\|R^\{\(1\)\}\)is estimable from gradient\-direction variance duringt<Δ​tt<\\Delta t\.

## Appendix S10Grokking Delay: Conditional Derivation

###### Proposition S10\.1\(Grokking Delay – Conditional on G1, Revised G2\)\.

Under G1 and the revised G2, for moderateλ<λc​\(p\)\\lambda<\\lambda\_\{c\}\(p\):

Δ​t=E0/Cmem−1λ\+Kfrac⋅p⋅λ∼Kfrac⋅p⋅λfor large​p,\\Delta t=\\frac\{E\_\{0\}/C\_\{\\mathrm\{mem\}\}\-1\}\{\\lambda\}\+\\frac\{K\}\{\\mathrm\{frac\}\\cdot p\\cdot\\lambda\}\\sim\\frac\{K\}\{\\mathrm\{frac\}\\cdot p\\cdot\\lambda\}\\quad\\text\{for large \}p,whereK\>0K\>0is fitted from data \(β=−1\.39±0\.20\\beta=\-1\.39\\pm 0\.20,R2=0\.91R^\{2\}=0\.91\)\. Forλ≥λc​\(p\)\\lambda\\geq\\lambda\_\{c\}\(p\), the weight decay destroys gradient signal before circuit formation completes, causing oscillatory failure \(observed atλ=4\\lambda=4forp=97p=97\)\.

###### Proof\.

From G1:Estep​\(t∗\)=CmemE\_\{\\mathrm\{step\}\}\(t^\{\*\}\)=C\_\{\\mathrm\{mem\}\}givest∗=\(E0/Cmem−1\)/λt^\{\*\}=\(E\_\{0\}/C\_\{\\mathrm\{mem\}\}\-1\)/\\lambda\. From revised G2:tconv∝1/\(frac⋅p⋅λ\)t\_\{\\mathrm\{conv\}\}\\propto 1/\(\\mathrm\{frac\}\\cdot p\\cdot\\lambda\)\. HenceΔ​t=t∗\+tconv\\Delta t=t^\{\*\}\+t\_\{\\mathrm\{conv\}\}\. ∎

## Appendix S11Summary of Results

Table 3:Summary of main results, dependencies, and status\.
## Appendix S12Discussion: On the Status of A6

A central contribution of this SI is the clarification of A6 \(metric contraction\)\. Rather than treating A6 as unverifiable:

1. 1\.Theoretical grounding:Under spectral normalisation and weight decay, Lipschitz constants decay \(Lemma[S7\.5](https://arxiv.org/html/2606.07563#A7.Thmtheorem5), with AdamW caveat in Remark[23](https://arxiv.org/html/2606.07563#Thmremark23)\)\.
2. 2\.Empirical verification protocol:Monitor‖w‖F2\\\|w\\\|^\{2\}\_\{F\}decay and weight\-norm peak; full spectral\-norm measurement is Open Protocol \(G1\-test\)\.
3. 3\.Empirical confirmation:‖w‖F2\\\|w\\\|^\{2\}\_\{F\}peaks before grokking in 92\.1% of runs; post\-grokking accuracy stabilises at0\.9745±0\.0140\.9745\\pm 0\.014\(no numerical Lip bound claimed\)\.
4. 4\.Edge cases:Lemma[S7\.7](https://arxiv.org/html/2606.07563#A7.Thmtheorem7)handles\|R1\|≠\|R2\|\|R\_\{1\}\|\\neq\|R\_\{2\}\|via padding; depth\-bounded formulas ensure finiteness\.

Thus, across all four instantiations, A6 is derivable from domain\-specific structural conditions: log\-Sobolev inequalities for EOM and IFF \(verified via Holley–Stroock and Bakry–Émery\), monotone compression for RSID \(verified via Hill coefficient structure\), and spectral normalization plus weight decay for ML \(empirically verified in Section 7\.1\.3 of the main text\)\. In each case, A6 is a theorem conditional on these structural conditions, which are satisfied by the respective instantiations\.

## Appendix S13References

## References

- \[1\]P\. W\. Anderson\. More is different\.Science, 177\(4047\):393–396, 1972\.
- \[2\]S\. Banach\. Sur les opérations dans les ensembles abstraits\.Fund\. Math\., 3:133–181, 1922\.
- \[3\]S\. G\. Bobkov and F\. Götze\. Exponential integrability and transportation cost related to logarithmic Sobolev inequalities\.J\. Funct\. Anal\., 163\(1\):1–28, 1999\.
- \[4\]M\. A\. Bedau\. Weak emergence\.Philosophical Perspectives, 11:375–399, 1997\.
- \[5\]M\. Belkin, D\. Hsu, S\. Ma, and S\. Mandal\. Reconciling modern machine\-learning practice and the classical bias\-variance trade\-off\.PNAS, 116\(32\):15849–15854, 2019\.
- \[6\]C\. H\. Bennett\. The thermodynamics of computation\.Int\. J\. Theor\. Phys\., 21\(12\):905–940, 1982\.
- \[7\]E\. Boix\-Adsera, N\. Mallinar, J\. B\. Simon, and M\. Belkin\. The features at convergence theorem for neural networks\.International Conference on Learning Representations \(ICLR\), 2026\. arXiv:2507\.05644\.
- \[8\]J\. Butterfield\. Emergence, reduction and supervenience\.Found\. Physics, 41\(6\):920–959, 2011\.
- \[9\]H\. B\. Callen\.Thermodynamics and an Introduction to Thermostatistics, 2nd ed\. Wiley, 1985\.
- \[10\]D\. J\. Chalmers\. Strong and weak emergence\. InThe Re\-emergence of Emergence, OUP, 2006\.
- \[11\]S\. Conway Morris\.Life’s Solution\. Cambridge University Press, 2003\.
- \[12\]S\. Conway Morris\.The Runes of Evolution\. Templeton Press, 2015\.
- \[13\]T\. M\. Cover and J\. A\. Thomas\.Elements of Information Theory, 2nd ed\. Wiley, 2006\.
- \[14\]D\. Doshi, A\. Das, T\. He, and A\. Gromov\. To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets\.International Conference on Learning Representations \(ICLR\), 2024\. arXiv:2310\.13061\.
- \[15\]D\. Deutsch and C\. Marletto\. Constructor theory of information\.Proc\. R\. Soc\. A, 471:20140540, 2015\.
- \[16\]N\. Elhage et al\. Toy models of superposition\.Transformer Circuits Thread, 2022\.
- \[17\]D\. H\. Erwin et al\. The Cambrian conundrum\.Science, 334\(6059\):1091–1097, 2011\.
- \[18\]J\. W\. Gibbs\.Elementary Principles in Statistical Mechanics\. Yale, 1902\.
- \[19\]P\. R\. Halmos\.Measure Theory\. Springer, 1950\.
- \[20\]F\. Hausdorff\.Grundzüge der Mengenlehre\. Veit, 1914\.
- \[21\]E\. P\. Hoel, L\. Albantakis, and G\. Tononi\. Quantifying causal emergence\.PNAS, 110\(49\):19790–19795, 2013\.
- \[22\]W\. Hordijk and M\. Steel\. Detecting autocatalytic sets\.J\. Theor\. Biol\., 227\(4\):451–461, 2004\.
- \[23\]M\. Huh, B\. Cheung, T\. Wang, and P\. Isola\. The Platonic Representation Hypothesis\.ICML, 2024\. arXiv:2405\.07987\.
- \[24\]C\. Jarzynski\. Nonequilibrium equality for free energy differences\.Phys\. Rev\. Lett\., 78\(14\):2690–2693, 1997\.
- \[25\]E\. T\. Jaynes\. Information theory and statistical mechanics\.Phys\. Rev\., 106:620–630, 1957\.
- \[26\]L\. P\. Kadanoff\. Scaling laws for Ising models nearTcT\_\{c\}\.Physics, 2\(6\):263–272, 1966\.
- \[27\]S\. A\. Kauffman\.The Origins of Order\. OUP, 1993\.
- \[28\]E\. Kreyszig\.Introductory Functional Analysis with Applications\. Wiley, 1978\.
- \[29\]L\. D\. Landau\. On the theory of phase transitions\.Zh\. Eksp\. Teor\. Fiz\., 7:19–32, 1937\.
- \[30\]R\. Landauer\. Irreversibility and heat generation\.IBM J\. Res\. Dev\., 5\(3\):183–191, 1961\.
- \[31\]I\. Loshchilov and F\. Hutter\. Decoupled weight decay regularisation\.ICLR, 2019\.
- \[32\]C\. R\. Marshall\. Explaining the Cambrian explosion\.Annu\. Rev\. Earth Planet\. Sci\., 34:355–384, 2006\.
- \[33\]K\. Clauw, S\. Stramaglia, and D\. Marinazzo\. Information\-theoretic progress measures reveal grokking is an emergent phase transition\. arXiv:2408\.08944, 2024\.
- \[34\]J\. Monod, J\. Wyman, and J\.\-P\. Changeux\. On the nature of allosteric transitions\.J\. Mol\. Biol\., 12\(1\):88–118, 1965\.
- \[35\]J\. R\. Munkres\.Topology, 2nd ed\. Prentice Hall, 2000\.
- \[36\]P\. Nakkiran et al\. Deep double descent\.ICLR, 2020\.
- \[37\]C\. Olah et al\. Zoom in: an introduction to circuits\.Distill, 2020\.
- \[38\]K\. T\. David, J\. G\. Schraiber, J\. G\. Crandall, A\. L\. Labella, D\. A\. Opulente, M\.\-C\. Harrison, J\. F\. Wolters, X\. Zhou, X\.\-X\. Shen, M\. Groenewald, C\. T\. Hittinger, M\. Pennell, and A\. Rokas\. Convergent expansions of keystone gene families drive metabolic innovation inSaccharomycotinayeasts\.Proc\. Natl\. Acad\. Sci\. U\.S\.A\., 122\(23\):e2500165122, 2025\. doi:10\.1073/pnas\.2500165122\.
- \[39\]F\. Otto and C\. Villani\. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality\.J\. Funct\. Anal\., 173\(2\):361–400, 2000\.
- \[40\]D\. Peer et al\. Nanocarriers as an emerging platform\.Nature Nanotechnology, 2:751–760, 2007\.
- \[41\]N\. Nanda, L\. Chan, T\. Lieberum, J\. Smith, J\. Steinhardt\.Progress measures for grokking via mechanistic interpretability\.International Conference on Learning Representations \(ICLR\), 2023\.
- \[42\]A\. Power et al\. Grokking: generalisation beyond overfitting\. arXiv:2201\.02177, 2022\.
- \[43\]M\. Raginsky\. Strong data processing inequalities\.IEEE Trans\. Inf\. Theory, 62\(6\):3355–3389, 2016\.
- \[44\]H\. S\. Seung, H\. Sompolinsky, and N\. Tishby\. Statistical mechanics of learning\.Phys\. Rev\. A, 45\(8\):6056–6091, 1992\.
- \[45\]L\. Szilárd\. Über die Entropieverminderung\.Z\. Phys\., 53:840–856, 1929\.
- \[46\]N\. Tishby, F\. C\. Pereira, and W\. Bialek\. The information bottleneck method\. arXiv:physics/0004057, 2000\.
- \[47\]Q\. H\. Truong and X\. K\. Truong\. Prebiotic selection as a physical process\.bioRxiv, 2026\. doi:10\.64898/2026\.04\.21\.719958\.
- \[48\]X\. K\. Truong\. First\-passage prediction of grokking delay: a calibrated law under AdamW with causal validation\. arXiv:2605\.18845, 2026\.
- \[49\]K\. G\. Wilson\. Renormalisation group and critical phenomena I\.Phys\. Rev\. B, 4\(9\):3174–3183, 1971\.
- \[50\]K\. G\. Wilson\. The renormalisation group and theε\\varepsilonexpansion\.Phys\. Rep\., 12\(2\):75–199, 1974\.
- \[51\]Y\. Xu\. The geometry of multi\-task grokking: transverse instability, superposition, and weight decay phase structure\. arXiv:2602\.18523, 2026\.

Similar Articles

Feature Lottery? A Bifurcation Theory of Concept Emergence

arXiv cs.LG

This paper introduces a bifurcation theory of representation dynamics to detect when neural networks acquire structured representations during training, using a Hessian analysis of a GMM probe. The resulting ratio β/β_c serves as a label-free phase coordinate that predicts the onset of usable structure and can forecast feature interpretability in sparse autoencoders early in training.