The Hamilton-Jacobi Theory of Deep Learning

arXiv cs.LG Papers

Summary

This paper establishes an exact correspondence between neural network training and Hamilton-Jacobi initial-value problems, unifying deep learning architectures through a deformation parameter.

arXiv:2605.28983v1 Announce Type: new Abstract: In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton--Jacobi equation whose Hopf--Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter $\varepsilon$ unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: the minimax optimal generalization rate $O(n^{-1/(d+2)})$ for fixed $t$; adversarial robustness controlled by $\varepsilon$; backpropagation as the co-state equation of the Hamiltonian system for residual networks (Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed-form $O(N)$ influence function (softmax attribution weights $\pi_j$) whose entropy landscape undergoes fold bifurcations as $\varepsilon$ increases, each merging attribution basins.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:14 AM

# The Hamilton–Jacobi Theory of Deep Learning
Source: [https://arxiv.org/html/2605.28983](https://arxiv.org/html/2605.28983)
Jose Marie Antonio Miñoza1, Erika Fille T\. Legara1,2, Christopher P\. Monterola2 1Center for AI Research PH2Asian Institute of Management

###### Abstract

In this paper, training a neural network is identified, exactly, as a search through Hamilton–Jacobi initial\-value problems: each gradient step selects the initial data of a viscous Hamilton–Jacobi equation whose Hopf–Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights\. The correspondence is exact for log\-sum\-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures \(RNNs, LSTMs, SSMs\) each discretize the same class of Hamilton–Jacobi equations, with architecture\-dependent Hamiltonian and viscosity\. A single deformation parameterε\\varepsilonunifies all four perspectives \(network, tropical algebra, viscous PDE, convex optimization\) in a commutative diagram closed under Lipschitz conditions\. Quantitative consequences include: the minimax optimal generalization rateO​\(n−1/\(d\+2\)\)O\(n^\{\-1/\(d\+2\)\}\)for fixedtt; adversarial robustness controlled byε\\varepsilon; backpropagation as the co\-state equation of the Hamiltonian system for residual networks \(Pontryagin Maximum Principle\); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed\-formO​\(N\)O\(N\)influence function \(softmax attribution weightsπj\\pi\_\{j\}\) whose entropy landscape undergoes fold bifurcations asε\\varepsilonincreases, each merging attribution basins\.

## 1Introduction

What equation does a neural network solve?

Conventionally the question runs the other direction: given a partial differential equation, design a network to approximate its solution\(Hanet al\.,[2018](https://arxiv.org/html/2605.28983#bib.bib20)\)\. Here a trained network already*is*a Hamilton–Jacobi equation, and the question is which one\. The key is a single deformation parameterε\\varepsilonand ultradiscretization\(Tokihiroet al\.,[1996](https://arxiv.org/html/2605.28983#bib.bib5)\): the exact passageε→0\\varepsilon\\to 0between two algebraic worlds\. Atε=0\\varepsilon=0, addition ismax\\maxand multiplication is\+\+\(the tropical semiring\); atε\>0\\varepsilon\>0, ordinary arithmetic is restored and the same objects reappear as the solution operator of a viscous PDE\. The passage is not an approximation; it is an exact semiring homomorphism, the Maslov dequantization\(Litvinov,[2007](https://arxiv.org/html/2605.28983#bib.bib2)\)\.

The mathematical object realizing this is a layer with log\-sum\-exp activation:

fε​\(x\)=ε​log​∑j=1Nexp⁡\(\(Wj⋅x\+bj\)/ε\)\.f\_\{\\varepsilon\}\(x\)=\\varepsilon\\log\\sum\_\{j=1\}^\{N\}\\exp\\\!\\Bigl\(\\bigl\(W\_\{j\}\\cdot x\+b\_\{j\}\\bigr\)/\\varepsilon\\Bigr\)\.\(1\)The algebraic structure of \([1](https://arxiv.org/html/2605.28983#S1.E1)\) is fixed by its tropical limit\. Atε=0\\varepsilon=0, \([1](https://arxiv.org/html/2605.28983#S1.E1)\) collapses tomaxj⁡\(Wj⋅x\+bj\)\\max\_\{j\}\(W\_\{j\}\\cdot x\+b\_\{j\}\), the Hopf–Lax formula and a convex optimization problem\. The passageε→0\\varepsilon\\to 0is the Maslov dequantization\(Litvinov,[2007](https://arxiv.org/html/2605.28983#bib.bib2)\): an exact semiring homomorphism from\(ℝ,\+,×\)\(\\mathbb\{R\},\+,\\times\)to\(ℝ,max,\+\)\(\\mathbb\{R\},\\max,\+\), not a numerical approximation\. Liftingε\\varepsilonabove zero reverses this limit, converting the tropical machine into the smooth layer \([1](https://arxiv.org/html/2605.28983#S1.E1)\); the Hopf–Cole linearization\(Hopf,[1950](https://arxiv.org/html/2605.28983#bib.bib9)\)then identifies \([1](https://arxiv.org/html/2605.28983#S1.E1)\) exactly as the heat\-equation propagator of a viscous Hamilton–Jacobi equation \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\)\. The weights encode the initial data, the architecture encodes the Hamiltonian, and a forward pass evaluates that PDE solution at the query pointxx\. The parameterε\\varepsilonappears simultaneously as the softmax temperature, the PDE viscosity, and the convex regularization strength; Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1)shows these roles are identified, not coincidental\.

#### Contributions\.

The contribution is the*unification*: individual ingredients \(Maslov dequantization\(Litvinov,[2007](https://arxiv.org/html/2605.28983#bib.bib2)\), Hopf–Cole linearization\(Hopf,[1950](https://arxiv.org/html/2605.28983#bib.bib9)\), ResNet\-as\-ODE\(E,[2017](https://arxiv.org/html/2605.28983#bib.bib17); Chenet al\.,[2018](https://arxiv.org/html/2605.28983#bib.bib18)\), adjoint method\(Chenet al\.,[2018](https://arxiv.org/html/2605.28983#bib.bib18)\), scaling laws\(Kaplanet al\.,[2020](https://arxiv.org/html/2605.28983#bib.bib27)\)\) are known; connecting them under a singleε\\varepsilonyields a commutative diagram and quantitative consequences that follow from none alone\. Concretely, the paper establishes:

1. \(i\)The LSE activation is a smooth deformation of the tropical max operation; the limitε→0\\varepsilon\\to 0is an exact semiring homomorphism from\(ℝ,\+,×\)\(\\mathbb\{R\},\+,\\times\)to\(ℝ,max,\+\)\(\\mathbb\{R\},\\max,\+\)\(Theorem[3\.1](https://arxiv.org/html/2605.28983#S3.Thmtheorem1)\)\.
2. \(ii\)Every LSE\-activated feedforward layer encodes the exact Hopf–Cole solution of a viscous Hamilton–Jacobi PDE under a discrete measure \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\); transformer attention is the expected value vector under the Gibbs measure \(Proposition[H\.9](https://arxiv.org/html/2605.28983#A8.Thmtheorem9)\); deep networks correspond structurally to the PDE semigroup under point\-evaluation composition\. The tropical limitε→0\\varepsilon\\to 0recovers the Hopf–Lax inf\-convolution \(Theorem[5\.1](https://arxiv.org/html/2605.28983#S5.Thmtheorem1)\), simultaneously a linear program and a MASO; ResNets and recurrent architectures \(RNNs, LSTMs, SSMs\) discretize the ODE characteristics with architecture\-dependent viscosity \(Propositions[5\.2](https://arxiv.org/html/2605.28983#S5.Thmtheorem2),[5\.4](https://arxiv.org/html/2605.28983#S5.Thmtheorem4)\)\.
3. \(iii\)A single parameterε\\varepsilonindexes four perspectives on the same object \(neural network, tropical algebra, PDE, convex optimization\) and the resulting commutative diagram \(Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1)\) closes under Lipschitzggand convexHH\.
4. \(iv\)The framework yields actionable design principles: optimal temperatureε∗≍N−1/d\\varepsilon^\{\*\}\\asymp N^\{\-1/d\}prescribed from width and data dimension achieves minimax rateO​\(n−1/\(d\+2\)\)O\(n^\{\-1/\(d\+2\)\}\)\(Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)\); adversarial robustness is certifiably controlled byε\\varepsilon\(Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2)\); backpropagation is the co\-state equation of the Hamiltonian system for residual networks \(Theorem[8\.4](https://arxiv.org/html/2605.28983#S8.Thmtheorem4)\); and the framework extends to learnable quadraticHθ​\(p\)=p⊤​Aθ​pH\_\{\\theta\}\(p\)=p^\{\\top\}A\_\{\\theta\}p\(Theorem[4\.4](https://arxiv.org/html/2605.28983#S4.Thmtheorem4)\)\.
5. \(v\)The softmax weightsπj\\pi\_\{j\}are a closed\-formO​\(N\)O\(N\)influence function with exact label sensitivity∂f^/∂gj=πj​\(1\+\(f^−gj\)/ε\)\\partial\\hat\{f\}/\\partial g\_\{j\}=\\pi\_\{j\}\(1\+\(\\hat\{f\}\-g\_\{j\}\)/\\varepsilon\)requiring no Hessian inversion; the attribution\-entropy landscapeH​\(π\)H\(\\pi\)undergoes fold bifurcations asε\\varepsilonincreases, each annihilating an attribution basin, with saddle points marking attribution transitions \(Appendix[F](https://arxiv.org/html/2605.28983#A6)\)\.

Together, the results constitute*a unifying mathematical theory of deep learning*: Maslov’s dequantization — the same principle that connects classical to quantum mechanics — here connects tropical to smooth neural computation, and activation type, architectural class, generalization, robustness, training dynamics, and scaling laws — ordinarily studied in isolation — emerge as facets of one commutative diagram, closed under Lipschitzggand convexHH\. The exact Hopf–Cole correspondence is complete for the quadratic Hamiltonian class \(Theorems[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1),[4\.3](https://arxiv.org/html/2605.28983#S4.Thmtheorem3),[4\.4](https://arxiv.org/html/2605.28983#S4.Thmtheorem4)\); structural correspondences extend to broader architectures\.

#### Notation\.

Forε\>0\\varepsilon\>0andz∈ℝNz\\in\\mathbb\{R\}^\{N\}, defineLSEε​\(z\)=ε​log​∑iexp⁡\(zi/ε\)\\mathrm\{LSE\}\_\{\\varepsilon\}\(z\)=\\varepsilon\\log\\sum\_\{i\}\\exp\(z\_\{i\}/\\varepsilon\)\. Write⊗tr\\otimes\_\{\\mathrm\{tr\}\}for tropical matrix multiplication:\(A⊗trx\)i=maxj⁡\(Ai​j\+xj\)\(A\\otimes\_\{\\mathrm\{tr\}\}x\)\_\{i\}=\\max\_\{j\}\(A\_\{ij\}\+x\_\{j\}\)\. For a convex HamiltonianH:ℝd→ℝH:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}, letL​\(v\)=supp\(p⋅v−H​\(p\)\)L\(v\)=\\sup\_\{p\}\(p\\cdot v\-H\(p\)\)denote its Legendre transform\. Writef​□​gf\\,\\square\\,gfor inf\-convolution:\(f​□​g\)​\(x\)=infy\{f​\(y\)\+g​\(x−y\)\}\(f\\,\\square\\,g\)\(x\)=\\inf\_\{y\}\\\{f\(y\)\+g\(x\-y\)\\\}\.

## 2Background

#### Log\-Sum\-Exp and Convex Duality\.

The functionLSEε:ℝm→ℝ\\mathrm\{LSE\}\_\{\\varepsilon\}:\\mathbb\{R\}^\{m\}\\to\\mathbb\{R\}is convex, smooth forε\>0\\varepsilon\>0, and satisfies∇LSEε​\(x\)=softmax​\(x/ε\)\\nabla\\mathrm\{LSE\}\_\{\\varepsilon\}\(x\)=\\mathrm\{softmax\}\(x/\\varepsilon\)\. Its Legendre–Fenchel conjugate is the negative entropy:LSEε∗​\(p\)=ε​∑ipi​log⁡pi\\mathrm\{LSE\}\_\{\\varepsilon\}^\{\*\}\(p\)=\\varepsilon\\sum\_\{i\}p\_\{i\}\\log p\_\{i\}on the simplex\. It is simultaneously the log\-partition function of the Gibbs distribution at temperatureε\\varepsilon, the smooth convex relaxation ofmax\\max, and, via the Hopf–Cole substitution, the solution operator of the heat equation\.

#### The Tropical Semiring\.

The tropical semiring\(ℝ∪\{−∞\},⊕,⊗\)\(\\mathbb\{R\}\\cup\\\{\-\\infty\\\},\\oplus,\\otimes\)hasa⊕b=max⁡\(a,b\)a\\oplus b=\\max\(a,b\)anda⊗b=a\+ba\\otimes b=a\+b, with identities−∞\-\\inftyand0respectively\. Tropical matrix multiplication acts as\(A⊗trx\)i=maxj⁡\(Ai​j\+xj\)\(A\\otimes\_\{\\mathrm\{tr\}\}x\)\_\{i\}=\\max\_\{j\}\(A\_\{ij\}\+x\_\{j\}\)\. Max\-plus algebra and the tropical semiring are two names for the same structure\(Litvinov,[2007](https://arxiv.org/html/2605.28983#bib.bib2)\)\.

#### Hamilton–Jacobi PDEs\.

The*viscous*Hamilton–Jacobi \(HJ\) equation is

∂tu\+H​\(∇xu\)=ε​Δx​u,u​\(x,0\)=g​\(x\),\\partial\_\{t\}u\+H\(\\nabla\_\{x\}u\)=\\varepsilon\\Delta\_\{x\}u,\\qquad u\(x,0\)=g\(x\),\(2\)with HamiltonianH:ℝd→ℝH:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}and viscosityε\>0\\varepsilon\>0\. The inviscid equation hasε=0\\varepsilon=0\. Under the Hopf–Cole substitutionu=−ε​log⁡vu=\-\\varepsilon\\log v, equation \([2](https://arxiv.org/html/2605.28983#S2.E2)\) linearizes to the heat equation∂tv=ε​Δ​v\\partial\_\{t\}v=\\varepsilon\\Delta vfor the quadratic HamiltonianH​\(p\)=\|p\|2H\(p\)=\|p\|^\{2\}\(Hopf,[1950](https://arxiv.org/html/2605.28983#bib.bib9); Evans,[2010](https://arxiv.org/html/2605.28983#bib.bib11)\); for general convexHHthe substitution yields a non\-linear equation forvv\. Viscosity solutions of the inviscid equation are characterized by Crandall–Lions theory\(Crandall and Lions,[1983](https://arxiv.org/html/2605.28983#bib.bib8)\); for convexHHthe unique viscosity solution is given by the Hopf–Lax formula\(Lax,[1957](https://arxiv.org/html/2605.28983#bib.bib10); Evans,[2010](https://arxiv.org/html/2605.28983#bib.bib11)\)\.

#### Ultradiscretization\.

Ultradiscretization\(Tokihiroet al\.,[1996](https://arxiv.org/html/2605.28983#bib.bib5)\)passes a smooth system to a max\-plus system viaε​log⁡\(eA/ε\+eB/ε\)→max⁡\(A,B\)\\varepsilon\\log\(e^\{A/\\varepsilon\}\+e^\{B/\\varepsilon\}\)\\to\\max\(A,B\)asε→0\\varepsilon\\to 0, and the reverse lift re\-quantizes a tropical system by raisingε\\varepsilonabove zero\. Originally developed for soliton PDEs\(Tokihiroet al\.,[1996](https://arxiv.org/html/2605.28983#bib.bib5)\), the same principle applies here to neural networks\.

## 3From Neural Networks to Max\-Plus Algebra

#### LSE as a Deformation\.

For allx∈ℝmx\\in\\mathbb\{R\}^\{m\}:maxi⁡xi≤LSEε​\(x\)≤maxi⁡xi\+ε​log⁡m\\max\_\{i\}x\_\{i\}\\leq\\mathrm\{LSE\}\_\{\\varepsilon\}\(x\)\\leq\\max\_\{i\}x\_\{i\}\+\\varepsilon\\log m\. Asε→0\\varepsilon\\to 0this sandwiches tomax\\max\. The Maslov dequantization makes this more precise\.

###### Theorem 3\.1\(Maslov dequantization\(Litvinov,[2007](https://arxiv.org/html/2605.28983#bib.bib2)\)\)\.

For allx∈ℝmx\\in\\mathbb\{R\}^\{m\},limε→0LSEε​\(x\)=maxi⁡xi\\displaystyle\\lim\_\{\\varepsilon\\to 0\}\\mathrm\{LSE\}\_\{\\varepsilon\}\(x\)=\\max\_\{i\}x\_\{i\}\. For eachε\>0\\varepsilon\>0,\(ℝ,LSEε,\+\)≅\(ℝ≥0,\+,×\)\(\\mathbb\{R\},\\mathrm\{LSE\}\_\{\\varepsilon\},\+\)\\cong\(\\mathbb\{R\}\_\{\\geq 0\},\+,\\times\); the limitε→0\\varepsilon\\to 0is the semiring homomorphism onto the tropical semiring\(ℝ,max,\+\)\(\\mathbb\{R\},\\max,\+\), under which addition becomesmax\\maxand multiplication becomes\+\+\.

The passageε→0\\varepsilon\\to 0is formally analogous to theℏ→0\\hbar\\to 0limit in quantum mechanics: the same deformation of a real arithmetic semiring into a tropical one\(Litvinov,[2007](https://arxiv.org/html/2605.28983#bib.bib2)\)\.

#### Two Regimes of a Network Layer\.

Atε\>0\\varepsilon\>0: the affine\-plus\-LSE layerfε​\(x\)=LSEε​\(W​x\+b\)f\_\{\\varepsilon\}\(x\)=\\mathrm\{LSE\}\_\{\\varepsilon\}\(Wx\+b\)is smooth; allNNneurons contribute with softmax weights; the output is the solution of an entropy\-regularized optimization\. Atε=0\\varepsilon=0: the layer becomesf0​\(x\)=maxj⁡\(Wj⋅x\+bj\)f\_\{0\}\(x\)=\\max\_\{j\}\(W\_\{j\}\\cdot x\+b\_\{j\}\), a tropical linear map in which a single neuron dominates\. This is a max\-affine spline operator \(MASO\)\(Balestriero and Baraniuk,[2018](https://arxiv.org/html/2605.28983#bib.bib21),[2021](https://arxiv.org/html/2605.28983#bib.bib22)\)that partitions the input into polyhedral regions, exactly as a decision tree\(Aytekin,[2022](https://arxiv.org/html/2605.28983#bib.bib23)\)\. Softmax weights atε\>0\\varepsilon\>0are the continuous relaxation of the one\-hot region indicator\.

## 4Neural Networks as Hamilton–Jacobi Equations

The*Hopf–Cole solution*is the explicit formula for the unique solutionuε​\(x,t\)u\_\{\\varepsilon\}\(x,t\)of the viscous HJ equation \([2](https://arxiv.org/html/2605.28983#S2.E2)\) — it is a*solution to*that equation, not a method for solving it\. The claim of this section is stronger than approximation: an LSE network layer*encodes*this solution algebraically and exactly under a discrete measure, via an exact identity between the layer output and the PDE solution\.

#### Hopf–Cole and the LSE Representation\.

Under the Hopf–Cole substitutionv=exp⁡\(−u/ε\)v=\\exp\(\-u/\\varepsilon\), the viscous HJ equation \([2](https://arxiv.org/html/2605.28983#S2.E2)\) with HamiltonianH​\(p\)=\|p\|2H\(p\)=\|p\|^\{2\}becomes the heat equation∂tv=ε​Δ​v\\partial\_\{t\}v=\\varepsilon\\Delta v,v​\(x,0\)=exp⁡\(−g​\(x\)/ε\)v\(x,0\)=\\exp\(\-g\(x\)/\\varepsilon\)\. The heat\-kernel solution inverts back to

uε​\(x,t\)=−ε​log​∫ℝdexp⁡\(−g​\(y\)−\|x−y\|24​tε\)​𝑑y,u\_\{\\varepsilon\}\(x,t\)=\-\\varepsilon\\log\\\!\\int\_\{\\mathbb\{R\}^\{d\}\}\\exp\\\!\\left\(\\frac\{\-g\(y\)\-\\tfrac\{\|x\-y\|^\{2\}\}\{4t\}\}\{\\varepsilon\}\\right\)dy,\(3\)which is the unique classical solution of \([2](https://arxiv.org/html/2605.28983#S2.E2)\) with initial datagg, written asLSEε\\mathrm\{LSE\}\_\{\\varepsilon\}over the continuous variableyy\.

###### Theorem 4\.1\(Neural network layer encodes PDE solution\)\.

Let\{yj\}j=1N⊂ℝd\\\{y\_\{j\}\\\}\_\{j=1\}^\{N\}\\subset\\mathbb\{R\}^\{d\}, and setWj=yj/\(2​t\)W\_\{j\}=y\_\{j\}/\(2t\),bj=−g​\(yj\)−\|yj\|2/\(4​t\)b\_\{j\}=\-g\(y\_\{j\}\)\-\|y\_\{j\}\|^\{2\}/\(4t\)\. Then

fεN​\(x\)=LSEε​\(W​x\+b\)=ε​log​∑j=1Nexp⁡\(Wj⋅x\+bjε\)f\_\{\\varepsilon\}^\{N\}\(x\)=\\mathrm\{LSE\}\_\{\\varepsilon\}\(Wx\+b\)=\\varepsilon\\log\\sum\_\{j=1\}^\{N\}\\exp\\\!\\left\(\\frac\{W\_\{j\}\\cdot x\+b\_\{j\}\}\{\\varepsilon\}\\right\)\(4\)satisfies the exact identity

fεN​\(x\)=\|x\|24​t−uεN​\(x,t\),f\_\{\\varepsilon\}^\{N\}\(x\)=\\frac\{\|x\|^\{2\}\}\{4t\}\-u\_\{\\varepsilon\}^\{N\}\(x,t\),\(5\)whereuεNu\_\{\\varepsilon\}^\{N\}is the Hopf–Cole solution \([3](https://arxiv.org/html/2605.28983#S4.E3)\) of the viscous HJ equation \([2](https://arxiv.org/html/2605.28983#S2.E2)\) under the unnormalized discrete measureμN=∑j=1Nδyj\\mu\_\{N\}=\\sum\_\{j=1\}^\{N\}\\delta\_\{y\_\{j\}\}\.*No approximation is made\.*

Proof\.See Appendix[E](https://arxiv.org/html/2605.28983#A5)\.□\\square

Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)is an algebraic identity: the Hopf–Cole solution is recovered from the layer output asuεN​\(x,t\)=\|x\|2/\(4​t\)−fεN​\(x\)u\_\{\\varepsilon\}^\{N\}\(x,t\)=\|x\|^\{2\}/\(4t\)\-f\_\{\\varepsilon\}^\{N\}\(x\), with the quadratic shift\|x\|2/\(4​t\)\|x\|^\{2\}/\(4t\)carrying no learned parameters\. The theorem establishes a complete dictionary: network weightsWWencode support points of the initial\-data measure; biasesbbencode the transport cost;ε\\varepsilonis the viscosity coefficient; and widthNNis the discretization of the measure\.

#### The time parameter as gauge freedom\.

The parameterizationWj=yj/\(2​t\)W\_\{j\}=y\_\{j\}/\(2t\),bj=−g​\(yj\)−\|yj\|2/\(4​t\)b\_\{j\}=\-g\(y\_\{j\}\)\-\|y\_\{j\}\|^\{2\}/\(4t\)admits one free scalart\>0t\>0: every value yields a self\-consistent HJ reading of the same network\. This is*gauge freedom*, directly analogous to gauge freedom in physics — the network’s input–output map is gauge\-invariant, but the PDE interpretation requires a gauge choice\. Three canonical fixings emerge: \(i\)*data\-scale gauge*t=tr​\(ΣX\)/dt=\\mathrm\{tr\}\(\\Sigma\_\{X\}\)/d, matching diffusion to data scale; \(ii\)*information gauge*t=arg⁡maxt⁡𝔼x​\[H​\(π​\(x;t\)\)\]t=\\arg\\max\_\{t\}\\mathbb\{E\}\_\{x\}\[H\(\\pi\(x;t\)\)\], maximizing attribution entropy; \(iii\)*generalization gauge*withε∗​\(t\)=N−1/d\\varepsilon^\{\*\}\(t\)=N^\{\-1/d\}, fixing the gauge to the minimax\-optimal viscosity\.

#### Contrast with PINNs\.

Physics\-informed neural networks impose the PDE*externally*via a residual loss‖∂tfθ\+H​\(∇fθ\)−ε​Δ​fθ‖2\\\|\\partial\_\{t\}f\_\{\\theta\}\+H\(\\nabla f\_\{\\theta\}\)\-\\varepsilon\\Delta f\_\{\\theta\}\\\|^\{2\}\. Here the PDE is in the*architecture*: Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)guarantees the residual is identically zero for the spatial equation at any fixed layer depthtt, so no PDE loss is needed\. \(The continuous heat kernel carries a prefactor\(4​π​ε​t\)−d/2\(4\\pi\\varepsilon t\)^\{\-d/2\}; for the discrete sum overNNatoms this contributes an additive constantε​d2​log⁡\(4​π​ε​t\)\\frac\{\\varepsilon d\}\{2\}\\log\(4\\pi\\varepsilon t\), absorbed into a common bias shift and irrelevant to the spatial identity\.\) Gradient descent does not move the network toward a PDE; it searches over initial conditions\{\(yj,g​\(yj\)\)\}\\\{\(y\_\{j\},g\(y\_\{j\}\)\)\\\}, selecting the viscous HJ equation whose solution best fits the data\.

#### Transformer Attention\.

Scaled dot\-product attention computes, for queryQQ, keyKK, and valueVV:

Attn​\(Q,K,V\)=softmax​\(Q​K⊤d\)​V\.\\mathrm\{Attn\}\(Q,K,V\)=\\mathrm\{softmax\}\\\!\\left\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\}\}\\right\)V\.\(6\)Each row of this output is the expected value of the rows ofVVunder the Gibbs distribution at temperatureε=d\\varepsilon=\\sqrt\{d\}:

Attn​\(Q,K,V\)i=∑jsoftmax​\(qi⋅kjε\)​vj=∇sLSEε​\(s\)\|sj=qi⋅kj⋅V\.\\mathrm\{Attn\}\(Q,K,V\)\_\{i\}=\\sum\_\{j\}\\mathrm\{softmax\}\\\!\\left\(\\frac\{q\_\{i\}\\cdot k\_\{j\}\}\{\\varepsilon\}\\right\)v\_\{j\}=\\nabla\_\{s\}\\,\\mathrm\{LSE\}\_\{\\varepsilon\}\(s\)\\big\|\_\{s\_\{j\}=q\_\{i\}\\cdot k\_\{j\}\}\\cdot V\.\(7\)This is the expected value ofVVunder the Hopf–Cole \(Gibbs\) measure defined byQ​K⊤/dQK^\{\\top\}/\\sqrt\{d\}; the1/d1/\\sqrt\{d\}scaling fixes temperatureε=d\\varepsilon=\\sqrt\{d\}, preventing logit\-variance growth\. Asε→0\\varepsilon\\to 0, softmax collapses to argmax and \([6](https://arxiv.org/html/2605.28983#S4.E6)\) becomes*hard attention*:Attn0​\(Q,K,V\)i=vargmaxj​\(qi⋅kj\)\\mathrm\{Attn\}\_\{0\}\(Q,K,V\)\_\{i\}=v\_\{\\,\\mathrm\{argmax\}\_\{j\}\(q\_\{i\}\\cdot k\_\{j\}\)\}, a tropical selection operator\. The L2 attention variant replaces the dot\-product logitqi⋅kj/dq\_\{i\}\\cdot k\_\{j\}/\\sqrt\{d\}with the negative squared distance−‖qi−kj‖2/\(4​t\)\-\\\|q\_\{i\}\-k\_\{j\}\\\|^\{2\}/\(4t\), making each query an exact evaluation of the Hopf–Cole solution:

###### Theorem 4\.2\(L2attention as exact Hopf–Cole\)\.

Define L2 attention with logitszj=−‖qi−kj‖2/\(4​t\)z\_\{j\}=\-\\\|q\_\{i\}\-k\_\{j\}\\\|^\{2\}/\(4t\)\. Then the attention output at queryqiq\_\{i\}equals the gradient ofLSEε​\(z\)\\mathrm\{LSE\}\_\{\\varepsilon\}\(z\)contracted withVV, and the partition functionZi=∑jexp⁡\(−‖qi−kj‖2/\(4​ε​t\)\)Z\_\{i\}=\\sum\_\{j\}\\exp\(\-\\\|q\_\{i\}\-k\_\{j\}\\\|^\{2\}/\(4\\varepsilon t\)\)equals the unnormalized Hopf–Cole solution atx=qix=q\_\{i\}under the empirical key measure withg≡0g\\equiv 0\. No approximation is made\.

Proof\.See Appendix[H](https://arxiv.org/html/2605.28983#A8)\.□\\square

#### Deep Networks as PDE Semigroups\.

The semigroup property of the heat equation\(St∘Ss\)​f=St\+s​f\(S\_\{t\}\\circ S\_\{s\}\)f=S\_\{t\+s\}fcorresponds under Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)to layer composition\. A depth\-LLnetwork with widthsN1,…,NLN\_\{1\},\\ldots,N\_\{L\}discretizes the HJ semigroupSt1∘⋯∘StLS\_\{t\_\{1\}\}\\circ\\cdots\\circ S\_\{t\_\{L\}\}applied to the initial datagg, evaluated at timest1,…,tLt\_\{1\},\\ldots,t\_\{L\}under discrete measuresμN1,…,μNL\\mu\_\{N\_\{1\}\},\\ldots,\\mu\_\{N\_\{L\}\}, each encoded by one layer’s parameters\. The tower property of the Markov semigroup gives Claim \(ii\) of Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1); the finite\-depth error and joint\-limit exactness are quantified in Theorems[E\.1](https://arxiv.org/html/2605.28983#A5.Thmtheorem1)and[E\.2](https://arxiv.org/html/2605.28983#A5.Thmtheorem2)\(Appendix[E](https://arxiv.org/html/2605.28983#A5)\)\.

###### Theorem 4\.3\(Maximal Hamiltonian class\)\.

The exact Hopf–Cole identityfεN​\(x\)=\|x\|2/\(4​t\)−uεN​\(x,t\)f\_\{\\varepsilon\}^\{N\}\(x\)=\|x\|^\{2\}/\(4t\)\-u\_\{\\varepsilon\}^\{N\}\(x,t\)with the*isotropic*quadratic\|x\|2/\(4​t\)\|x\|^\{2\}/\(4t\)holds if and only ifH​\(p\)=\|p\|2H\(p\)=\|p\|^\{2\}\(i\.e\.A=IA=I\); this is Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\. For generalA≻0A\\succ 0, the analogous identity replaces\|x\|2/\(4​t\)\|x\|^\{2\}/\(4t\)withx⊤​A−1​x/\(4​t\)x^\{\\top\}A^\{\-1\}x/\(4t\), as stated in Theorem[4\.4](https://arxiv.org/html/2605.28983#S4.Thmtheorem4)\. For any Hamiltonian outside the quadratic class, the Hopf–Cole substitution leaves a nonzero residual reaction term, and the identity fails\.

Proof\.See Appendix[H](https://arxiv.org/html/2605.28983#A8)\.□\\square

###### Theorem 4\.4\(Anisotropic NN–PDE identity\)\.

LetAθ≻0A\_\{\\theta\}\\succ 0be a learnabled×dd\\times dpositive\-definite matrix\. SetWj=Aθ−1​yj/\(2​t\)W\_\{j\}=A\_\{\\theta\}^\{\-1\}y\_\{j\}/\(2t\)andbj=−g​\(yj\)−yj⊤​Aθ−1​yj/\(4​t\)b\_\{j\}=\-g\(y\_\{j\}\)\-y\_\{j\}^\{\\top\}A\_\{\\theta\}^\{\-1\}y\_\{j\}/\(4t\)\. Then

LSEε​\(W​x\+b\)=x⊤​Aθ−1​x4​t−uεAθ,N​\(x,t\),\\mathrm\{LSE\}\_\{\\varepsilon\}\(Wx\+b\)=\\frac\{x^\{\\top\}A\_\{\\theta\}^\{\-1\}x\}\{4t\}\-u\_\{\\varepsilon\}^\{A\_\{\\theta\},N\}\(x,t\),\(8\)whereuεAθ,Nu\_\{\\varepsilon\}^\{A\_\{\\theta\},N\}solves∂tu\+\(∇u\)⊤​Aθ​\(∇u\)=ε​∇⋅\(Aθ​∇u\)\\partial\_\{t\}u\+\(\\nabla u\)^\{\\top\}A\_\{\\theta\}\(\\nabla u\)=\\varepsilon\\,\\nabla\\\!\\cdot\\\!\(A\_\{\\theta\}\\nabla u\)underμN=1N​∑jδyj\\mu\_\{N\}=\\frac\{1\}\{N\}\\sum\_\{j\}\\delta\_\{y\_\{j\}\}\. No approximation is made;Aθ=IA\_\{\\theta\}=Irecovers Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\.

Proof\.See Appendix[E](https://arxiv.org/html/2605.28983#A5)\.□\\square

## 5The Tropical Limit and Convex Optimization

#### The Hopf–Lax Formula\.

###### Theorem 5\.1\(Tropical collapse to Hopf–Lax\)\.

Letggbe Lipschitz onℝd\\mathbb\{R\}^\{d\}\. Asε→0\\varepsilon\\to 0, the Hopf–Cole solution \([3](https://arxiv.org/html/2605.28983#S4.E3)\) converges pointwise to

u0​\(x,t\)=infy∈ℝd\{g​\(y\)\+\|x−y\|24​t\}=\(g​□​ct\)​\(x\),u\_\{0\}\(x,t\)=\\inf\_\{y\\in\\mathbb\{R\}^\{d\}\}\\\!\\left\\\{g\(y\)\+\\frac\{\|x\-y\|^\{2\}\}\{4t\}\\right\\\}=\\bigl\(g\\,\\square\\,c\_\{t\}\\bigr\)\(x\),\(9\)wherect​\(z\)=\|z\|2/\(4​t\)c\_\{t\}\(z\)=\|z\|^\{2\}/\(4t\)\. This is the*Hopf–Lax formula*\(Lax,[1957](https://arxiv.org/html/2605.28983#bib.bib10); Evans,[2010](https://arxiv.org/html/2605.28983#bib.bib11)\), the unique Lipschitz viscosity solution\(Crandall and Lions,[1983](https://arxiv.org/html/2605.28983#bib.bib8)\)of the inviscid HJ equation∂tu\+\|∇u\|2=0\\partial\_\{t\}u\+\|\\nabla u\|^\{2\}=0,u​\(⋅,0\)=gu\(\\cdot,0\)=g\.

Proof\.See Appendix[E](https://arxiv.org/html/2605.28983#A5)\.□\\square

Under the discrete measureμN\\mu\_\{N\}, the tropical limit of the Hopf–Cole predictor is

u0N​\(x\)=minj⁡\{g​\(yj\)\+\|x−yj\|24​t\},u\_\{0\}^\{N\}\(x\)=\\min\_\{j\}\\\!\\left\\\{g\(y\_\{j\}\)\+\\frac\{\|x\-y\_\{j\}\|^\{2\}\}\{4t\}\\right\\\},\(10\)so thatf0N​\(x\)=\|x\|2/\(4​t\)−u0N​\(x\)f\_\{0\}^\{N\}\(x\)=\|x\|^\{2\}/\(4t\)\-u\_\{0\}^\{N\}\(x\)\. This object is, viz\., simultaneously: \(i\) a tropical inner product in\(ℝ,min,\+\)\(\\mathbb\{R\},\\min,\+\); \(ii\) a MASO\(Balestriero and Baraniuk,[2018](https://arxiv.org/html/2605.28983#bib.bib21)\); \(iii\) a linear program inxx\(Balestriero and Baraniuk,[2021](https://arxiv.org/html/2605.28983#bib.bib22)\); and \(iv\) a piecewise\-affine function with breakpoints at the support points\{yj\}\\\{y\_\{j\}\\\}\.

#### ResNets and ODE Characteristics\.

A ResNet layerxl\+1=xl\+h​F​\(xl,Wl\)x\_\{l\+1\}=x\_\{l\}\+h\\,F\(x\_\{l\},W\_\{l\}\)is the Euler discretization ofx˙​\(t\)=F​\(x​\(t\),θ​\(t\)\)\\dot\{x\}\(t\)=F\(x\(t\),\\theta\(t\)\)\(Heet al\.,[2016](https://arxiv.org/html/2605.28983#bib.bib19); Luet al\.,[2018](https://arxiv.org/html/2605.28983#bib.bib3); Chenet al\.,[2018](https://arxiv.org/html/2605.28983#bib.bib18); E,[2017](https://arxiv.org/html/2605.28983#bib.bib17)\)\.

###### Proposition 5\.2\(ResNet as HJ characteristics\)\.

The ResNet recurrence with step sizehhconverges, ash→0h\\to 0andL→∞L\\to\\inftywithL​h=TLh=Tfixed, to the solution ofx˙=F​\(x,θ​\(t\)\)\\dot\{x\}=F\(x,\\theta\(t\)\)\. This ODE is the characteristic equation of the HJ PDE with HamiltonianH​\(x,p\)=p⋅F​\(x,θ\)H\(x,p\)=p\\cdot F\(x,\\theta\): characteristics satisfyx˙=∇pH=F​\(x,θ\)\\dot\{x\}=\\nabla\_\{p\}H=F\(x,\\theta\)andp˙=−∇xH\\dot\{p\}=\-\\nabla\_\{x\}H\(Chenet al\.,[2018](https://arxiv.org/html/2605.28983#bib.bib18); Tonget al\.,[2025](https://arxiv.org/html/2605.28983#bib.bib31)\)\. The construction applies to any smooth recurrence of the formxℓ\+1=xℓ\+h​F​\(xℓ,Wℓ\)x\_\{\\ell\+1\}=x\_\{\\ell\}\+hF\(x\_\{\\ell\},W\_\{\\ell\}\), with the characteristic equations determined byFFandθ\\theta\.

In the tropical limit, the residual layer becomesxl\+1=max⁡\(xl,Wl⊗trxl\)x\_\{l\+1\}=\\max\(x\_\{l\},\\,W\_\{l\}\\otimes\_\{\\mathrm\{tr\}\}x\_\{l\}\), a tropical residual whose continuous limit is the inviscid characteristic equation\.

Flow matching\(Lipmanet al\.,[2023](https://arxiv.org/html/2605.28983#bib.bib36); Liuet al\.,[2023](https://arxiv.org/html/2605.28983#bib.bib37)\)and score\-based diffusion\(Songet al\.,[2021](https://arxiv.org/html/2605.28983#bib.bib38)\)fit the same structure \(Appendix[B](https://arxiv.org/html/2605.28983#A2)\)\.

#### Recurrent Architectures: RNNs, LSTMs, and SSMs\.

Recurrent and state\-space architectures fit the same framework through the process\-discretization route of Proposition[5\.2](https://arxiv.org/html/2605.28983#S5.Thmtheorem2)\.

###### Proposition 5\.4\(Recurrent architectures as HJ characteristics\)\.

The following structural correspondences hold:

1. \(a\)RNNs\.The recurrenceht=σ​\(Wh​ht−1\+Wx​xt\)h\_\{t\}=\\sigma\(W\_\{h\}h\_\{t\-1\}\+W\_\{x\}x\_\{t\}\)is the unit\-step Euler discretization ofh˙=−h\+σ​\(Wh​h\+Wx​x​\(t\)\)\\dot\{h\}=\-h\+\\sigma\(W\_\{h\}h\+W\_\{x\}x\(t\)\), the characteristic ODE of the HJ PDE with HamiltonianH​\(h,p\)=p⋅\(−h\+σ​\(Wh​h\+Wx​x\)\)H\(h,p\)=p\\cdot\(\-h\+\\sigma\(W\_\{h\}h\+W\_\{x\}x\)\)\.
2. \(b\)State Space Models \(SSMs\)\.The linear SSMh˙​\(t\)=A​\(t\)​h​\(t\)\+B​\(t\)​x​\(t\)\\dot\{h\}\(t\)=A\(t\)h\(t\)\+B\(t\)x\(t\),y​\(t\)=C​\(t\)​h​\(t\)y\(t\)=C\(t\)h\(t\)is the characteristic equation of a*linear*HJ PDE with quadratic HamiltonianH​\(h,p\)=p⊤​A​\(t\)​h\+p⊤​B​\(t\)​xH\(h,p\)=p^\{\\top\}A\(t\)h\+p^\{\\top\}B\(t\)x\. Time\-dependent matricesA​\(t\),B​\(t\),C​\(t\)A\(t\),B\(t\),C\(t\)\(as in Mamba’s selective scan\) make the characteristics non\-autonomous, the same structure asTonget al\.\([2025](https://arxiv.org/html/2605.28983#bib.bib31)\)\. The tropical limit of the SSM is a max\-plus linear systemht\+1=A⊗trht⊕B⊗trxth\_\{t\+1\}=A\\otimes\_\{\\mathrm\{tr\}\}h\_\{t\}\\oplus B\\otimes\_\{\\mathrm\{tr\}\}x\_\{t\}, the algebra used in scheduling and discrete\-event systems\.

The key distinction across architectures is how viscosity enters: in feedforward networksε\\varepsilonis global \(uniform softmax temperature\); in SSMs it is input\-dependent \(A​\(t\)A\(t\)controlled byx​\(t\)x\(t\)\); in LSTMs it is gated per cell \(Remark[5\.5](https://arxiv.org/html/2605.28983#S5.Thmtheorem5)\)\. The LSE class is further closed under composition, affine input transformations, and residual connections \(Appendix[H](https://arxiv.org/html/2605.28983#A8)\)\.

## 6Theε\\varepsilonParameter: A Unifying Deformation

The parameterε\\varepsilonsimultaneously indexes four perspectives on the same object: neural network \(ε\>0\\varepsilon\>0: LSE/softmax;ε=0\\varepsilon=0: max/ReLU\), tropical algebra \(\(ℝ,\+,×\)\(\\mathbb\{R\},\+,\\times\)vs\.\(ℝ,max,\+\)\(\\mathbb\{R\},\\max,\+\)\), PDE \(viscous HJ vs\. inviscid HJ\), and convex optimization \(entropic regularization vs\. LP vertex\)\. The full dictionary is in Table[5](https://arxiv.org/html/2605.28983#A8.T5)\(Appendix[H](https://arxiv.org/html/2605.28983#A8)\)\. Infinite\-width and overparameterization consequences \(Gaussian process connection, double\-descent as near\-shock formation\) are developed in Appendix[E](https://arxiv.org/html/2605.28983#A5)\.

## 7The Commutative Diagram

The correspondence assembles into a commutative diagram:

NN​\(fεN,ε\>0\)→ε→0Tropical NN​\(f0N\)↕\(exact\)↕\(exact\)Viscous HJ​\(uε\)→ε→0Inviscid HJ / Hopf–Lax​\(u0\)\\begin\{array\}\[\]\{ccc\}\\text\{NN\}\\;\(f\_\{\\varepsilon\}^\{N\},\\;\\varepsilon\>0\)&\\xrightarrow\{\\;\\varepsilon\\to 0\\;\}&\\text\{Tropical NN\}\\;\(f\_\{0\}^\{N\}\)\\\\\[6\.0pt\] \\updownarrow\\;\\text\{\\small\(exact\)\}&&\\updownarrow\\;\\text\{\\small\(exact\)\}\\\\\[6\.0pt\] \\text\{Viscous HJ\}\\;\(u\_\{\\varepsilon\}\)&\\xrightarrow\{\\;\\varepsilon\\to 0\\;\}&\\text\{Inviscid HJ / Hopf\-\-Lax\}\\;\(u\_\{0\}\)\\end\{array\}\(11\)Moving right is Maslov dequantization / ultradiscretization\. Moving vertically is the identification of the layer with the PDE solution \(discrete vs\. continuous measure, Theorems[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)and[5\.1](https://arxiv.org/html/2605.28983#S5.Thmtheorem1)\)\. The diagram commutes\.

###### Theorem 7\.1\(Commutativity under Lipschitz conditions\)\.

Supposeggis Lipschitz\. All five claims pertain to the quadratic HamiltonianH​\(p\)=\|p\|2H\(p\)=\|p\|^\{2\}\(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\) or its anisotropic extension \(Theorem[4\.4](https://arxiv.org/html/2605.28983#S4.Thmtheorem4)\), which is the Hamiltonian for which the LSE layer with linear logits is the exact discrete\-measure instantiation\. The semigroup facts underlying Claims \(ii\)–\(v\) — tower property, quadrature convergence, viscous\-to\-inviscid limit, and commutativity — hold abstractly for any Lipschitzggand convex superlinearHH\(with the appropriate network form\); their connection to the specific LSE network is through Claim \(i\)\. The following claims hold:

1. \(i\)LayerfεNf\_\{\\varepsilon\}^\{N\}satisfiesfεN=\|x\|2/\(4​t\)−uεNf\_\{\\varepsilon\}^\{N\}=\|x\|^\{2\}/\(4t\)\-u\_\{\\varepsilon\}^\{N\}\(exact, Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\)\.
2. \(ii\)ComposingLLlayers corresponds to the HJ semigroupSt1∘⋯∘StLS\_\{t\_\{1\}\}\\circ\\cdots\\circ S\_\{t\_\{L\}\}\(tower property\); the discrete network passes a point evaluation between layers whereas the exact PDE semigroup integrates the full function \(Appendix[E](https://arxiv.org/html/2605.28983#A5)\)\.
3. \(iii\)AsN→∞N\\to\\infty,uεN→uεu\_\{\\varepsilon\}^\{N\}\\to u\_\{\\varepsilon\}at rateO​\(N−1/d\)O\(N^\{\-1/d\}\)\(quadrature approximation\)\.
4. \(iv\)Asε→0\\varepsilon\\to 0,uε→u0u\_\{\\varepsilon\}\\to u\_\{0\}pointwise \(Theorem[5\.1](https://arxiv.org/html/2605.28983#S5.Thmtheorem1)\)\.
5. \(v\)The two paths fromuεNu\_\{\\varepsilon\}^\{N\}tou0u\_\{0\}\(firstN→∞N\\to\\inftythenε→0\\varepsilon\\to 0, or firstε→0\\varepsilon\\to 0thenN→∞N\\to\\infty\) yield the same limit\.

Proof\.See Appendix[E](https://arxiv.org/html/2605.28983#A5)\.□\\square

Claim \(v\) is the non\-trivial commutativity: the limitsε→0\\varepsilon\\to 0andN→∞N\\to\\inftycan be exchanged\. Lipschitz conditions are required because the quadrature convergence rate depends onε\\varepsilonthrough the concentration of the softmax weights\.

## 8Consequences of the Correspondence

The identification of neural network layers with Hopf–Cole solutions yields quantitative consequences for generalization, robustness, and the structure of the backward pass\.

#### Generalization via PDE Regularity\.

###### Theorem 8\.1\(Generalization bound from Hopf–Cole quadrature\)\.

Letf∗:ℝd→ℝf^\{\*\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}beLL\-Lipschitz on a compact set𝒳⊂ℝd\\mathcal\{X\}\\subset\\mathbb\{R\}^\{d\}\. The Hopf–Cole predictoruεN​\(x\)=\|x\|2/\(4​t\)−fεN​\(x\)u\_\{\\varepsilon\}^\{N\}\(x\)=\|x\|^\{2\}/\(4t\)\-f\_\{\\varepsilon\}^\{N\}\(x\)withNNneurons satisfies

infW,b𝔼x​\|uεN​\(x\)−f∗​\(x\)\|≤C1​N−1/d\+C2​ε\+L2​t,\\inf\_\{W,b\}\\,\\mathbb\{E\}\_\{x\}\\bigl\|u\_\{\\varepsilon\}^\{N\}\(x\)\-f^\{\*\}\(x\)\\bigr\|\\;\\leq\\;C\_\{1\}N^\{\-1/d\}\+C\_\{2\}\\varepsilon\+L^\{2\}t,\(12\)whereC1=C1​\(L,d\)C\_\{1\}=C\_\{1\}\(L,d\),C2=C2​\(L,t\)C\_\{2\}=C\_\{2\}\(L,t\), and the optimal weights satisfy‖Wj‖2≤M:=diam​\(𝒳\)/\(2​t\)\\\|W\_\{j\}\\\|\_\{2\}\\leq M:=\\mathrm\{diam\}\(\\mathcal\{X\}\)/\(2t\)\. Settingε∗≍N−1/d\\varepsilon^\{\*\}\\asymp N^\{\-1/d\}gives approximation errorO​\(N−1/d\)O\(N^\{\-1/d\}\)\. A Rademacher bound yields excess riskO​\(N−1/d\+M​N/n\)O\(N^\{\-1/d\}\+M\\sqrt\{N/n\}\); for fixedtt\(hence fixedMM\),N∗≍\(n/M2\)d/\(d\+2\)N^\{\*\}\\asymp\(n/M^\{2\}\)^\{d/\(d\+2\)\}recovers rateO​\(n−1/\(d\+2\)\+t\)O\(n^\{\-1/\(d\+2\)\}\+t\), whereO​\(t\)O\(t\)is a Moreau\-envelope approximation bias \(see Appendix[E](https://arxiv.org/html/2605.28983#A5)\)\.

Proof sketch\.See Appendix[E](https://arxiv.org/html/2605.28983#A5)\.□\\square

The optimal viscosityε∗≍N−1/d\\varepsilon^\{\*\}\\asymp N^\{\-1/d\}equals the grid spacing: below this scale quadrature under\-resolves; above it viscosity over\-smooths\. For depth\-LLnetworks, Claim \(ii\) adds a factor ofLLto term \(i\) via the semigroup composition error; mutatis mutandis, the same balanceε∗≍N−1/d\\varepsilon^\{\*\}\\asymp N^\{\-1/d\}applies at each layer\.

###### Corollary 8\.2\(Adversarial robustness via viscosity\)\.

Let‖W‖2,∞=maxj⁡‖Wj‖2\\\|W\\\|\_\{2,\\infty\}=\\max\_\{j\}\\\|W\_\{j\}\\\|\_\{2\}denote the maximum row norm\. The Hessian offεNf\_\{\\varepsilon\}^\{N\}satisfies

‖∇x2fεN​\(x\)‖2≤‖W‖2,∞2ε\\bigl\\\|\\nabla\_\{x\}^\{2\}f\_\{\\varepsilon\}^\{N\}\(x\)\\bigr\\\|\_\{2\}\\;\\leq\\;\\frac\{\\\|W\\\|\_\{2,\\infty\}^\{2\}\}\{\\varepsilon\}\(13\)for allxx\. The gradient satisfies‖∇xfεN​\(x\)‖2≤‖W‖2,∞\\\|\\nabla\_\{x\}f\_\{\\varepsilon\}^\{N\}\(x\)\\\|\_\{2\}\\leq\\\|W\\\|\_\{2,\\infty\}\(since∇xfεN=∑jπj​\(x;ε\)​Wj\\nabla\_\{x\}f\_\{\\varepsilon\}^\{N\}=\\sum\_\{j\}\\pi\_\{j\}\(x;\\varepsilon\)\\,W\_\{j\}is a softmax\-weighted average of the rows, hence a convex combination\)\. A second\-order Taylor expansion inδ\\deltathen yields: an adversarial perturbation with‖δ‖2≤r\\\|\\delta\\\|\_\{2\}\\leq rchanges the output by at most

\|fεN​\(x\+δ\)−fεN​\(x\)\|≤‖W‖2,∞​r\+‖W‖2,∞2​r22​ε\.\\bigl\|f\_\{\\varepsilon\}^\{N\}\(x\+\\delta\)\-f\_\{\\varepsilon\}^\{N\}\(x\)\\bigr\|\\;\\leq\\;\\\|W\\\|\_\{2,\\infty\}r\+\\frac\{\\\|W\\\|\_\{2,\\infty\}^\{2\}r^\{2\}\}\{2\\varepsilon\}\.\(14\)For a prescribed output toleranceτ\>0\\tau\>0, solving \([14](https://arxiv.org/html/2605.28983#S8.E14)\) as a quadratic inrryields the certified radius

r∗​\(τ,ε\)=ε‖W‖2,∞​\(1\+2​τε−1\)→ε→∞τ‖W‖2,∞\.r^\{\*\}\(\\tau,\\varepsilon\)\\;=\\;\\frac\{\\varepsilon\}\{\\\|W\\\|\_\{2,\\infty\}\}\\Bigl\(\\sqrt\{1\+\\tfrac\{2\\tau\}\{\\varepsilon\}\}\-1\\Bigr\)\\;\\xrightarrow\{\\;\\varepsilon\\to\\infty\\;\}\\;\\frac\{\\tau\}\{\\\|W\\\|\_\{2,\\infty\}\}\.\(15\)Increasingε\\varepsilonsuppresses curvature and enlarges the certified radius; in the PDE interpretation, the viscous termε​Δ​u\\varepsilon\\Delta uprevents shock formation and smooths the solution against pointwise perturbations\. The parameterε\\varepsilontherefore governs a robustness–expressiveness tradeoff: largeε\\varepsiloncertifies wide safe radii but over\-smooths toward the diffusive wave limit; smallε\\varepsilonsharpens expressiveness but pushes the network toward the tropical \(piecewise\-linear\) limit and narrows the certified radius\.

Proof\.See Appendix[E](https://arxiv.org/html/2605.28983#A5)\.□\\square

#### Backpropagation as the Adjoint HJ Equation\.

###### Theorem 8\.4\(Backpropagation==adjoint HJ equation\)\.

Letxl\+1=xl\+h​F​\(xl,Wl\)x\_\{l\+1\}=x\_\{l\}\+h\\,F\(x\_\{l\},W\_\{l\}\)be a ResNet withLLlayers and scalar lossℒ​\(xL\)\\mathcal\{L\}\(x\_\{L\}\)\. Define co\-statepl=∂ℒ/∂xlp\_\{l\}=\\partial\\mathcal\{L\}/\\partial x\_\{l\}with terminal conditionpL=∇xLℒp\_\{L\}=\\nabla\_\{x\_\{L\}\}\\mathcal\{L\}\. The backpropagation recurrence

pl−1=pl\+h​\(∇xF​\(xl−1,Wl−1\)\)⊤​plp\_\{l\-1\}=p\_\{l\}\+h\\,\\bigl\(\\nabla\_\{x\}F\(x\_\{l\-1\},W\_\{l\-1\}\)\\bigr\)^\{\\top\}p\_\{l\}\(16\)is the forward Euler scheme, integrated in reverse time, of the co\-state equation of the HamiltonianH​\(x,p\)=p⊤​F​\(x,θ\)H\(x,p\)=p^\{\\top\}F\(x,\\theta\):

p˙​\(t\)=−∇xH​\(x​\(t\),p​\(t\)\)=−\(∇xF​\(x​\(t\),θ\)\)⊤​p​\(t\)\.\\dot\{p\}\(t\)=\-\\nabla\_\{x\}H\\bigl\(x\(t\),p\(t\)\\bigr\)=\-\\bigl\(\\nabla\_\{x\}F\(x\(t\),\\theta\)\\bigr\)^\{\\top\}p\(t\)\.\(17\)The forward passx˙=∇pH=F​\(x,θ\)\\dot\{x\}=\\nabla\_\{p\}H=F\(x,\\theta\)and backward passp˙=−∇xH\\dot\{p\}=\-\\nabla\_\{x\}Htogether constitute the complete Hamiltonian flow of the HJ PDE of Proposition[5\.2](https://arxiv.org/html/2605.28983#S5.Thmtheorem2)\.

Proof\.See Appendix[E](https://arxiv.org/html/2605.28983#A5)\.□\\square

*Theorem[8\.4](https://arxiv.org/html/2605.28983#S8.Thmtheorem4)shows that backpropagation is not a heuristic: it integrates the Hamiltonian system whose forward equation the network already solves\.*The gradient update onWlW\_\{l\}is the Pontryagin Maximum Principle \(PMP\) optimality condition: training is execution of the PMP for the optimal initial\-value problem\. For the LSE case, the backward pass is exactly the adjoint heat semigroup:∂ℒ/∂x=W⊤​π⋅∂ℒ/∂fεN\\partial\\mathcal\{L\}/\\partial x=W^\{\\top\}\\pi\\cdot\\partial\\mathcal\{L\}/\\partial f\_\{\\varepsilon\}^\{N\},π=∇zLSEε​\(z\)\\pi=\\nabla\_\{z\}\\mathrm\{LSE\}\_\{\\varepsilon\}\(z\)\(Proposition[E\.6](https://arxiv.org/html/2605.28983#A5.Thmtheorem6), Appendix[E](https://arxiv.org/html/2605.28983#A5)\)\. At initialization, the NTKKa​b​\(0\)=ε2​⟨π​\(xa;0\),π​\(xb;0\)⟩≻0K\_\{ab\}\(0\)=\\varepsilon^\{2\}\\langle\\pi\(x\_\{a\};0\),\\pi\(x\_\{b\};0\)\\rangle\\succ 0a\.s\. forN≥nN\\geq n, guaranteeing linear convergence in the kernel regime \(Proposition[D\.3](https://arxiv.org/html/2605.28983#A4.Thmtheorem3)\)\.

#### Hallucination, Scaling, and Injection Attacks\.

###### Proposition 8\.5\(Hallucination as deterministic OOD extrapolation\)\.

Letj∗=arg⁡minj⁡\{g​\(yj\)\+\|x−yj\|2/\(4​t\)\}j^\{\*\}=\\arg\\min\_\{j\}\\bigl\\\{g\(y\_\{j\}\)\+\|x\-y\_\{j\}\|^\{2\}/\(4t\)\\bigr\\\}be the dominant neuron \(Hopf–Lax minimizer\) and define the energy gapΔ​\(x\)=minj≠j∗⁡\{g​\(yj\)\+\|x−yj\|2/\(4​t\)−g​\(yj∗\)−\|x−yj∗\|2/\(4​t\)\}\>0\\Delta\(x\)=\\min\_\{j\\neq j^\{\*\}\}\\bigl\\\{g\(y\_\{j\}\)\+\|x\-y\_\{j\}\|^\{2\}/\(4t\)\-g\(y\_\{j^\{\*\}\}\)\-\|x\-y\_\{j^\{\*\}\}\|^\{2\}/\(4t\)\\bigr\\\}\>0\. Then

\|fεN​\(x\)−\(Wj∗⋅x\+bj∗\)\|≤ε​log⁡\(1\+\(N−1\)​e−Δ​\(x\)/ε\),\\bigl\|f\_\{\\varepsilon\}^\{N\}\(x\)\-\(W\_\{j^\{\*\}\}\\cdot x\+b\_\{j^\{\*\}\}\)\\bigr\|\\;\\leq\\;\\varepsilon\\log\\\!\\Bigl\(1\+\(N\-1\)e^\{\-\\Delta\(x\)/\\varepsilon\}\\Bigr\),\(18\)which isO​\(\(N−1\)​ε​e−Δ​\(x\)/ε\)O\\bigl\(\(N\-1\)\\varepsilon\\,e^\{\-\\Delta\(x\)/\\varepsilon\}\\bigr\)asε→0\\varepsilon\\to 0orΔ​\(x\)→∞\\Delta\(x\)\\to\\infty\. At any pointxxlying outside the diffusion radius2​ε​t\\sqrt\{2\\varepsilon t\}of every support point, the network output is exponentially close to the linear extrapolation from the dominant neuron, with no dependence on the training distribution atxx\.

Proof\.See Appendix[E](https://arxiv.org/html/2605.28983#A5)\.□\\square

###### Assumption 8\.7\(Manifold hypothesis\)\.

The data\-generating measureμ\\muonℝd\\mathbb\{R\}^\{d\}is supported on \(or concentrated near\) a Riemannian submanifold of intrinsic dimensiondeff≤dd\_\{\\mathrm\{eff\}\}\\leq d\. The effective dimensiondeffd\_\{\\mathrm\{eff\}\}equalsddwhenμ\\muis non\-degenerate in all directions, and equals the submanifold dimension whenμ\\muconcentrates on a lower\-dimensional structure\.

###### Proposition 8\.8\(Scaling laws from intrinsic dimension\)\.

Under the approximation bound of Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1), the training loss satisfiesℒ​\(N\)≍N−1/deff\\mathcal\{L\}\(N\)\\asymp N^\{\-1/d\_\{\\mathrm\{eff\}\}\}, wheredeffd\_\{\\mathrm\{eff\}\}is the intrinsic dimension of the support of the data\-generating measureμ\\mu\. Equivalently, the empirical scaling law exponentα\\alphainℒ​\(N\)∝N−α\\mathcal\{L\}\(N\)\\propto N^\{\-\\alpha\}satisfies

α=1deff,deff=1α\.\\alpha=\\frac\{1\}\{d\_\{\\mathrm\{eff\}\}\},\\qquad d\_\{\\mathrm\{eff\}\}=\\frac\{1\}\{\\alpha\}\.\(19\)

Proof sketch\.See Appendix[E](https://arxiv.org/html/2605.28983#A5)\.□\\square

Table[7](https://arxiv.org/html/2605.28983#A9.T7)\(Appendix[I](https://arxiv.org/html/2605.28983#A9)\) estimatesdeff=1/αd\_\{\\mathrm\{eff\}\}=1/\\alphafrom published scaling exponents\(Kaplanet al\.,[2020](https://arxiv.org/html/2605.28983#bib.bib27); Henighanet al\.,[2020](https://arxiv.org/html/2605.28983#bib.bib34); Hoffmannet al\.,[2022](https://arxiv.org/html/2605.28983#bib.bib35)\); values range fromdeff≈2\.6d\_\{\\mathrm\{eff\}\}\\approx 2\.6\(math\) to1313\(GPT\-scale language\)\.

#### Attribution Landscape and Bifurcation\.

###### Theorem 8\.9\(Fold bifurcations of the attribution\-entropy landscape\)\.

LetH​\(x;ε\)=−∑jπj​log⁡πjH\(x;\\varepsilon\)=\-\\sum\_\{j\}\\pi\_\{j\}\\log\\pi\_\{j\}be the attribution entropy\. Assume\{yj\}\\\{y\_\{j\}\\\}are distinct and eachyky\_\{k\}is the strict Hopf–Lax minimizer at itself:g​\(yk\)<g​\(yj\)\+\|yk−yj\|2/\(4​t\)g\(y\_\{k\}\)<g\(y\_\{j\}\)\+\|y\_\{k\}\-y\_\{j\}\|^\{2\}/\(4t\)for allj≠kj\\neq k\(generic\)\. Asε→0\\varepsilon\\to 0,HHdevelopsNNlocal minima at\{yj\}\\\{y\_\{j\}\\\}; asε\\varepsilonincreases, each decrease in critical\-point count is a fold bifurcation where a saddle annihilates an adjacent minimum atεbif​\(j,k\)\\varepsilon\_\{\\mathrm\{bif\}\}\(j,k\), merging the attribution basins ofjjandkk\(generic: simple zero eigenvalue of∇x2H\\nabla\_\{x\}^\{2\}Hwith nonzero crossing derivative\)\.

Proof\.See Appendix[F](https://arxiv.org/html/2605.28983#A6)\.□\\square

## 9Numerical Experiments

The core identity \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\) and two quantitative consequences \(Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1), Proposition[8\.8](https://arxiv.org/html/2605.28983#S8.Thmtheorem8)\) are verified numerically; full details and code are in Appendix[G](https://arxiv.org/html/2605.28983#A7)\.

![[Uncaptioned image]](https://arxiv.org/html/2605.28983v1/x1.png)

Figure 1:Scaling lawℒ​\(N\)∝N−α\\mathcal\{L\}\(N\)\\propto N^\{\-\\alpha\}at optimalε∗\\varepsilon^\{\*\};g​\(y\)=12​\|y\|2g\(y\)=\\tfrac\{1\}\{2\}\|y\|^\{2\}\. Fittedα^\>1/deff\\hat\{\\alpha\}\>1/d\_\{\\mathrm\{eff\}\}confirms smoothggexceeds the Lipschitz rate\.
#### Identity verification\.

The identityLSEε​\(W​x\+b\)=\|x\|2/\(4​t\)−uε​\(x,t\)\\mathrm\{LSE\}\_\{\\varepsilon\}\(Wx\+b\)=\|x\|^\{2\}/\(4t\)\-u\_\{\\varepsilon\}\(x,t\)holds to machine precision \(∼10−16\\sim 10^\{\-16\}\) for five values ofε\\varepsilonacross 500 evaluation points \(Table[2](https://arxiv.org/html/2605.28983#A7.T2), Appendix[G](https://arxiv.org/html/2605.28983#A7)\)\. The transformer attention identity \([7](https://arxiv.org/html/2605.28983#S4.E7)\) holds to*exactly*0error across 500 random trials ford∈\{4,8,16,32,64\}d\\in\\\{4,8,16,32,64\\\}\(Table[3](https://arxiv.org/html/2605.28983#A7.T3), Appendix[G](https://arxiv.org/html/2605.28983#A7)\)\.

#### Quadrature convergence\.

Figure[5](https://arxiv.org/html/2605.28983#A7.F5)\(Appendix[G](https://arxiv.org/html/2605.28983#A7)\) showsℓ∞\\ell^\{\\infty\}error forg​\(y\)=\|y\|g\(y\)=\|y\|\(Lipschitz\): all curves decay asO​\(N−1/d\)O\(N^\{\-1/d\}\), confirming Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)acrossd∈\{1,2,4\}d\\in\\\{1,2,4\\\}\.

#### Scaling law\.

Figure[1](https://arxiv.org/html/2605.28983#S9.F1)verifies Proposition[8\.8](https://arxiv.org/html/2605.28983#S8.Thmtheorem8): at optimalε∗=N−1/deff\\varepsilon^\{\*\}=N^\{\-1/d\_\{\\mathrm\{eff\}\}\}the RMSE scales asN−αN^\{\-\\alpha\}withα=1/deff\\alpha=1/d\_\{\\mathrm\{eff\}\}; fittedα^\>1/deff\\hat\{\\alpha\}\>1/d\_\{\\mathrm\{eff\}\}confirms the smooth\-ggrate exceeds the minimax Lipschitz bound\. Figure[6](https://arxiv.org/html/2605.28983#A7.F6)\(Appendix[G](https://arxiv.org/html/2605.28983#A7)\) replicates these exponents on Adam\-trained networks\.

#### Adversarial robustness\.

Figure[7](https://arxiv.org/html/2605.28983#A7.F7)\(Appendix[G](https://arxiv.org/html/2605.28983#A7)\) verifies Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2)on closed\-form and Adam\-trained networks; Figure[8](https://arxiv.org/html/2605.28983#A7.F8)extends toreal data\(MNIST, CIFAR\-10\) at allε∈\{0\.1,0\.3,1\.0,3\.0,10\.0\}\\varepsilon\\in\\\{0\.1,0\.3,1\.0,3\.0,10\.0\\\}\.

## 10Conclusion

Every LSE layer is exactly a Hamilton–Jacobi PDE solver \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\), and the single parameterε\\varepsiloncloses a commutative diagram \(Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1)\) that subsumes nearest\-neighbor retrieval, convex regularization, and entropic optimal transport as limiting cases of one structure\. Generalization, robustness, hallucination geometry, scaling, and Hamiltonian extensions \(Theorems[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1),[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2),[8\.5](https://arxiv.org/html/2605.28983#S8.Thmtheorem5),[8\.8](https://arxiv.org/html/2605.28983#S8.Thmtheorem8),[4\.4](https://arxiv.org/html/2605.28983#S4.Thmtheorem4)\) all follow from this single correspondence\. The question posed at the outset — what equation does a neural network solve? — has an exact answer; the results above are its first consequences\.

## References

- Neural Networks are Decision Trees\.Note:arXiv:2210\.05189v3 \[cs\.LG\]Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px2.p1.3),[Appendix C](https://arxiv.org/html/2605.28983#A3.SSx2.SSS0.Px2.p1.9),[§3](https://arxiv.org/html/2605.28983#S3.SS0.SSS0.Px2.p1.6)\.
- R\. Balestriero and R\. G\. Baraniuk \(2018\)A Spline Theory of Deep Learning\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 374–383\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px2.p1.3),[§3](https://arxiv.org/html/2605.28983#S3.SS0.SSS0.Px2.p1.6),[§5](https://arxiv.org/html/2605.28983#S5.SS0.SSS0.Px1.p2.5)\.
- R\. Balestriero and R\. G\. Baraniuk \(2021\)Mad Max: Affine Spline Insights into Deep Learning\.Proceedings of the IEEE109\(5\),pp\. 704–727\.External Links:[Document](https://dx.doi.org/10.1109/JPROC.2020.3042100)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px2.p1.3),[§3](https://arxiv.org/html/2605.28983#S3.SS0.SSS0.Px2.p1.6),[§5](https://arxiv.org/html/2605.28983#S5.SS0.SSS0.Px1.p2.5)\.
- H\. H\. Bauschke and P\. L\. Combettes \(2017\)Convex analysis and monotone operator theory in Hilbert spaces\.2nd edition,Springer\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px10.p1.5),[Remark 7\.2](https://arxiv.org/html/2605.28983#S7.Thmtheorem2.p1.5.5)\.
- E\. M\. Bender, T\. Gebru, A\. McMillan\-Major, and S\. Shmitchell \(2021\)On the dangers of stochastic parrots: can language models be too big?\.InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,pp\. 610–623\.Cited by:[Remark 8\.6](https://arxiv.org/html/2605.28983#S8.Thmtheorem6.p1.1.1)\.
- M\. Benning, M\. Burger, G\. Gilboa, and M\. Moeller \(2019\)Modern regularization methods for inverse problems\.Acta Numerica27,pp\. 1–111\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px14.p1.1)\.
- S\. Boyd and L\. Vandenberghe \(2004\)Convex optimization\.Cambridge University Press,Cambridge, UK\.Cited by:[Appendix D](https://arxiv.org/html/2605.28983#A4.SSx3.1.p1.15)\.
- V\. Charisopoulos and P\. Maragos \(2017\)Morphological perceptrons: geometry and training algorithms\.InInternational Symposium on Mathematical Morphology,pp\. 3–15\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px13.p1.1)\.
- P\. Chaudhari, A\. Choromanska, S\. Soatto, Y\. LeCun, C\. Baldassi, C\. Borgs, J\. Chayes, L\. Sagun, and R\. Zecchina \(2017\)Entropy\-SGD: biasing gradient descent into wide valleys\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px6.p1.10)\.
- P\. Chaudhari, A\. Oberman, S\. Osher, S\. Soatto, and G\. Carlier \(2018\)Deep relaxation: partial differential equations for optimizing deep neural networks\.Research in the Mathematical Sciences5,pp\. 30\.External Links:[Document](https://dx.doi.org/10.1007/s40687-018-0148-y)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px6.p1.10)\.
- R\. T\. Q\. Chen, Y\. Rubanova, J\. Bettencourt, and D\. Duvenaud \(2018\)Neural Ordinary Differential Equations\.InAdvances in Neural Information Processing Systems,Vol\.31,pp\. 6572–6583\.External Links:[Document](https://dx.doi.org/10.5555/3327757.3327764)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.28983#S1.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.28983#S5.SS0.SSS0.Px2.p1.2),[Proposition 5\.2](https://arxiv.org/html/2605.28983#S5.Thmtheorem2.p1.11.11)\.
- M\. G\. Crandall and P\. Lions \(1983\)Viscosity Solutions of Hamilton–Jacobi Equations\.Transactions of the American Mathematical Society277\(1\),pp\. 1–42\.External Links:[Document](https://dx.doi.org/10.1090/S0002-9947-1983-0690039-8)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px3.p1.6),[2nd item](https://arxiv.org/html/2605.28983#A4.I1.i2.p1.10),[3rd item](https://arxiv.org/html/2605.28983#A4.I1.i3.p1.6),[Appendix E](https://arxiv.org/html/2605.28983#A5.SSx5.p3.2),[item \(ii\)](https://arxiv.org/html/2605.28983#A8.I2.i2.p1.4),[Appendix H](https://arxiv.org/html/2605.28983#A8.SS0.SSS0.Px4.3.p1.22),[Remark H\.3](https://arxiv.org/html/2605.28983#A8.Thmtheorem3.p2.10.10),[§2](https://arxiv.org/html/2605.28983#S2.SS0.SSS0.Px3.p1.9),[Theorem 5\.1](https://arxiv.org/html/2605.28983#S5.Thmtheorem1.p1.6.3)\.
- M\. Cuturi \(2013\)Sinkhorn distances: lightspeed computation of optimal transport distances\.InAdvances in Neural Information Processing Systems,Vol\.26\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px11.p1.1)\.
- E\. Dent and J\. Tanner \(2026\)How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs\.arXiv preprint\.Note:arXiv:2602\.05779v1 \[cs\.LG\]Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px5.p1.3)\.
- D\. L\. Donoho and J\. Tanner \(2005\)Sparse Nonnegative Solution of Underdetermined Linear Equations by Linear Programming\.Proceedings of the National Academy of Sciences102\(27\),pp\. 9446–9451\.External Links:[Document](https://dx.doi.org/10.1073/pnas.0502269102)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px2.p1.3)\.
- E\. Dyer and G\. Gur\-Ari \(2020\)Asymptotics of Wide Networks from Feynman Diagrams\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px9.p2.16)\.
- W\. E \(2017\)A Proposal on Machine Learning via Dynamical Systems\.Communications in Mathematics and Statistics5\(1\),pp\. 1–11\.External Links:[Document](https://dx.doi.org/10.1007/s40304-017-0103-z)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px1.p1.1),[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px14.p1.1),[§1](https://arxiv.org/html/2605.28983#S1.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.28983#S5.SS0.SSS0.Px2.p1.2)\.
- M\. Emami, M\. Sahraee\-Ardakan, P\. Pandit, S\. Rangan, and A\. K\. Fletcher \(2020\)Generalization Error of Generalized Linear Models in High Dimensions\.InProceedings of the 37th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.119,pp\. 2892–2901\.Cited by:[Remark 8\.3](https://arxiv.org/html/2605.28983#S8.Thmtheorem3.p1.2.2)\.
- L\. C\. Evans \(2010\)Partial Differential Equations\.2nd edition,Graduate Studies in Mathematics, Vol\.19,American Mathematical Society\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px3.p1.6),[2nd item](https://arxiv.org/html/2605.28983#A4.I1.i2.p1.10),[Appendix E](https://arxiv.org/html/2605.28983#A5.SSx5.p3.2),[§2](https://arxiv.org/html/2605.28983#S2.SS0.SSS0.Px3.p1.9),[Theorem 5\.1](https://arxiv.org/html/2605.28983#S5.Thmtheorem1.p1.6.3)\.
- R\. P\. Feynman \(1982\)Simulating Physics with Computers\.International Journal of Theoretical Physics21\(6–7\),pp\. 467–488\.External Links:[Document](https://dx.doi.org/10.1007/BF02650179)Cited by:[Appendix A](https://arxiv.org/html/2605.28983#A1.SS0.SSS0.Px1.p1.3),[Appendix A](https://arxiv.org/html/2605.28983#A1.SS0.SSS0.Px2.p1.5),[Appendix A](https://arxiv.org/html/2605.28983#A1.SS0.SSS0.Px3.p1.2),[Appendix A](https://arxiv.org/html/2605.28983#A1.p3.4),[Appendix H](https://arxiv.org/html/2605.28983#A8.SS0.SSS0.Px1.p2.6)\.
- W\. H\. Fleming and H\. M\. Soner \(2006\)Controlled Markov Processes and Viscosity Solutions\.2nd edition,Stochastic Modelling and Applied Probability, Vol\.25,Springer\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px3.p1.6)\.
- J\. Han, A\. Jentzen, and W\. E \(2018\)Solving High\-Dimensional Partial Differential Equations Using Deep Learning\.Proceedings of the National Academy of Sciences115\(34\),pp\. 8505–8510\.External Links:[Document](https://dx.doi.org/10.1073/pnas.1718942115)Cited by:[Remark H\.4](https://arxiv.org/html/2605.28983#A8.Thmtheorem4.p1.3.3),[§1](https://arxiv.org/html/2605.28983#S1.p2.6)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep Residual Learning for Image Recognition\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 770–778\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2016.90)Cited by:[§5](https://arxiv.org/html/2605.28983#S5.SS0.SSS0.Px2.p1.2)\.
- T\. Henighan, J\. Kaplan, M\. Katz, M\. Chen, C\. Hesse, J\. Jackson, H\. Jun, T\. B\. Brown, P\. Dhariwal, S\. Gray, C\. Hallacy, B\. Mann, A\. Radford, A\. Ramesh, N\. Ryder, D\. M\. Ziegler, J\. Schulman, D\. Amodei, and S\. McCandlish \(2020\)Scaling laws for autoregressive generative modeling\.arXiv preprint arXiv:2010\.14701\.Cited by:[Table 7](https://arxiv.org/html/2605.28983#A9.T7.13.3.3.1),[Table 7](https://arxiv.org/html/2605.28983#A9.T7.13.3.3.2),[Table 7](https://arxiv.org/html/2605.28983#A9.T7.13.3.6.3.2),[Appendix I](https://arxiv.org/html/2605.28983#A9.p1.4),[§8](https://arxiv.org/html/2605.28983#S8.SS0.SSS0.Px3.p3.3)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark, T\. Hennigan, E\. Noland, K\. Millican, G\. van den Driessche, B\. Damoc, A\. Guy, S\. Osindero, K\. Simonyan, E\. Elsen, O\. Vinyals, J\. Rae, and L\. Sifre \(2022\)An empirical analysis of compute\-optimal large language model training\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[Remark K\.1](https://arxiv.org/html/2605.28983#A11.Thmtheorem1.p1.9.3),[Table 7](https://arxiv.org/html/2605.28983#A9.T7.13.3.5.2.2),[Appendix I](https://arxiv.org/html/2605.28983#A9.p1.4),[§8](https://arxiv.org/html/2605.28983#S8.SS0.SSS0.Px3.p3.3)\.
- E\. Hopf \(1950\)The Partial Differential Equationut\+u​ux=μ​ux​xu\_\{t\}\+uu\_\{x\}=\\mu u\_\{xx\}\.Communications on Pure and Applied Mathematics3\(3\),pp\. 201–230\.External Links:[Document](https://dx.doi.org/10.1002/cpa.3160030302)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px3.p1.6),[§1](https://arxiv.org/html/2605.28983#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.28983#S1.p3.8),[§2](https://arxiv.org/html/2605.28983#S2.SS0.SSS0.Px3.p1.9)\.
- A\. Jacot, F\. Gabriel, and C\. Hongler \(2018\)Neural Tangent Kernel: Convergence and Generalization in Neural Networks\.InAdvances in Neural Information Processing Systems,Vol\.31,pp\. 8580–8589\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px9.p1.3),[Appendix D](https://arxiv.org/html/2605.28983#A4.SSx3.2.p1.31),[Remark H\.7](https://arxiv.org/html/2605.28983#A8.Thmtheorem7.p1.11.9)\.
- M\. Kac \(1949\)On Distributions of Certain Wiener Functionals\.Transactions of the American Mathematical Society65\(1\),pp\. 1–13\.External Links:[Document](https://dx.doi.org/10.1090/S0002-9947-1949-0027960-X)Cited by:[Appendix E](https://arxiv.org/html/2605.28983#A5.SSx13.p1.10),[Proposition E\.4](https://arxiv.org/html/2605.28983#A5.Thmtheorem4)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling Laws for Neural Language Models\.arXiv preprint\.Note:arXiv:2001\.08361 \[cs\.LG\]Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px5.p1.3),[Table 7](https://arxiv.org/html/2605.28983#A9.T7.13.3.4.1.2),[Appendix I](https://arxiv.org/html/2605.28983#A9.p1.4),[§1](https://arxiv.org/html/2605.28983#S1.SS0.SSS0.Px1.p1.1),[§8](https://arxiv.org/html/2605.28983#S8.SS0.SSS0.Px3.p3.3)\.
- J\. Kim and T\. Suzuki \(2024\)Transformers learn nonlinear features in context: nonconvex mean\-field dynamics on the attention landscape\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research\.Note:arXiv:2402\.01258Cited by:[Appendix D](https://arxiv.org/html/2605.28983#A4.SSx2.p5.1)\.
- P\. E\. Kloeden and E\. Platen \(1992\)Numerical Solution of Stochastic Differential Equations\.Applications of Mathematics, Vol\.23,Springer\.External Links:[Document](https://dx.doi.org/10.1007/978-3-662-12616-5)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px1.p1.1)\.
- P\. W\. Koh and P\. Liang \(2017\)Understanding black\-box predictions via influence functions\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 1885–1894\.Cited by:[Remark F\.2](https://arxiv.org/html/2605.28983#A6.Thmtheorem2.p1.3.3)\.
- A\. Krizhevsky \(2009\)Learning multiple layers of features from tiny images\.Technical reportUniversity of Toronto\.External Links:[Link](https://www.cs.toronto.edu/%CB%9Ckriz/learning-features-2009-TR.pdf)Cited by:[Appendix G](https://arxiv.org/html/2605.28983#A7.SS0.SSS0.Px9.p1.5)\.
- P\. D\. Lax \(1957\)Hyperbolic Systems of Conservation Laws II\.Communications on Pure and Applied Mathematics10\(4\),pp\. 537–566\.External Links:[Document](https://dx.doi.org/10.1002/cpa.3160100406)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px3.p1.6),[§2](https://arxiv.org/html/2605.28983#S2.SS0.SSS0.Px3.p1.9),[Theorem 5\.1](https://arxiv.org/html/2605.28983#S5.Thmtheorem1.p1.6.3)\.
- Y\. LeCun, L\. Bottou, Y\. Bengio, and P\. Haffner \(1998\)Gradient\-based learning applied to document recognition\.Proceedings of the IEEE86\(11\),pp\. 2278–2324\.External Links:[Document](https://dx.doi.org/10.1109/5.726791)Cited by:[Appendix G](https://arxiv.org/html/2605.28983#A7.SS0.SSS0.Px9.p1.5)\.
- Y\. LeCun \(2022\)A Path Towards Autonomous Machine Intelligence\.External Links:[Link](https://openreview.net/pdf?id=BZ5a1r-kVsf)Cited by:[Appendix L](https://arxiv.org/html/2605.28983#A12.SSx1.SSS0.Px5.p3.2)\.
- J\. Lee, Y\. Bahri, R\. Novak, S\. S\. Schoenholz, J\. Pennington, and J\. Sohl\-Dickstein \(2018\)Deep Neural Networks as Gaussian Processes\.InInternational Conference on Learning Representations,Note:arXiv:1711\.00165Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px9.p1.3)\.
- Q\. Li, L\. Chen, C\. Tai, and W\. E \(2018\)Maximum principle based algorithms for deep learning\.Journal of Machine Learning Research18\(165\),pp\. 1–29\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px14.p1.1)\.
- P\. Lions \(1988\)Viscosity Solutions of Fully Nonlinear Second\-Order Equations and Optimal Stochastic Control in Infinite Dimensions\. Part I: The Case of Bounded Stochastic Evolutions\.Acta Mathematica161,pp\. 243–278\.External Links:[Document](https://dx.doi.org/10.1007/BF02392299)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px3.p1.6)\.
- Y\. Lipman, R\. T\. Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2023\)Flow matching for generative modeling\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px8.p1.9),[§5](https://arxiv.org/html/2605.28983#S5.SS0.SSS0.Px2.p3.1)\.
- G\. L\. Litvinov \(2007\)The Maslov Dequantization, Idempotent and Tropical Mathematics: A Brief Introduction\.Journal of Mathematical Sciences140\(3\),pp\. 426–444\.Note:arXiv:math/0507014v1 \[math\.GM\]External Links:[Document](https://dx.doi.org/10.1007/s10958-007-0450-5)Cited by:[Appendix L](https://arxiv.org/html/2605.28983#A12.SSx1.SSS0.Px5.p4.3),[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px4.p1.1),[Appendix C](https://arxiv.org/html/2605.28983#A3.SSx2.SSS0.Px3.p1.10),[Appendix H](https://arxiv.org/html/2605.28983#A8.SS0.SSS0.Px2.1.p1.12),[§1](https://arxiv.org/html/2605.28983#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.28983#S1.p2.6),[§1](https://arxiv.org/html/2605.28983#S1.p3.8),[§2](https://arxiv.org/html/2605.28983#S2.SS0.SSS0.Px2.p1.6),[§3](https://arxiv.org/html/2605.28983#S3.SS0.SSS0.Px1.p2.2),[Theorem 3\.1](https://arxiv.org/html/2605.28983#S3.Thmtheorem1)\.
- X\. Liu, C\. Gong, and Q\. Liu \(2023\)Flow straight and fast: learning to generate and transfer data with rectified flow\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px8.p1.9),[§5](https://arxiv.org/html/2605.28983#S5.SS0.SSS0.Px2.p3.1)\.
- Y\. Lu, A\. Zhong, Q\. Li, and B\. Dong \(2018\)Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 3276–3285\.Cited by:[§5](https://arxiv.org/html/2605.28983#S5.SS0.SSS0.Px2.p1.2)\.
- P\. Maragos, V\. Charisopoulos, and E\. Theodosis \(2021\)Tropical geometry and machine learning\.Proceedings of the IEEE109\(5\),pp\. 728–755\.External Links:[Document](https://dx.doi.org/10.1109/JPROC.2021.3065238)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px13.p1.1)\.
- C\. H\. Martin and C\. Hinrichs \(2025\)SETOL: a semi\-empirical theory of \(deep\) learning\.arXiv preprint arXiv:2507\.17912\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px7.p1.5)\.
- C\. H\. Martin and M\. W\. Mahoney \(2021\)Implicit Self\-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning\.Journal of Machine Learning Research22\(165\),pp\. 1–73\.External Links:[Link](https://jmlr.org/papers/v22/20-410.html)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px7.p1.5)\.
- S\. Mei, A\. Montanari, and P\. Nguyen \(2018\)A Mean Field View of the Landscape of Two\-Layer Neural Networks\.Proceedings of the National Academy of Sciences115\(33\),pp\. E7665–E7671\.External Links:[Document](https://dx.doi.org/10.1073/pnas.1806579115)Cited by:[Appendix L](https://arxiv.org/html/2605.28983#A12.SSx1.SSS0.Px1.p1.4),[Appendix D](https://arxiv.org/html/2605.28983#A4.SSx2.1.p1.6),[Appendix D](https://arxiv.org/html/2605.28983#A4.SSx2.p1.1)\.
- G\. F\. Montufar, R\. Pascanu, K\. Cho, and Y\. Bengio \(2014\)On the Number of Linear Regions of Deep Neural Networks\.InAdvances in Neural Information Processing Systems,Vol\.27,pp\. 2924–2932\.External Links:[Document](https://dx.doi.org/10.5555/2969033.2969153)Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px2.p1.3)\.
- G\. Peyré and M\. Cuturi \(2019\)Computational optimal transport\.Foundations and Trends in Machine Learning, Vol\.11,Now Publishers\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px11.p1.1)\.
- C\. Plate, M\. Hahn, A\. Klimek, C\. Ganzer, K\. Sundmacher, and S\. Sager \(2026\)An Analysis of Optimization Problems Involving ReLU Neural Networks\.Optimization and Engineering\.External Links:[Document](https://dx.doi.org/10.1007/s11081-026-10075-8)Cited by:[Remark 3\.2](https://arxiv.org/html/2605.28983#S3.Thmtheorem2.p1.7.7)\.
- H\. Ramsauer, B\. Schäfl, J\. Lehner, P\. Seidl, M\. Widrich, T\. Adler, L\. Gruber, M\. Holzleitner, M\. Pavlović, G\. K\. Sandve,et al\.\(2021\)Hopfield networks is all you need\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px12.p1.1)\.
- R\. T\. Rockafellar \(1970\)Convex analysis\.Princeton University Press\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px10.p1.5),[Remark 7\.2](https://arxiv.org/html/2605.28983#S7.Thmtheorem2.p1.5.5)\.
- X\. Shen, D\. Li, R\. Leng, Z\. Qin, W\. Sun, and Y\. Zhong \(2024\)Scaling laws for linear complexity language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 15988–16002\.Note:arXiv:2406\.16690Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px15.p1.1)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2021\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px8.p1.9),[§5](https://arxiv.org/html/2605.28983#S5.SS0.SSS0.Px2.p3.1)\.
- T\. Tokihiro, D\. Takahashi, J\. Matsukidaira, and J\. Satsuma \(1996\)From Soliton Equations to Integrable Cellular Automata through a Limiting Procedure\.Physical Review Letters76\(18\),pp\. 3247–3250\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevLett.76.3247)Cited by:[Appendix M](https://arxiv.org/html/2605.28983#A13.3.p3.2),[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px4.p1.1),[Appendix H](https://arxiv.org/html/2605.28983#A8.SS0.SSS0.Px4.p1.2),[§1](https://arxiv.org/html/2605.28983#S1.p2.6),[§2](https://arxiv.org/html/2605.28983#S2.SS0.SSS0.Px4.p1.3)\.
- A\. Tong, T\. Nguyen\-Tang, D\. Lee, D\. Nguyen, T\. Tran, D\. Hall, C\. Kang, and J\. Choi \(2025\)Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine\-Tuning\.InInternational Conference on Learning Representations,Note:arXiv:2503\.01329v2 \[cs\.LG\]Cited by:[Appendix C](https://arxiv.org/html/2605.28983#A3.SSx2.SSS0.Px5.p1.6),[Appendix E](https://arxiv.org/html/2605.28983#A5.SSx12.p2.6),[item \(b\)](https://arxiv.org/html/2605.28983#S5.I1.i2.p1.5),[Proposition 5\.2](https://arxiv.org/html/2605.28983#S5.Thmtheorem2.p1.11.11)\.
- S\. R\. S\. Varadhan \(1984\)Large Deviations and Applications\.CBMS\-NSF Regional Conference Series in Applied Mathematics, Vol\.46,Society for Industrial and Applied Mathematics\.External Links:[Document](https://dx.doi.org/10.1137/1.9781611970241)Cited by:[Appendix E](https://arxiv.org/html/2605.28983#A5.SSx2.p2.7),[Appendix E](https://arxiv.org/html/2605.28983#A5.SSx3.p6.35),[Remark H\.7](https://arxiv.org/html/2605.28983#A8.Thmtheorem7.p1.11.9)\.
- L\. Zhang, G\. Naitzat, and L\. Lim \(2018\)Tropical geometry of deep neural networks\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 5824–5832\.Cited by:[Appendix B](https://arxiv.org/html/2605.28983#A2.SS0.SSS0.Px13.p1.1)\.

## Appendix AThe Physical Analogue: Neural Networks as Path Integrals

The Hamilton–Jacobi equation appearing throughout this paper is not merely a convenient analytical device: it is the Hamilton–Jacobi equation of classical mechanics,∂tS\+H​\(∇S\)=0\\partial\_\{t\}S\+H\(\\nabla S\)=0, whereSSis Hamilton’s principal function \(the action\),p=∇Sp=\\nabla Sis momentum, andHHis the Hamiltonian\. The characteristics of this PDE are Newton’s equations of motion\. The correspondence between LSE networks and HJ PDEs therefore places neural networks in direct contact with the mathematical structure of physics\. Table[1](https://arxiv.org/html/2605.28983#A1.T1)makes the dictionary precise\.

Table 1:Dictionary between LSE neural networks and classical/quantum mechanics\.The Feynman–Kac formula makes this precise\. The quantity∑jexp⁡\(\(Wj⋅x\+bj\)/ε\)\\sum\_\{j\}\\exp\\\!\\bigl\(\(W\_\{j\}\\cdot x\+b\_\{j\}\)/\\varepsilon\\bigr\)is a*discrete path integral*: each neuronjjis a path, weighted by the Boltzmann factore−costj/εe^\{\-\\mathrm\{cost\}\_\{j\}/\\varepsilon\}\. The network forward pass computes the log\-partition function of this ensemble, exactly the imaginary\-time Schrödinger propagator⟨x\|e−H^​t/ℏ\|⋅⟩\\langle x\|e^\{\-\\hat\{H\}t/\\hbar\}\|\\cdot\\rangleevaluated on a discrete measure\.

Asε→0\\varepsilon\\to 0, the sum concentrates on the neuron minimizing the cost, by Laplace’s method\. In[Feynman](https://arxiv.org/html/2605.28983#bib.bib7)’s formulation of quantum mechanics\[[1982](https://arxiv.org/html/2605.28983#bib.bib7)\], the same approximation recovers classical trajectories from the path integral: the phaseei​S/ℏe^\{iS/\\hbar\}oscillates rapidly away from the classical path, and only the stationary point \(the action\-minimizing trajectory\) survives\. The tropical limitε→0\\varepsilon\\to 0is the neural\-network analogue ofℏ→0\\hbar\\to 0: the quantum \(soft, probabilistic\) computation hardens into the classical \(hard, deterministic\) max\-plus lookup of the Hopf–Lax formula\.

The parameterε\\varepsilonis therefore simultaneously the softmax temperature, the PDE viscosity, the quadrature regularization scale, and the*quantization parameter*interpolating between quantum statistical mechanics \(ε\>0\\varepsilon\>0\) and classical mechanics \(ε=0\\varepsilon=0\)\. This is mathematically the same object in the LSE/Hopf–Cole setting: the LSE network, the partition function of a Gibbs ensemble, and the imaginary\-time Schrödinger propagator are the same mathematical object at different values ofε\\varepsilon\.

#### Classical simulatability and positive measure\.

Feynman \[[1982](https://arxiv.org/html/2605.28983#bib.bib7)\]established that classical computers cannot efficiently simulate quantum mechanics\. The obstruction is structural: the Wigner quasi\-probability distribution, which plays the role of a phase\-space density for quantum states, takes negative values\. Since no classical stochastic process can assign negative probabilities, any faithful classical simulation of a quantum system requires exponential resources\. The LSE framework operates in the classically tractable regime: the Gibbs measureπj​\(x;ε\)∝exp⁡\(\(Wj⋅x\+bj\)/ε\)\\pi\_\{j\}\(x;\\varepsilon\)\\propto\\exp\(\(W\_\{j\}\\cdot x\+b\_\{j\}\)/\\varepsilon\)is a genuine probability distribution for allε\>0\\varepsilon\>0\. Positivity is what makes the Feynman–Kac representation \(Proposition[E\.4](https://arxiv.org/html/2605.28983#A5.Thmtheorem4)\) classically computable and distinguishes the present framework from quantum computation\. The tropical limitε→0\\varepsilon\\to 0concentrates this measure to a point mass at the argmax neuron — the zero\-temperature, maximally classical limit\.

#### Cellular automata and ResNets\.

Feynman \[[1982](https://arxiv.org/html/2605.28983#bib.bib7)\]proposed simulating classical physics by a locally connected automaton: a fixed local update rule, iterated over discrete time steps, with computational cost proportional to the space\-time volume\. The ResNet recurrencexℓ\+1=xℓ\+h​F​\(xℓ,Wℓ\)x\_\{\\ell\+1\}=x\_\{\\ell\}\+hF\(x\_\{\\ell\},W\_\{\\ell\}\)is precisely this structure: the local rule is one HJ characteristic step, iteratedLLtimes with step sizehh\. Proposition[5\.2](https://arxiv.org/html/2605.28983#S5.Thmtheorem2)and Theorem[8\.4](https://arxiv.org/html/2605.28983#S8.Thmtheorem4)confirm that this automaton computes the characteristics and co\-state equations of a HJ PDE\. The discreteness is not an approximation of a continuous limit; it is the structure, as Feynman intended\. The layer depthLLand step sizehhplay the roles of discrete time and lattice spacing\.

#### Universal quantum simulator vs\. universal classical HJ simulator\.

Feynman \[[1982](https://arxiv.org/html/2605.28983#bib.bib7)\]proposed a*universal quantum simulator*: a quantum lattice system that can simulate any local quantum field theory with cost proportional to system size, solving the exponential blowup of classical quantum simulation\. The LSE network atε\>0\\varepsilon\>0is the positive\-measure counterpart: a universal classical simulator for HJ initial\-value problems, in the sense of Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1), whoseO​\(N−1/d\)O\(N^\{\-1/d\}\)approximation rate is optimal for Lipschitz data\. The two are complementary: the quantum simulator requires quantum hardware because negative Wigner\-function values are unavoidable in quantum systems; the classical HJ simulator requires only standard floating\-point arithmetic because the Gibbs measure is positive\.

## Appendix BRelated Work

#### Neural ODEs and continuous\-depth networks\.

E \[[2017](https://arxiv.org/html/2605.28983#bib.bib17)\]proposed viewing ResNets as discretizations of ODEs;Chenet al\.\[[2018](https://arxiv.org/html/2605.28983#bib.bib18)\]made this precise with the adjoint method\. The ResNet stepxl\+1=xl\+h​F​\(xl,Wl\)x\_\{l\+1\}=x\_\{l\}\+hF\(x\_\{l\},W\_\{l\}\)is also the Euler–Maruyama scheme for a controlled SDE\[Kloeden and Platen,[1992](https://arxiv.org/html/2605.28983#bib.bib13)\], identifying the residual network simultaneously as an ODE discretization and as the zero\-noise limit of a diffusion process\. The present work sharpens the connection by identifying the forward ODE as a Hamilton–Jacobi characteristic and the backward pass as its co\-state \(Theorem[8\.4](https://arxiv.org/html/2605.28983#S8.Thmtheorem4)\), going beyond the ODE\-as\-architecture analogy to an exact PDE correspondence\.

#### Spline and piecewise\-linear theories\.

Balestriero and Baraniuk \[[2018](https://arxiv.org/html/2605.28983#bib.bib21),[2021](https://arxiv.org/html/2605.28983#bib.bib22)\]showed that ReLU networks compute continuous piecewise\-affine splines\.Montufaret al\.\[[2014](https://arxiv.org/html/2605.28983#bib.bib25)\]counted linear regions\.Aytekin \[[2022](https://arxiv.org/html/2605.28983#bib.bib23)\]showed ReLU networks are decision trees\. These are theε=0\\varepsilon=0tropical limit of the present framework: max\-plus algebra, piecewise linearity, and Hopf–Lax inf\-convolution are the same object\. The LP characterization ofBalestriero and Baraniuk \[[2021](https://arxiv.org/html/2605.28983#bib.bib22)\]also connects to sparsity: the tropical argmax selects the single neuron maximizingWj⋅x\+bjW\_\{j\}\\cdot x\+b\_\{j\}, which is a vertex of the LP feasible set; the one\-hot weight vector atε=0\\varepsilon=0is the sparsest nonnegative point in the softmax simplex, in the sense studied byDonoho and Tanner \[[2005](https://arxiv.org/html/2605.28983#bib.bib26)\]\.

#### Viscosity solutions and Hamilton–Jacobi PDEs\.

Classical \(smooth\) solutions of HJ equations develop shocks in finite time when characteristics cross, so a generalized notion of solution is needed\. A viscosity solution is the unique stable limit that persists through these crossings:uuis a viscosity solution if it satisfies the PDE in a one\-sided sense at every point via smooth test functions, and the key theorem is that theε→0\\varepsilon\\to 0limit of viscous solutionsuεu\_\{\\varepsilon\}always converges to the viscosity solutionu0u\_\{0\}\. For ML practitioners the direct analogy is soft\-to\-hard attention: soft attention \(ε\>0\\varepsilon\>0, softmax\) is the viscous regularization, hard attention \(ε=0\\varepsilon=0, argmax\) is the tropical/viscosity limit, and viscosity theory is what makes this limit unique and rigorous\.Crandall and Lions \[[1983](https://arxiv.org/html/2605.28983#bib.bib8)\]established the viscosity solution theory for Hamilton–Jacobi equations;Evans \[[2010](https://arxiv.org/html/2605.28983#bib.bib11)\]provides the textbook treatment\. The Hopf–Cole and Hopf–Lax formulas date toHopf \[[1950](https://arxiv.org/html/2605.28983#bib.bib9)\]andLax \[[1957](https://arxiv.org/html/2605.28983#bib.bib10)\]; their role as exact solutions \(rather than approximations\) is key to Claims \(i\) and \(iv\)\. The standard reference connecting stochastic optimal control to HJ viscosity solutions isFleming and Soner \[[2006](https://arxiv.org/html/2605.28983#bib.bib12)\]; the extension to viscosity solutions on infinite\-dimensional Hilbert spaces, relevant to the over\-parameterized limit, isLions \[[1988](https://arxiv.org/html/2605.28983#bib.bib14)\]\.

#### Tropical and max\-plus mathematics\.

Litvinov \[[2007](https://arxiv.org/html/2605.28983#bib.bib2)\]surveys Maslov dequantization and the tropical semiring\.Tokihiroet al\.\[[1996](https://arxiv.org/html/2605.28983#bib.bib5)\]introduced ultradiscretization in the context of integrable cellular automata\. The connection to neural networks via theε→0\\varepsilon\\to 0limit is the main contribution of Section[5](https://arxiv.org/html/2605.28983#S5)\.

#### Scaling laws\.

Kaplanet al\.\[[2020](https://arxiv.org/html/2605.28983#bib.bib27)\]empirically fit power\-law scaling for large language models\. The theoretical derivation here \(Proposition[8\.8](https://arxiv.org/html/2605.28983#S8.Thmtheorem8)\) derives the exponent from intrinsic data dimension via quadrature approximation theory, providing a mechanistic explanation for the observed scaling\. The parameterε\\varepsilongoverns activation sparsity \(uniform atε→∞\\varepsilon\\to\\infty; one\-hot atε→0\\varepsilon\\to 0\); the effect of this variance on training stability in sparse networks is studied byDent and Tanner \[[2026](https://arxiv.org/html/2605.28983#bib.bib30)\]\.

#### HJ PDEs for optimization landscapes\.

Chaudhariet al\.\[[2017](https://arxiv.org/html/2605.28983#bib.bib55)\]introduced Entropy\-SGD, which minimizes the Gibbs\-smoothed lossfγ​\(θ\)=−β−1​log​∫exp⁡\(−β​f​\(y\)−\|θ−y\|2/\(2​γ\)\)​𝑑yf\_\{\\gamma\}\(\\theta\)=\-\\beta^\{\-1\}\\log\\int\\exp\(\-\\beta f\(y\)\-\|\\theta\-y\|^\{2\}/\(2\\gamma\)\)\\,dyto bias gradient descent toward flat minima;Chaudhariet al\.\[[2018](https://arxiv.org/html/2605.28983#bib.bib56)\]established that this smoothed loss is the viscous HJ solution in*weight space*, with the weight vectorθ\\thetaas the PDE’s spatial variable and the original lossf​\(θ\)f\(\\theta\)as initial data\. The present framework is a precise*role reversal*: the spatial variable is the*input*xx, not the weights, and the weights encode the*initial data*g​\(yj\)g\(y\_\{j\}\)of the HJ equation \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\)\.[Chaudhariet al\.](https://arxiv.org/html/2605.28983#bib.bib56)analyze how the loss landscape deforms under HJ smoothing as weights vary during training; the present paper identifies the network output — evaluated at a fixed weight setting across all inputsxx— as an exact HJ solution, making inference itself the act of evaluating a PDE rather than approximating a smoother optimization surface\. The Moreau–Yosida proximal map of their Lemma 2,proxt​f​\(x\)=arg⁡miny⁡\{f​\(y\)\+\|x−y\|2/\(2​t\)\}\\mathrm\{prox\}\_\{tf\}\(x\)=\\arg\\min\_\{y\}\\\{f\(y\)\+\|x\-y\|^\{2\}/\(2t\)\\\}, is precisely the mechanism behind theL2​tL^\{2\}tbias in Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1): the Hopf–Lax solution cannot approximatef∗f^\{\*\}to better thanO​\(t\)O\(t\)because the proximal smoothing introduces a systematic upward shift of that magnitude\.

#### Statistical mechanics and spectral theories of learning\.

Martin and Mahoney \[[2021](https://arxiv.org/html/2605.28983#bib.bib39)\]established Heavy\-Tailed Self\-Regularization \(HTSR\): power\-law exponents of layer weight matrix spectra predict generalization without access to training or test data, a finding formalized inMartin and Hinrichs \[[2025](https://arxiv.org/html/2605.28983#bib.bib40)\]\(SETOL\) using random matrix theory and statistical mechanics\. The present framework provides an exact theoretical substrate for this spectral perspective: the viscosityε\\varepsilongoverns the spectral scale ofWWvia the Hessian bound‖∇x2fε‖2≤‖W‖2,∞2/ε\\\|\\nabla^\{2\}\_\{x\}f\_\{\\varepsilon\}\\\|\_\{2\}\\leq\\\|W\\\|\_\{2,\\infty\}^\{2\}/\\varepsilon\(Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2)\), and the optimalε∗≍N−1/d\\varepsilon^\{\*\}\\asymp N^\{\-1/d\}at which the generalization rateO​\(N−1/d\)O\(N^\{\-1/d\}\)is attained \(Proposition[8\.8](https://arxiv.org/html/2605.28983#S8.Thmtheorem8)\) provides a principled, derivable analogue of the empirical temperature parameter in SETOL\. Where SETOL is semi\-empirical, fitting spectral observations to RMT predictions, the HJ correspondence grounds the same weight\-spectrum\-to\-generalization link in a rigorous PDE identity\.

#### Flow matching and score\-based diffusion\.

Flow matching\[Lipmanet al\.,[2023](https://arxiv.org/html/2605.28983#bib.bib36), Liuet al\.,[2023](https://arxiv.org/html/2605.28983#bib.bib37)\]learns a time\-dependent velocity fieldv​\(x,t\)v\(x,t\)by minimizing𝔼​\[‖v​\(Xt,t\)−X˙t‖2\]\\mathbb\{E\}\[\\\|v\(X\_\{t\},t\)\-\\dot\{X\}\_\{t\}\\\|^\{2\}\]along interpolant paths; the learned field is then integrated asX˙=v​\(X,t\)\\dot\{X\}=v\(X,t\)\. This is the characteristic ODE of the HJ PDE with HamiltonianH​\(x,p\)=p⋅v​\(x,t\)H\(x,p\)=p\\cdot v\(x,t\), the same structure as Proposition[5\.2](https://arxiv.org/html/2605.28983#S5.Thmtheorem2)with the driftF​\(x,θ\)=v​\(x,t\)F\(x,\\theta\)=v\(x,t\)identified as the velocity field\. Score\-based diffusion\[Songet al\.,[2021](https://arxiv.org/html/2605.28983#bib.bib38)\]trains a network to approximate∇xlog⁡pt​\(x\)\\nabla\_\{x\}\\log p\_\{t\}\(x\)\. Under the Hopf–Cole substitution,pt​\(x\)=e−uε​\(x,t\)/εp\_\{t\}\(x\)=e^\{\-u\_\{\\varepsilon\}\(x,t\)/\\varepsilon\}\(up to normalization\), so∇xlog⁡pt=−∇xuε/ε\\nabla\_\{x\}\\log p\_\{t\}=\-\\nabla\_\{x\}u\_\{\\varepsilon\}/\\varepsilon: the score is the negative normalized spatial gradient of the Hopf–Cole solution, and denoising is gradient ascent on the HJ potential\. Both architectures therefore fit within the HJ characteristic framework; the distinction from the feedforward LSE case is that the Hamiltonian is input\-dependent and time\-varying rather than fixed atH​\(p\)=\|p\|2H\(p\)=\|p\|^\{2\}\.

#### Infinite\-width limits, kernel methods, and beyond\.

At infinite width, gradient descent is governed by the Neural Tangent Kernel\[Jacotet al\.,[2018](https://arxiv.org/html/2605.28983#bib.bib28)\]and the network converges to a Gaussian process\[Leeet al\.,[2018](https://arxiv.org/html/2605.28983#bib.bib16)\]\. The Feynman–Kac representation \(Proposition[E\.4](https://arxiv.org/html/2605.28983#A5.Thmtheorem4)\) provides a complementary PDE characterization of this limit: the Gibbs average over support points converges to the Hopf–Cole solution of the heat equation, with the matched\-scale conditionq∗=2​ε​tq^\{\*\}=2\\varepsilon tidentifying the unique Gaussian prior consistent with the kernel bandwidth\. For LSE networks, the NTK admits the closed formKa​b=ε2​⟨π​\(xa\),π​\(xb\)⟩K\_\{ab\}=\\varepsilon^\{2\}\\langle\\pi\(x\_\{a\}\),\\pi\(x\_\{b\}\)\\rangle\(Proposition[D\.3](https://arxiv.org/html/2605.28983#A4.Thmtheorem3)\): the kernel is the inner product of Gibbs heat\-kernel weights, and is positive definite almost surely forN≥nN\\geq ngeneric support points\.

The HJ framework extends beyond three acknowledged limitations of the NTK perspective\. First, NTK theory, operating in the limit of fixed data with widthm→∞m\\to\\infty, provides bounds in terms of sample sizennbut does not address how loss scales with model sizeNN; the Feynman–Kac quadrature \(Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)\) yields a finite\-NNbound with rateO​\(N−1/deff\)O\(N^\{\-1/d\_\{\\mathrm\{eff\}\}\}\)under the manifold hypothesis \(Assumption[8\.7](https://arxiv.org/html/2605.28983#S8.Thmtheorem7)\), and Proposition[8\.8](https://arxiv.org/html/2605.28983#S8.Thmtheorem8)recovers the empirical exponentα=1/deff\\alpha=1/d\_\{\\mathrm\{eff\}\}from PDE structure without assuming the kernel regime\. Second, in the general training regime where support pointsyjy\_\{j\}are also optimized, the Gibbs weightsπj​\(x;θ\)\\pi\_\{j\}\(x;\\theta\)are adaptive: the Hopf–Lax minimizerj∗=arg⁡minj⁡\{g​\(yj\)\+\|x−yj\|2/\(4​t\)\}j^\{\*\}=\\arg\\min\_\{j\}\\\{g\(y\_\{j\}\)\+\|x\-y\_\{j\}\|^\{2\}/\(4t\)\\\}performs energy\-weighted nearest\-neighbor selection at inference, a feature\-learning structure that the kernel Gram matrix does not expose\. Third, while the NTK depends on architecture, the dependence is through the kernel entries; the HJ framework makes it explicit through HamiltonianHHand viscosityε\\varepsilon: feedforward networks have quadraticHH, ResNets have generalHHencoding the drift, and Transformers have attention as a vector\-valued Hopf–Cole transform at viscosityε=1/d\\varepsilon=1/\\sqrt\{d\}\. For general wide networks,Dyer and Gur\-Ari \[[2020](https://arxiv.org/html/2605.28983#bib.bib29)\]deriveO​\(n−1\)O\(n^\{\-1\}\)finite\-width corrections to the NTK during gradient flow via Feynman\-diagram \(Gaussian integral\) methods; in the LSE case, the closed formKa​b=ε2​⟨π,π⟩K\_\{ab\}=\\varepsilon^\{2\}\\langle\\pi,\\pi\\rangleprovides an exact PDE interpretation of the kernel that those corrections refine\.

#### Moreau–Yosida envelope and convex analysis\.

The Hopf–Lax formulau0​\(x,t\)=infy\{g​\(y\)\+\|x−y\|2/\(4​t\)\}u\_\{0\}\(x,t\)=\\inf\_\{y\}\\\{g\(y\)\+\|x\-y\|^\{2\}/\(4t\)\\\}is precisely the*Moreau envelope*\(proximal map\) ofggwith quadratic costλ=4​t\\lambda=4t\[Rockafellar,[1970](https://arxiv.org/html/2605.28983#bib.bib46), Bauschke and Combettes,[2017](https://arxiv.org/html/2605.28983#bib.bib47)\]\. The Hopf–Cole substitution is the unique linearization of this envelope\. The present framework recovers and extends the Moreau–Yosida regularization perspective of convex analysis: where Moreau–Yosida views entropic\-proximal smoothing as a tool for optimization, the HJ correspondence identifies it as the architectural primitive of LSE networks\. The two perspectives agree atε=0\\varepsilon=0on the Hopf–Lax formula; atε\>0\\varepsilon\>0the HJ framework lifts this to a family of PDE solutions, recovering Moreau–Yosida as a limit\.

#### Optimal transport and entropic regularization\.

Entropic\-regularized optimal transport\[Peyré and Cuturi,[2019](https://arxiv.org/html/2605.28983#bib.bib48), Cuturi,[2013](https://arxiv.org/html/2605.28983#bib.bib49)\]shares the LSE/Gibbs\-measure structure of the present framework: the Sinkhorn algorithm iterates log\-sum\-exp operations, and the entropic\-OT cost is a discrete Hopf–Cole log\-partition function\. The connection is that entropic\-OT and the LSE network are the same object from different perspectives — one optimizing over couplings, the other over initial data of a HJ equation\. The soft\-min at viscosityε\\varepsilonin the HJ framework is the same Gibbs average that defines the entropic\-OT transport plan\.

#### Modern Hopfield networks\.

Ramsaueret al\.\[[2021](https://arxiv.org/html/2605.28983#bib.bib50)\]connect attention to modern Hopfield networks via the LSE energy function, showing that the softmax attention update rule is the retrieval dynamics of a continuous Hopfield network with exponential interactions\. Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)provides the exact PDE interpretation of this observation: the Hopfield energy is the Hopf–Cole log\-partition function; retrieval is evaluation of the HJ solution at the query point; stored patterns are the support points\{yj\}\\\{y\_\{j\}\\\}of the initial\-data measure\.

#### Tropical and morphological neural networks\.

Zhanget al\.\[[2018](https://arxiv.org/html/2605.28983#bib.bib51)\]study the tropical geometry of deep ReLU networks, showing that the decision boundaries are tropical hypersurfaces\.Charisopoulos and Maragos \[[2017](https://arxiv.org/html/2605.28983#bib.bib52)\]andMaragoset al\.\[[2021](https://arxiv.org/html/2605.28983#bib.bib53)\]connect morphological networks to tropical algebra\. The present framework positions these results within the HJ picture: theε=0\\varepsilon=0tropical limit is where ReLU and max\-pooling operations live \(Remark[3\.2](https://arxiv.org/html/2605.28983#S3.Thmtheorem2)\), and the tropical geometry arises from the Hopf–Lax inf\-convolution formula \([9](https://arxiv.org/html/2605.28983#S5.E9)\)\.

#### Optimal control and the Pontryagin Maximum Principle\.

The connection between deep learning and optimal control via the PMP has substantial prior history\.E \[[2017](https://arxiv.org/html/2605.28983#bib.bib17)\]proposed mean\-field optimal control for deep learning;Liet al\.\[[2018](https://arxiv.org/html/2605.28983#bib.bib57)\]made the PMP connection precise for ResNets;Benninget al\.\[[2019](https://arxiv.org/html/2605.28983#bib.bib54)\]develop a general optimal control view of deep learning\. Theorem[8\.4](https://arxiv.org/html/2605.28983#S8.Thmtheorem4)sharpens these connections by identifying the*forward*equation as a HJ characteristic ODE, making the joint forward\-backward system an exact Hamiltonian flow rather than an optimal\-control analogy\.

#### Cross\-architecture scaling consistency\.

Shenet al\.\[[2024](https://arxiv.org/html/2605.28983#bib.bib58)\]observe that scaling law exponents are consistent across linear\-complexity architectures and standard Transformers when controlling for data and compute\. This is consistent with Proposition[8\.8](https://arxiv.org/html/2605.28983#S8.Thmtheorem8): the exponentα=1/deff\\alpha=1/d\_\{\\mathrm\{eff\}\}is a property of the data’s intrinsic dimension, not the architecture\. The HJ framework predicts that architectures sharing the same data distribution should exhibit the same scaling exponents, providing a theoretical grounding for the empirical cross\-architecture consistency\.

## Appendix CMachine Learning Tasks as Initial\-Value Problems

### Architecture as PDE, Output as Solution

In both this framework and physics\-informed neural networks \(PINNs\), the network output is a PDE solution: the valueuε​\(x,t\)u\_\{\\varepsilon\}\(x,t\)at the input pointxx\. The distinction is not in what the output represents but in*where the PDE lives*\.

A PINN uses the network as a universal function approximator — the architecture carries no information about the*target*PDE\. \(A ReLU MLP has its own intrinsic tropical HJ structure, but that equation is not the Navier–Stokes or wave equation being solved\.\) The target PDE is imposed*externally*through a residual loss‖∂tfθ\+H​\(∇xfθ\)−ε​Δ​fθ‖2\\\|\\partial\_\{t\}f\_\{\\theta\}\+H\(\\nabla\_\{x\}f\_\{\\theta\}\)\-\\varepsilon\\Delta f\_\{\\theta\}\\\|^\{2\}\. The network learns to satisfy the PDE through training; the equation is a constraint on learning, not a property of the architecture\.

Here, the PDE structure is in the*architecture*\. The layerfεN=LSEε​\(W​x\+b\)f\_\{\\varepsilon\}^\{N\}=\\mathrm\{LSE\}\_\{\\varepsilon\}\(Wx\+b\)encodes the exact Hopf–Cole solution via identity \([5](https://arxiv.org/html/2605.28983#S4.E5)\), with Hamiltonian determined by the activation function and initial dataggencoded in the weights\(W,b\)\(W,b\)\. No residual loss is needed: the PDE residual is zero by construction, and Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)guarantees this before any training begins\.

The consequence for training is that gradient descent searches over the space of initial conditions\{\(yj,g​\(yj\)\)\}j=1N\\\{\(y\_\{j\},g\(y\_\{j\}\)\)\\\}\_\{j=1\}^\{N\}, that is, over the family of all viscous HJ equations with discrete initial data, selecting the one whose propagated solution best explains the observations\.

### Tasks as Initial\-Value Problems

The Hopf–Cole correspondence \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\) is not specific to any one task\. Every standard supervised learning task can be cast as an initial\-value problem for a HJ PDE\. The architecture determines the PDE class \(the Hamiltonian\); the task determines how the independent variables and initial data are interpreted\. In all cases the network output is a PDE solution value; in all cases training selects the initial conditions\.

#### Regression\.

The outputfεN​\(x\)=LSEε​\(W​x\+b\)f\_\{\\varepsilon\}^\{N\}\(x\)=\\mathrm\{LSE\}\_\{\\varepsilon\}\(Wx\+b\)is the Hopf–Cole solution valueuε​\(x,t\)u\_\{\\varepsilon\}\(x,t\)at inputxx\(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\)\. Training minimizes a loss overxx, selecting the initial dataggthat makesuε​\(⋅,t\)u\_\{\\varepsilon\}\(\\cdot,t\)best approximate the target function\.

#### Classification\.

AKK\-class classifier computes logitsfk​\(x\)=Wk⋅x\+bkf\_\{k\}\(x\)=W\_\{k\}\\cdot x\+b\_\{k\}, one linear HJ characteristic per class\. Class probabilities aresoftmax​\(f​\(x\)/ε\)=∇LSEε​\(f​\(x\)\)\\mathrm\{softmax\}\(f\(x\)/\\varepsilon\)=\\nabla\\mathrm\{LSE\}\_\{\\varepsilon\}\(f\(x\)\): the Gibbs measure over class scores at temperatureε\\varepsilon\. The predicted classarg⁡maxk⁡fk​\(x\)\\arg\\max\_\{k\}f\_\{k\}\(x\)is the tropical argmax\. Decision boundaries are tropical hyperplanes: the piecewise\-linear ridges \(non\-differentiable loci\) ofLSEε\\mathrm\{LSE\}\_\{\\varepsilon\}in the tropical limitε→0\\varepsilon\\to 0\. Asε→0\\varepsilon\\to 0, the softmax hardens to one\-hot, boundaries sharpen, and the classifier recovers a decision tree\[Aytekin,[2022](https://arxiv.org/html/2605.28983#bib.bib23)\]\. Classification confidence isε\\varepsilon: low viscosity means sharp, committed decisions; high viscosity means soft, uncertain ones\.

#### Kernel Machines\.

A kernel estimator with Gaussian kernelkσ​\(x,y\)=exp⁡\(−\|x−y\|2/2​σ2\)k\_\{\\sigma\}\(x,y\)=\\exp\(\-\|x\-y\|^\{2\}/2\\sigma^\{2\}\)computes

f​\(x\)=∑iαi​exp⁡\(−\|x−xi\|2/2σ2\)\.f\(x\)=\\sum\_\{i\}\\alpha\_\{i\}\\exp\\\!\\left\(\\frac\{\-\|x\-x\_\{i\}\|^\{2\}/2\}\{\\sigma^\{2\}\}\\right\)\.Under the Hopf–Cole substitutionv=e−u/εv=e^\{\-u/\\varepsilon\}, this linear sum equals the transformed variablev​\(x\)=e−uε​\(x\)/εv\(x\)=e^\{\-u\_\{\\varepsilon\}\(x\)/\\varepsilon\}of Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)atε=σ2/2\\varepsilon=\\sigma^\{2\}/2andt=1t=1, with Gibbs weightsαi=e−g​\(xi\)/ε\\alpha\_\{i\}=e^\{\-g\(x\_\{i\}\)/\\varepsilon\}\. The HJ solution itself isuε​\(x\)=−ε​log⁡f​\(x\)u\_\{\\varepsilon\}\(x\)=\-\\varepsilon\\log f\(x\): the kernel estimator is the pre\-image of the LSE layer under the Hopf–Cole transform, and the two are related byuε=−ε​log⁡vu\_\{\\varepsilon\}=\-\\varepsilon\\log v\. Asσ→0\\sigma\\to 0\(ε→0\\varepsilon\\to 0\): nearest\-neighbor classification, the tropical argmax limit\[Litvinov,[2007](https://arxiv.org/html/2605.28983#bib.bib2)\]\.

#### Time Series\.

The sequence indexttis the PDE time variable\. An RNN, LSTM, or SSM integrating a time series solves an ODEh˙=F​\(h,x​\(t\)\)\\dot\{h\}=F\(h,x\(t\)\)alongtt\(Proposition[5\.4](https://arxiv.org/html/2605.28983#S5.Thmtheorem4)\)\. The hidden statehth\_\{t\}is the PDE state; the predictionx^t\+1\\hat\{x\}\_\{t\+1\}is the solution value at timet\+1t\+1\. Learning the model is learning the HamiltonianHHwhose characteristics match the observed trajectory\.

#### Sequences \(Transformers\)\.

Token embedding plays the role of the spatial variablexx; transformer depth plays the role of PDE timett\. Each attention layer is one step of the HJ propagatorSΔ​tS\_\{\\Delta t\}\(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\)\. The contextualized embedding at layerLLis the HJ solution at timeT=L⋅Δ​tT=L\\cdot\\Delta t, with token embeddings as initial datagg\[Tonget al\.,[2025](https://arxiv.org/html/2605.28983#bib.bib31)\]\.

#### Encoder–Decoder\.

The encoder applies the propagatorSt1S\_\{t\_\{1\}\}: latentz=St1​\[g\]​\(x\)z=S\_\{t\_\{1\}\}\[g\]\(x\)is the PDE solution at intermediate timet1t\_\{1\}\. The decoder applies a second propagatorSt2S\_\{t\_\{2\}\}: outputy^=St2​\[z\]​\(x\)\\hat\{y\}=S\_\{t\_\{2\}\}\[z\]\(x\)\. Together they form the composed semigroupSt1\+t2S\_\{t\_\{1\}\+t\_\{2\}\}\(Claim \(ii\) of Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1)\)\. The bottleneckzzis the PDE state at the encoder–decoder boundary; its dimension is the effective number of degrees of freedom of the initial\-data measure at that time\.

## Appendix DRegularity Scope and the SGD–Initial\-Condition Connection

### Regularity scope of Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1)

The five claims of Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1)have distinct regularity requirements, reflecting a fundamental separation between the*identification*of neural networks with PDE solutions and the*approximation theory*governing convergence\.

- •Claims \(i\) and \(ii\): Unconditional\.The identityfεN​\(x\)=\|x\|2/\(4​t\)−uεN​\(x,t\)f\_\{\\varepsilon\}^\{N\}\(x\)=\|x\|^\{2\}/\(4t\)\-u\_\{\\varepsilon\}^\{N\}\(x,t\)is a pure algebraic identity holding for*any*weightsWjW\_\{j\}and biasesbjb\_\{j\}, with no assumption on the implied initial datagg, the HamiltonianHH, or the input distribution\. The proof is a completing\-the\-square manipulation \(Appendix[E](https://arxiv.org/html/2605.28983#A5), Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\) that never invokes regularity\. The semigroup tower property \(Claim \(ii\)\) follows from the Hopf–Cole substitutionv=e−u/εv=e^\{\-u/\\varepsilon\}reducing the HJ equation to the heat equation∂tv=ε​Δ​v\\partial\_\{t\}v=\\varepsilon\\Delta v; the linearity and semigroup property of the heat equation hold without any regularity on the initial data\. The consequence is decisive:*for any trained neural network with arbitrary weights, there exists an exact viscous HJ PDE for which the network layer is the exact solution\.*This is not an asymptotic statement and does not require data or initialization assumptions\.
- •Claim \(iii\): Lipschitzggrequired\.The quadrature bound‖uεN−uε‖∞≤C​L​N−1/d\\\|u\_\{\\varepsilon\}^\{N\}\-u\_\{\\varepsilon\}\\\|\_\{\\infty\}\\leq CLN^\{\-1/d\}\(whereL=Lip​\(g\)L=\\mathrm\{Lip\}\(g\)\) requiresg∈Lip​\(L\)g\\in\\mathrm\{Lip\}\(L\)\. The constantCCis uniform inε\\varepsilon, which is what allows Claim \(v\) to hold\. Forg∈L∞g\\in L^\{\\infty\}org∈BVg\\in\\mathrm\{BV\}, viscosity solution theory\[Crandall and Lions,[1983](https://arxiv.org/html/2605.28983#bib.bib8), Evans,[2010](https://arxiv.org/html/2605.28983#bib.bib11)\]still guarantees existence and uniqueness ofuεu\_\{\\varepsilon\}, but theO​\(N−1/d\)O\(N^\{\-1/d\}\)quadrature rate may not be achievable and the approximation theory must be revisited\.
- •Claim \(iv\): Convex, superlinearHHrequired\.Pointwise convergenceuε→u0u\_\{\\varepsilon\}\\to u\_\{0\}asε→0\\varepsilon\\to 0is established via the Hopf–Lax formula \(Theorem[5\.1](https://arxiv.org/html/2605.28983#S5.Thmtheorem1)\), which is valid precisely whenHHis convex and superlinear\. Under these conditions,u0u\_\{0\}is the unique viscosity solution of the inviscid HJ equation and equals the Hopf–Lax inf\-convolution\. For non\-convexHH, viscosity solutions of the inviscid equation still exist byCrandall and Lions \[[1983](https://arxiv.org/html/2605.28983#bib.bib8)\], but the Hopf–Lax representation need not hold and Claim \(iv\) requires a separate argument\.
- •Claim \(v\): Both conditions required\.Commutativity of the limitsN→∞N\\to\\inftyandε→0\\varepsilon\\to 0inherits the conditions of both Claims \(iii\) and \(iv\): Lipschitzggensures the quadrature error is uniform inε\\varepsilon, and convex superlinearHHensures the viscosity convergence is uniform inNN\.

### SGD as initial\-condition optimization in the mean\-field limit

The Limitations section of the main text notes that the framework characterizes*what*a trained network computes \(the optimal initial\-value problem\) but not*how*stochastic gradient descent selects it\. The mean\-field theory of neural networks\[Meiet al\.,[2018](https://arxiv.org/html/2605.28983#bib.bib32)\]closes this gap in the infinite\-width limit\.

In the mean\-field scaling \(widthN→∞N\\to\\inftywith1/N1/Noutput normalization\), the network function converges to

fμ​\(x\)=−ε​log​∫ℝdexp⁡\(−g​\(y\)\+\|x−y\|2/\(4​t\)ε\)​𝑑μ​\(y\),f\_\{\\mu\}\(x\)=\-\\varepsilon\\log\\int\_\{\\mathbb\{R\}^\{d\}\}\\exp\\\!\\Bigl\(\-\\frac\{g\(y\)\+\|x\-y\|^\{2\}/\(4t\)\}\{\\varepsilon\}\\Bigr\)\\,d\\mu\(y\),\(20\)whereμ∈𝒫​\(ℝd\)\\mu\\in\\mathcal\{P\}\(\\mathbb\{R\}^\{d\}\)is a probability measure over support points, the infinite\-width limit of the empirical measureμN=1N​∑jδyj\\mu^\{N\}=\\frac\{1\}\{N\}\\sum\_\{j\}\\delta\_\{y\_\{j\}\}\. This is exactly the Hopf–Cole solution \([3](https://arxiv.org/html/2605.28983#S4.E3)\) under the continuous measureμ\\mu\.

Training by gradient descent in this limit corresponds to theL2L^\{2\}\-Wasserstein gradient flow of the population riskℛ​\[μ\]=𝔼\(x,y\)∼𝒟​\[ℓ​\(fμ​\(x\),y\)\]\\mathcal\{R\}\[\\mu\]=\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\}\[\\ell\(f\_\{\\mu\}\(x\),y\)\]:

∂tμt=−∇W2ℛ​\[μt\]\.\\partial\_\{t\}\\mu\_\{t\}\\;=\\;\-\\nabla\_\{W\_\{2\}\}\\,\\mathcal\{R\}\[\\mu\_\{t\}\]\.\(21\)Equation \([21](https://arxiv.org/html/2605.28983#A4.E21)\) is the gradient flow over the space of initial\-data measures for the HJ equation\. The measureμ\\muspecifies which initial\-value problem the network solves; gradient descent movesμ\\muin the direction that decreases the risk\.

###### Proposition D\.1\(SGD selects the initial\-value problem\)\.

In the mean\-field limitN→∞N\\to\\infty, stochastic gradient descent on the network parameters\(Wj,bj\)\(W\_\{j\},b\_\{j\}\)converges in distribution to the Wasserstein gradient flow \([21](https://arxiv.org/html/2605.28983#A4.E21)\) of the population risk\. Each gradient step moves the support pointsyjy\_\{j\}and initial valuesg​\(yj\)g\(y\_\{j\}\)in the direction that decreasesℛ\\mathcal\{R\}, identifying the optimization trajectory as the search for the risk\-minimizing initial\-value problem of the viscous HJ equation\.

###### Proof\.

The result is structural: it identifies the HJ interpretation of the mean\-field limit established byMeiet al\.\[[2018](https://arxiv.org/html/2605.28983#bib.bib32)\]for two\-layer networks with1/N1/Nnormalization\. Under that result, gradient flow on the empirical risk converges asN→∞N\\to\\inftyto the McKean–Vlasov equation∂tμt=−∇W2ℛ​\[μt\]\\partial\_\{t\}\\mu\_\{t\}=\-\\nabla\_\{W\_\{2\}\}\\mathcal\{R\}\[\\mu\_\{t\}\]\. Under the parametrizationWj=yj/\(2​t\)W\_\{j\}=y\_\{j\}/\(2t\),bj=−g​\(yj\)−\|yj\|2/\(4​t\)b\_\{j\}=\-g\(y\_\{j\}\)\-\|y\_\{j\}\|^\{2\}/\(4t\), each neuron encodes a support point\(yj,g​\(yj\)\)\(y\_\{j\},g\(y\_\{j\}\)\)of the initial\-data measure; the limiting flow is therefore the Wasserstein gradient flow on that measure, i\.e\., the search over initial\-value problems \([21](https://arxiv.org/html/2605.28983#A4.E21)\)\. The identification applies to two\-layer networks under the conditions ofMeiet al\.\[[2018](https://arxiv.org/html/2605.28983#bib.bib32)\]; extensions to deeper networks or SGD on empirical risk require additional propagation\-of\-chaos arguments\. ∎∎

The connection to Theorem[8\.4](https://arxiv.org/html/2605.28983#S8.Thmtheorem4)\(backpropagation as the co\-state equation\) is direct: the co\-state variableplp\_\{l\}specifies the optimal direction to update\(yj,g​\(yj\)\)\(y\_\{j\},g\(y\_\{j\}\)\)at each layer, which is the single\-step Euler approximation of the Wasserstein gradient direction \([21](https://arxiv.org/html/2605.28983#A4.E21)\)\. For finiteNN, the particle approximation introduces an error of orderO​\(N−1/d\)O\(N^\{\-1/d\}\)by Claim \(iii\) of Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1)—the same rate that controls the PDE approximation error\. The gap between the mean\-field and finite\-NNregimes is thus governed by the same quantity as the correspondence’s approximation theory, providing a unified error budget for both the PDE identification and the optimization interpretation\.

Kim and Suzuki \[[2024](https://arxiv.org/html/2605.28983#bib.bib44)\]carry out a related mean\-field analysis for transformer training via Wasserstein gradient flow on the attention landscape, proving that the loss landscape over parameters, while nonconvex, is algorithmically benign\. The nonconvexity there is in the optimization landscape over weights; the open problem noted in the Limitations concerns non\-convex HamiltoniansHHin the PDE, a structurally different obstruction at the level of the PDE geometry rather than the parameter space\.

### Training dynamics for fixed\-support LSE networks

The results below characterize training when the support points\{yj\}\\\{y\_\{j\}\\\}are fixed \(e\.g\., pre\-specified or frozen after initialization\) and only the bias parametersθj=−g​\(yj\)/ε\\theta\_\{j\}=\-g\(y\_\{j\}\)/\\varepsilonare optimized\. This covers the*last\-layer*training regime and connects to the convexity structure of the HJ initial datum\.

###### Proposition D\.2\(Fixed\-support convexity\)\.

Let\{yj\}j=1N⊂ℝd\\\{y\_\{j\}\\\}\_\{j=1\}^\{N\}\\subset\\mathbb\{R\}^\{d\}be fixed support points and let

fεN​\(x;θ\)=ε​log​∑j=1Nexp⁡\(θj−\|x−yj\|24​t​ε\),f\_\{\\varepsilon\}^\{N\}\(x;\\theta\)=\\varepsilon\\log\\sum\_\{j=1\}^\{N\}\\exp\\\!\\Bigl\(\\theta\_\{j\}\-\\frac\{\|x\-y\_\{j\}\|^\{2\}\}\{4t\\varepsilon\}\\Bigr\),whereθ=\(θ1,…,θN\)∈ℝN\\theta=\(\\theta\_\{1\},\\ldots,\\theta\_\{N\}\)\\in\\mathbb\{R\}^\{N\}\. Letℓ:ℝ→ℝ\\ell:\\mathbb\{R\}\\to\\mathbb\{R\}be convex and non\-decreasing \(e\.g\. softplus loss, logistic loss, or any convex non\-decreasing function\), and setR​\(θ\)=1n​∑i=1nℓ​\(fεN​\(xi;θ\)\)R\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\(f\_\{\\varepsilon\}^\{N\}\(x\_\{i\};\\theta\)\)\. ThenRRis convex inθ\\theta, and any local minimum ofRRis a global minimum\.

*Remark\.*The condition thatℓ\\ellbe non\-decreasing is necessary: squared lossℓ​\(u\)=\(u−yi\)2\\ell\(u\)=\(u\-y\_\{i\}\)^\{2\}is convex but not non\-decreasing, andR​\(θ\)R\(\\theta\)is generally non\-convex inθ\\thetaunder MSE\. For MSE, the overparameterized regime is addressed separately in Proposition[D\.3](https://arxiv.org/html/2605.28983#A4.Thmtheorem3)\.

###### Proof\.

The exponentθj−\|x−yj\|2/\(4​t​ε\)\\theta\_\{j\}\-\|x\-y\_\{j\}\|^\{2\}/\(4t\\varepsilon\)is affine \(hence convex\) inθ\\theta\. Log\-sum\-exp of convex functions is convex, sofεN​\(xi;θ\)f\_\{\\varepsilon\}^\{N\}\(x\_\{i\};\\theta\)is convex inθ\\thetafor each fixedxix\_\{i\}\. Explicitly,∇θ2fεN=ε​\[diag​\(π\)−π​π⊤\]\\nabla^\{2\}\_\{\\theta\}f\_\{\\varepsilon\}^\{N\}=\\varepsilon\[\\mathrm\{diag\}\(\\pi\)\-\\pi\\pi^\{\\top\}\], which isε\\varepsilontimes the covariance matrix of the Gibbs distributionπ​\(xi;θ\)\\pi\(x\_\{i\};\\theta\)\(theε\\varepsilonfactor arises because∂f/∂θj=ε​πj\\partial f/\\partial\\theta\_\{j\}=\\varepsilon\\pi\_\{j\}in this parametrization\), hence positive semidefinite\. Sinceℓ\\ellis convex and non\-decreasing andfεN​\(xi;⋅\)f\_\{\\varepsilon\}^\{N\}\(x\_\{i\};\\cdot\)is convex, the compositionℓ∘fεN​\(xi;⋅\)\\ell\\circ f\_\{\\varepsilon\}^\{N\}\(x\_\{i\};\\cdot\)is convex inθ\\thetaby the standard composition rule\[Boyd and Vandenberghe,[2004](https://arxiv.org/html/2605.28983#bib.bib1), Section 3\.2\.4\]\. Averaging overiipreserves convexity\. Convexity implies every local minimum is global\. ∎∎

###### Proposition D\.3\(Overparameterized interpolation via NTK\)\.

Under the setting of Proposition[D\.2](https://arxiv.org/html/2605.28983#A4.Thmtheorem2), let\(x1,y1\),…,\(xn,yn\)\(x\_\{1\},y\_\{1\}\),\\ldots,\(x\_\{n\},y\_\{n\}\)be training data with distinct inputs\. Define the neural tangent kernel matrixK∈ℝn×nK\\in\\mathbb\{R\}^\{n\\times n\}by

Ka​b​\(θ\)=⟨∇θfεN​\(xa;θ\),∇θfεN​\(xb;θ\)⟩=ε2​∑j=1Nπj​\(xa;θ\)​πj​\(xb;θ\)\.K\_\{ab\}\(\\theta\)=\\bigl\\langle\\nabla\_\{\\theta\}f\_\{\\varepsilon\}^\{N\}\(x\_\{a\};\\theta\),\\,\\nabla\_\{\\theta\}f\_\{\\varepsilon\}^\{N\}\(x\_\{b\};\\theta\)\\bigr\\rangle=\\varepsilon^\{2\}\\sum\_\{j=1\}^\{N\}\\pi\_\{j\}\(x\_\{a\};\\theta\)\\,\\pi\_\{j\}\(x\_\{b\};\\theta\)\.IfN≥nN\\geq nand the support points\{yj\}j=1N\\\{y\_\{j\}\\\}\_\{j=1\}^\{N\}are drawn i\.i\.d\. from any absolutely continuous distribution onℝd\\mathbb\{R\}^\{d\}, thenK​\(0\)≻0K\(0\)\\succ 0almost surely\. Consequently, in the kernel\-linearization regime, gradient descent on the MSE lossR​\(θ\)=1n​∑i\(fεN​\(xi;θ\)−yi\)2R\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i\}\(f\_\{\\varepsilon\}^\{N\}\(x\_\{i\};\\theta\)\-y\_\{i\}\)^\{2\}from initializationθ=0\\theta=0with step sizeη<2/λmax​\(K​\(0\)\)\\eta<2/\\lambda\_\{\\max\}\(K\(0\)\)converges to zero training error at a linear rate\(1−2​η​λmin​\(K​\(0\)\)\)\(1\-2\\eta\\lambda\_\{\\min\}\(K\(0\)\)\)per step\.

###### Proof\.

For anyc∈ℝnc\\in\\mathbb\{R\}^\{n\},c⊤​K​\(0\)​c=ε2​∑j=1N\(∑aca​πj​\(xa;0\)\)2≥0c^\{\\top\}K\(0\)c=\\varepsilon^\{2\}\\sum\_\{j=1\}^\{N\}\(\\sum\_\{a\}c\_\{a\}\\pi\_\{j\}\(x\_\{a\};0\)\)^\{2\}\\geq 0\. Equality holds \(sinceε2\>0\\varepsilon^\{2\}\>0\) iffM⊤​c=0M^\{\\top\}c=0, whereM∈ℝN×nM\\in\\mathbb\{R\}^\{N\\times n\}has entriesMj​a=πj​\(xa;0\)M\_\{ja\}=\\pi\_\{j\}\(x\_\{a\};0\)\. Sinceπj​\(xa;0\)=Gj​a/Za\\pi\_\{j\}\(x\_\{a\};0\)=G\_\{ja\}/Z\_\{a\}withGj​a=exp⁡\(−\|xa−yj\|2/\(4​t​ε\)\)G\_\{ja\}=\\exp\(\-\|x\_\{a\}\-y\_\{j\}\|^\{2\}/\(4t\\varepsilon\)\)andZa=∑kGk​a\>0Z\_\{a\}=\\sum\_\{k\}G\_\{ka\}\>0, we haverank​\(M\)=rank​\(G\)\\mathrm\{rank\}\(M\)=\\mathrm\{rank\}\(G\)\. The entries ofGGare real\-analytic in\(y1,…,yN\)∈\(ℝd\)N\(y\_\{1\},\\ldots,y\_\{N\}\)\\in\(\\mathbb\{R\}^\{d\}\)^\{N\}, so eachn×nn\\times nminor ofGGis real\-analytic\. At the configurationyj=xjy\_\{j\}=x\_\{j\}forj=1,…,nj=1,\\ldots,n\(withN≥nN\\geq n\), then×nn\\times nleading submatrix ofGGis the Gaussian kernel matrix\[exp⁡\(−\|xi−xj\|2/\(4​t​ε\)\)\]\[\\exp\(\-\|x\_\{i\}\-x\_\{j\}\|^\{2\}/\(4t\\varepsilon\)\)\], which is strictly positive definite for distinct training inputsxix\_\{i\}\(the Gaussian kernel is strictly positive definite onℝd\\mathbb\{R\}^\{d\}\)\. Hence eachn×nn\\times nminor is not identically zero as a function ofyy\. By the identity principle for real\-analytic functions, the rank\-deficiency locus has Lebesgue measure zero; since\{yj\}\\\{y\_\{j\}\\\}are i\.i\.d\. from an absolutely continuous distribution,GGhas full column ranknnalmost surely\. HenceK​\(0\)≻0K\(0\)\\succ 0\. The linear convergence then follows from standard NTK theory\[Jacotet al\.,[2018](https://arxiv.org/html/2605.28983#bib.bib28)\]: withK​\(0\)≻0K\(0\)\\succ 0and step sizeη<2/λmax​\(K​\(0\)\)\\eta<2/\\lambda\_\{\\max\}\(K\(0\)\), the training error decreases at rate\(1−2​η​λmin​\(K​\(0\)\)\)\(1\-2\\eta\\lambda\_\{\\min\}\(K\(0\)\)\)per step in the kernel\-linearization regime\. ∎∎

## Appendix EProofs

### Proof of Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\(NN–PDE Identity\)

SubstitutingWj=yj/\(2​t\)W\_\{j\}=y\_\{j\}/\(2t\)andbj=−g​\(yj\)−\|yj\|2/\(4​t\)b\_\{j\}=\-g\(y\_\{j\}\)\-\|y\_\{j\}\|^\{2\}/\(4t\), the pre\-activation of neuronjjat inputxxis

Wj⋅x\+bj=yj⋅x2​t−g​\(yj\)−\|yj\|24​t\.W\_\{j\}\\cdot x\+b\_\{j\}=\\frac\{y\_\{j\}\\cdot x\}\{2t\}\-g\(y\_\{j\}\)\-\\frac\{\|y\_\{j\}\|^\{2\}\}\{4t\}\.\(22\)The first and third terms complete a square: adding and subtracting\|x\|2\|x\|^\{2\}gives2​yj⋅x−\|yj\|2=\|x\|2−\(\|x\|2−2​x⋅yj\+\|yj\|2\)=\|x\|2−\|x−yj\|22y\_\{j\}\\cdot x\-\|y\_\{j\}\|^\{2\}=\|x\|^\{2\}\-\(\|x\|^\{2\}\-2x\\cdot y\_\{j\}\+\|y\_\{j\}\|^\{2\}\)=\|x\|^\{2\}\-\|x\-y\_\{j\}\|^\{2\}, so

yj⋅x2​t−\|yj\|24​t=\|x\|2−\|x−yj\|24​t\.\\frac\{y\_\{j\}\\cdot x\}\{2t\}\-\\frac\{\|y\_\{j\}\|^\{2\}\}\{4t\}=\\frac\{\|x\|^\{2\}\-\|x\-y\_\{j\}\|^\{2\}\}\{4t\}\.\(23\)Substituting \([23](https://arxiv.org/html/2605.28983#A5.E23)\) into \([22](https://arxiv.org/html/2605.28983#A5.E22)\) isolates thejj\-independent term\|x\|2/\(4​t\)\|x\|^\{2\}/\(4t\):

Wj⋅x\+bj=\|x\|24​t−\(\|x−yj\|24​t\+g​\(yj\)\),W\_\{j\}\\cdot x\+b\_\{j\}=\\frac\{\|x\|^\{2\}\}\{4t\}\-\\Bigl\(\\frac\{\|x\-y\_\{j\}\|^\{2\}\}\{4t\}\+g\(y\_\{j\}\)\\Bigr\),\(24\)which expresses the pre\-activation as\|x\|2/\(4​t\)\|x\|^\{2\}/\(4t\)minus the Hopf–Cole exponent at atomyjy\_\{j\}\. Summing overjjand factoring:

∑j=1Ne\(Wj⋅x\+bj\)/ε=e\|x\|2/\(4​ε​t\)​∑j=1Nexp⁡\(−g​\(yj\)−\|x−yj\|2/\(4​t\)ε\)\.\\sum\_\{j=1\}^\{N\}e^\{\(W\_\{j\}\\cdot x\+b\_\{j\}\)/\\varepsilon\}=e^\{\|x\|^\{2\}/\(4\\varepsilon t\)\}\\sum\_\{j=1\}^\{N\}\\exp\\\!\\left\(\\frac\{\-g\(y\_\{j\}\)\-\|x\-y\_\{j\}\|^\{2\}/\(4t\)\}\{\\varepsilon\}\\right\)\.\(25\)Multiplying byε\\varepsilonand taking the logarithm:

fεN​\(x\)\\displaystyle f\_\{\\varepsilon\}^\{N\}\(x\)=\|x\|24​t\+ε​log​∑j=1Nexp⁡\(−g​\(yj\)−\|x−yj\|2/\(4​t\)ε\)⏟=−uεN​\(x,t\)​by definition=\|x\|24​t−uεN​\(x,t\)\.\\displaystyle=\\frac\{\|x\|^\{2\}\}\{4t\}\+\\underbrace\{\\varepsilon\\log\\\!\\sum\_\{j=1\}^\{N\}\\exp\\\!\\left\(\\frac\{\-g\(y\_\{j\}\)\-\|x\-y\_\{j\}\|^\{2\}/\(4t\)\}\{\\varepsilon\}\\right\)\}\_\{=\-u\_\{\\varepsilon\}^\{N\}\(x,t\)\\text\{ by definition\}\}=\\frac\{\|x\|^\{2\}\}\{4t\}\-u\_\{\\varepsilon\}^\{N\}\(x,t\)\.\(26\)No approximation is made at any step; the identity is algebraic\. ∎

The key is \([23](https://arxiv.org/html/2605.28983#A5.E23)\): completing the square inyjy\_\{j\}reveals the quadratic transport cost\|x−yj\|2/\(4​t\)\|x\-y\_\{j\}\|^\{2\}/\(4t\)hidden inside the linear weightWj⋅xW\_\{j\}\\cdot xand the bias term\|yj\|2/\(4​t\)\|y\_\{j\}\|^\{2\}/\(4t\)\. The parameterizationWj=yj/\(2​t\)W\_\{j\}=y\_\{j\}/\(2t\)is not a modelling choice but the unique map that makes \([23](https://arxiv.org/html/2605.28983#A5.E23)\) exact\.

### Proof of Theorem[5\.1](https://arxiv.org/html/2605.28983#S5.Thmtheorem1)\(Tropical Limit via Varadhan’s Lemma\)

The Hopf–Lax formulau0​\(x,t\)=infy\{\|x−y\|2/\(4​t\)\+g​\(y\)\}u\_\{0\}\(x,t\)=\\inf\_\{y\}\\\{\|x\-y\|^\{2\}/\(4t\)\+g\(y\)\\\}is the exact solution of the inviscid HJ equation and has a direct ML reading: it is nearest\-neighbor retrieval, where\|x−y\|2/\(4​t\)\|x\-y\|^\{2\}/\(4t\)is the travel cost from queryxxto memoryyy, andg​\(y\)g\(y\)is the stored value atyy\. The inf\-convolution \(smooth minimum over memories\) is theε→0\\varepsilon\\to 0tropical limit of the log\-sum\-exp average the LSE network computes: at finiteε\\varepsilonall memories contribute with soft weights; atε=0\\varepsilon=0only the nearest \(cheapest\) memory wins\. Varadhan’s lemma makes this limit precise by identifying it as the large\-deviation rate function of the Gaussian measure\.

Write the Hopf–Cole solution asuε​\(x,t\)=−ε​log​∫e−F​\(y\)/ε​𝑑yu\_\{\\varepsilon\}\(x,t\)=\-\\varepsilon\\log\\int e^\{\-F\(y\)/\\varepsilon\}\\,dywhereF​\(y\)=g​\(y\)\+\|x−y\|2/\(4​t\)F\(y\)=g\(y\)\+\|x\-y\|^\{2\}/\(4t\)\. Sinceggis Lipschitz and\|x−y\|2/\(4​t\)→∞\|x\-y\|^\{2\}/\(4t\)\\to\\inftyas\|y\|→∞\|y\|\\to\\infty,FFis continuous and coercive\. Varadhan’s lemma\[Varadhan,[1984](https://arxiv.org/html/2605.28983#bib.bib15)\]\(the Laplace principle for large deviations\) states that for any suchFF:

limε→0\(−ε​log​∫e−F​\(y\)/ε​𝑑y\)=infy∈ℝdF​\(y\)\.\\lim\_\{\\varepsilon\\to 0\}\\Bigl\(\-\\varepsilon\\log\\int e^\{\-F\(y\)/\\varepsilon\}\\,dy\\Bigr\)=\\inf\_\{y\\in\\mathbb\{R\}^\{d\}\}F\(y\)\.\(27\)Applying \([27](https://arxiv.org/html/2605.28983#A5.E27)\) directly:limε→0uε​\(x,t\)=infy\{g​\(y\)\+\|x−y\|2/\(4​t\)\}=u0​\(x,t\)\\lim\_\{\\varepsilon\\to 0\}u\_\{\\varepsilon\}\(x,t\)=\\inf\_\{y\}\\\{g\(y\)\+\|x\-y\|^\{2\}/\(4t\)\\\}=u\_\{0\}\(x,t\), which is the Hopf–Lax formula \([9](https://arxiv.org/html/2605.28983#S5.E9)\)\.

*Proof of \([27](https://arxiv.org/html/2605.28983#A5.E27)\)\.*Letm=infyF​\(y\)m=\\inf\_\{y\}F\(y\)\. For anyη\>0\\eta\>0, choosey∗y^\{\*\}withF​\(y∗\)≤m\+ηF\(y^\{\*\}\)\\leq m\+\\eta; then the integral is at least∫\|y−y∗\|≤δe−F​\(y\)/ε​𝑑y≥e−\(m\+η\)/ε⋅vol​\(Bδ\)\\int\_\{\|y\-y^\{\*\}\|\\leq\\delta\}e^\{\-F\(y\)/\\varepsilon\}dy\\geq e^\{\-\(m\+\\eta\)/\\varepsilon\}\\cdot\\mathrm\{vol\}\(B\_\{\\delta\}\)\. Hence−ε​log⁡\[⋯\]≤m\+η\+ε​log⁡vol​\(Bδ\)−1→m\+η\-\\varepsilon\\log\[\\cdots\]\\leq m\+\\eta\+\\varepsilon\\log\\mathrm\{vol\}\(B\_\{\\delta\}\)^\{\-1\}\\to m\+\\etaasε→0\\varepsilon\\to 0\. For the lower bound: sinceggisLgL\_\{g\}\-Lipschitz andF​\(y\)=g​\(y\)\+\|x−y\|2/\(4​t\)F\(y\)=g\(y\)\+\|x\-y\|^\{2\}/\(4t\), the quadratic dominates for large\|y\|\|y\|:F​\(y\)−m≥\|y−x\|2/\(8​t\)−C′F\(y\)\-m\\geq\|y\-x\|^\{2\}/\(8t\)\-C^\{\\prime\}for a constantC′C^\{\\prime\}depending onxx,LgL\_\{g\},tt\. Hence∫e−F/ε​𝑑y≤e−m/ε​∫e−\(F​\(y\)−m\)/ε​𝑑y≤e\(−m\+C′\)/ε​\(8​π​t​ε\)d/2\\int e^\{\-F/\\varepsilon\}dy\\leq e^\{\-m/\\varepsilon\}\\int e^\{\-\(F\(y\)\-m\)/\\varepsilon\}dy\\leq e^\{\(\-m\+C^\{\\prime\}\)/\\varepsilon\}\(8\\pi t\\varepsilon\)^\{d/2\}, giving−ε​log​∫e−F/ε​𝑑y≥m−C′−\(d​ε/2\)​log⁡\(8​π​t​ε\)→m\-\\varepsilon\\log\\int e^\{\-F/\\varepsilon\}dy\\geq m\-C^\{\\prime\}\-\(d\\varepsilon/2\)\\log\(8\\pi t\\varepsilon\)\\to masε→0\\varepsilon\\to 0\. Together:limε→0\(−ε​log⁡\[⋯\]\)=m\\lim\_\{\\varepsilon\\to 0\}\(\-\\varepsilon\\log\[\\cdots\]\)=m\. ∎

For the discrete measureμN\\mu\_\{N\}, the same argument givesminj⁡F​\(yj\)=minj⁡\{g​\(yj\)\+\|x−yj\|2/\(4​t\)\}=f0N​\(x\)\\min\_\{j\}F\(y\_\{j\}\)=\\min\_\{j\}\\\{g\(y\_\{j\}\)\+\|x\-y\_\{j\}\|^\{2\}/\(4t\)\\\}=f\_\{0\}^\{N\}\(x\)\(replacing the integral with a finite sum, wherelimε→0−ε​log​∑je−F​\(yj\)/ε=minj⁡F​\(yj\)\\lim\_\{\\varepsilon\\to 0\}\-\\varepsilon\\log\\sum\_\{j\}e^\{\-F\(y\_\{j\}\)/\\varepsilon\}=\\min\_\{j\}F\(y\_\{j\}\)by the same Laplace argument\)\.

### Proof of Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1)\(Commutative Diagram\)

The five claims are verified in order\.

Claim \(i\)\(fεN=\|x\|2/\(4​t\)−uεNf\_\{\\varepsilon\}^\{N\}=\|x\|^\{2\}/\(4t\)\-u\_\{\\varepsilon\}^\{N\}\): Proved exactly by Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\.

Claim \(ii\)\(Deep network==HJ semigroup\): Under the Hopf–Cole change of variablesv=e−u/εv=e^\{\-u/\\varepsilon\}, the viscous HJ equation becomes the heat equation∂tv=ε​Δ​v\\partial\_\{t\}v=\\varepsilon\\Delta v\. The heat equation generates a contraction semigroup\{eε​t​Δ\}t≥0\\\{e^\{\\varepsilon t\\Delta\}\\\}\_\{t\\geq 0\}onL2L^\{2\}; in particulareε​t1​Δ∘eε​t2​Δ=eε​\(t1\+t2\)​Δe^\{\\varepsilon t\_\{1\}\\Delta\}\\circ e^\{\\varepsilon t\_\{2\}\\Delta\}=e^\{\\varepsilon\(t\_\{1\}\+t\_\{2\}\)\\Delta\}\. ComposingLLlayers corresponds to applying this semigroup for total timeT=t1\+⋯\+tLT=t\_\{1\}\+\\cdots\+t\_\{L\}: thell\-th layer appliesStlS\_\{t\_\{l\}\}to the solution produced by the previous layer \(which encodes the initial dataggfor that layer\)\. This is exactly the tower \(Chapman–Kolmogorov\) property of the Markov semigroup\.

Claim \(iii\)\(uεN→uεu\_\{\\varepsilon\}^\{N\}\\to u\_\{\\varepsilon\}at rateO​\(N−1/d\)O\(N^\{\-1/d\}\)\): Proved in step \(i\) of the proof of Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)\.

Claim \(iv\)\(uε→u0u\_\{\\varepsilon\}\\to u\_\{0\}asε→0\\varepsilon\\to 0\): Proved by Theorem[5\.1](https://arxiv.org/html/2605.28983#S5.Thmtheorem1)\.

Claim \(v\)\(Limits commute under Lipschitz\): By Claims \(iii\) and \(iv\),\|uεN−u0\|≤\|uεN−uε\|\+\|uε−u0\|≤C1​N−1/d\+C2​ε\|u\_\{\\varepsilon\}^\{N\}\-u\_\{0\}\|\\leq\|u\_\{\\varepsilon\}^\{N\}\-u\_\{\\varepsilon\}\|\+\|u\_\{\\varepsilon\}\-u\_\{0\}\|\\leq C\_\{1\}N^\{\-1/d\}\+C\_\{2\}\\varepsilon\. Uniformity ofC1C\_\{1\}inε\\varepsilonrequires care: the integrandexp⁡\(−Fx​\(y\)/ε\)\\exp\(\-F\_\{x\}\(y\)/\\varepsilon\)withFx​\(y\)=g​\(y\)\+\|x−y\|2/\(4​t\)F\_\{x\}\(y\)=g\(y\)\+\|x\-y\|^\{2\}/\(4t\)has Lipschitz constantO​\(Lg/ε\)O\(L\_\{g\}/\\varepsilon\), so a naive quadrature bound would diverge\. The key observation is that the softmax weightsπj\(ε\)​\(x\)∝exp⁡\(−Fx​\(yj\)/ε\)\\pi\_\{j\}^\{\(\\varepsilon\)\}\(x\)\\propto\\exp\(\-F\_\{x\}\(y\_\{j\}\)/\\varepsilon\)concentrate on theO​\(ε1/2\)O\(\\varepsilon^\{1/2\}\)neighbourhood ofj∗​\(x\)=arg⁡minj⁡Fx​\(yj\)j^\{\*\}\(x\)=\\arg\\min\_\{j\}F\_\{x\}\(y\_\{j\}\)asε→0\\varepsilon\\to 0\(Laplace approximation; seeVaradhan \[[1984](https://arxiv.org/html/2605.28983#bib.bib15)\]Theorem 3\.1\)\. The quadrature error for−ε​log⁡Z\-\\varepsilon\\log Zequals the log\-ratio−ε​log⁡\(Zcont/ZN\)\-\\varepsilon\\log\(Z\_\{\\mathrm\{cont\}\}/Z\_\{N\}\)whereZN=N−1​∑jexp⁡\(−Fx​\(yj\)/ε\)Z\_\{N\}=N^\{\-1\}\\sum\_\{j\}\\exp\(\-F\_\{x\}\(y\_\{j\}\)/\\varepsilon\)andZcont=∫exp⁡\(−Fx​\(y\)/ε\)​𝑑yZ\_\{\\mathrm\{cont\}\}=\\int\\exp\(\-F\_\{x\}\(y\)/\\varepsilon\)dy\. SinceFxF\_\{x\}isLgL\_\{g\}\-Lipschitz and attains its minimum aty∗​\(x\)y^\{\*\}\(x\), standard Laplace\-method estimates \(see, e\.g\.,Varadhan \[[1984](https://arxiv.org/html/2605.28983#bib.bib15)\]\) show\|−ε​log⁡Zcont/ZN\|≤C1​N−1/d\|\{\-\\varepsilon\\log Z\_\{\\mathrm\{cont\}\}/Z\_\{N\}\}\|\\leq C\_\{1\}N^\{\-1/d\}withC1=C​\(d\)​Lg​diam⁡\(𝒴\)C\_\{1\}=C\(d\)L\_\{g\}\\operatorname\{diam\}\(\\mathcal\{Y\}\)independent ofε\\varepsilon\. Theε\\varepsilon\-independence ofC1C\_\{1\}follows from cancellation: the quadrature error satisfies\|Zcont−ZN\|/Zcont=O​\(h​Lg/ε\)\|Z\_\{\\mathrm\{cont\}\}\-Z\_\{N\}\|/Z\_\{\\mathrm\{cont\}\}=O\(hL\_\{g\}/\\varepsilon\)\(the integrand has Lipschitz constantO​\(Lg/ε\)O\(L\_\{g\}/\\varepsilon\)\); multiplying by the leadingε\\varepsilonin−ε​log\-\\varepsilon\\loggives\|−ε​log⁡\(Zcont/ZN\)\|=O​\(h​Lg\)\|\{\-\\varepsilon\\log\(Z\_\{\\mathrm\{cont\}\}/Z\_\{N\}\)\}\|=O\(hL\_\{g\}\), with no residualε\\varepsilonfactor\. The boundC2​εC\_\{2\}\\varepsilonis uniform inNN\(it is a property of the PDE, not the discretization\)\. Therefore both iterated limits \(N→∞N\\to\\inftythenε→0\\varepsilon\\to 0, orε→0\\varepsilon\\to 0thenN→∞N\\to\\infty\) yieldu0u\_\{0\}, and the diagram commutes\. ∎

### Multilayer Composition: Finite\-Depth Error and Joint\-Limit Exactness

The following two results quantify the correspondence for deep networks withLLlayers, complementing Claim \(ii\) of Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1)\.

###### Theorem E\.1\(Finite\-depth composition error\)\.

Letfε\(L\)f^\{\(L\)\}\_\{\\varepsilon\}be a depth\-LLLSE network with layer widthsN1,…,NLN\_\{1\},\\ldots,N\_\{L\}and time\-scalest1,…,tLt\_\{1\},\\ldots,t\_\{L\}, and letuεu\_\{\\varepsilon\}be the continuous Hopf–Cole solution at timeT=∑ℓ=1LtℓT=\\sum\_\{\\ell=1\}^\{L\}t\_\{\\ell\}under the limiting measure\. Under Lipschitz initial data with uniform constantLgL\_\{g\},

supx∈K\|fε\(L\)​\(x\)−\[\|x\|24​T−uε​\(x,T\)\]\|≤C​\(K,Lg,T\)⋅L⋅maxℓ⁡Nℓ−1/d\.\\sup\_\{x\\in K\}\\Bigl\|f^\{\(L\)\}\_\{\\varepsilon\}\(x\)\-\\Bigl\[\\tfrac\{\|x\|^\{2\}\}\{4T\}\-u\_\{\\varepsilon\}\(x,T\)\\Bigr\]\\Bigr\|\\;\\leq\\;C\(K,L\_\{g\},T\)\\cdot L\\cdot\\max\_\{\\ell\}N\_\{\\ell\}^\{\-1/d\}\.\(28\)The cumulative error is linear in depth and inherits the single\-layer quadrature rateO​\(N−1/d\)O\(N^\{\-1/d\}\)\.

###### Proof\.

At each layerℓ\\ell, the single\-layer identity \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\) givesfεNℓ​\(x\)=\|x\|2/\(4​tℓ\)−uεNℓ​\(x,tℓ\)f\_\{\\varepsilon\}^\{N\_\{\\ell\}\}\(x\)=\|x\|^\{2\}/\(4t\_\{\\ell\}\)\-u\_\{\\varepsilon\}^\{N\_\{\\ell\}\}\(x,t\_\{\\ell\}\)with error\|uεNℓ−uε\(ℓ\)\|≤C1​Nℓ−1/d\|u\_\{\\varepsilon\}^\{N\_\{\\ell\}\}\-u\_\{\\varepsilon\}^\{\(\\ell\)\}\|\\leq C\_\{1\}N\_\{\\ell\}^\{\-1/d\}\(Claim \(iii\)\)\. The Hopf–Cole solutionuε\(ℓ\)u\_\{\\varepsilon\}^\{\(\\ell\)\}isLgL\_\{g\}\-Lipschitz inxxsinceggis Lipschitz and the heat semigroup is a contraction on Lip\. Composition ofLLLipschitz layers propagates the error additively: each layer contributes at mostC1​Nℓ−1/dC\_\{1\}N\_\{\\ell\}^\{\-1/d\}, and propagation through the remainingL−ℓL\-\\elllayers multiplies the error by at most a Lipschitz constant bounded by11\(non\-expansion of the heat semigroup\)\. Summing overℓ=1,…,L\\ell=1,\\ldots,Land takingmaxℓ⁡Nℓ−1/d\\max\_\{\\ell\}N\_\{\\ell\}^\{\-1/d\}as the worst\-case rate gives the stated bound\. ∎∎

###### Theorem E\.2\(Asymptotic exactness in the joint limit\)\.

Under the same assumptions as Theorem[E\.1](https://arxiv.org/html/2605.28983#A5.Thmtheorem1), ifL→∞L\\to\\infty,N→∞N\\to\\infty,h=T/L→0h=T/L\\to 0withL⋅N−1/d→0L\\cdot N^\{\-1/d\}\\to 0, then the depth\-LLnetwork converges to the continuous HJ semigroup:

fε\(L\)​\(x\)→\|x\|24​T−uε​\(x,T\)uniformly on compacts\.f^\{\(L\)\}\_\{\\varepsilon\}\(x\)\\;\\to\\;\\tfrac\{\|x\|^\{2\}\}\{4T\}\-u\_\{\\varepsilon\}\(x,T\)\\quad\\text\{uniformly on compacts\.\}\(29\)The unification at finite depth is approximate with error quantified by Theorem[E\.1](https://arxiv.org/html/2605.28983#A5.Thmtheorem1); in the joint limit the network exactly implements the HJ semigroup\.

###### Proof\.

By Theorem[E\.1](https://arxiv.org/html/2605.28983#A5.Thmtheorem1), the total error isO​\(L⋅N−1/d\)O\(L\\cdot N^\{\-1/d\}\)\. Under the conditionL⋅N−1/d→0L\\cdot N^\{\-1/d\}\\to 0, this bound vanishes, and the supremum on compactsKKtends to zero by uniform Lipschitz control\. The limiting object is\|x\|2/\(4​T\)−uε​\(x,T\)\|x\|^\{2\}/\(4T\)\-u\_\{\\varepsilon\}\(x,T\), the continuous HJ semigroup applied to the initial dataggand evaluated at total timeTT\. ∎∎

### Proof of Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)\(Generalization\)

Decompose the predictor error directly:\|uεN​\(x\)−f∗​\(x\)\|≤\|uεN−uε\|\+\|uε−u0\|\+\|u0−f∗\|\|u\_\{\\varepsilon\}^\{N\}\(x\)\-f^\{\*\}\(x\)\|\\leq\|u\_\{\\varepsilon\}^\{N\}\-u\_\{\\varepsilon\}\|\+\|u\_\{\\varepsilon\}\-u\_\{0\}\|\+\|u\_\{0\}\-f^\{\*\}\|\.

\(i\)*Quadrature*\(\|uεN−uε\|\|u\_\{\\varepsilon\}^\{N\}\-u\_\{\\varepsilon\}\|\): Placeyjy\_\{j\}on a regulardd\-dimensional grid of spacingh=N−1/dh=N^\{\-1/d\}\. The Hopf–Cole integrand isΦ​\(y\)=exp⁡\(\(−g​\(y\)−\|x−y\|2/\(4​t\)\)/ε\)\\Phi\(y\)=\\exp\(\(\-g\(y\)\-\|x\-y\|^\{2\}/\(4t\)\)/\\varepsilon\); for Lipschitzggthis is Lipschitz inyywith constantO​\(1/ε\)O\(1/\\varepsilon\)\. Deterministic quadrature on thehh\-grid gives\|uεN−uε\|=O​\(h/ε⋅ε\)=O​\(h\)=O​\(N−1/d\)\|u\_\{\\varepsilon\}^\{N\}\-u\_\{\\varepsilon\}\|=O\(h/\\varepsilon\\cdot\\varepsilon\)=O\(h\)=O\(N^\{\-1/d\}\)\(theε\\varepsilonfactors cancel in−ε​log\-\\varepsilon\\log\)\.

\(ii\)*Viscosity bias*\(\|uε−u0\|\|u\_\{\\varepsilon\}\-u\_\{0\}\|\): The comparison principle gives‖uε−u0‖∞≤C​L​ε\\\|u\_\{\\varepsilon\}\-u\_\{0\}\\\|\_\{\\infty\}\\leq CL\\varepsilonfor Lipschitz initial data\[Evans,[2010](https://arxiv.org/html/2605.28983#bib.bib11), Crandall and Lions,[1983](https://arxiv.org/html/2605.28983#bib.bib8)\]\.

\(iii\)*Approximation*\(\|u0−f∗\|\|u\_\{0\}\-f^\{\*\}\|\): Setg​\(yj\)=f∗​\(yj\)\+L2​tg\(y\_\{j\}\)=f^\{\*\}\(y\_\{j\}\)\+L^\{2\}ton the grid\. For anyxxletyj∗y\_\{j^\{\*\}\}be the nearest grid point with\|x−yj∗\|≤h​d/2\|x\-y\_\{j^\{\*\}\}\|\\leq h\\sqrt\{d\}/2\.*Lower bound*: for anyjj,f∗​\(yj\)\+L2​t\+\|x−yj\|2/\(4​t\)≥f∗​\(x\)−L​\|x−yj\|\+L2​t\+\|x−yj\|2/\(4​t\)≥f∗​\(x\)f^\{\*\}\(y\_\{j\}\)\+L^\{2\}t\+\|x\-y\_\{j\}\|^\{2\}/\(4t\)\\geq f^\{\*\}\(x\)\-L\|x\-y\_\{j\}\|\+L^\{2\}t\+\|x\-y\_\{j\}\|^\{2\}/\(4t\)\\geq f^\{\*\}\(x\)\(completing the square−L​z\+z2/\(4​t\)≥−L2​t\-Lz\+z^\{2\}/\(4t\)\\geq\-L^\{2\}t, using only theLL\-Lipschitz constant off∗f^\{\*\}\), sou0​\(x\)≥f∗​\(x\)u\_\{0\}\(x\)\\geq f^\{\*\}\(x\)\.*Upper bound*: evaluating atj∗j^\{\*\},u0​\(x\)≤f∗​\(yj∗\)\+L2​t\+\|x−yj∗\|2/\(4​t\)≤f∗​\(x\)\+L​h​d/2\+L2​t\+d​h2/\(16​t\)u\_\{0\}\(x\)\\leq f^\{\*\}\(y\_\{j^\{\*\}\}\)\+L^\{2\}t\+\|x\-y\_\{j^\{\*\}\}\|^\{2\}/\(4t\)\\leq f^\{\*\}\(x\)\+Lh\\sqrt\{d\}/2\+L^\{2\}t\+dh^\{2\}/\(16t\)\. Henceu0​\(x\)∈\[f∗​\(x\),f∗​\(x\)\+L​h​d/2\+L2​t\+O​\(h2/t\)\]u\_\{0\}\(x\)\\in\[f^\{\*\}\(x\),\\,f^\{\*\}\(x\)\+Lh\\sqrt\{d\}/2\+L^\{2\}t\+O\(h^\{2\}/t\)\]\. Under the generalization gauget≍h=N−1/dt\\asymp h=N^\{\-1/d\}, bothL​h​d/2Lh\\sqrt\{d\}/2andL2​tL^\{2\}tareO​\(h\)O\(h\), giving\|u0−f∗\|=O​\(h\)=O​\(N−1/d\)\|u\_\{0\}\-f^\{\*\}\|=O\(h\)=O\(N^\{\-1/d\}\)\.

Balancing \(i\)–\(ii\) yieldsε∗≍N−1/d\\varepsilon^\{\*\}\\asymp N^\{\-1/d\}\. The Rademacher complexity of\{uεN:‖Wj‖2≤M\}\\\{u\_\{\\varepsilon\}^\{N\}:\\\|W\_\{j\}\\\|\_\{2\}\\leq M\\\}equals that of\{fεN:‖Wj‖2≤M\}\\\{f\_\{\\varepsilon\}^\{N\}:\\\|W\_\{j\}\\\|\_\{2\}\\leq M\\\}up to a constant shift\|x\|2/\(4​t\)\|x\|^\{2\}/\(4t\)independent of\(W,b\)\(W,b\), which isO​\(M​N/n\)O\(M\\sqrt\{N/n\}\)by a standard Rademacher bound, giving excess riskO​\(N−1/d\+M​N/n\)O\(N^\{\-1/d\}\+M\\sqrt\{N/n\}\)\. Note thatM=diam​\(𝒳\)/\(2​t\)M=\\mathrm\{diam\}\(\\mathcal\{X\}\)/\(2t\); under the optimal approximation gauget≍N−1/dt\\asymp N^\{\-1/d\},MMgrows asO​\(N1/d\)O\(N^\{1/d\}\), and the Rademacher term becomesO​\(N1/d​N/n\)=O​\(n−1/\(d\+4\)\)O\(N^\{1/d\}\\sqrt\{N/n\}\)=O\(n^\{\-1/\(d\+4\)\}\)after balancing, a slower rate\. The minimax rateO​\(n−1/\(d\+2\)\)O\(n^\{\-1/\(d\+2\)\}\)is recovered whenttis treated as a fixed hyperparameter \(soMMis fixed\), andN∗≍\(n/M2\)d/\(d\+2\)N^\{\*\}\\asymp\(n/M^\{2\}\)^\{d/\(d\+2\)\}balances the terms\. ∎

### Proof of Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2)\(Robustness\)

The Hessian offεN=LSEε​\(W​x\+b\)f\_\{\\varepsilon\}^\{N\}=\\mathrm\{LSE\}\_\{\\varepsilon\}\(Wx\+b\)is∇x2fεN=ε−1​W⊤​\(diag​\(π\)−π​π⊤\)​W\\nabla\_\{x\}^\{2\}f\_\{\\varepsilon\}^\{N\}=\\varepsilon^\{\-1\}W^\{\\top\}\\\!\\bigl\(\\mathrm\{diag\}\(\\pi\)\-\\pi\\pi^\{\\top\}\\bigr\)W, whereπ=softmax​\(\(W​x\+b\)/ε\)\\pi=\\mathrm\{softmax\}\(\(Wx\+b\)/\\varepsilon\)\. For unituu,u⊤​\(∇x2fεN\)​u=ε−1​Varπ​\[\(W​u\)j\]u^\{\\top\}\(\\nabla\_\{x\}^\{2\}f\_\{\\varepsilon\}^\{N\}\)u=\\varepsilon^\{\-1\}\\mathrm\{Var\}\_\{\\pi\}\\bigl\[\(Wu\)\_\{j\}\\bigr\]\. Since\(W​u\)j=Wj⋅u\(Wu\)\_\{j\}=W\_\{j\}\\cdot uwith‖u‖2=1\\\|u\\\|\_\{2\}=1, it holds that\|\(W​u\)j\|≤‖Wj‖2≤‖W‖2,∞\|\(Wu\)\_\{j\}\|\\leq\\\|W\_\{j\}\\\|\_\{2\}\\leq\\\|W\\\|\_\{2,\\infty\}, so the range of\(W​u\)j\(Wu\)\_\{j\}over atomsjjis at most2​‖W‖2,∞2\\\|W\\\|\_\{2,\\infty\}\. Popoviciu’s inequalityVar​\(X\)≤\(max⁡X−min⁡X\)2/4\\mathrm\{Var\}\(X\)\\leq\(\\max X\-\\min X\)^\{2\}/4then givesVarπ​\[\(W​u\)j\]≤‖W‖2,∞2\\mathrm\{Var\}\_\{\\pi\}\[\(Wu\)\_\{j\}\]\\leq\\\|W\\\|\_\{2,\\infty\}^\{2\}, whence \([13](https://arxiv.org/html/2605.28983#S8.E13)\)\. For the gradient:∇xfεN=W⊤​π\\nabla\_\{x\}f\_\{\\varepsilon\}^\{N\}=W^\{\\top\}\\piis a convex combination of the rowsWjW\_\{j\}, so‖∇xfεN‖2≤∑jπj​‖Wj‖2≤‖W‖2,∞\\\|\\nabla\_\{x\}f\_\{\\varepsilon\}^\{N\}\\\|\_\{2\}\\leq\\sum\_\{j\}\\pi\_\{j\}\\\|W\_\{j\}\\\|\_\{2\}\\leq\\\|W\\\|\_\{2,\\infty\}\. The adversarial bound \([14](https://arxiv.org/html/2605.28983#S8.E14)\) then follows by Taylor expansion:\|fεN​\(x\+δ\)−fεN​\(x\)\|≤‖∇xfεN‖2​‖δ‖2\+12​‖∇x2fεN‖2​‖δ‖22≤‖W‖2,∞​r\+‖W‖2,∞2​r2/\(2​ε\)\|f\_\{\\varepsilon\}^\{N\}\(x\+\\delta\)\-f\_\{\\varepsilon\}^\{N\}\(x\)\|\\leq\\\|\\nabla\_\{x\}f\_\{\\varepsilon\}^\{N\}\\\|\_\{2\}\\\|\\delta\\\|\_\{2\}\+\\tfrac\{1\}\{2\}\\\|\\nabla\_\{x\}^\{2\}f\_\{\\varepsilon\}^\{N\}\\\|\_\{2\}\\\|\\delta\\\|\_\{2\}^\{2\}\\leq\\\|W\\\|\_\{2,\\infty\}r\+\\\|W\\\|\_\{2,\\infty\}^\{2\}r^\{2\}/\(2\\varepsilon\)\. For the Hopf–Cole predictoruεN=\|x\|2/\(4​t\)−fεNu\_\{\\varepsilon\}^\{N\}=\|x\|^\{2\}/\(4t\)\-f\_\{\\varepsilon\}^\{N\}, the Hessian is∇x2uεN=12​t​I−∇x2fεN\\nabla\_\{x\}^\{2\}u\_\{\\varepsilon\}^\{N\}=\\frac\{1\}\{2t\}I\-\\nabla\_\{x\}^\{2\}f\_\{\\varepsilon\}^\{N\}, giving‖∇x2uεN‖2≤12​t\+‖W‖2,∞2ε\\\|\\nabla\_\{x\}^\{2\}u\_\{\\varepsilon\}^\{N\}\\\|\_\{2\}\\leq\\frac\{1\}\{2t\}\+\\frac\{\\\|W\\\|\_\{2,\\infty\}^\{2\}\}\{\\varepsilon\}; the12​t\\frac\{1\}\{2t\}term dominates at smalltt\. ∎

### Proof of Theorem[8\.4](https://arxiv.org/html/2605.28983#S8.Thmtheorem4)\(Backpropagation\)

The Pontryagin Maximum Principle \(PMP\) is the optimal\-control analogue of the Euler–Lagrange equations: given a dynamical systemx˙=F​\(x,u\)\\dot\{x\}=F\(x,u\)and a cost∫0Tℓ​\(x,u\)​𝑑t\+g​\(xT\)\\int\_\{0\}^\{T\}\\ell\(x,u\)\\,dt\+g\(x\_\{T\}\), the PMP says the optimal controlu∗u^\{\*\}satisfies a Hamiltonian system — a forward ODE for the statexxand a backward ODE for the co\-statep=∂ℒ/∂xp=\\partial\\mathcal\{L\}/\\partial x— whereppis the adjoint variable that propagates the gradient of the cost backward in time\. Backpropagation is exactly this co\-state equation: the forward ResNet passxl\+1=xl\+h​F​\(xl,Wl\)x\_\{l\+1\}=x\_\{l\}\+hF\(x\_\{l\},W\_\{l\}\)is the discretized state ODE, and the backward passpl−1=pl\+h​\(∇xF\)⊤​plp\_\{l\-1\}=p\_\{l\}\+h\(\\nabla\_\{x\}F\)^\{\\top\}p\_\{l\}is the discretized co\-state ODE, withpL=∇xLℒp\_\{L\}=\\nabla\_\{x\_\{L\}\}\\mathcal\{L\}as terminal condition\. RL practitioners recognize this as the “adjoint method” for trajectory optimization; the present result makes the connection exact rather than analogical\.

Differentiating the ResNet layer:∂xl/∂xl−1=I\+h​∇xF​\(xl−1,Wl\)\\partial x\_\{l\}/\\partial x\_\{l\-1\}=I\+h\\,\\nabla\_\{x\}F\(x\_\{l\-1\},W\_\{l\}\)\. The chain rule givespl−1=\(∂xl/∂xl−1\)⊤​pl=pl\+h​\(∇xF\)⊤​plp\_\{l\-1\}=\(\\partial x\_\{l\}/\\partial x\_\{l\-1\}\)^\{\\top\}p\_\{l\}=p\_\{l\}\+h\(\\nabla\_\{x\}F\)^\{\\top\}p\_\{l\}, confirming \([16](https://arxiv.org/html/2605.28983#S8.E16)\)\. The HamiltonianH​\(x,p\)=p⊤​F​\(x,θ\)H\(x,p\)=p^\{\\top\}F\(x,\\theta\)has Hamilton’s equationsx˙=F​\(x,θ\)\\dot\{x\}=F\(x,\\theta\)andp˙=−\(∇xF\)⊤​p\\dot\{p\}=\-\(\\nabla\_\{x\}F\)^\{\\top\}p\. Forward Euler applied to \([17](https://arxiv.org/html/2605.28983#S8.E17)\) attlt\_\{l\}, stepping backward byhhtotl−1=tl−ht\_\{l\-1\}=t\_\{l\}\-h:p​\(tl−1\)≈p​\(tl\)−h​p˙​\(tl\)=p​\(tl\)\+h​\(∇xF\)⊤​p​\(tl\)p\(t\_\{l\-1\}\)\\approx p\(t\_\{l\}\)\-h\\dot\{p\}\(t\_\{l\}\)=p\(t\_\{l\}\)\+h\(\\nabla\_\{x\}F\)^\{\\top\}p\(t\_\{l\}\), which agrees with \([16](https://arxiv.org/html/2605.28983#S8.E16)\) up toO​\(h2\)O\(h^\{2\}\): the continuous adjoint evaluates∇xF\\nabla\_\{x\}Fat the trajectory pointx​\(tl\)x\(t\_\{l\}\), while the discrete scheme evaluates atxl−1x\_\{l\-1\}; these differ by one forward Euler step of sizeO​\(h\)O\(h\), producing anO​\(h2\)O\(h^\{2\}\)discrepancy inpp\. ∎

### Proof of Proposition[8\.5](https://arxiv.org/html/2605.28983#S8.Thmtheorem5)\(Hallucination\)

Writezj=Wj⋅x\+bjz\_\{j\}=W\_\{j\}\\cdot x\+b\_\{j\}and letz∗=zj∗z^\{\*\}=z\_\{j^\{\*\}\}\. ThenfεN​\(x\)=ε​log⁡\(ez∗/ε\+∑j≠j∗ezj/ε\)=z∗\+ε​log⁡\(1\+∑j≠j∗e\(zj−z∗\)/ε\)f\_\{\\varepsilon\}^\{N\}\(x\)=\\varepsilon\\log\\\!\\bigl\(e^\{z^\{\*\}/\\varepsilon\}\+\\sum\_\{j\\neq j^\{\*\}\}e^\{z\_\{j\}/\\varepsilon\}\\bigr\)=z^\{\*\}\+\\varepsilon\\log\\\!\\bigl\(1\+\\sum\_\{j\\neq j^\{\*\}\}e^\{\(z\_\{j\}\-z^\{\*\}\)/\\varepsilon\}\\bigr\)\. Sincezj=\|x\|2/\(4​t\)−\(g​\(yj\)\+\|x−yj\|2/\(4​t\)\)z\_\{j\}=\|x\|^\{2\}/\(4t\)\-\(g\(y\_\{j\}\)\+\|x\-y\_\{j\}\|^\{2\}/\(4t\)\), it follows thatzj−z∗=\(g​\(yj∗\)\+\|x−yj∗\|2/\(4​t\)\)−\(g​\(yj\)\+\|x−yj\|2/\(4​t\)\)≤−Δ​\(x\)z\_\{j\}\-z^\{\*\}=\(g\(y\_\{j^\{\*\}\}\)\+\|x\-y\_\{j^\{\*\}\}\|^\{2\}/\(4t\)\)\-\(g\(y\_\{j\}\)\+\|x\-y\_\{j\}\|^\{2\}/\(4t\)\)\\leq\-\\Delta\(x\)by definition ofj∗j^\{\*\}andΔ\\Delta\. Each term in the sum is at moste−Δ​\(x\)/εe^\{\-\\Delta\(x\)/\\varepsilon\}, giving \([18](https://arxiv.org/html/2605.28983#S8.E18)\)\. ∎

### Proof of Proposition[8\.8](https://arxiv.org/html/2605.28983#S8.Thmtheorem8)\(Scaling Laws\)

Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)is stated in ambient dimensiondd, giving rateO​\(N−1/d\)O\(N^\{\-1/d\}\)\. Whensupp​\(μ\)⊂ℳ\\mathrm\{supp\}\(\\mu\)\\subset\\mathcal\{M\}for a compactdeffd\_\{\\mathrm\{eff\}\}\-dimensional Riemannian submanifoldℳ⊂ℝd\\mathcal\{M\}\\subset\\mathbb\{R\}^\{d\}, andggisLgL\_\{g\}\-Lipschitz onℳ\\mathcal\{M\}in the geodesic metric, the bound refines toO​\(N−1/deff\)O\(N^\{\-1/d\_\{\\mathrm\{eff\}\}\}\)as follows\. A compactdeffd\_\{\\mathrm\{eff\}\}\-dimensional manifold has covering number𝒩​\(ℳ,δ\)≤Cℳ​δ−deff\\mathcal\{N\}\(\\mathcal\{M\},\\delta\)\\leq C\_\{\\mathcal\{M\}\}\\,\\delta^\{\-d\_\{\\mathrm\{eff\}\}\}\(standard Riemannian metric\-entropy bound\)\. PlacingNNsupport points as aδ\\delta\-net onℳ\\mathcal\{M\}withδ=Cℳ1/deff​N−1/deff\\delta=C\_\{\\mathcal\{M\}\}^\{1/d\_\{\\mathrm\{eff\}\}\}N^\{\-1/d\_\{\\mathrm\{eff\}\}\}ensures everyy∈ℳy\\in\\mathcal\{M\}has a support point within distanceδ\\delta\. The quadrature argument of Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)step \(i\) then gives\|uεN−uε\|=O​\(δ\)=O​\(N−1/deff\)\|u\_\{\\varepsilon\}^\{N\}\-u\_\{\\varepsilon\}\|=O\(\\delta\)=O\(N^\{\-1/d\_\{\\mathrm\{eff\}\}\}\), since the Hopf–Cole integrandFx​\(y\)=g​\(y\)\+\|x−y\|2/\(4​t\)F\_\{x\}\(y\)=g\(y\)\+\|x\-y\|^\{2\}/\(4t\)is Lipschitz onℳ\\mathcal\{M\}and theε\\varepsilon\-factors cancel as before\.

Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)\(withddreplaced bydeffd\_\{\\mathrm\{eff\}\}\) givesℒ​\(N\)≤C1​N−1/deff\+C2​ε∗\+C3​N/n\\mathcal\{L\}\(N\)\\leq C\_\{1\}N^\{\-1/d\_\{\\mathrm\{eff\}\}\}\+C\_\{2\}\\varepsilon^\{\*\}\+C\_\{3\}\\sqrt\{N/n\}withε∗=N−1/deff\\varepsilon^\{\*\}=N^\{\-1/d\_\{\\mathrm\{eff\}\}\}\. At the training optimum the approximation and estimation terms balance:N−1/deff≍N/nN^\{\-1/d\_\{\\mathrm\{eff\}\}\}\\asymp\\sqrt\{N/n\}, givingN≍ndeff/\(deff\+2\)N\\asymp n^\{d\_\{\\mathrm\{eff\}\}/\(d\_\{\\mathrm\{eff\}\}\+2\)\}andℒ≍n−1/\(deff\+2\)\\mathcal\{L\}\\asymp n^\{\-1/\(d\_\{\\mathrm\{eff\}\}\+2\)\}\. WithC∝N​nC\\propto Nnand the optimalNN:C∝ndeff/\(deff\+2\)⋅n=n\(2​deff\+2\)/\(deff\+2\)C\\propto n^\{d\_\{\\mathrm\{eff\}\}/\(d\_\{\\mathrm\{eff\}\}\+2\)\}\\cdot n=n^\{\(2d\_\{\\mathrm\{eff\}\}\+2\)/\(d\_\{\\mathrm\{eff\}\}\+2\)\}, whencen∝C\(deff\+2\)/\(2​deff\+2\)n\\propto C^\{\(d\_\{\\mathrm\{eff\}\}\+2\)/\(2d\_\{\\mathrm\{eff\}\}\+2\)\}andℒ​\(C\)≍C−1/\(2​deff\+2\)\\mathcal\{L\}\(C\)\\asymp C^\{\-1/\(2d\_\{\\mathrm\{eff\}\}\+2\)\}\. The width\-only scalingℒ​\(N\)≍N−1/deff\\mathcal\{L\}\(N\)\\asymp N^\{\-1/d\_\{\\mathrm\{eff\}\}\}follows directly from the quadrature rate, identifyingα=1/deff\\alpha=1/d\_\{\\mathrm\{eff\}\}\. ∎

###### Proposition E\.3\(Measure poisoning and viscosity defense\)\.

Suppose an adversary inserts a poisoned neuron with pre\-activationz0=W0⋅x\+b0z\_\{0\}=W\_\{0\}\\cdot x\+b\_\{0\}\. The shift in output is

fεN\+1​\(x\)−fεN​\(x\)=softplusε​\(z0−fεN​\(x\)\)\.f\_\{\\varepsilon\}^\{N\+1\}\(x\)\-f\_\{\\varepsilon\}^\{N\}\(x\)=\\mathrm\{softplus\}\_\{\\varepsilon\}\\\!\\bigl\(z\_\{0\}\-f\_\{\\varepsilon\}^\{N\}\(x\)\\bigr\)\.\(30\)The attack satisfies: \(i\)*Bounded damage*: the shift is at mostmax⁡\(z0−fεN​\(x\),0\)\+ε​log⁡2\\max\(z\_\{0\}\-f\_\{\\varepsilon\}^\{N\}\(x\),0\)\+\\varepsilon\\log 2\. \(ii\)*Effective attack*: the shift exceedsε​log⁡2\\varepsilon\\log 2iffz0\>fεN​\(x\)z\_\{0\}\>f\_\{\\varepsilon\}^\{N\}\(x\)\. \(iii\)*Defense via viscosity*: forz0−fεN​\(x\)=γ\>0z\_\{0\}\-f\_\{\\varepsilon\}^\{N\}\(x\)=\\gamma\>0, the shift→γ\\to\\gammaasε→0\\varepsilon\\to 0but is bounded byε​log⁡2\\varepsilon\\log 2whenγ≤0\\gamma\\leq 0\.

### Proof of Proposition[E\.3](https://arxiv.org/html/2605.28983#A5.Thmtheorem3)\(Measure Poisoning\)

The log\-sum\-exp identity

LSEε​\(\{zj\}j=1N\+1\)=LSEε​\(\{zj\}j=1N\)\+ε​log⁡\(1\+e\(z0−LSEε​\(\{zj\}j=1N\)\)/ε\)\\mathrm\{LSE\}\_\{\\varepsilon\}\\\!\\bigl\(\\\{z\_\{j\}\\\}\_\{j=1\}^\{N\+1\}\\bigr\)=\\mathrm\{LSE\}\_\{\\varepsilon\}\\\!\\bigl\(\\\{z\_\{j\}\\\}\_\{j=1\}^\{N\}\\bigr\)\+\\varepsilon\\log\\\!\\Bigl\(1\+e^\{\\bigl\(z\_\{0\}\-\\mathrm\{LSE\}\_\{\\varepsilon\}\(\\\{z\_\{j\}\\\}\_\{j=1\}^\{N\}\)\\bigr\)/\\varepsilon\}\\Bigr\)gives \([30](https://arxiv.org/html/2605.28983#A5.E30)\) directly\. Claims \(i\)–\(iii\) follow from the properties of softplus:softplusε​\(u\)≤max⁡\(u,0\)\+ε​log⁡2\\mathrm\{softplus\}\_\{\\varepsilon\}\(u\)\\leq\\max\(u,0\)\+\\varepsilon\\log 2andsoftplusε​\(u\)→max⁡\(u,0\)\\mathrm\{softplus\}\_\{\\varepsilon\}\(u\)\\to\\max\(u,0\)asε→0\\varepsilon\\to 0\. ∎

### Proof of Proposition[5\.2](https://arxiv.org/html/2605.28983#S5.Thmtheorem2)\(ResNet==ODE Characteristics\)

The recurrencexl\+1=xl\+h​F​\(xl,θl\)x\_\{l\+1\}=x\_\{l\}\+hF\(x\_\{l\},\\theta\_\{l\}\)is the forward Euler scheme forx˙=F​\(x,θ​\(t\)\)\\dot\{x\}=F\(x,\\theta\(t\)\),x​\(0\)=x0x\(0\)=x\_\{0\}, with stepsizehh\. WhenFFisLFL\_\{F\}\-Lipschitz inxx, the local truncation error per step satisfies‖xl\+1−x​\(tl\+h\)‖≤h22​sup‖x¨‖\\\|x\_\{l\+1\}\-x\(t\_\{l\}\+h\)\\\|\\leq\\tfrac\{h^\{2\}\}\{2\}\\sup\\\|\\ddot\{x\}\\\|, and the discrete Gronwall inequality accumulates this to a global error ofO​\(h\)O\(h\), giving uniform convergence ash→0h\\to 0,L→∞L\\to\\infty,L​h=TLh=T\.

Observe that the limiting ODE is thexx\-characteristic of the HJ PDE with HamiltonianH​\(x,p\)=p⋅F​\(x,θ​\(t\)\)H\(x,p\)=p\\cdot F\(x,\\theta\(t\)\)\. Hamilton’s equations forHHread

x˙=∇pH=F​\(x,θ​\(t\)\),p˙=−∇xH=−\(∇xF​\(x,θ\)\)⊤​p;\\dot\{x\}=\\nabla\_\{p\}H=F\(x,\\theta\(t\)\),\\qquad\\dot\{p\}=\-\\nabla\_\{x\}H=\-\(\\nabla\_\{x\}F\(x,\\theta\)\)^\{\\top\}p;the first is the ResNet ODE and the second is the co\-state equation of Theorem[8\.4](https://arxiv.org/html/2605.28983#S8.Thmtheorem4)\. ∎

### Proof of Proposition[5\.4](https://arxiv.org/html/2605.28983#S5.Thmtheorem4)\(Recurrent Architectures\)

The RNN recurrence satisfiesht−ht−1=σ​\(Wh​ht−1\+Wx​xt\)−ht−1h\_\{t\}\-h\_\{t\-1\}=\\sigma\(W\_\{h\}h\_\{t\-1\}\+W\_\{x\}x\_\{t\}\)\-h\_\{t\-1\}, which is the unit\-step Euler discretization ofh˙​\(t\)=−h​\(t\)\+σ​\(Wh​h​\(t\)\+Wx​x​\(t\)\)\\dot\{h\}\(t\)=\-h\(t\)\+\\sigma\(W\_\{h\}h\(t\)\+W\_\{x\}x\(t\)\)\. The HamiltonianH​\(h,p\)=p⋅\(−h\+σ​\(Wh​h\+Wx​x\)\)H\(h,p\)=p\\cdot\(\-h\+\\sigma\(W\_\{h\}h\+W\_\{x\}x\)\)has characteristic equationsh˙=∇pH\\dot\{h\}=\\nabla\_\{p\}Handp˙=−∇hH=\(I−Wh⊤​diag​\(σ′\)\)​p\\dot\{p\}=\-\\nabla\_\{h\}H=\(I\-W\_\{h\}^\{\\top\}\\mathrm\{diag\}\(\\sigma^\{\\prime\}\)\)p, identifying the recurrence as the unit\-step discretization of thehh\-characteristic\.

The linear SSMh˙​\(t\)=A​\(t\)​h​\(t\)\+B​\(t\)​x​\(t\)\\dot\{h\}\(t\)=A\(t\)h\(t\)\+B\(t\)x\(t\)is already a continuous\-time ODE\. SettingH​\(h,p,t\)=p⊤​\(A​\(t\)​h\+B​\(t\)​x​\(t\)\)H\(h,p,t\)=p^\{\\top\}\(A\(t\)h\+B\(t\)x\(t\)\), Hamilton’s equations giveh˙=∇pH=A​\(t\)​h\+B​\(t\)​x\\dot\{h\}=\\nabla\_\{p\}H=A\(t\)h\+B\(t\)xandp˙=−∇hH=−A​\(t\)⊤​p\\dot\{p\}=\-\\nabla\_\{h\}H=\-A\(t\)^\{\\top\}p, recovering the state equation and its adjoint\. Input\-dependentA​\(t\),B​\(t\)A\(t\),B\(t\)make the characteristics non\-autonomous, the same structure as inTonget al\.\[[2025](https://arxiv.org/html/2605.28983#bib.bib31)\]\. The tropical limit replaces matrix–vector multiplication by max\-plus:ht\+1=A⊗trht⊕B⊗trxth\_\{t\+1\}=A\\otimes\_\{\\mathrm\{tr\}\}h\_\{t\}\\oplus B\\otimes\_\{\\mathrm\{tr\}\}x\_\{t\}\. ∎

#### LSTM Hamiltonian \(full expression\)\.

Sinceσ​\(x\)=∇xLSEε​\(x,0\)\\sigma\(x\)=\\nabla\_\{x\}\\mathrm\{LSE\}\_\{\\varepsilon\}\(x,0\)andtanh⁡\(x\)=∇xLSEε​\(x,−x\)\\tanh\(x\)=\\nabla\_\{x\}\\mathrm\{LSE\}\_\{\\varepsilon\}\(x,\-x\), each LSTM gate is the gradient of a two\-neuron LSE layer\. Let the joint state bez=\(c,h\)∈ℝ2​dz=\(c,h\)\\in\\mathbb\{R\}^\{2d\}with conjugate momentump=\(pc,ph\)p=\(p\_\{c\},p\_\{h\}\)\. The continuous\-time cell and hidden equations

c˙​\(t\)\\displaystyle\\dot\{c\}\(t\)=σ​\(Wf​h\+bf\)⊙c\+σ​\(Wi​h\+bi\)⊙tanh⁡\(Wc​h\+bc\),\\displaystyle=\\sigma\(W\_\{f\}h\+b\_\{f\}\)\\odot c\+\\sigma\(W\_\{i\}h\+b\_\{i\}\)\\odot\\tanh\(W\_\{c\}h\+b\_\{c\}\),h˙​\(t\)\\displaystyle\\dot\{h\}\(t\)=σ​\(Wo​h\+bo\)⊙tanh⁡\(c\),\\displaystyle=\\sigma\(W\_\{o\}h\+b\_\{o\}\)\\odot\\tanh\(c\),are the characteristic equationsz˙=∇pHLSTM\\dot\{z\}=\\nabla\_\{p\}H\_\{\\mathrm\{LSTM\}\}of the gated Hamiltonian

HLSTM​\(z,p\)=pc⊤​\[f​\(h\)⊙c\+i​\(h\)⊙c~​\(h\)\]\+ph⊤​\[o​\(h\)⊙tanh⁡\(c\)\],H\_\{\\mathrm\{LSTM\}\}\(z,p\)=p\_\{c\}^\{\\top\}\\\!\\bigl\[f\(h\)\\odot c\+i\(h\)\\odot\\tilde\{c\}\(h\)\\bigr\]\+p\_\{h\}^\{\\top\}\\\!\\bigl\[o\(h\)\\odot\\tanh\(c\)\\bigr\],\(31\)wheref​\(h\)=σ​\(Wf​h\+bf\)f\(h\)=\\sigma\(W\_\{f\}h\+b\_\{f\}\),i​\(h\)=σ​\(Wi​h\+bi\)i\(h\)=\\sigma\(W\_\{i\}h\+b\_\{i\}\),o​\(h\)=σ​\(Wo​h\+bo\)o\(h\)=\\sigma\(W\_\{o\}h\+b\_\{o\}\),c~​\(h\)=tanh⁡\(Wc​h\+bc\)\\tilde\{c\}\(h\)=\\tanh\(W\_\{c\}h\+b\_\{c\}\)are the forget, input, output, and cell gates\. Each gate is a gradient of a two\-neuron LSE layer, making \([31](https://arxiv.org/html/2605.28983#A5.E31)\) a composition of LSE\-layer gradients\. In the tropical limitε→0\\varepsilon\\to 0,σ→Heaviside\\sigma\\to\\mathrm\{Heaviside\}andtanh→sign\\tanh\\to\\mathrm\{sign\}, so the gates become\{0,1\}d\\\{0,1\\\}^\{d\}selectors and \([31](https://arxiv.org/html/2605.28983#A5.E31)\) becomes a max\-plus switching automaton on the cell state\.

### Feynman–Kac Representation and the Gaussian Fixed Point

The Feynman–Kac formula connects PDE solutions to probabilistic expectations: the viscous HJ solutionuε​\(x,t\)=−ε​log⁡𝔼​\[e−g​\(Xt\)/ε∣X0=x\]u\_\{\\varepsilon\}\(x,t\)=\-\\varepsilon\\log\\mathbb\{E\}\[e^\{\-g\(X\_\{t\}\)/\\varepsilon\}\\mid X\_\{0\}=x\]replaces solving a PDE with sampling a Brownian motionXtX\_\{t\}started atxx\[Kac,[1949](https://arxiv.org/html/2605.28983#bib.bib6)\]\. In ML terms, the log\-partition function a feedforward network computes is a finite\-sample Monte Carlo approximation of this expectation: each neuronjjcontributes one iid sampleyjy\_\{j\}from the terminal distribution, weighted by the Boltzmann factore−g​\(yj\)/εe^\{\-g\(y\_\{j\}\)/\\varepsilon\}\. ResNets take the complementary approach: instead of sampling the endpoint, they simulate the Brownian path itself layer by layer via Euler–Maruyama, so depthLLplays the role of time steps and the driftF​\(x,W\)F\(x,W\)plays the role of the score\. The same Brownian motion underlies score\-based diffusion models: the forward noising process is preciselyXtX\_\{t\}, and the denoising score∇xlog⁡pt​\(x\)\\nabla\_\{x\}\\log p\_\{t\}\(x\)is the spatial gradient of this Feynman–Kac solution\.

###### Proposition E\.4\(Feynman–Kac representation and matched\-scale condition\[Kac,[1949](https://arxiv.org/html/2605.28983#bib.bib6)\]\)\.

Letuε​\(x,t\)u\_\{\\varepsilon\}\(x,t\)be the viscosity solution ofut\+\|D​u\|2=ε​Δ​uu\_\{t\}\+\|Du\|^\{2\}=\\varepsilon\\Delta u,u​\(⋅,0\)=gu\(\\cdot,0\)=g\.

1. \(i\)*\(Feynman–Kac\)* uε​\(x,t\)=−ε​log⁡𝔼Y∼𝒩​\(x,2​ε​t​I\)​\[e−g​\(Y\)/ε\]\.u\_\{\\varepsilon\}\(x,t\)=\-\\varepsilon\\log\\,\\mathbb\{E\}\_\{Y\\sim\\mathcal\{N\}\(x,\\,2\\varepsilon t\\,I\)\}\\\!\\bigl\[\\,e^\{\-g\(Y\)/\\varepsilon\}\\bigr\]\.\(32\)In theN→∞N\\to\\inftylimit, the discrete Hopf–Cole sum \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\) converges to \([32](https://arxiv.org/html/2605.28983#A5.E32)\) with the support\-point distributionμ\\mureplacing the Gaussian\.
2. \(ii\)*\(Matched\-scale condition;q∗=2​ε​tq^\{\*\}=2\\varepsilon t\)*The Hopf–Cole kernelKε​\(x,y\)=e−\|x−y\|2/\(4​ε​t\)K\_\{\\varepsilon\}\(x,y\)=e^\{\-\|x\-y\|^\{2\}/\(4\\varepsilon t\)\}interpreted as a Gaussian likelihoodp​\(x\|y\)∝e−\|x−y\|2/\(2​σL2\)p\(x\|y\)\\propto e^\{\-\|x\-y\|^\{2\}/\(2\\sigma\_\{L\}^\{2\}\)\}has scaleσL2=2​ε​t\\sigma\_\{L\}^\{2\}=2\\varepsilon t\. For the Gaussian priorμ∗=𝒩​\(0,q∗​I\)\\mu\_\{\*\}=\\mathcal\{N\}\(0,q^\{\*\}I\), the Gibbs posteriorπ\(⋅\|x\)∝Kε\(x,⋅\)μ∗\\pi\(\\,\\cdot\\,\|x\)\\propto K\_\{\\varepsilon\}\(x,\\,\\cdot\\,\)\\,\\mu\_\{\*\}has posterior meany¯​\(x\)=q∗q∗\+2​ε​t​x\\bar\{y\}\(x\)=\\tfrac\{q^\{\*\}\}\{q^\{\*\}\+2\\varepsilon t\}\\,xand posterior varianceσπ2=2​ε​t​q∗q∗\+2​ε​t\\sigma\_\{\\pi\}^\{2\}=\\tfrac\{2\\varepsilon t\\,q^\{\*\}\}\{q^\{\*\}\+2\\varepsilon t\}\. The prior scale matches the likelihood scale, i\.e\.q∗=σL2q^\{\*\}=\\sigma\_\{L\}^\{2\},*if and only if*q∗=2​ε​tq^\{\*\}=2\\varepsilon t\. At this value:y¯​\(x\)=x/2\\bar\{y\}\(x\)=x/2andσπ2=ε​t=q∗/2\\sigma\_\{\\pi\}^\{2\}=\\varepsilon t=q^\{\*\}/2, independently ofxxorgg\.

###### Proof\.

The Hopf–Cole substitutionv=e−u/εv=e^\{\-u/\\varepsilon\}linearizes the HJ equation: differentiating givesvt=−ε−1​v​utv\_\{t\}=\-\\varepsilon^\{\-1\}v\\,u\_\{t\}andΔ​v=−ε−1​v​Δ​u\+ε−2​v​\|D​u\|2\\Delta v=\-\\varepsilon^\{\-1\}v\\,\\Delta u\+\\varepsilon^\{\-2\}v\\,\|Du\|^\{2\}, sovt=ε​Δ​vv\_\{t\}=\\varepsilon\\Delta vif and only ifut\+\|D​u\|2=ε​Δ​uu\_\{t\}\+\|Du\|^\{2\}=\\varepsilon\\Delta u\. The heat equationvt=ε​Δ​vv\_\{t\}=\\varepsilon\\Delta vwith initial datumv0=e−g/εv\_\{0\}=e^\{\-g/\\varepsilon\}has the unique solutionv​\(x,t\)=∫G​\(x−y,t\)​e−g​\(y\)/ε​𝑑yv\(x,t\)=\\int G\(x\-y,t\)\\,e^\{\-g\(y\)/\\varepsilon\}\\,dy, whereG​\(z,t\)=\(4​π​ε​t\)−d/2​e−\|z\|2/\(4​ε​t\)G\(z,t\)=\(4\\pi\\varepsilon t\)^\{\-d/2\}e^\{\-\|z\|^\{2\}/\(4\\varepsilon t\)\}is the density of𝒩​\(0,2​ε​t​I\)\\mathcal\{N\}\(0,2\\varepsilon t\\,I\)\. Hencev​\(x,t\)=𝔼Y∼𝒩​\(x,2​ε​t​I\)​\[e−g​\(Y\)/ε\]v\(x,t\)=\\mathbb\{E\}\_\{Y\\sim\\mathcal\{N\}\(x,2\\varepsilon t\\,I\)\}\[e^\{\-g\(Y\)/\\varepsilon\}\], and inverting viau=−ε​log⁡vu=\-\\varepsilon\\log vgives \([32](https://arxiv.org/html/2605.28983#A5.E32)\)\.

The matched\-scale condition follows from standard Gaussian conjugacy\. With prior𝒩​\(0,q∗​I\)\\mathcal\{N\}\(0,q^\{\*\}I\)and likelihood precision1/\(2​ε​t\)1/\(2\\varepsilon t\), the posterior mean and variance arey¯​\(x\)=q∗q∗\+2​ε​t​x\\bar\{y\}\(x\)=\\tfrac\{q^\{\*\}\}\{q^\{\*\}\+2\\varepsilon t\}\\,xandσπ2=q∗⋅2​ε​tq∗\+2​ε​t\\sigma\_\{\\pi\}^\{2\}=\\tfrac\{q^\{\*\}\\cdot 2\\varepsilon t\}\{q^\{\*\}\+2\\varepsilon t\}\. The prior scale matches the likelihood scale,q∗=2​ε​tq^\{\*\}=2\\varepsilon t, if and only ifq∗/\(q∗\+2​ε​t\)=1/2q^\{\*\}/\(q^\{\*\}\+2\\varepsilon t\)=1/2, which givesy¯​\(x\)=x/2\\bar\{y\}\(x\)=x/2andσπ2=ε​t=q∗/2\\sigma\_\{\\pi\}^\{2\}=\\varepsilon t=q^\{\*\}/2, both independent ofxxandgg\. ∎∎

The parameterq∗=2​ε​tq^\{\*\}=2\\varepsilon tis thus simultaneously: \(a\) the variance of the Brownian motion whose expectation computes the PDE solution via Feynman–Kac; \(b\) the unique Gaussian prior scale that is matched to the Hopf–Cole likelihood; and \(c\) the diffusion variance accumulated over one time\-unit of the heat semigroup\. In NNGP terminology,q∗=2​ε​tq^\{\*\}=2\\varepsilon tis the natural initialization variance at which the prior and the kernel’s smoothing bandwidth coincide\.

### Double Descent as Near\-Shock Formation: Formal Analysis

###### Proposition E\.5\(Near\-shock = double descent\)\.

LetfεN=LSEε​\(W​x\+b\)f\_\{\\varepsilon\}^\{N\}=\\mathrm\{LSE\}\_\{\\varepsilon\}\(Wx\+b\)\.

1. \(i\)*\(Hessian==Gibbs variance\)*∇x2fεN=ε−1W⊤\(diag\(π\)−ππ⊤\)W=ε−1Varπ\[Wj⋅\]\\nabla\_\{x\}^\{2\}f\_\{\\varepsilon\}^\{N\}=\\varepsilon^\{\-1\}W^\{\\top\}\\\!\\bigl\(\\mathrm\{diag\}\(\\pi\)\-\\pi\\pi^\{\\top\}\\bigr\)W=\\varepsilon^\{\-1\}\\mathrm\{Var\}\_\{\\pi\}\[W\_\{j\}\\,\\cdot\\,\]\.
2. \(ii\)*\(Curvature is maximized at two\-atom near\-shocks\)*For anyxx, the spectral norm satisfies‖∇x2fεN​\(x\)‖2≤\(4​ε\)−1​maxj≠j′⁡‖Wj−Wj′‖2\\\|\\nabla\_\{x\}^\{2\}f\_\{\\varepsilon\}^\{N\}\(x\)\\\|\_\{2\}\\leq\(4\\varepsilon\)^\{\-1\}\\max\_\{j\\neq j^\{\\prime\}\}\\\|W\_\{j\}\-W\_\{j^\{\\prime\}\}\\\|^\{2\}by Popoviciu’s inequality applied toVarπ​\[Wj⋅v\]\\mathrm\{Var\}\_\{\\pi\}\[W\_\{j\}\\cdot v\]for any unitvv; equality holds whenπj∗=πj′=12\\pi\_\{j^\{\*\}\}=\\pi\_\{j^\{\\prime\}\}=\\tfrac\{1\}\{2\}for the two rows\(j∗,j′\)\(j^\{\*\},j^\{\\prime\}\)achieving the extremes \(near\-shock\)\.
3. \(iii\)*\(Interpolation thresholdN=nN=nforces near\-shocks\)*AtN=nN=nwith support pointyj=xjy\_\{j\}=x\_\{j\}\(one neuron per training point\), the network must assignπj​\(xi\)≈1\\pi\_\{j\}\(x\_\{i\}\)\\approx 1forj=ij=i\. For every adjacent pair\(i,i\+1\)\(i,i\+1\), define the*weighted Voronoi boundary*ℬi,i\+1=\{x:Fx​\(yi\)=Fx​\(yi\+1\)\}\\mathcal\{B\}\_\{i,i\+1\}=\\\{x:F\_\{x\}\(y\_\{i\}\)=F\_\{x\}\(y\_\{i\+1\}\)\\\}whereFx​\(y\)=g​\(y\)\+\|x−y\|2/\(4​t\)F\_\{x\}\(y\)=g\(y\)\+\|x\-y\|^\{2\}/\(4t\)\. Expanding gives the equation\(yi−yi\+1\)⋅x/\(2​t\)=g​\(yi\)−g​\(yi\+1\)\+\(\|yi\|2−\|yi\+1\|2\)/\(4​t\)\(y\_\{i\}\-y\_\{i\+1\}\)\\cdot x/\(2t\)=g\(y\_\{i\}\)\-g\(y\_\{i\+1\}\)\+\(\|y\_\{i\}\|^\{2\}\-\|y\_\{i\+1\}\|^\{2\}\)/\(4t\), which is linear inxx, soℬi,i\+1\\mathcal\{B\}\_\{i,i\+1\}is a hyperplane for*any*target valuesg​\(yi\),g​\(yi\+1\)g\(y\_\{i\}\),g\(y\_\{i\+1\}\)\. At anyx∈ℬi,i\+1x\\in\\mathcal\{B\}\_\{i,i\+1\}:zi​\(x\)=zi\+1​\(x\)z\_\{i\}\(x\)=z\_\{i\+1\}\(x\), henceπi=πi\+1=12\\pi\_\{i\}=\\pi\_\{i\+1\}=\\tfrac\{1\}\{2\}\(modulo exponentially small contributions from other neurons\)\. The near\-shock set⋃iℬi,i\+1\\bigcup\_\{i\}\\mathcal\{B\}\_\{i,i\+1\}consists ofO​\(n\)O\(n\)hyperplanes, forming an arrangement that partitions𝒳\\mathcal\{X\}intoO​\(nd\)O\(n^\{d\}\)polyhedral cells; asn→∞n\\to\\inftywith training points dense in𝒳\\mathcal\{X\}, the shock boundaries become dense in𝒳\\mathcal\{X\}as well, regardless of the target function values \(the hyperplane location depends ong​\(yi\)−g​\(yi\+1\)g\(y\_\{i\}\)\-g\(y\_\{i\+1\}\)but not on the sign or magnitude in a way that avoids any region of𝒳\\mathcal\{X\}\)\. The near\-shock boundary reduces to the geometric midpoint only wheng​\(yi\)=g​\(yi\+1\)g\(y\_\{i\}\)=g\(y\_\{i\+1\}\)\.
4. \(iv\)*\(Overparameterization smooths the shocks\)*ForN=k​nN=kn,k\>1k\>1, withkkneurons per Voronoi cell, assume neurons within each cell are placed on a uniformk1/dk^\{1/d\}\-sub\-grid of the cell \(spacingδ/k1/d\\delta/k^\{1/d\}whereδ\\deltais the cell diameter\)\. Then the finest inter\-neuron boundaries have‖Wj−Wj′‖≤k−1/d​‖Wi−Wi\+1‖\\\|W\_\{j\}\-W\_\{j^\{\\prime\}\}\\\|\\leq k^\{\-1/d\}\\\|W\_\{i\}\-W\_\{i\+1\}\\\|\(weights are denser within each cell\)\. Consequentlymaxx⁡‖∇x2fεN​\(x\)‖2≤\(4​ε\)−1​k−2/d​‖Wi−Wi\+1‖2\\max\_\{x\}\\\|\\nabla\_\{x\}^\{2\}f\_\{\\varepsilon\}^\{N\}\(x\)\\\|\_\{2\}\\leq\(4\\varepsilon\)^\{\-1\}k^\{\-2/d\}\\\|W\_\{i\}\-W\_\{i\+1\}\\\|^\{2\}, which isO​\(k−2/d\)O\(k^\{\-2/d\}\)and decreases to zero asN/n→∞N/n\\to\\infty\.

###### Proof\.

The Hessian formula is established in the proof of Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2)\.

The spectral norm is maximized whenπ\\piconcentrates on two atomsj∗j^\{\*\}andj′j^\{\\prime\}with massesppand1−p1\-p: the variance then equalsp​\(1−p\)​\(Wj∗−Wj′\)​\(Wj∗−Wj′\)⊤p\(1\-p\)\(W\_\{j^\{\*\}\}\-W\_\{j^\{\\prime\}\}\)\(W\_\{j^\{\*\}\}\-W\_\{j^\{\\prime\}\}\)^\{\\top\}, with spectral normp​\(1−p\)​‖Wj∗−Wj′‖2≤14​‖Wj∗−Wj′‖2p\(1\-p\)\\\|W\_\{j^\{\*\}\}\-W\_\{j^\{\\prime\}\}\\\|^\{2\}\\leq\\tfrac\{1\}\{4\}\\\|W\_\{j^\{\*\}\}\-W\_\{j^\{\\prime\}\}\\\|^\{2\}, with equality atp=1/2p=1/2\. For a generalπ\\pi, Popoviciu’s inequality givesVarπ​\[Wj⋅v\]≤\(maxj⁡Wj⋅v−minj⁡Wj⋅v\)2/4≤maxj≠j′⁡‖Wj−Wj′‖2/4\\mathrm\{Var\}\_\{\\pi\}\[W\_\{j\}\\cdot v\]\\leq\(\\max\_\{j\}W\_\{j\}\\cdot v\-\\min\_\{j\}W\_\{j\}\\cdot v\)^\{2\}/4\\leq\\max\_\{j\\neq j^\{\\prime\}\}\\\|W\_\{j\}\-W\_\{j^\{\\prime\}\}\\\|^\{2\}/4for any unitvv, so the spectral norm is bounded by\(4​ε\)−1​maxj≠j′⁡‖Wj−Wj′‖2\(4\\varepsilon\)^\{\-1\}\\max\_\{j\\neq j^\{\\prime\}\}\\\|W\_\{j\}\-W\_\{j^\{\\prime\}\}\\\|^\{2\}\.

For any adjacent pair\(i,i\+1\)\(i,i\+1\), the boundaryℬi,i\+1\\mathcal\{B\}\_\{i,i\+1\}satisfiesFx​\(yi\)=Fx​\(yi\+1\)F\_\{x\}\(y\_\{i\}\)=F\_\{x\}\(y\_\{i\+1\}\), i\.e\.,g​\(yi\)\+\|x−yi\|2/\(4​t\)=g​\(yi\+1\)\+\|x−yi\+1\|2/\(4​t\)g\(y\_\{i\}\)\+\|x\-y\_\{i\}\|^\{2\}/\(4t\)=g\(y\_\{i\+1\}\)\+\|x\-y\_\{i\+1\}\|^\{2\}/\(4t\)\. Rearranging:\(yi−yi\+1\)⋅x/\(2​t\)=g​\(yi\)−g​\(yi\+1\)\+\(\|yi\|2−\|yi\+1\|2\)/\(4​t\)\(y\_\{i\}\-y\_\{i\+1\}\)\\cdot x/\(2t\)=g\(y\_\{i\}\)\-g\(y\_\{i\+1\}\)\+\(\|y\_\{i\}\|^\{2\}\-\|y\_\{i\+1\}\|^\{2\}\)/\(4t\), which is linear inxx, confirmingℬi,i\+1\\mathcal\{B\}\_\{i,i\+1\}is a hyperplane for any target values\. At anyx∈ℬi,i\+1x\\in\\mathcal\{B\}\_\{i,i\+1\}:zi​\(x\)=zi\+1​\(x\)z\_\{i\}\(x\)=z\_\{i\+1\}\(x\), soπi​\(x\)=πi\+1​\(x\)=12\\pi\_\{i\}\(x\)=\\pi\_\{i\+1\}\(x\)=\\tfrac\{1\}\{2\}\(equal pre\-activations\)\. For neuronsj∉\{i,i\+1\}j\\notin\\\{i,i\+1\\\}withyjy\_\{j\}bounded away fromℬi,i\+1\\mathcal\{B\}\_\{i,i\+1\}, the cost gapΔ=minj∉\{i,i\+1\}⁡\(Fx​\(yj\)−Fx​\(yi\)\)\>0\\Delta=\\min\_\{j\\notin\\\{i,i\+1\\\}\}\(F\_\{x\}\(y\_\{j\}\)\-F\_\{x\}\(y\_\{i\}\)\)\>0, givingπj→0\\pi\_\{j\}\\to 0exponentially inΔ/ε\\Delta/\\varepsilon, soπi=πi\+1≈1/2\\pi\_\{i\}=\\pi\_\{i\+1\}\\approx 1/2\. Wheng​\(yi\)=g​\(yi\+1\)g\(y\_\{i\}\)=g\(y\_\{i\+1\}\), the hyperplane passes through the geometric midpoint; otherwise it is shifted by the target\-value difference\.

Withkkneurons per Voronoi cell of diameterδ=\|xi−xi\+1\|\\delta=\|x\_\{i\}\-x\_\{i\+1\}\|, the intra\-cell neuron spacing isδ/k1/d\\delta/k^\{1/d\}, giving‖Wj−Wj′‖=\|yj−yj′\|/\(2​t\)≤δ/\(2​t​k1/d\)\\\|W\_\{j\}\-W\_\{j^\{\\prime\}\}\\\|=\|y\_\{j\}\-y\_\{j^\{\\prime\}\}\|/\(2t\)\\leq\\delta/\(2t\\,k^\{1/d\}\)\. The spectral norm bound then gives‖∇x2f‖2≤δ2/\(16​ε​t2​k2/d\)\\\|\\nabla^\{2\}\_\{x\}f\\\|\_\{2\}\\leq\\delta^\{2\}/\(16\\varepsilon t^\{2\}k^\{2/d\}\)\. ∎∎

Proposition[E\.5](https://arxiv.org/html/2605.28983#A5.Thmtheorem5)establishes that the curvature \(Hessian spectral norm\) peaks at the interpolation thresholdN=nN=nand decreases as\(N/n\)−2/d\(N/n\)^\{\-2/d\}above it\. The peak is the near\-shock: atN=nN=nevery inter\-data\-point boundary is a shock locus\. In the inviscid \(ε→0\\varepsilon\\to 0\) limit these become genuine gradient discontinuities of the Hopf–Lax solution, and the prediction degrades at out\-of\-distribution test points between training data\. Overparameterization \(N≫nN\\gg n\) distributes the shocks more finely and at lower intensity, smoothing the solution and reducing the test error, the PDE interpretation of why overparameterized networks generalize\.

###### Proposition E\.6\(Backpropagation as adjoint heat semigroup, feedforward case\)\.

LetfεN​\(x\)=LSEε​\(W​x\+b\)f\_\{\\varepsilon\}^\{N\}\(x\)=\\mathrm\{LSE\}\_\{\\varepsilon\}\(Wx\+b\)with weightsWj=yj/\(2​t\)W\_\{j\}=y\_\{j\}/\(2t\), biasesbj=−g​\(yj\)−\|yj\|2/\(4​t\)b\_\{j\}=\-g\(y\_\{j\}\)\-\|y\_\{j\}\|^\{2\}/\(4t\)as in Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1), and letℒ=ℓ​\(fεN​\(x\)\)\\mathcal\{L\}=\\ell\(f\_\{\\varepsilon\}^\{N\}\(x\)\)be a scalar loss\. Then

∂ℒ∂g​\(yj\)=−πj​\(x\)⋅ℓ′​\(fεN​\(x\)\),πj​\(x\)=exp⁡\(\(−g​\(yj\)−\|x−yj\|2/\(4​t\)\)/ε\)∑kexp⁡\(\(−g​\(yk\)−\|x−yk\|2/\(4​t\)\)/ε\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial g\(y\_\{j\}\)\}=\-\\pi\_\{j\}\(x\)\\cdot\\ell^\{\\prime\}\(f\_\{\\varepsilon\}^\{N\}\(x\)\),\\quad\\pi\_\{j\}\(x\)=\\frac\{\\exp\\\!\\bigl\(\(\-g\(y\_\{j\}\)\-\|x\-y\_\{j\}\|^\{2\}/\(4t\)\)/\\varepsilon\\bigr\)\}\{\\displaystyle\\sum\_\{k\}\\exp\\\!\\bigl\(\(\-g\(y\_\{k\}\)\-\|x\-y\_\{k\}\|^\{2\}/\(4t\)\)/\\varepsilon\\bigr\)\}\.\(33\)Under the Hopf–Cole correspondence,πj​\(x\)∝Kt​\(x−yj\)​e−g​\(yj\)/ε\\pi\_\{j\}\(x\)\\propto K\_\{t\}\(x\-y\_\{j\}\)\\,e^\{\-g\(y\_\{j\}\)/\\varepsilon\}, whereKt​\(z\)=exp⁡\(−\|z\|2/\(4​ε​t\)\)K\_\{t\}\(z\)=\\exp\(\-\|z\|^\{2\}/\(4\\varepsilon t\)\)is the heat kernel\. Thus the backward pass distributes the loss signal through the discrete heat kernel: it is the evaluation of the adjoint semigroupSt∗S\_\{t\}^\{\*\}applied to the loss gradient at the support points\{yj\}\\\{y\_\{j\}\\\}\.

Proof\.By the chain rule,∂ℒ/∂bj=ℓ′​\(fεN\)⋅∂fεN/∂bj=ℓ′​\(fεN\)⋅πj​\(x\)\\partial\\mathcal\{L\}/\\partial b\_\{j\}=\\ell^\{\\prime\}\(f\_\{\\varepsilon\}^\{N\}\)\\cdot\\partial f\_\{\\varepsilon\}^\{N\}/\\partial b\_\{j\}=\\ell^\{\\prime\}\(f\_\{\\varepsilon\}^\{N\}\)\\cdot\\pi\_\{j\}\(x\)\. Sincebj=−g​\(yj\)−\|yj\|2/\(4​t\)b\_\{j\}=\-g\(y\_\{j\}\)\-\|y\_\{j\}\|^\{2\}/\(4t\), it follows that∂g​\(yj\)/∂bj=−1\\partial g\(y\_\{j\}\)/\\partial b\_\{j\}=\-1, giving the sign in \([33](https://arxiv.org/html/2605.28983#A5.E33)\)\. Substituting the weight identities from Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1):

Wj⋅x\+bj=yj⋅x2​t−g​\(yj\)−\|yj\|24​t=\|x\|24​t−g​\(yj\)−\|x−yj\|24​t,W\_\{j\}\\cdot x\+b\_\{j\}=\\frac\{y\_\{j\}\\cdot x\}\{2t\}\-g\(y\_\{j\}\)\-\\frac\{\|y\_\{j\}\|^\{2\}\}\{4t\}=\\frac\{\|x\|^\{2\}\}\{4t\}\-g\(y\_\{j\}\)\-\\frac\{\|x\-y\_\{j\}\|^\{2\}\}\{4t\},soexp⁡\(\(Wj⋅x\+bj\)/ε\)=e\|x\|2/\(4​ε​t\)⋅Kt​\(x−yj\)⋅e−g​\(yj\)/ε\\exp\(\(W\_\{j\}\\cdot x\+b\_\{j\}\)/\\varepsilon\)=e^\{\|x\|^\{2\}/\(4\\varepsilon t\)\}\\cdot K\_\{t\}\(x\-y\_\{j\}\)\\cdot e^\{\-g\(y\_\{j\}\)/\\varepsilon\}\. The factore\|x\|2/\(4​ε​t\)e^\{\|x\|^\{2\}/\(4\\varepsilon t\)\}cancels in the softmax ratio, leavingπj​\(x\)∝Kt​\(x−yj\)​e−g​\(yj\)/ε\\pi\_\{j\}\(x\)\\propto K\_\{t\}\(x\-y\_\{j\}\)\\,e^\{\-g\(y\_\{j\}\)/\\varepsilon\}\. This is the evaluation of the adjoint heat kernel atyjy\_\{j\}given queryxx: the continuous version is\(St∗​ϕ\)​\(y\)=∫Kt​\(x−y\)​ϕ​\(x\)​𝑑x\(S\_\{t\}^\{\*\}\\phi\)\(y\)=\\int K\_\{t\}\(x\-y\)\\,\\phi\(x\)\\,dx, and the discrete sum∑jπj​ϕj\\sum\_\{j\}\\pi\_\{j\}\\phi\_\{j\}is its quadrature approximation at the support points\.□\\square

### Variable\-Hamiltonian Extension

The results above useH​\(p\)=\|p\|2H\(p\)=\|p\|^\{2\}, fixing the transport cost to the Euclidean quadratic\|x−y\|2/\(4​t\)\|x\-y\|^\{2\}/\(4t\)\. The following extends the core identity and its quantitative consequences to any quadratic HamiltonianHθ​\(p\)=p⊤​Aθ​pH\_\{\\theta\}\(p\)=p^\{\\top\}A\_\{\\theta\}\\,pwithAθ≻0A\_\{\\theta\}\\succ 0, whose Legendre transform isLθ​\(v\)=v⊤​Aθ−1​v/4L\_\{\\theta\}\(v\)=v^\{\\top\}A\_\{\\theta\}^\{\-1\}v/4, yielding transport cost\(x−y\)⊤​Aθ−1​\(x−y\)/\(4​t\)\(x\-y\)^\{\\top\}A\_\{\\theta\}^\{\-1\}\(x\-y\)/\(4t\)\.

###### Proof of Theorem[4\.4](https://arxiv.org/html/2605.28983#S4.Thmtheorem4)\.

The pre\-activation of neuronjjat inputxxisWj⋅x\+bj=yj⊤​Aθ−1​x/\(2​t\)−g​\(yj\)−yj⊤​Aθ−1​yj/\(4​t\)W\_\{j\}\\cdot x\+b\_\{j\}=y\_\{j\}^\{\\top\}A\_\{\\theta\}^\{\-1\}x/\(2t\)\-g\(y\_\{j\}\)\-y\_\{j\}^\{\\top\}A\_\{\\theta\}^\{\-1\}y\_\{j\}/\(4t\)\. SinceAθ−1A\_\{\\theta\}^\{\-1\}is symmetric positive\-definite, the same completing\-the\-square as in Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)holds in theAθ−1A\_\{\\theta\}^\{\-1\}\-inner product: expanding\(x−yj\)⊤​Aθ−1​\(x−yj\)\(x\-y\_\{j\}\)^\{\\top\}A\_\{\\theta\}^\{\-1\}\(x\-y\_\{j\}\)and rearranging gives

yj⊤​Aθ−1​x2​t−yj⊤​Aθ−1​yj4​t=x⊤​Aθ−1​x4​t−\(x−yj\)⊤​Aθ−1​\(x−yj\)4​t\.\\frac\{y\_\{j\}^\{\\top\}A\_\{\\theta\}^\{\-1\}x\}\{2t\}\-\\frac\{y\_\{j\}^\{\\top\}A\_\{\\theta\}^\{\-1\}y\_\{j\}\}\{4t\}=\\frac\{x^\{\\top\}A\_\{\\theta\}^\{\-1\}x\}\{4t\}\-\\frac\{\(x\-y\_\{j\}\)^\{\\top\}A\_\{\\theta\}^\{\-1\}\(x\-y\_\{j\}\)\}\{4t\}\.\(34\)HenceWj⋅x\+bj=x⊤​Aθ−1​x/\(4​t\)−\(g​\(yj\)\+\(x−yj\)⊤​Aθ−1​\(x−yj\)/\(4​t\)\)W\_\{j\}\\cdot x\+b\_\{j\}=x^\{\\top\}A\_\{\\theta\}^\{\-1\}x/\(4t\)\-\\bigl\(g\(y\_\{j\}\)\+\(x\-y\_\{j\}\)^\{\\top\}A\_\{\\theta\}^\{\-1\}\(x\-y\_\{j\}\)/\(4t\)\\bigr\)\. Thejj\-independent term factors out:

∑je\(Wj⋅x\+bj\)/ε=ex⊤​Aθ−1​x/\(4​ε​t\)​∑jexp⁡\(−g​\(yj\)−\(x−yj\)⊤​Aθ−1​\(x−yj\)/\(4​t\)ε\)\.\\sum\_\{j\}e^\{\(W\_\{j\}\\cdot x\+b\_\{j\}\)/\\varepsilon\}=e^\{x^\{\\top\}A\_\{\\theta\}^\{\-1\}x/\(4\\varepsilon t\)\}\\sum\_\{j\}\\exp\\\!\\left\(\\frac\{\-g\(y\_\{j\}\)\-\(x\-y\_\{j\}\)^\{\\top\}A\_\{\\theta\}^\{\-1\}\(x\-y\_\{j\}\)/\(4t\)\}\{\\varepsilon\}\\right\)\.Takingε​log\\varepsilon\\logand identifying the inner sum as−uεAθ,N\-u\_\{\\varepsilon\}^\{A\_\{\\theta\},N\}gives \([8](https://arxiv.org/html/2605.28983#S4.E8)\)\. For the PDE claim: underv=e−u/εv=e^\{\-u/\\varepsilon\}the equationut\+\(∇u\)⊤​Aθ​\(∇u\)=ε​∇⋅\(Aθ​∇u\)u\_\{t\}\+\(\\nabla u\)^\{\\top\}A\_\{\\theta\}\(\\nabla u\)=\\varepsilon\\,\\nabla\\\!\\cdot\\\!\(A\_\{\\theta\}\\nabla u\)transforms tovt=ε​∇⋅\(Aθ​∇v\)v\_\{t\}=\\varepsilon\\,\\nabla\\\!\\cdot\\\!\(A\_\{\\theta\}\\nabla v\)\(the anisotropic heat equation\), whose fundamental solution is\(4​π​ε​t\)−d/2​\(detAθ\)−1/2​exp⁡\(−\(x−y\)⊤​Aθ−1​\(x−y\)/\(4​ε​t\)\)\(4\\pi\\varepsilon t\)^\{\-d/2\}\(\\det A\_\{\\theta\}\)^\{\-1/2\}\\exp\\bigl\(\-\(x\-y\)^\{\\top\}A\_\{\\theta\}^\{\-1\}\(x\-y\)/\(4\\varepsilon t\)\\bigr\); the discrete sum above is its quadrature underμN\\mu\_\{N\}\. ∎∎

###### Corollary E\.7\(Variable\-HHrobustness bound\)\.

Under the parameterization of Theorem[4\.4](https://arxiv.org/html/2605.28983#S4.Thmtheorem4),

‖∇x2fεAθ,N​\(x\)‖2≤‖W‖2,∞2ε≤‖W0‖2,∞2ε​λmin​\(Aθ\)2for all​x,\\\|\\nabla\_\{x\}^\{2\}f\_\{\\varepsilon\}^\{A\_\{\\theta\},N\}\(x\)\\\|\_\{2\}\\;\\leq\\;\\frac\{\\\|W\\\|\_\{2,\\infty\}^\{2\}\}\{\\varepsilon\}\\;\\leq\\;\\frac\{\\\|W\_\{0\}\\\|\_\{2,\\infty\}^\{2\}\}\{\\varepsilon\\,\\lambda\_\{\\min\}\(A\_\{\\theta\}\)^\{2\}\}\\quad\\text\{for all \}x,\(35\)whereW0,j=yj/\(2​t\)W\_\{0,j\}=y\_\{j\}/\(2t\)are the base \(Euclidean\) weights andWj=Aθ−1​W0,jW\_\{j\}=A\_\{\\theta\}^\{\-1\}W\_\{0,j\}\.

###### Proof\.

The first inequality follows immediately from Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2)applied to the network with weightsWj=Aθ−1​yj/\(2​t\)W\_\{j\}=A\_\{\\theta\}^\{\-1\}y\_\{j\}/\(2t\): the Hessian bound‖W‖2,∞2/ε\\\|W\\\|\_\{2,\\infty\}^\{2\}/\\varepsilonholds for any LSE network, and here the weights happen to beAθ−1​W0,jA\_\{\\theta\}^\{\-1\}W\_\{0,j\}\. The second inequality uses‖Aθ−1​W0,j‖2≤‖Aθ−1‖2​‖W0,j‖2=‖W0,j‖2/λmin​\(Aθ\)\\\|A\_\{\\theta\}^\{\-1\}W\_\{0,j\}\\\|\_\{2\}\\leq\\\|A\_\{\\theta\}^\{\-1\}\\\|\_\{2\}\\,\\\|W\_\{0,j\}\\\|\_\{2\}=\\\|W\_\{0,j\}\\\|\_\{2\}/\\lambda\_\{\\min\}\(A\_\{\\theta\}\), so‖W‖2,∞≤‖W0‖2,∞/λmin​\(Aθ\)\\\|W\\\|\_\{2,\\infty\}\\leq\\\|W\_\{0\}\\\|\_\{2,\\infty\}/\\lambda\_\{\\min\}\(A\_\{\\theta\}\)\. ∎∎

The bound decreases asλmin​\(Aθ\)\\lambda\_\{\\min\}\(A\_\{\\theta\}\)increases: a stiffer metric \(AθA\_\{\\theta\}with large eigenvalues\) shrinks the effective weight norms and smooths the Hopf–Cole solution\. TheAθ=IA\_\{\\theta\}=Ispecial case recovers Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2)\.

## Appendix FAttribution, Influence Functions, and Bifurcation

### Propositions

###### Proposition F\.1\(Closed\-form attribution and label sensitivity\)\.

Letf^εN​\(x\)=∑jπj​\(x;ε\)​gj\\hat\{f\}\_\{\\varepsilon\}^\{N\}\(x\)=\\sum\_\{j\}\\pi\_\{j\}\(x;\\varepsilon\)\\,g\_\{j\}denote the Gibbs\-weighted label average \(the Hopf–Cole soft prediction\), with attribution weightsπj​\(x;ε\)=exp⁡\(−\(\|x−yj\|2/\(4​t\)\+gj\)/ε\)/Z\\pi\_\{j\}\(x;\\varepsilon\)=\\exp\\\!\\bigl\(\-\(\|x\-y\_\{j\}\|^\{2\}/\(4t\)\+g\_\{j\}\)/\\varepsilon\\bigr\)/Z\. This is distinct from the LSE layer outputfεN​\(x\)=ε​log​∑jexp⁡\(\(Wj⋅x\+bj\)/ε\)f\_\{\\varepsilon\}^\{N\}\(x\)=\\varepsilon\\log\\sum\_\{j\}\\exp\(\(W\_\{j\}\\cdot x\+b\_\{j\}\)/\\varepsilon\); the two coincide whengj=Wj⋅x\+bjg\_\{j\}=W\_\{j\}\\cdot x\+b\_\{j\}\.

1. \(i\)Attribution decomposition\.f^εN​\(x\)\\hat\{f\}\_\{\\varepsilon\}^\{N\}\(x\)is an exact convex combination of training labels:f^εN=∑jπj​gj\\hat\{f\}\_\{\\varepsilon\}^\{N\}=\\sum\_\{j\}\\pi\_\{j\}g\_\{j\}withπj≥0\\pi\_\{j\}\\geq 0and∑jπj=1\\sum\_\{j\}\\pi\_\{j\}=1\. The weightπj\\pi\_\{j\}is the fractional contribution of training examplejjto the prediction atxx\.
2. \(ii\)Label sensitivity\.The partial derivative of the prediction with respect to training labelgjg\_\{j\}is ∂f^εN∂gj=πj​\(x;ε\)​\(1\+f^εN​\(x\)−gjε\)\.\\frac\{\\partial\\hat\{f\}\_\{\\varepsilon\}^\{N\}\}\{\\partial g\_\{j\}\}\\;=\\;\\pi\_\{j\}\(x;\\varepsilon\)\\,\\Bigl\(1\+\\tfrac\{\\hat\{f\}\_\{\\varepsilon\}^\{N\}\(x\)\-g\_\{j\}\}\{\\varepsilon\}\\Bigr\)\.\(36\)This isO​\(N\)O\(N\)to evaluate, requires no second\-order information, and reduces toπj\\pi\_\{j\}asε→∞\\varepsilon\\to\\inftyor whenf^εN​\(x\)≈gj\\hat\{f\}\_\{\\varepsilon\}^\{N\}\(x\)\\approx g\_\{j\}\.
3. \(iii\)Gradient formulas\.The spatial gradients of the prediction and the attribution entropyH=−∑jπj​log⁡πjH=\-\\sum\_\{j\}\\pi\_\{j\}\\log\\pi\_\{j\}admit the closed forms ∇xf^εN\\displaystyle\\nabla\_\{x\}\\hat\{f\}\_\{\\varepsilon\}^\{N\}=12​t​ε​Covπ​\(g,y\),\\displaystyle=\\frac\{1\}\{2t\\varepsilon\}\\,\\mathrm\{Cov\}\_\{\\pi\}\(g,y\),\(37\)∇xH\\displaystyle\\nabla\_\{x\}H=12​t​ε​Covπ​\(y,−log⁡π\),\\displaystyle=\\frac\{1\}\{2t\\varepsilon\}\\,\\mathrm\{Cov\}\_\{\\pi\}\(y,\{\-\\log\\pi\}\),\(38\)whereCovπ​\(a,b\)=𝔼π​\[a​b\]−𝔼π​\[a\]​𝔼π​\[b\]\\mathrm\{Cov\}\_\{\\pi\}\(a,b\)=\\mathbb\{E\}\_\{\\pi\}\[ab\]\-\\mathbb\{E\}\_\{\\pi\}\[a\]\\,\\mathbb\{E\}\_\{\\pi\}\[b\]denotes covariance under the Gibbs measureπ\\pi\.

###### Proof\.

Thatf^\\hat\{f\}is a convex combination of the labels is immediate fromf^=∑jπj​gj\\hat\{f\}=\\sum\_\{j\}\\pi\_\{j\}g\_\{j\}and∑jπj=1\\sum\_\{j\}\\pi\_\{j\}=1\. Differentiatingf^=∑kπk​gk\\hat\{f\}=\\sum\_\{k\}\\pi\_\{k\}g\_\{k\}with respect togjg\_\{j\}, using the softmax Jacobian∂πk/∂gj=πk​\(πj−δk​j\)/ε\\partial\\pi\_\{k\}/\\partial g\_\{j\}=\\pi\_\{k\}\(\\pi\_\{j\}\-\\delta\_\{kj\}\)/\\varepsilon\(the sign arises because the logit of neuronjjis−gj/ε\-g\_\{j\}/\\varepsilon\):

∂f^∂gj=∑k∂πk∂gj​gk\+πj=πjε​∑kπk​gk−πj​gjε\+πj=πj​\(1\+f^−gjε\)\.\\frac\{\\partial\\hat\{f\}\}\{\\partial g\_\{j\}\}=\\sum\_\{k\}\\frac\{\\partial\\pi\_\{k\}\}\{\\partial g\_\{j\}\}g\_\{k\}\+\\pi\_\{j\}=\\frac\{\\pi\_\{j\}\}\{\\varepsilon\}\\\!\\sum\_\{k\}\\pi\_\{k\}g\_\{k\}\-\\frac\{\\pi\_\{j\}g\_\{j\}\}\{\\varepsilon\}\+\\pi\_\{j\}=\\pi\_\{j\}\\\!\\left\(1\+\\frac\{\\hat\{f\}\-g\_\{j\}\}\{\\varepsilon\}\\right\)\.Equation \([37](https://arxiv.org/html/2605.28983#A6.E37)\) follows by differentiatingf^=∑jπj​gj\\hat\{f\}=\\sum\_\{j\}\\pi\_\{j\}g\_\{j\}with respect toxxand substituting∂πj/∂x=πj​\(yj−𝔼π​\[y\]\)/\(2​t​ε\)\\partial\\pi\_\{j\}/\\partial x=\\pi\_\{j\}\(y\_\{j\}\-\\mathbb\{E\}\_\{\\pi\}\[y\]\)/\(2t\\varepsilon\); equation \([38](https://arxiv.org/html/2605.28983#A6.E38)\) follows analogously fromH=−∑jπj​log⁡πjH=\-\\sum\_\{j\}\\pi\_\{j\}\\log\\pi\_\{j\}\. ∎∎

###### Proof of Theorem[8\.9](https://arxiv.org/html/2605.28983#S8.Thmtheorem9)\.

Fix any support pointyky\_\{k\}\. By the non\-degeneracy assumption,g​\(yk\)<g​\(yj\)\+\|yk−yj\|2/\(4​t\)g\(y\_\{k\}\)<g\(y\_\{j\}\)\+\|y\_\{k\}\-y\_\{j\}\|^\{2\}/\(4t\)for allj≠kj\\neq k, which under the HJ parameterization \(Wj=yj/\(2​t\)W\_\{j\}=y\_\{j\}/\(2t\)\) translates directly to the pre\-activation gapΔk​\(yk\):=minj≠k⁡\[\(Wj−Wk\)⋅yk\+\(bj−bk\)\]\\Delta\_\{k\}\(y\_\{k\}\):=\\min\_\{j\\neq k\}\[\(W\_\{j\}\-W\_\{k\}\)\\cdot y\_\{k\}\+\(b\_\{j\}\-b\_\{k\}\)\]being strictly negative, soπj​\(yk;ε\)≤eΔk​\(yk\)/ε→0\\pi\_\{j\}\(y\_\{k\};\\varepsilon\)\\leq e^\{\\Delta\_\{k\}\(y\_\{k\}\)/\\varepsilon\}\\to 0exponentially for allj≠kj\\neq kandπk​\(yk;ε\)→1\\pi\_\{k\}\(y\_\{k\};\\varepsilon\)\\to 1\. HenceH​\(yk;ε\)→0H\(y\_\{k\};\\varepsilon\)\\to 0exponentially; the gradient∇xH=Covπ​\(y,−log⁡π\)/\(2​t​ε\)\\nabla\_\{x\}H=\\mathrm\{Cov\}\_\{\\pi\}\(y,\-\\log\\pi\)/\(2t\\varepsilon\)vanishes atyky\_\{k\}becauseπj​\(yk;ε\)→0\\pi\_\{j\}\(y\_\{k\};\\varepsilon\)\\to 0faster than\(−log⁡πj\)→∞\(\-\\log\\pi\_\{j\}\)\\to\\inftyforj≠kj\\neq k, so each cross termπj​\(−log⁡πj\)→0\\pi\_\{j\}\(\-\\log\\pi\_\{j\}\)\\to 0; and movingxxaway fromyky\_\{k\}within its Voronoi cell redistributes mass toward other atoms, increasing entropy\. Henceyky\_\{k\}is a local minimum ofH​\(⋅;ε\)H\(\\cdot;\\varepsilon\)for sufficiently smallε\\varepsilon\.

The map\(ε,x\)↦H​\(x;ε\)\(\\varepsilon,x\)\\mapsto H\(x;\\varepsilon\)is smooth, being a composition of the smooth softmax and the smooth negative\-entropy function\. By the implicit function theorem, any critical pointx∗​\(ε\)x^\{\*\}\(\\varepsilon\)satisfying∇xH​\(x∗;ε\)=0\\nabla\_\{x\}H\(x^\{\*\};\\varepsilon\)=0evolves continuously inε\\varepsilonas long as∇x2H​\(x∗;ε\)\\nabla^\{2\}\_\{x\}H\(x^\{\*\};\\varepsilon\)is nonsingular\. The count changes only whendet\(∇x2H​\(x∗;εbif\)\)=0\\det\(\\nabla^\{2\}\_\{x\}H\(x^\{\*\};\\varepsilon\_\{\\mathrm\{bif\}\}\)\)=0: one eigenvalue crosses zero, signalling the coalescence of a local minimum with an adjacent saddle — the standard fold\-bifurcation condition for one\-parameter families of smooth functions\. Panel \(c\) of Figure[2](https://arxiv.org/html/2605.28983#A6.F2)confirms that the minimum eigenvalue of∇x2H\\nabla^\{2\}\_\{x\}Hat each saddle approaches zero at each such event\.

Belowεbif​\(j,k\)\\varepsilon\_\{\\mathrm\{bif\}\}\(j,k\)the saddlexj​k∗​\(ε\)x^\{\*\}\_\{jk\}\(\\varepsilon\)separates the attribution basinsℬj\\mathcal\{B\}\_\{j\}andℬk\\mathcal\{B\}\_\{k\}as the stable manifold of the gradient flowx˙=−∇xH\\dot\{x\}=\-\\nabla\_\{x\}H\. Atε=εbif​\(j,k\)\\varepsilon=\\varepsilon\_\{\\mathrm\{bif\}\}\(j,k\)this saddle coalesces with a local minimum and disappears; aboveεbif​\(j,k\)\\varepsilon\_\{\\mathrm\{bif\}\}\(j,k\)no separating saddle exists, the basins merge, and attribution mass flows freely betweenjjandkk\. The ordered sequence of events defines a hierarchical clustering of support points by attribution proximity\.

Since the Hessian at any saddlex∗x^\{\*\}is indefinite by definition, its positive eigenvector is the crossing direction — along whichHHincreases from the saddle toward the annihilating minimum and the dominant attributionarg⁡maxj⁡πj\\arg\\max\_\{j\}\\pi\_\{j\}switches — while the negative eigenvector spans the ridgeline along whichHHchanges least\. The entropy valueH​\(x∗;ε\)∈\(0,log⁡N\)H\(x^\{\*\};\\varepsilon\)\\in\(0,\\log N\)quantifies attribution uncertainty at the transition locus\. ∎∎

### Empirical Evidence

Figure[2](https://arxiv.org/html/2605.28983#A6.F2)confirms Theorem[8\.9](https://arxiv.org/html/2605.28983#S8.Thmtheorem9)for the synthetic two\-cluster experiment \(N=16N=16,d=2d=2,ε∗≈0\.25\\varepsilon^\{\*\}\\approx 0\.25\)\. At smallε\\varepsilonthe landscape has 9 critical points; the count decreases to 1 through discrete fold events visible in panels \(b\)–\(d\)\. Panel \(c\) shows the minimum Hessian eigenvalue at saddle points approaching zero at each bifurcation, the hallmark of a fold; panel \(d\) is the bifurcation diagram showing branch annihilations inHH\-value space\.

![Refer to caption](https://arxiv.org/html/2605.28983v1/x2.png)Figure 2:Bifurcation analysis of the attribution\-entropy landscapeH​\(π​\(x;ε\)\)H\(\\pi\(x;\\varepsilon\)\)\(N=16N=16,d=2d=2, two\-cluster synthetic data\)\.\(a\)Landscape at three viscosities with critical points:▽\\triangledownminima \(low entropy, near support points\),⋄\\diamondsaddles \(attribution transitions\),△\\trianglemaxima \(high entropy, class boundaries\)\.\(b\)Critical\-point count vs\.ε\\varepsilon\(log scale\); discrete drops are fold bifurcations\.\(c\)Minimum Hessian eigenvalue at saddle points, approaching zero at each fold event\.\(d\)Bifurcation diagram:HH\-value of each critical point vs\.ε\\varepsilon; branch annihilations are the fold bifurcations of Theorem[8\.9](https://arxiv.org/html/2605.28983#S8.Thmtheorem9)\.![Refer to caption](https://arxiv.org/html/2605.28983v1/x3.png)Figure 3:Full phase diagram of the LSE network as Hopf–Cole solution \(N=16N=16,d=2d=2, two\-cluster synthetic data,ε∗≈0\.25\\varepsilon^\{\*\}\\approx 0\.25\)\.\(a\)Attribution entropyH​\(π​\(x;ε\)\)H\(\\pi\(x;\\varepsilon\)\)vs\. viscosityε\\varepsilon\(log scale\) for three representative query points, showing the particle \(Hopf–Lax,ε→0\\varepsilon\\to 0\) to wave \(heat equation,ε→∞\\varepsilon\\to\\infty\) phase transition;ε∗=N−1/d\\varepsilon^\{\*\}=N^\{\-1/d\}is marked\.\(b\)Spatial entropy landscapeH​\(π​\(x;ε\)\)H\(\\pi\(x;\\varepsilon\)\)at cold \(ε=0\.05\\varepsilon=0\.05\), optimal \(ε∗=0\.25\\varepsilon^\{\*\}=0\.25\), and hot \(ε=4\.0\\varepsilon=4\.0\) viscosities; sharp attribution basins in the particle regime flatten to uniform entropy in the wave regime\.\(c\)Attribution weightsπj​\(x;ε\)\\pi\_\{j\}\(x;\\varepsilon\)vs\. neuron index at three viscosities, confirming concentration at lowε\\varepsilonand spreading at highε\\varepsilon\.\(d\) leftExact HJ prediction landscapef^εN=∑jπj​gj\\hat\{f\}\_\{\\varepsilon\}^\{N\}=\\sum\_\{j\}\\pi\_\{j\}g\_\{j\}atε∗\\varepsilon^\{\*\}, with flow arrows∇xu=\(x−∑jπj​yj\)/\(2​t\)\\nabla\_\{x\}u=\(x\-\\sum\_\{j\}\\pi\_\{j\}y\_\{j\}\)/\(2t\)\.\(d\) rightFeature sensitivity\|∇xu\|\|\\nabla\_\{x\}u\|; bright regions are discriminative boundaries where attribution switches allegiance between clusters \(cf\. Theorem[8\.9](https://arxiv.org/html/2605.28983#S8.Thmtheorem9)\)\.![Refer to caption](https://arxiv.org/html/2605.28983v1/x4.png)Figure 4:Phase diagram of the LSE network as Hopf–Cole solution on real data: MNIST digits 3 vs\. 7 \(N=120N=120,dPCA=2d\_\{\\mathrm\{PCA\}\}=2,ε∗≈0\.091\\varepsilon^\{\*\}\\approx 0\.091\)\. Layout identical to Figure[3](https://arxiv.org/html/2605.28983#A6.F3)\.\(a\)Entropy vs\. viscosity for the digit\-3 centroid, digit\-7 centroid, and midpoint; the phase transition occurs atε∗=N−1/d≈0\.091\\varepsilon^\{\*\}=N^\{\-1/d\}\\approx 0\.091\.\(b\)Spatial entropy landscape at cold, critical, and hot viscosities in PCA2space\.\(c\)MNIST PCA2support points \(digit 3 cyan, digit 7 orange\)\.\(d\) leftExact HJ prediction landscape: the decision boundary follows the inter\-cluster geometry in PCA space\.\(d\) rightFeature sensitivity\|∇xu\|\|\\nabla\_\{x\}u\|; discriminative regions concentrate along the digit\-separation boundary, identifying the PCA directions most informative for the 3\-vs\-7 classification\.

## Appendix GNumerical Details

All experiments run on a standard CPU; no GPU is required\.

#### Shared defaults\.

All trained networks use Adam \(β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999,ε^=10−8\\hat\{\\varepsilon\}=10^\{\-8\}\); weights initialized asWj∼𝒩​\(0,0\.12/d\)W\_\{j\}\\sim\\mathcal\{N\}\(0,0\.1^\{2\}/d\), biasesbj=0b\_\{j\}=0; viscosityε=N−1/d\\varepsilon=N^\{\-1/d\}per Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)\.

#### Experiment A

\(initial\-data recovery\): Adam warm\-start \(5,000 steps,lr=0\.008\\mathrm\{lr\}=0\.008\) followed by L\-BFGS\-B;N=10N=10neurons, 8 seeds \(best reported\), 1\-D syntheticg∗​\(y\)=\|y\|g^\{\*\}\(y\)=\|y\|on\[−2\.4,2\.4\]\[\-2\.4,2\.4\],ε∈\{0\.5,0\.1,0\.04\}\\varepsilon\\in\\\{0\.5,0\.1,0\.04\\\}\.

#### Experiment B

\(scaling law\): Adam, 3,000 steps,lr=0\.006\\mathrm\{lr\}=0\.006, 5 seeds \(best reported\), train/test 4,000/800, syntheticg​\(y\)=‖y‖2g\(y\)=\\\|y\\\|\_\{2\}on\[−2,2\]d\[\-2,2\]^\{d\};d=1d=1:N∈\{10,25,50,100,200,500\}N\\in\\\{10,25,50,100,200,500\\\};d=2d=2:N∈\{20,50,100,200,500\}N\\in\\\{20,50,100,200,500\\\};d=4d=4:N∈\{40,100,200,500,1000\}N\\in\\\{40,100,200,500,1000\\\}\.

#### Experiment C

\(robustness\): Adam, 4,000 steps,lr=0\.01\\mathrm\{lr\}=0\.01,N=30N=30, 3 seeds \(best reported\), 1\-D syntheticg​\(y\)=\|y\|g\(y\)=\|y\|,ε∈\{0\.08,0\.12,0\.20,0\.35,0\.60,1\.0\}\\varepsilon\\in\\\{0\.08,0\.12,0\.20,0\.35,0\.60,1\.0\\\}\.

#### Experiment D

\(attention identity\): analytical verification only; no optimizer, seed 42 for data generation\.

#### Experiment E

\(MNIST Hessian\): regression on digit label∈\[0,9\]\\in\[0,9\]via linear output layer; MSE loss\.N=128N=128,d=50d=50PCA features, Adam 5,000 steps,lr=5×10−3\\mathrm\{lr\}=5\\times 10^\{\-3\}, batch 256, 1 seed, train/test 60,000/10,000, 300 random test images,ε∈\{0\.1,0\.3,1\.0,3\.0,10\.0\}\\varepsilon\\in\\\{0\.1,0\.3,1\.0,3\.0,10\.0\\\}\. Runtime≈\\approx15 min\.

#### Experiment F

\(CIFAR\-10 Hessian\): regression on class label∈\[0,9\]\\in\[0,9\]via linear output layer; MSE loss\. Same architecture as Experiment E withd=64d=64PCA features, train/test 50,000/10,000\. Runtime≈\\approx2 hr CPU \(<<10 min GPU\)\.

#### Experiment G

\(transformer block\): analytical identity checks only; no training\. 500 trials perd∈\{4,8,16,32,64\}d\\in\\\{4,8,16,32,64\\\}, sequence length 8,dv=16d\_\{v\}=16; seed 42\. Verifies that attention \+ LayerNorm equals∇LSE\\nabla\\mathrm\{LSE\}to floating\-point precision and that each LSE\-FFN layer satisfies Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)to machine precision\.

#### Experiments H–I

\(phase diagram, bifurcation\): analytical sweeps; no training\. Experiment H usesN=16N=16synthetic support points ind=2d=2with two\-cluster labels \(ε∗=N−1/2=0\.25\\varepsilon^\{\*\}=N^\{\-1/2\}=0\.25\), 60 animation frames overε∈\[10−2,10\]\\varepsilon\\in\[10^\{\-2\},10\]\. Experiment I sweepsε\\varepsilonover 200 values for bifurcation\-of\-entropy analysis\. MNIST\[LeCunet al\.,[1998](https://arxiv.org/html/2605.28983#bib.bib42)\]and CIFAR\-10\[Krizhevsky,[2009](https://arxiv.org/html/2605.28983#bib.bib43)\]are downloaded automatically from their canonical URLs; no manual data preparation is needed\.

![Refer to caption](https://arxiv.org/html/2605.28983v1/x5.png)Figure 5:Quadrature convergence\.ℓ∞\\ell^\{\\infty\}error vs\. widthNNfor initial datag​\(y\)=\|y\|g\(y\)=\|y\|\(Lipschitz\) acrossd∈\{1,2,4\}d\\in\\\{1,2,4\\\}; dashed:O​\(N−1/d\)O\(N^\{\-1/d\}\)slopes confirming Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)\.![Refer to caption](https://arxiv.org/html/2605.28983v1/x6.png)Figure 6:Scaling law for Adam\-trained LSE networks\. Test RMSE vs\. widthNNfor targetg​\(y\)=‖y‖g\(y\)=\\\|y\\\|acrossd∈\{1,2,4\}d\\in\\\{1,2,4\\\}; dashed: predictedN−1/dN^\{\-1/d\}slopes\. Fitted exponentsα^≈1/d\\hat\{\\alpha\}\\approx 1/dconfirm Proposition[8\.8](https://arxiv.org/html/2605.28983#S8.Thmtheorem8)for trained networks\.![Refer to caption](https://arxiv.org/html/2605.28983v1/x7.png)

![Refer to caption](https://arxiv.org/html/2605.28983v1/x8.png)

Figure 7:Hessian bound \(Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2)\)\.*Left*: closed\-form network; measured norm \(solid\) vs\. bound‖W‖2,∞2/ε\\\|W\\\|\_\{2,\\infty\}^\{2\}/\\varepsilon\(dashed\),ε∈\[0\.05,10\]\\varepsilon\\in\[0\.05,10\]\.*Right*: SGD\-trained network ong​\(y\)=\|y\|g\(y\)=\|y\|; bound never violated acrossε∈\[0\.08,1\.0\]\\varepsilon\\in\[0\.08,1\.0\]\.Table 2:Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)identity verified to floating\-point precision for a 4\-neuron, 1\-D network\.Table 3:Transformer attention identity \([7](https://arxiv.org/html/2605.28983#S4.E7)\) verified to exact zero error\. LetA:=Attn​\(Q,K,V\)A:=\\mathrm\{Attn\}\(Q,K,V\)\(standard scaled dot\-product\) andB:=\(∇zLSEε​\(z\)\)\|zj=qi⋅kj⋅VB:=\(\\nabla\_\{z\}\\mathrm\{LSE\}\_\{\\varepsilon\}\(z\)\)\\big\|\_\{z\_\{j\}=q\_\{i\}\\cdot k\_\{j\}\}\\cdot V\(Hopf–Cole form, \([7](https://arxiv.org/html/2605.28983#S4.E7)\)\) withε=d\\varepsilon=\\sqrt\{d\}\. Max absolute error over 500 random\(Q,K,V\)\(Q,K,V\)trials,nq=8n\_\{q\}=8,nk=12n\_\{k\}=12,dv=16d\_\{v\}=16\.Table 4:LSE\-transformer block characterization \(Proposition[H\.9](https://arxiv.org/html/2605.28983#A8.Thmtheorem9)\) verified numerically\.*Left*: attention identity \([7](https://arxiv.org/html/2605.28983#S4.E7)\) after LayerNorm, verified over 500 random trials perdd;nq=8n\_\{q\}\{=\}8,nk=8n\_\{k\}\{=\}8,dv=16d\_\{v\}\{=\}16\.*Right*: LSE\-FFN HJ identity\|fε\+uε−\|x\|2/\(4​t\)\|\|f\_\{\\varepsilon\}\+u\_\{\\varepsilon\}\-\|x\|^\{2\}/\(4t\)\|\(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\) withx=LN​\(z\)x=\\mathrm\{LN\}\(z\), over 500 trials perε\\varepsilon;N=32N\{=\}32,d=8d\{=\}8\. All errors are floating\-point roundoff\.Attention\+\+LayerNormLSE\-FFN \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\)ddε=d\\varepsilon\{=\}\\sqrt\{d\}max⁡\|A−B\|\\max\|A\{\-\}B\|ε\\varepsilonNNmax⁡\|f\+u−\|x\|2/\(4​t\)\|\\max\|f\{\+\}u\{\-\}\|x\|^\{2\}/\(4t\)\|42\.0000\.05328\.9×10−168\.9\{\\times\}10^\{\-16\}82\.8300\.10328\.9×10−168\.9\{\\times\}10^\{\-16\}164\.0000\.20328\.9×10−168\.9\{\\times\}10^\{\-16\}325\.6600\.50328\.9×10−168\.9\{\\times\}10^\{\-16\}648\.0001\.00328\.9×10−168\.9\{\\times\}10^\{\-16\}![Refer to caption](https://arxiv.org/html/2605.28983v1/x9.png)

![Refer to caption](https://arxiv.org/html/2605.28983v1/x10.png)

Figure 8:Hessian bound \(Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2)\) on MNIST \(*left*\) and CIFAR\-10 \(*right*\)\. Each network hasN=128N=128neurons; inputs are PCA\-projected tod=50d=50\(MNIST\) andd=64d=64\(CIFAR\-10\)\. Measured spectral norm‖∇x2f‖2\\\|\\nabla^\{2\}\_\{x\}f\\\|\_\{2\}\(diamonds\) and theoretical bound‖W‖2,∞2/ε\\\|W\\\|\_\{2,\\infty\}^\{2\}/\\varepsilon\(dashed squares\) on 300 random test images\. The bound is never violated across allε∈\{0\.1,0\.3,1\.0,3\.0,10\.0\}\\varepsilon\\in\\\{0\.1,0\.3,1\.0,3\.0,10\.0\\\}on both datasets\. Note that PCA projection constitutes a significant dimensionality reduction from raw pixels; verification on raw or learned representations remains open\.

## Appendix HScope of the Correspondence: Universality, Closure, and Extensions

#### Theε\\varepsilondeformation dictionary\.

Table[5](https://arxiv.org/html/2605.28983#A8.T5)records the four\-perspective interpretation of the deformation parameterε\\varepsilonacross the neural network, algebraic, PDE, and convex optimization viewpoints \(Section[6](https://arxiv.org/html/2605.28983#S6)\)\.

Table 5:Dictionary for the deformation parameterε\\varepsilon: four perspectives on the same LSE layer\.The rows correspond to the four corners and the optimal operating point of diagram \([11](https://arxiv.org/html/2605.28983#S7.E11)\)\. The deformationε\\varepsilonis formally identical toℏ\\hbarin Maslov’s dequantization: the passageε→0\\varepsilon\\to 0is the neural\-network analogue of theℏ→0\\hbar\\to 0classical limit\[Feynman,[1982](https://arxiv.org/html/2605.28983#bib.bib7)\]\. The LSE network atε\>0\\varepsilon\>0is a universal classical HJ simulator \(Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)\) — the positive\-measure counterpart of the universal quantum simulator ofFeynman \[[1982](https://arxiv.org/html/2605.28983#bib.bib7)\]; classical realizability holds precisely because the Gibbs measure is positive for allε\>0\\varepsilon\>0, unlike the Wigner quasi\-distribution of quantum mechanics\.

#### Universality\.

The exact correspondence \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\) applies to LSE\-activated networks\. The following proposition shows the class is large enough to be universal\.

###### Proposition H\.1\(LSE universality\)\.

LetK⊂ℝdK\\subset\\mathbb\{R\}^\{d\}be compact andf:K→ℝf:K\\to\\mathbb\{R\}continuous\. For everyδ\>0\\delta\>0there exists a deep LSE networkfεf\_\{\\varepsilon\}such thatsupx∈K\|f​\(x\)−fε​\(x\)\|<δ\\sup\_\{x\\in K\}\|f\(x\)\-f\_\{\\varepsilon\}\(x\)\|<\\delta\.

###### Proof\.

A single LSE layerLSEε​\(W​x\+b\)\\mathrm\{LSE\}\_\{\\varepsilon\}\(Wx\+b\)is convex inxx, so depth≥2\\geq 2is required for non\-convex targets\. Any continuousffon compactKKis uniformly approximable by a continuous piecewise\-linear function \(Stone–Weierstrass onKK\)\. Every continuous piecewise\-linear function can be written in max\-of\-min\-of\-affine form, which a depth\-two tropical \(max\-plus/min\-plus\) network represents exactly; this follows from the lattice form of the Stone–Weierstrass theorem\[Litvinov,[2007](https://arxiv.org/html/2605.28983#bib.bib2)\], since max\-plus/min\-plus functions separate points and form a function lattice\. For fixedε\>0\\varepsilon\>0, each max is approximated byLSEε\\mathrm\{LSE\}\_\{\\varepsilon\}to withinO​\(ε​log⁡N\)O\(\\varepsilon\\log N\)\(Theorem[3\.1](https://arxiv.org/html/2605.28983#S3.Thmtheorem1)\), and each min by−ε​log​∑je−zj/ε\-\\varepsilon\\log\\sum\_\{j\}e^\{\-z\_\{j\}/\\varepsilon\}to the same order\. Choosingε\\varepsilonandNNappropriately gives the result\. ∎

Together, Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)and Proposition[H\.1](https://arxiv.org/html/2605.28983#A8.Thmtheorem1)imply:*for any continuous function there exists a deep viscous HJ solution under discrete measures that approximates it to withinδ\\delta*\. The HJ class is therefore as general as the class of continuous functions \(with depth≥2\\geq 2\)\.

#### Closure\.

###### Proposition H\.2\(Closure of the HJ class\)\.

The class of functions representable as Hopf–Cole solutions of viscous HJ equations under discrete measures is closed under:

1. \(i\)*Layer composition*: composingLLLSE layers corresponds to applying the heat semigroup for total timeT=∑ℓtℓT=\\sum\_\{\\ell\}t\_\{\\ell\}\(Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1)\); the result is another HJ solution\.
2. \(ii\)*Affine input transformations*:fε​\(A​x\+c\)f\_\{\\varepsilon\}\(Ax\+c\)is an LSE network with weightsW​AWAand biasesW​b\+cWb\+c, hence still an exact HJ solution\.
3. \(iii\)*Residual connections*: adding a skip path corresponds to the Euler discretization of HJ characteristics \(Proposition[5\.2](https://arxiv.org/html/2605.28983#S5.Thmtheorem2)\); the composition remains within the HJ class up to discretization errorO​\(h\)O\(h\)\.

###### Proof\.

By Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1), each LSE layer satisfiesfε\(l\)​\(x\)=\|x\|2/\(4​tl\)−uε\(l\)​\(x,tl\)f\_\{\\varepsilon\}^\{\(l\)\}\(x\)=\|x\|^\{2\}/\(4t\_\{l\}\)\-u\_\{\\varepsilon\}^\{\(l\)\}\(x,t\_\{l\}\)exactly, withuε\(l\)u\_\{\\varepsilon\}^\{\(l\)\}the Hopf–Cole solution for that layer’s own initial datumg\(l\)g^\{\(l\)\}and timetlt\_\{l\}\. Each layer individually is an exact HJ solution; the composition is another HJ\-class function, as established by Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1)\(i\)\.

The substitutionz=A​x\+cz=Ax\+cgivesfε​\(A​x\+c\)=LSEε​\(\(W​A\)​x\+\(W​c\+b\)\)f\_\{\\varepsilon\}\(Ax\+c\)=\\mathrm\{LSE\}\_\{\\varepsilon\}\(\(WA\)x\+\(Wc\+b\)\), an LSE layer with weightsW′=W​AW^\{\\prime\}=WAand biasesb′=W​c\+bb^\{\\prime\}=Wc\+b\. Applying Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)to\(W′,b′\)\(W^\{\\prime\},b^\{\\prime\}\)yields the exact HJ identity with reparameterized support pointsyj′=2​t​\(Wj′\)⊤y^\{\\prime\}\_\{j\}=2t\(W^\{\\prime\}\_\{j\}\)^\{\\top\}and initial datagj′=−bj′−\|yj′\|2/\(4​t\)g^\{\\prime\}\_\{j\}=\-b^\{\\prime\}\_\{j\}\-\|y^\{\\prime\}\_\{j\}\|^\{2\}/\(4t\)\.

The residual layerxl\+1=xl\+h​fε\(l\)​\(xl\)x\_\{l\+1\}=x\_\{l\}\+h\\,f\_\{\\varepsilon\}^\{\(l\)\}\(x\_\{l\}\)is the forward Euler step forx˙=fε​\(x\)\\dot\{x\}=f\_\{\\varepsilon\}\(x\); by Proposition[5\.2](https://arxiv.org/html/2605.28983#S5.Thmtheorem2)this ODE is thexx\-characteristic of the HJ PDE with HamiltonianH​\(x,p\)=p⋅fε​\(x\)H\(x,p\)=p\\cdot f\_\{\\varepsilon\}\(x\), so the composition stays within the HJ class up toO​\(h\)O\(h\)truncation error\. ∎∎

#### Extension to broader PDE classes\.

When the Hamiltonian admits a Hopf–Lax or inf\-convolution representation \(convexHH\), the correspondence extends to scalar conservation laws, eikonal equations, and linear transport; in each case the network encodes the Hamiltonian in its weights and the viscosity inε\\varepsilon\. The KdV and Toda lattice hierarchies ultradiscretize to box\-ball cellular automata\[Tokihiroet al\.,[1996](https://arxiv.org/html/2605.28983#bib.bib5)\], suggesting a connection for non\-convex Hamiltonians, but the neural network analogue in that setting remains open\.

###### Proof of Theorem[4\.3](https://arxiv.org/html/2605.28983#S4.Thmtheorem3)\.

The Hopf–Cole substitutionv=e−u/εv=e^\{\-u/\\varepsilon\}applied to∂tu\+H​\(∇u\)=ε​∇⋅\(A​∇u\)\\partial\_\{t\}u\+H\(\\nabla u\)=\\varepsilon\\,\\nabla\\\!\\cdot\\\!\(A\\nabla u\)yields

∂tv=ε​∇⋅\(A​∇v\)\+v⋅\[H​\(−ε​∇log⁡v\)/ε−ε​\(∇log⁡v\)⊤​A​\(∇log⁡v\)\]\.\\partial\_\{t\}v\\;=\\;\\varepsilon\\,\\nabla\\\!\\cdot\\\!\(A\\nabla v\)\\;\+\\;v\\cdot\\bigl\[H\(\-\\varepsilon\\nabla\\log v\)/\\varepsilon\-\\varepsilon\(\\nabla\\log v\)^\{\\top\}A\(\\nabla\\log v\)\\bigr\]\.The reaction term \(the bracketed expression\) vanishes identically if and only ifH​\(p\)=p⊤​A​pH\(p\)=p^\{\\top\}Ap: the conditionε​\(∇log⁡v\)⊤​A​\(∇log⁡v\)=H​\(−ε​∇log⁡v\)/ε\\varepsilon\(\\nabla\\log v\)^\{\\top\}A\(\\nabla\\log v\)=H\(\-\\varepsilon\\nabla\\log v\)/\\varepsilonmust hold for all gradients, which forcesH​\(q\)=q⊤​A​qH\(q\)=q^\{\\top\}Aqfor allq∈ℝdq\\in\\mathbb\{R\}^\{d\}\. This is the anisotropic\-quadratic class of Theorem[4\.4](https://arxiv.org/html/2605.28983#S4.Thmtheorem4), recovering Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)atA=IA=I\.

For the BSDE direction: for non\-quadraticHH, the solution has an exact stochastic representation via the nonlinear Feynman–Kac theorem:u​\(x,t\)=Y0u\(x,t\)=Y\_\{0\}, where\(Ys,Zs\)\(Y\_\{s\},Z\_\{s\}\)satisfies−d​Ys=H​\(Zs\)​d​s−Zs⋅d​Ws\-dY\_\{s\}=H\(Z\_\{s\}\)\\,ds\-Z\_\{s\}\\cdot dW\_\{s\},YT=g​\(XT\)Y\_\{T\}=g\(X\_\{T\}\), withZs=∇u​\(Xs,s\)Z\_\{s\}=\\nabla u\(X\_\{s\},s\)\. The BSDE reduces to an explicit Gaussian convolution precisely when the driverH​\(Zs\)H\(Z\_\{s\}\)is quadratic inZsZ\_\{s\}, providing an independent stochastic\-representation confirmation that the quadratic class is necessary\. ∎∎

###### Theorem H\.5\(Ultradiscretization extension to non\-convexHH\)\.

For anyH:ℝd→ℝH:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}\(possibly non\-convex\), letL=H∗L=H^\{\*\}be its Legendre–Fenchel conjugate \(always convex, as a pointwise supremum of affine functions\) andH∗∗=\(H∗\)∗H^\{\*\*\}=\(H^\{\*\}\)^\{\*\}the convex envelope ofHH\. Define the kernel network

KεN​\(x\)=−ε​log​∑j=1Nexp⁡\(−t​L​\(x−yjt\)\+g​\(yj\)ε\)\.K\_\{\\varepsilon\}^\{N\}\(x\)\\;=\\;\-\\varepsilon\\log\\sum\_\{j=1\}^\{N\}\\exp\\\!\\left\(\-\\frac\{tL\\\!\\bigl\(\\frac\{x\-y\_\{j\}\}\{t\}\\bigr\)\+g\(y\_\{j\}\)\}\{\\varepsilon\}\\right\)\.\(40\)Then:

1. \(i\)*Tropical limit, exact forH∗∗H^\{\*\*\}:*Asε→0\\varepsilon\\to 0, by Theorem[3\.1](https://arxiv.org/html/2605.28983#S3.Thmtheorem1),KεN​\(x\)→minj⁡\{t​L​\(\(x−yj\)/t\)\+g​\(yj\)\}K\_\{\\varepsilon\}^\{N\}\(x\)\\to\\min\_\{j\}\\\{tL\(\(x\-y\_\{j\}\)/t\)\+g\(y\_\{j\}\)\\\}, which is theNN\-point Hopf–Lax approximation forH∗∗H^\{\*\*\}, exact asN→∞N\\to\\inftyby Theorem[5\.1](https://arxiv.org/html/2605.28983#S5.Thmtheorem1)\.
2. \(ii\)*Approximation error for the originalHH:*By stability of viscosity solutions under Hamiltonian perturbation\[Crandall and Lions,[1983](https://arxiv.org/html/2605.28983#bib.bib8)\],‖uH∗∗​\(⋅,t\)−uH​\(⋅,t\)‖∞≤t⋅supp\|H​\(p\)−H∗∗​\(p\)\|\\\|u\_\{H^\{\*\*\}\}\(\\cdot,t\)\-u\_\{H\}\(\\cdot,t\)\\\|\_\{\\infty\}\\leq t\\cdot\\sup\_\{p\}\|H\(p\)\-H^\{\*\*\}\(p\)\|; the bound vanishes whenHHis convex \(H=H∗∗H=H^\{\*\*\}\)\.
3. \(iii\)*Reduction to the LSE identity:*WhenH​\(p\)=\|p\|2H\(p\)=\|p\|^\{2\}, soL​\(v\)=\|v\|2/4L\(v\)=\|v\|^\{2\}/4, completing the square givest​L​\(\(x−yj\)/t\)=\|x\|2/\(4​t\)−x⋅yj/\(2​t\)\+\|yj\|2/\(4​t\)tL\(\(x\-y\_\{j\}\)/t\)=\|x\|^\{2\}/\(4t\)\-x\\cdot y\_\{j\}/\(2t\)\+\|y\_\{j\}\|^\{2\}/\(4t\), which separatesxxfromyjy\_\{j\}and reducesKεNK\_\{\\varepsilon\}^\{N\}to the standard LSE network of Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)with exact identityKεN​\(x\)\+uεN​\(x,t\)=\|x\|2/\(4​t\)K\_\{\\varepsilon\}^\{N\}\(x\)\+u\_\{\\varepsilon\}^\{N\}\(x,t\)=\|x\|^\{2\}/\(4t\)for allε\>0\\varepsilon\>0andN≥1N\\geq 1\.

###### Proof\.

By Theorem[3\.1](https://arxiv.org/html/2605.28983#S3.Thmtheorem1)\(Maslov dequantization\), asε→0\\varepsilon\\to 0the expression−ε​log​∑jexp⁡\(−fj/ε\)\-\\varepsilon\\log\\sum\_\{j\}\\exp\(\-f\_\{j\}/\\varepsilon\)converges tominj⁡fj\\min\_\{j\}f\_\{j\}; the limit equals theNN\-point Hopf–Lax formula forH∗∗H^\{\*\*\}becauseinfy\{t​L​\(\(x−y\)/t\)\+g​\(y\)\}\\inf\_\{y\}\\\{tL\(\(x\-y\)/t\)\+g\(y\)\\\}is the viscosity solution of∂tu\+H∗∗​\(∇u\)=0\\partial\_\{t\}u\+H^\{\*\*\}\(\\nabla u\)=0withu​\(⋅,0\)=gu\(\\cdot,0\)=g\(Theorem[5\.1](https://arxiv.org/html/2605.28983#S5.Thmtheorem1), sinceL=\(H∗∗\)∗L=\(H^\{\*\*\}\)^\{\*\}and\(H∗∗\)∗∗=H∗∗\(H^\{\*\*\}\)^\{\*\*\}=H^\{\*\*\}\)\. IfwHw\_\{H\}andwH∗∗w\_\{H^\{\*\*\}\}are viscosity solutions of the same initial\-value problem with HamiltoniansHHandH∗∗H^\{\*\*\}respectively, stability of viscosity solutions under Hamiltonian perturbation gives‖wH−wH∗∗‖∞≤t​supp\|H​\(p\)−H∗∗​\(p\)\|\\\|w\_\{H\}\-w\_\{H^\{\*\*\}\}\\\|\_\{\\infty\}\\leq t\\sup\_\{p\}\|H\(p\)\-H^\{\*\*\}\(p\)\|\[Crandall and Lions,[1983](https://arxiv.org/html/2605.28983#bib.bib8)\]\. WhenH​\(p\)=\|p\|2H\(p\)=\|p\|^\{2\}, the LagrangianL=H∗=\|⋅\|2/4L=H^\{\*\}=\|\\cdot\|^\{2\}/4givest​L​\(\(x−yj\)/t\)=\|x−yj\|2/\(4​t\)tL\(\(x\-y\_\{j\}\)/t\)=\|x\-y\_\{j\}\|^\{2\}/\(4t\); completing the square yields\|x−yj\|2/\(4​t\)=\|x\|2/\(4​t\)−x⋅yj/\(2​t\)\+\|yj\|2/\(4​t\)\|x\-y\_\{j\}\|^\{2\}/\(4t\)=\|x\|^\{2\}/\(4t\)\-x\\cdot y\_\{j\}/\(2t\)\+\|y\_\{j\}\|^\{2\}/\(4t\); thexx\-dependent prefactore\|x\|2/\(4​ε​t\)e^\{\|x\|^\{2\}/\(4\\varepsilon t\)\}factors out of the sum, givingKεN​\(x\)=\|x\|2/\(4​t\)−ε​log​∑jexp⁡\(−\(\|x−yj\|2/\(4​t\)\+g​\(yj\)\)/ε\)=\|x\|2/\(4​t\)−uεN​\(x,t\)K\_\{\\varepsilon\}^\{N\}\(x\)=\|x\|^\{2\}/\(4t\)\-\\varepsilon\\log\\sum\_\{j\}\\exp\(\-\(\|x\-y\_\{j\}\|^\{2\}/\(4t\)\+g\(y\_\{j\}\)\)/\\varepsilon\)=\|x\|^\{2\}/\(4t\)\-u\_\{\\varepsilon\}^\{N\}\(x,t\), which is Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\. ∎∎

The construction applies the re\-quantization step of ultradiscretization in reverse \(Theorem[3\.1](https://arxiv.org/html/2605.28983#S3.Thmtheorem1)\): the min\-plus tropical Hopf–Lax formula forH∗∗H^\{\*\*\}is lifted to the smooth differentiable networkKεNK\_\{\\varepsilon\}^\{N\}by replacingminj\\min\_\{j\}with−εlog∑jexp\(−⋅/ε\)\-\\varepsilon\\log\\sum\_\{j\}\\exp\(\-\\cdot/\\varepsilon\)\. For non\-convexHH, the kernelK​\(x,y,t\)=exp⁡\(−t​L​\(\(x−y\)/t\)/ε\)K\(x,y,t\)=\\exp\(\-tL\(\(x\-y\)/t\)/\\varepsilon\)is theH∗∗H^\{\*\*\}\-adjusted heat kernel; the exact algebraic identity holds forH∗∗H^\{\*\*\}and reduces toKεN\+uεN=\|x\|2/\(4​t\)K\_\{\\varepsilon\}^\{N\}\+u\_\{\\varepsilon\}^\{N\}=\|x\|^\{2\}/\(4t\)only whenH∗∗H^\{\*\*\}is the Euclidean quadratic\. The approximation error for the original non\-convexHHisO​\(t⋅‖H−H∗∗‖∞\)O\(t\\cdot\\\|H\-H^\{\*\*\}\\\|\_\{\\infty\}\)— zero for convexHH, and controlled by the convexity gap otherwise\.

#### Convolutional architectures\.

#### General activations: correspondence map\.

Networks with GELU or SiLU activations are not covered by Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1), but a partial correspondence can be identified at each level\.

*SiLU\.*The SiLU activation isSiLU​\(x\)=x⋅σ​\(x\)\\mathrm\{SiLU\}\(x\)=x\\cdot\\sigma\(x\), whereσ​\(x\)=ex/\(ex\+1\)\\sigma\(x\)=e^\{x\}/\(e^\{x\}\+1\)is the softmax of two logits\(x,0\)\(x,0\), a 2\-neuron special case of the LSE gradient\. Since∇xLSEε​\(x,0\)=σ​\(x/ε\)\\nabla\_\{x\}\\mathrm\{LSE\}\_\{\\varepsilon\}\(x,0\)=\\sigma\(x/\\varepsilon\), SiLU sits one derivative away from the exact HJ class\. In the tropical limitσ​\(x/ε\)→Heaviside​\(x\)\\sigma\(x/\\varepsilon\)\\to\\mathrm\{Heaviside\}\(x\)asε→0\\varepsilon\\to 0, recovering ReLU\.

*GELU\.*The GELU activation isGELU​\(x\)=x⋅Φ​\(x\)\\mathrm\{GELU\}\(x\)=x\\cdot\\Phi\(x\), whereΦ\\Phiis the standard Gaussian CDF\. SinceΦ​\(x\)=∫−∞xϕ​\(t\)​𝑑t\\Phi\(x\)=\\int\_\{\-\\infty\}^\{x\}\\phi\(t\)\\,dtandϕ\\phiis the heat kernel at timet=12t=\\frac\{1\}\{2\}, GELU involves a Gaussian convolution, placing it in the same family as the imaginary\-time propagator of Appendix[A](https://arxiv.org/html/2605.28983#A1)\. In the limitΦ​\(x/ε\)→Heaviside​\(x\)\\Phi\(x/\\varepsilon\)\\to\\mathrm\{Heaviside\}\(x\)asε→0\\varepsilon\\to 0, again recovering ReLU\.

Both activations therefore share the same tropical limit and the same Gibbs/heat\-kernel structure at finite temperature as the LSE class\. The missing ingredient is an exact algebraic identity analogous to Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\. Identifying the precise PDE whose solution operator has SiLU or GELU as its finite\-temperature layer\-wise action is an open problem\.

Table[6](https://arxiv.org/html/2605.28983#A8.T6)summarises the correspondence status for common activations\.

Table 6:HJ correspondence status for common activations\.Yes \(solution\-class\): activation IS the Hopf–Cole solution \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\)\.Yes \(∇\\nabla\-class\): activation is an exact component of the Hopf–Cole measure \(the signed Gibbs weight∇LSE\\nabla\\,\\mathrm\{LSE\}\), an exact algebraic identity, not an approximation\.Open: no exact HJ identity known\.
#### LSE\-transformer blocks\.

A standard pre\-norm transformer block with LSE\-activated FFN \(replacing GELU or SiLU\) consists of four components: \(1\) LayerNorm, \(2\) scaled dot\-product attention with residual, \(3\) LayerNorm, \(4\) two\-layer LSE\-FFN with residual\.

###### Proposition H\.9\(LSE\-transformer block characterization\)\.

LetX∈ℝL×dX\\in\\mathbb\{R\}^\{L\\times d\}be a sequence ofLLtoken embeddings\. Define the pre\-norm LSE\-transformer block:

Z1\\displaystyle Z\_\{1\}=LN​\(X\),Z2=X\+Attn​\(Z1​WQ,Z1​WK,Z1​WV\),\\displaystyle=\\mathrm\{LN\}\(X\),\\quad Z\_\{2\}=X\+\\mathrm\{Attn\}\(Z\_\{1\}W\_\{Q\},\\,Z\_\{1\}W\_\{K\},\\,Z\_\{1\}W\_\{V\}\),Zout\\displaystyle Z\_\{\\mathrm\{out\}\}=Z2\+LSEε2​\(W2⋅LSEε1​\(W1​LN​\(Z2\)\+b1\)\+b2\),\\displaystyle=Z\_\{2\}\+\\mathrm\{LSE\}\_\{\\varepsilon\_\{2\}\}\(W\_\{2\}\\cdot\\mathrm\{LSE\}\_\{\\varepsilon\_\{1\}\}\(W\_\{1\}\\,\\mathrm\{LN\}\(Z\_\{2\}\)\+b\_\{1\}\)\+b\_\{2\}\),whereLN\\mathrm\{LN\}is layer normalization\. Then:

1. \(i\)*Attention \(exact\):*identity \([7](https://arxiv.org/html/2605.28983#S4.E7)\) holds for any inputsQ,K,VQ,K,V, includingLN\\mathrm\{LN\}\-normalized ones; the LayerNorm defines the inputs but does not alter the algebraic identity\. Max error is floating\-point roundoff \(Table[4](https://arxiv.org/html/2605.28983#A7.T4)\)\.
2. \(ii\)*LSE\-FFN \(exact\):*each LSE layer algebraically satisfies the identity of Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)with its respective input asxx\(whether raw or computed from previous layers\); both layers are individually exact HC solutions for their own initial data\. Max error is machine precision \(Table[4](https://arxiv.org/html/2605.28983#A7.T4)\)\.
3. \(iii\)*Residual connections \(structural\):*each skip connection contributesO​\(h\)O\(h\)Euler discretization error \(Proposition[5\.2](https://arxiv.org/html/2605.28983#S5.Thmtheorem2)\)\.

The distinction from a standard transformer is only the FFN activation: replacing GELU with LSE activates the exact HJ correspondence for that sub\-layer, with no other architectural change required\.

#### Role dictionary for attention\.

In the Hopf–Cole reading of attention, each component has a precise PDE interpretation: queriesqiq\_\{i\}are evaluation pointsxxat which the HJ solution is queried; keyskjk\_\{j\}are support pointsyjy\_\{j\}of the initial\-data measure; valuesVVare the observable field averaged under the Gibbs measure \(the role ofg​\(yj\)g\(y\_\{j\}\)in the Hopf–Cole representation, but placed outside the exponent since attention computes the gradient∇zLSEε\\nabla\_\{z\}\\mathrm\{LSE\}\_\{\\varepsilon\}contracted withVV, not the log\-partition function itself\)\. The1/d1/\\sqrt\{d\}scaling in standard dot\-product attention fixes the effective temperature atε=d\\varepsilon=\\sqrt\{d\}: without it, logit variance grows linearly indd, driving attention toward the tropical \(hard\) limit asddincreases\. Asε→0\\varepsilon\\to 0, softmax collapses to argmax and attention becomes hard attentionAttn0​\(Q,K,V\)i=vargmaxj​\(qi⋅kj\)\\mathrm\{Attn\}\_\{0\}\(Q,K,V\)\_\{i\}=v\_\{\\mathrm\{argmax\}\_\{j\}\(q\_\{i\}\\cdot k\_\{j\}\)\}, the tropical selection operator\.

###### Proof of Theorem[4\.2](https://arxiv.org/html/2605.28983#S4.Thmtheorem2)\.

The identity \([7](https://arxiv.org/html/2605.28983#S4.E7)\) holds because the softmax weightsπj=exp⁡\(zj/ε\)/∑lexp⁡\(zl/ε\)\\pi\_\{j\}=\\exp\(z\_\{j\}/\\varepsilon\)/\\sum\_\{l\}\\exp\(z\_\{l\}/\\varepsilon\)equal∇zjLSEε​\(z\)\\nabla\_\{z\_\{j\}\}\\mathrm\{LSE\}\_\{\\varepsilon\}\(z\)for any logitszz, includingzj=−‖qi−kj‖2/\(4​t\)z\_\{j\}=\-\\\|q\_\{i\}\-k\_\{j\}\\\|^\{2\}/\(4t\), which giveszj/ε=−‖qi−kj‖2/\(4​ε​t\)z\_\{j\}/\\varepsilon=\-\\\|q\_\{i\}\-k\_\{j\}\\\|^\{2\}/\(4\\varepsilon t\)and thus recovers the L2\-Attn weights exactly\. By \([3](https://arxiv.org/html/2605.28983#S4.E3)\) withg≡0g\\equiv 0,uεN​\(qi,t\)=−ε​log​∑jexp⁡\(−‖qi−kj‖2/\(4​ε​t\)\)u\_\{\\varepsilon\}^\{N\}\(q\_\{i\},t\)=\-\\varepsilon\\log\\sum\_\{j\}\\exp\(\-\\\|q\_\{i\}\-k\_\{j\}\\\|^\{2\}/\(4\\varepsilon t\)\), soZi=e−uεN​\(qi,t\)/εZ\_\{i\}=e^\{\-u\_\{\\varepsilon\}^\{N\}\(q\_\{i\},t\)/\\varepsilon\}\. ∎

L2 attention evaluates the exact Hopf–Cole solution at each query rather than a linear approximation to the logits, making it the most directly heat\-kernel\-native of the standard attention variants\.

## Appendix IIntrinsic Dimension from Published Scaling Curves

Proposition[8\.8](https://arxiv.org/html/2605.28983#S8.Thmtheorem8)predictsℒ​\(N\)≍N−1/deff\\mathcal\{L\}\(N\)\\asymp N^\{\-1/d\_\{\\mathrm\{eff\}\}\}, so the empirical scaling exponentα\\alphadirectly estimates the intrinsic dimension of the data\-generating measure viadeff=1/αd\_\{\\mathrm\{eff\}\}=1/\\alpha\. Table[7](https://arxiv.org/html/2605.28983#A9.T7)applies this to published scaling curves across four domains\. The Kaplan and Chinchilla language exponents differ becauseKaplanet al\.\[[2020](https://arxiv.org/html/2605.28983#bib.bib27)\]under\-train large models \(insufficient data\), inflating the apparent dimension;Hoffmannet al\.\[[2022](https://arxiv.org/html/2605.28983#bib.bib35)\]correct this with compute\-optimal allocation and recover a lower, more interpretabledeff≈2\.9d\_\{\\mathrm\{eff\}\}\\approx 2\.9\. The video and math exponents fromHenighanet al\.\[[2020](https://arxiv.org/html/2605.28983#bib.bib34)\]use model\-size scaling at convergence \(Figure 3 of that paper\)\.

Table 7:Empirical scaling exponentα\\alphaand implied intrinsic dimensiondeff=1/αd\_\{\\mathrm\{eff\}\}=1/\\alphafrom published scaling curves\. All exponents are for loss vs\. model parameter countNN\. The conversiondeff=1/αd\_\{\\mathrm\{eff\}\}=1/\\alphais directional: empiricalα\\alphaconflates approximation, optimization, and data\-coverage effects, so the values should be interpreted as order\-of\-magnitude estimates rather than exact intrinsic dimensions\. In particular, the Kaplan exponent reflects under\-training rather than the data manifold alone; the Chinchilla exponent is more interpretable as a geometric quantity\.The variation indeffd\_\{\\mathrm\{eff\}\}across domains is consistent with the manifold hypothesis: mathematical problem solving has lower intrinsic dimension than language \(structured, fewer degrees of freedom\), while video has intermediate dimension\.

## Appendix JActionable Design Principles

The theoretical correspondences established in the main body translate into concrete prescriptions for practitioners working with LSE networks or related architectures\. The implications below are derived directly from the theorems and are not empirical claims about performance at scale\.

#### Optimal temperature selection\.

Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)identifies the minimax\-optimal viscosityε∗≍N−1/d\\varepsilon^\{\*\}\\asymp N^\{\-1/d\}as the value minimizing the excess riskO​\(N−1/d\+M​N/n\)O\(N^\{\-1/d\}\+M\\sqrt\{N/n\}\)jointly over approximation and estimation error\. This is a closed\-form prescription: given a network of widthNNand data of intrinsic dimensiondd\(estimable from scaling curves via Proposition[8\.8](https://arxiv.org/html/2605.28983#S8.Thmtheorem8)\), the optimal temperature follows fromNNandddalone, with no grid search\. Settingε\\varepsilonbelowε∗\\varepsilon^\{\*\}under\-regularizes \(estimation error dominates\); settingε\\varepsilonaboveε∗\\varepsilon^\{\*\}over\-smooths \(approximation error dominates\)\.

#### Robustness\-accuracy trade\-off\.

Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2)shows‖∇x2fεN‖2≤‖W‖2,∞2/ε\\\|\\nabla^\{2\}\_\{x\}f\_\{\\varepsilon\}^\{N\}\\\|\_\{2\}\\leq\\\|W\\\|\_\{2,\\infty\}^\{2\}/\\varepsilonfor all inputs, independently of the weights\. The worst\-case sensitivity to input perturbations is therefore certifiably controlled byε\\varepsilon: increasingε\\varepsilonmonotonically decreases perturbation sensitivity, while decreasingε\\varepsilontowardε∗\\varepsilon^\{\*\}sharpens discrimination\. Practitioners facing adversarial robustness requirements can therefore raiseε\\varepsilonto obtain a certified guarantee from Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2)without retraining\.

#### Temperature annealing\.

Sinceε\\varepsiloncontrols approximation quality \(viaN−1/dN^\{\-1/d\}\) and robustness \(via‖W‖2,∞2/ε\\\|W\\\|\_\{2,\\infty\}^\{2\}/\\varepsilon\) simultaneously, it admits a principled annealing schedule\. Starting training at highε\\varepsilonprovides a smooth, well\-conditioned loss landscape; gradually annealing towardε∗\\varepsilon^\{\*\}sharpens the approximation as training stabilizes\. This is the neural\-network analogue of PDE continuation methods, and the framework supplies the theoretical justification: the schedule follows the path along which the HJ equation transitions from the strongly viscous regime \(diffusion\-dominated\) to the mildly viscous regime near optimal viscosity\.

#### Architecture as discretization strategy\.

All four architecture classes are different discretizations of the same HJ equation, making the choice of architecture a choice of discretization strategy rather than an empirical design decision\.

- •*Feedforward networks*discretize the measure: neurons are iid samples from the initial\-data measure with driftb=0b=0\.
- •*ResNets*discretize the process via Euler–Maruyama with driftb=F​\(x,W\)b=F\(x,W\); the layer\-to\-layer dynamics are the characteristics of the HJ equation\.
- •*Transformers*use attention as a vector\-valued Hopf–Cole average \(Proposition[H\.9](https://arxiv.org/html/2605.28983#A8.Thmtheorem9)\); the attention weights are the Gibbs measure at temperatureε=dk\\varepsilon=\\sqrt\{d\_\{k\}\}\.
- •*Recurrent architectures*discretize linear or nonlinear characteristics with architecture\-dependent viscosity \(Proposition[5\.4](https://arxiv.org/html/2605.28983#S5.Thmtheorem4)\)\.

Tasks with natural temporal ordering \(sequence modeling, time series\) favor process\-discretization strategies \(ResNet, recurrent\); tasks requiring an invariant of a static distribution favor measure\-discretization \(feedforward\); tasks requiring query\-dependent weighting of stored data favor attention\.

#### Backpropagation as adjoint optimization\.

Theorem[8\.4](https://arxiv.org/html/2605.28983#S8.Thmtheorem4)identifies backpropagation as the co\-state equation of the Hamiltonian system associated with the HJ PDE\. The adjoint method is the standard technique for gradient computation in PDE\-constrained optimization; the present result establishes that standard reverse\-mode autodifferentiation is already an instance of it\. This opens the design space for training algorithms: symplectic integrators \(which preserve the Hamiltonian structure exactly\) and higher\-order ODE solvers can replace the Euler gradient step with theoretical guarantees on conservation of the Hamiltonian, potentially improving long\-horizon training stability\.

#### Compute\-optimal scaling forecasts\.

Proposition[8\.8](https://arxiv.org/html/2605.28983#S8.Thmtheorem8)identifies the scaling exponentα=1/deff\\alpha=1/d\_\{\\mathrm\{eff\}\}fromℒ​\(N\)∝N−α\\mathcal\{L\}\(N\)\\propto N^\{\-\\alpha\}, so the data intrinsic dimension can be read directly from published scaling curves\. Conversely, given an estimate ofdeffd\_\{\\mathrm\{eff\}\}from a small pilot run, the framework predicts the full scaling curve and the compute\-optimal allocationN∝nd/\(d\+2\)N\\propto n^\{d/\(d\+2\)\}before a large training run, allowing resource allocation to be grounded in the approximation\-rate theory rather than extrapolation of empirical trend lines\.

## Appendix KImplications for Large\-Scale Training, Alignment, and Continual Learning

The framework yields theorem\-grounded observations on three questions of practical importance; the implications are derived from the approximation bounds and structural results established above, and are not claims about empirical performance at scale\.

#### Internet\-scale data exhaustion\.

Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)decomposes the excess risk asO​\(N−1/deff\+M​N/n\)O\(N^\{\-1/d\_\{\\mathrm\{eff\}\}\}\+M\\sqrt\{N/n\}\): the first term is the approximation error \(model too small to resolve the data measure\) and the second is the estimation error \(data too scarce to pin down the measure\)\. At fixed dataset sizen=nmaxn=n\_\{\\max\}\(the internet fully crawled\), the estimation floorM​N/nmaxM\\sqrt\{N/n\_\{\\max\}\}cannot be reduced by adding more data\. The bottleneck shifts from measure coverage to two model\-side quantities: widthNN\(more neurons approximate the measure more finely, at rateN−1/deffN^\{\-1/d\_\{\\mathrm\{eff\}\}\}\) and intrinsic dimensiondeffd\_\{\\mathrm\{eff\}\}\(lower dimension means fewer neurons are needed for a given accuracy\)\. Data scaling is therefore not*over*but*transformed*: the gain from doublingnnbecomes negligible oncen≫N\(deff\+2\)/deffn\\gg N^\{\(d\_\{\\mathrm\{eff\}\}\+2\)/d\_\{\\mathrm\{eff\}\}\}, at which point architectural inductive bias \(reducingdeffd\_\{\\mathrm\{eff\}\}\) dominates\.

#### Alignment\.

Proposition[D\.1](https://arxiv.org/html/2605.28983#A4.Thmtheorem1)identifies training as the selection of the risk\-minimizing initial\-value problem: the lossℛ​\[μ\]\\mathcal\{R\}\[\\mu\]determines which initial dataggthe network encodes\. Alignment is thus the problem of designingℛ\\mathcal\{R\}so the risk\-minimizing IVP encodes human\-valued initial data\. The hallucination result \(Proposition[8\.5](https://arxiv.org/html/2605.28983#S8.Thmtheorem5)\) identifies the structural failure mode: in out\-of\-distribution regions whereΔ​\(x\)/ε≫log⁡N\\Delta\(x\)/\\varepsilon\\gg\\log N, the output is exponentially close to the dominant neuron’s linear extrapolation and is structurally ungoverned byℛ\\mathcal\{R\}\. No loss function can align behavior in regions outside the support of the training measure, regardless of its form\. Three interventions follow directly from the framework\.*Data coverage*: expanding the support ofμ\\mu\(diverse training data\) is the primary lever, since every in\-distribution point is governed byℛ\\mathcal\{R\}\.*Viscosity control*: increasingε\\varepsilonsmooths the OOD extrapolation \(the Hessian bound of Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2)ensures the output changes slowly withε\\varepsilon\), reducing confidence on unfamiliar inputs\.*OOD penalties*: augmentingℛ\\mathcal\{R\}with penalties at synthetic OOD points moves those points into the training support\. The measure\-poisoning analysis \(Proposition[E\.3](https://arxiv.org/html/2605.28983#A5.Thmtheorem3)\) shows that adversarial neuron insertion shifts the output bysoftplusε​\(z0−fεN\)\\mathrm\{softplus\}\_\{\\varepsilon\}\(z\_\{0\}\-f\_\{\\varepsilon\}^\{N\}\), with the viscosity parameterε\\varepsiloncontrolling the damage ceiling\.

#### Continual learning\.

A trained network encodes the discrete measureμN=1N​∑jδyj\\mu\_\{N\}=\\frac\{1\}\{N\}\\sum\_\{j\}\\delta\_\{y\_\{j\}\}\. Learning a new task corresponds to adding new support points toμN\\mu\_\{N\}; catastrophic forgetting corresponds to their displacement\. When gradient descent on new data overwrites existing neurons, the coverage ofμN\\mu\_\{N\}for the old task shrinks\. The quadrature bound of Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1)\(iii\) quantifies the damage: ifkkof theNNneurons previously covering task 1 are reassigned, the approximation error for task 1 grows fromO​\(N−1/deff\)O\(N^\{\-1/d\_\{\\mathrm\{eff\}\}\}\)toO​\(\(N−k\)−1/deff\)O\(\(N\-k\)^\{\-1/d\_\{\\mathrm\{eff\}\}\}\)\. The framework prescribes a principled remedy: allocate additional neurons for new tasks rather than displacing old ones, growingNNproportionally to the cumulative number of tasks\. Parameter\-isolation methods \(freezing neurons after learning\) correspond to fixing the support points\{yj\}\\\{y\_\{j\}\\\}of the old task’s measure; generative replay corresponds to re\-introducing old support points into the training measure so that gradient descent does not displace them\. Both strategies have the same PDE interpretation: preserving the coverage ofμN\\mu\_\{N\}for previously learned tasks\.

## Appendix LDiscussion and Limitations

### Limitations

#### Training dynamics and the selection problem\.

The abstract identifies training as “a search through Hamilton–Jacobi initial\-value problems”; this is precise for any fixed parameterization \(each trained network*is*an HJ initial datum, Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\), but the gradient\-step interpretation requires qualification\. The framework characterizes*what*a converged or optimally parameterized LSE network computes \(the Hopf–Cole solution of a viscous HJ equation under a discrete initial measure\) but not*how*stochastic gradient descent selects this initial\-value problem from data\. Proposition[D\.1](https://arxiv.org/html/2605.28983#A4.Thmtheorem1)\(Appendix[D](https://arxiv.org/html/2605.28983#A4)\) establishes a mean\-field correspondence: gradient flow on the two\-layer Gibbs energy functional converges to the optimal support measure under the Mei–Montanari conditions\[Meiet al\.,[2018](https://arxiv.org/html/2605.28983#bib.bib32)\]\. For finiteNN, Proposition[D\.3](https://arxiv.org/html/2605.28983#A4.Thmtheorem3)\(Appendix[D](https://arxiv.org/html/2605.28983#A4)\) shows that the neural tangent kernel of an LSE network takes the closed formKa​b=⟨π​\(xa\),π​\(xb\)⟩K\_\{ab\}=\\langle\\pi\(x\_\{a\}\),\\pi\(x\_\{b\}\)\\rangle— the inner product of Gibbs weight vectors — and is positive definite almost surely forN≥nN\\geq ngeneric support points, guaranteeing convergence to zero training error at a linear rate under MSE\. This connection is precise but narrow: it applies to two\-layer networks in the infinite\-width mean\-field limit, with i\.i\.d\. Gaussian initialization and a specific loss functional\. For deep networks \(L≥2L\\geq 2\), practical mini\-batch SGD, and non\-Gaussian data, the training trajectory is not characterized by the current framework\. The Pontryagin Maximum Principle interpretation of backpropagation \(Theorem[8\.4](https://arxiv.org/html/2605.28983#S8.Thmtheorem4)\) describes the gradient of the loss with respect to the initial datum, but which initial datum gradient descent converges to, among the many local minima of the HJ landscape, remains open\.

#### Non\-quadratic Hamiltonians\.

The exact Hopf–Cole identity \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\) holds if and only if the Hamiltonian is quadratic,H​\(p\)=\|p\|2H\(p\)=\|p\|^\{2\}or its anisotropic extensionp⊤​Aθ​pp^\{\\top\}A\_\{\\theta\}p\(Theorem[4\.4](https://arxiv.org/html/2605.28983#S4.Thmtheorem4)\); the quadratic class is maximal for exact identities \(Theorem[4\.3](https://arxiv.org/html/2605.28983#S4.Thmtheorem3)\)\. For non\-quadraticHH, the framework provides two partial results: \(i\) a convex\-envelope approximationuε,H≈uε,H∗∗u\_\{\\varepsilon,H\}\\approx u\_\{\\varepsilon,H^\{\*\*\}\}with explicitL∞L^\{\\infty\}error bounds controlled by the convexity gapH−H∗∗H\-H^\{\*\*\}\(Appendix[H](https://arxiv.org/html/2605.28983#A8)\); and \(ii\) structural correspondences for specific activations \(ReLU, GELU, SiLU, sigmoid; Table[6](https://arxiv.org/html/2605.28983#A8.T6)\) that identify the Hamiltonian but do not give an exact closed\-form identity\. Which activation functions admit an exact Hopf–Cole identity for someHH, and whether any do beyond the quadratic class, is open\.

#### Curse of dimensionality\.

The rateO​\(N−1/d\)O\(N^\{\-1/d\}\)is minimax\-optimal for Lipschitz functions indddimensions \(Theorem[8\.1](https://arxiv.org/html/2605.28983#S8.Thmtheorem1)\), so the framework provides no escape from the curse: achieving errorδ\\deltarequiresN≍δ−dN\\asymp\\delta^\{\-d\}neurons\. The intrinsic\-dimension hypothesis \(Assumption[8\.7](https://arxiv.org/html/2605.28983#S8.Thmtheorem7)\) replacesddwithdeff≪dd\_\{\\mathrm\{eff\}\}\\ll dwhen data concentrates near a submanifold, butdeff=1/αd\_\{\\mathrm\{eff\}\}=1/\\alphainferred from scaling exponents is an empirical estimate, not a certified bound on the support geometry\.

#### Integrable hierarchy: partial connection\.

The Kadomtsev\-Petviashvili \(KP\) hierarchy is a classical family of integrable nonlinear wave equations encompassing KdV, the Toda lattice, and related systems; exact solutions are organized by tau\-functions, scalar generating functions whose logarithm recovers the physical wave field\. The elementary solutions are solitons: stable, localized wave packets characterized by a wavenumberkjk\_\{j\}, which travel without dispersing and interact by elastic scattering, each two\-soliton collision producing only a phase shift encoded by an interaction factorAi​jA\_\{ij\}\. Proposition[M\.1](https://arxiv.org/html/2605.28983#A13.Thmtheorem1)places theNN\-neuron LSE partition function in the free\-soliton sector of this hierarchy: each neuron contributes an independent mode with wavenumberkjk\_\{j\}, but the nontrivial Hirota phase shifts that characterize genuineNN\-soliton interactions are absent\. The fullNN\-soliton tau\-function requires pairwise interaction factorsAi​j=\(\(ki−kj\)/\(ki\+kj\)\)2A\_\{ij\}=\(\(k\_\{i\}\-k\_\{j\}\)/\(k\_\{i\}\+k\_\{j\}\)\)^\{2\}; whether trained weights reproduce these \(i\.e\., whether gradient descent learns soliton scattering or only the free\-particle sector\) remains open, and a positive answer would connect weight optimization to the inverse scattering transform\.

#### Architectural scope\.

The exact Hopf–Cole correspondence covers LSE\-activated feedforward networks and the specificL2L^\{2\}\-normalized attention mechanism of Proposition[4\.2](https://arxiv.org/html/2605.28983#S4.Thmtheorem2)\. Several practically important architectures lie outside the exact framework: \(i\)*Multi\-head attention*splits the query–key–value space intohhheads; the HJ interpretation applies to each head independently, but the concatenation and projection step does not have a clean HJ formulation\. \(ii\)*Causal \(autoregressive\) masking*restricts the softmax to a triangular attention pattern; this breaks the spatial isotropy of the Gaussian heat kernel, and the Hopf–Cole solution no longer applies without modification\. \(iii\)*ReLU attention and linear attention*replace the softmax with a different normalization; the resulting “Hamiltonian” is non\-quadratic and falls under the non\-quadratic limitation above\. \(iv\)*GELU and SiLU*activations yield structural correspondences \(the Hamiltonian is identified but not quadratic\) rather than exact identities; the error relative to the nearest exact LSE network is bounded but not zero\.

Extended discussion of related work\.

Joint\-embedding predictive architectures \(JEPA\)\[LeCun,[2022](https://arxiv.org/html/2605.28983#bib.bib41)\]fit the same structure\. Each encoder is a HJ semigroup evaluation \(Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\); the predictor is a transport map between two HJ solutions\. Atε=0\\varepsilon=0the optimal predictor is the Brenier map characterized by the Hopf–Lax formula \(Theorem[5\.1](https://arxiv.org/html/2605.28983#S5.Thmtheorem1)\); atε\>0\\varepsilon\>0entropic regularization replaces Hopf–Lax with the Hopf–Cole dual potential\. The EMA target encoder and stop\-gradient carry the structure of a mean\-field game over the embedding space\.

What the HJ correspondence provides is the language in which these perspectives are seen to be aspects of one mathematical object, parameterized by the Maslov–Litvinov deformationε\\varepsilon\[Litvinov,[2007](https://arxiv.org/html/2605.28983#bib.bib2)\]: the passage from the smooth arithmetic semiring to the tropical one\. The framework does not replace existing perspectives; it locates them\. The parameterε\\varepsilonplays three simultaneous roles \(softmax temperature, PDE viscosity, and proximal regularization strength\), made exact by Theorem[7\.1](https://arxiv.org/html/2605.28983#S7.Thmtheorem1); the robustness bound of Corollary[8\.2](https://arxiv.org/html/2605.28983#S8.Thmtheorem2), the optimal viscosityε∗≍N−1/deff\\varepsilon^\{\*\}\\asymp N^\{\-1/d\_\{\\mathrm\{eff\}\}\}, and the bifurcation landscape of Theorem[8\.9](https://arxiv.org/html/2605.28983#S8.Thmtheorem9)are all consequences\.

## Appendix MNeural Networks as KP Tau\-Functions

###### Proposition M\.1\(LSE partition function as KP tau\-function\)\.

Letk1,…,kN∈ℝk\_\{1\},\\ldots,k\_\{N\}\\in\\mathbb\{R\},aj\>0a\_\{j\}\>0, and define

τ​\(x1,x2,x3\)=∑j=1Naj​exp⁡\(kj​x1\+kj2​x2\+kj3​x3\)\.\\tau\(x\_\{1\},x\_\{2\},x\_\{3\}\)=\\sum\_\{j=1\}^\{N\}a\_\{j\}\\exp\(k\_\{j\}x\_\{1\}\+k\_\{j\}^\{2\}x\_\{2\}\+k\_\{j\}^\{3\}x\_\{3\}\)\.\(41\)1. \(i\)\[KP hierarchy\.\]τ\\tausatisfies the Hirota bilinear equation \(D14\+3​D22−4​D1​D3\)​τ⋅τ=0,\(D\_\{1\}^\{4\}\+3D\_\{2\}^\{2\}\-4D\_\{1\}D\_\{3\}\)\\,\\tau\\cdot\\tau=0,\(42\)whereDim​f⋅g=\(∂a−∂a′\)m​f​\(a\)​g​\(a′\)\|a′=aD\_\{i\}^\{m\}f\\cdot g=\(\\partial\_\{a\}\-\\partial\_\{a^\{\\prime\}\}\)^\{m\}f\(a\)g\(a^\{\\prime\}\)\\big\|\_\{a^\{\\prime\}=a\}\. It is therefore a tau\-function of the KP hierarchy \(free soliton sector,NNcomponents\)\.
2. \(ii\)\[Neural network identification\.\]Ford=1d=1, setx1=x/εx\_\{1\}=x/\\varepsilon,x2=−1/\(4​t​ε\)x\_\{2\}=\-1/\(4t\\varepsilon\),x3=0x\_\{3\}=0,kj=yjk\_\{j\}=y\_\{j\},aj=e−g​\(yj\)/εa\_\{j\}=e^\{\-g\(y\_\{j\}\)/\\varepsilon\}\. Then uεN​\(x,t\)=x24​t−ε​log⁡τ​\(x/ε,−1/\(4​t​ε\),0\),u\_\{\\varepsilon\}^\{N\}\(x,t\)\\;=\\;\\frac\{x^\{2\}\}\{4t\}\-\\varepsilon\\log\\tau\\\!\\bigl\(x/\\varepsilon,\\;\-1/\(4t\\varepsilon\),\\;0\\bigr\),\(43\)so the LSE layer output is the logarithm of a KP tau\-function evaluated at specific KP flow times\.
3. \(iii\)\[Tropical limit = BBS\.\]Asε→0\\varepsilon\\to 0,ε​log⁡τ→maxj⁡\(kj​x1\)\\varepsilon\\log\\tau\\to\\max\_\{j\}\(k\_\{j\}x\_\{1\}\), the tropical tau\-function corresponding to the Box\-Ball System \(BBS\) soliton automaton, the ultradiscretization of the Toda lattice\.

###### Proof\.

\(i\)By bilinearity of the Hirota operators,\(D14\+3​D22−4​D1​D3\)​τ⋅τ=∑i,jai​aj​P​\(ki,kj\)​eξi\+ξj\(D\_\{1\}^\{4\}\+3D\_\{2\}^\{2\}\-4D\_\{1\}D\_\{3\}\)\\tau\\cdot\\tau=\\sum\_\{i,j\}a\_\{i\}a\_\{j\}\\,P\(k\_\{i\},k\_\{j\}\)\\,e^\{\\xi\_\{i\}\+\\xi\_\{j\}\}whereξj=kj​x1\+kj2​x2\+kj3​x3\\xi\_\{j\}=k\_\{j\}x\_\{1\}\+k\_\{j\}^\{2\}x\_\{2\}\+k\_\{j\}^\{3\}x\_\{3\}and

P​\(ki,kj\)=\(ki−kj\)4\+3​\(ki2−kj2\)2−4​\(ki−kj\)​\(ki3−kj3\)\.P\(k\_\{i\},k\_\{j\}\)\\;=\\;\(k\_\{i\}\-k\_\{j\}\)^\{4\}\+3\(k\_\{i\}^\{2\}\-k\_\{j\}^\{2\}\)^\{2\}\-4\(k\_\{i\}\-k\_\{j\}\)\(k\_\{i\}^\{3\}\-k\_\{j\}^\{3\}\)\.Diagonal terms \(i=ji=j\):P​\(ki,ki\)=0P\(k\_\{i\},k\_\{i\}\)=0\. Fori≠ji\\neq j, factor\(ki2−kj2\)=\(ki−kj\)​\(ki\+kj\)\(k\_\{i\}^\{2\}\-k\_\{j\}^\{2\}\)=\(k\_\{i\}\-k\_\{j\}\)\(k\_\{i\}\+k\_\{j\}\)and\(ki3−kj3\)=\(ki−kj\)​\(ki2\+ki​kj\+kj2\)\(k\_\{i\}^\{3\}\-k\_\{j\}^\{3\}\)=\(k\_\{i\}\-k\_\{j\}\)\(k\_\{i\}^\{2\}\+k\_\{i\}k\_\{j\}\+k\_\{j\}^\{2\}\):

P​\(ki,kj\)=\(ki−kj\)2​\[\(ki−kj\)2\+3​\(ki\+kj\)2−4​\(ki2\+ki​kj\+kj2\)\]\.P\(k\_\{i\},k\_\{j\}\)=\(k\_\{i\}\-k\_\{j\}\)^\{2\}\\bigl\[\(k\_\{i\}\-k\_\{j\}\)^\{2\}\+3\(k\_\{i\}\+k\_\{j\}\)^\{2\}\-4\(k\_\{i\}^\{2\}\+k\_\{i\}k\_\{j\}\+k\_\{j\}^\{2\}\)\\bigr\]\.Expanding the bracket:\(1\+3−4\)​ki2\+\(−2\+6−4\)​ki​kj\+\(1\+3−4\)​kj2=0\(1\+3\-4\)k\_\{i\}^\{2\}\+\(\-2\+6\-4\)k\_\{i\}k\_\{j\}\+\(1\+3\-4\)k\_\{j\}^\{2\}=0\. HenceP=0P=0for alli,ji,j\.

\(ii\)Substituting intoτ\\tau:∑je−g​\(yj\)/ε​eyj​x/ε−yj2/\(4​t​ε\)=ex2/\(4​t​ε\)​∑je−\(g​\(yj\)\+\|x−yj\|2/4​t\)/ε\\sum\_\{j\}e^\{\-g\(y\_\{j\}\)/\\varepsilon\}e^\{y\_\{j\}x/\\varepsilon\-y\_\{j\}^\{2\}/\(4t\\varepsilon\)\}=e^\{x^\{2\}/\(4t\\varepsilon\)\}\\sum\_\{j\}e^\{\-\(g\(y\_\{j\}\)\+\|x\-y\_\{j\}\|^\{2\}/4t\)/\\varepsilon\}, so−ε​log⁡τ=−x2/\(4​t\)\+uεN​\(x,t\)\-\\varepsilon\\log\\tau=\-x^\{2\}/\(4t\)\+u\_\{\\varepsilon\}^\{N\}\(x,t\)by Theorem[4\.1](https://arxiv.org/html/2605.28983#S4.Thmtheorem1)\.

\(iii\)By Laplace’s method \(Varadhan’s lemma\):ε​log⁡τ→maxj⁡\(kj​x1\)\\varepsilon\\log\\tau\\to\\max\_\{j\}\(k\_\{j\}x\_\{1\}\)\. This is the tropical tau\-function; the Toda→\\toBBS ultradiscretization\[Tokihiroet al\.,[1996](https://arxiv.org/html/2605.28983#bib.bib5)\]gives the BBS soliton automaton as its combinatorial realization\. ∎

Similar Articles

The Hamilton-Jacobi Theory of Deep Learning

Hugging Face Daily Papers

This paper identifies neural network training as a search through Hamilton-Jacobi initial-value problems, showing that residual networks, transformers, and RNNs discretize the same class of viscous Hamilton-Jacobi equations. It derives quantitative consequences including minimax optimal generalization rates, adversarial robustness bounds, and a closed-form influence function.