A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks

arXiv cs.LG Papers

Summary

This paper establishes a mathematically rigorous connection between shock-wave theory and symmetry-quotiented learning dynamics of stochastic gradient descent, showing that after symmetry reduction and coarse-graining, the dynamics satisfy viscous Hamilton-Jacobi and Burgers-type equations with shock formation times controlled by loss curvature.

arXiv:2606.18303v1 Announce Type: new Abstract: We develop a mathematically explicit link between shock-wave theory and the symmetry-quotiented learning dynamics of stochastic gradient descent, drawing on differential geometry, Lie group theory, and fluid mechanics. Specifically, after quotienting parameter symmetries and applying local-entropy coarse-graining, the effective dynamics satisfy a viscous Hamilton--Jacobi equation on the quotient manifold. Moreover, under the assumption that the raw parameter dynamics can be summarized by a gradient field on the quotiented space, the gradient of the coarse-grained loss function obeys a Burgers-type equation, and shock formation can be established rigorously. We apply our theory to multilayer perceptrons, convolutional neural networks, Transformers, and mean-field networks, and show that they obey the Hamilton--Jacobi or Burgers-type equations. We conjecture that this framework also yields practical diagnostics for deep learning. In architectures such as Transformers, raw parameter norms are often distorted by symmetry redundancy and may therefore be misleading, whereas symmetry-corrected quotient observables provide a principled basis for monitoring, forecasting, and controlling training-phase transitions.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:40 AM

# A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks
Source: [https://arxiv.org/html/2606.18303](https://arxiv.org/html/2606.18303)
11institutetext:NEC Corporation11email:miyagawataik@nec\.com###### Abstract

We develop a mathematically explicit link between shock\-wave theory and the symmetry\-quotiented learning dynamics of stochastic gradient descent, drawing on differential geometry, Lie group theory, and fluid mechanics\. Specifically, afterquotienting parameter symmetriesand applyinglocal\-entropy coarse\-graining, the effective dynamics satisfy a viscous Hamilton–Jacobi equation on the quotient manifold\. Moreover, under the assumption that the raw parameter dynamics can be summarized by a gradient field on the quotiented space, the gradient of the coarse\-grained loss function obeys a Burgers\-type equation, and shock formation can be established rigorously\. We apply our theory to multilayer perceptrons, convolutional neural networks, Transformers, and mean\-field networks, and show that they obey the Hamilton–Jacobi or Burgers\-type equations\. We conjecture that this framework also yields practical diagnostics for deep learning\. In architectures such as Transformers, raw parameter norms are often distorted by symmetry redundancy and may therefore be misleading, whereas symmetry\-corrected quotient observables provide a principled basis for monitoring, forecasting, and controlling training\-phase transitions\.

## 1Introduction

Combining the following insights, we propose a correspondence between shock wave theory and symmetry\-quotiented learning dynamics of stochastic gradient descent \(SGD\)\. Shock waves in fluid mechanics are governed by nonlinear transport, loss of classical regularity, and weak\-solution selection by entropy conditions\. Deep learning, by contrast, is usually formulated as a high\-dimensional stochastic optimization problem\. Several established mathematical facts suggest a principled bridge\. First, positively homogeneous neural networks such as ReLU networks possess positive rescalings and permutations, so physically meaningful observables often live onquotient spacesrather than in raw parameter coordinates\[[1](https://arxiv.org/html/2606.18303#bib.bib1),[2](https://arxiv.org/html/2606.18303#bib.bib2)\]\. Second, discrete\-time SGD admits continuous\-time approximations in the form of stochastic modified equations and stochastic modified flows\[[4](https://arxiv.org/html/2606.18303#bib.bib4),[5](https://arxiv.org/html/2606.18303#bib.bib5)\]\. Third, local\-entropy relaxations of nonconvex losses are governed by viscous Hamilton–Jacobi equations\[[3](https://arxiv.org/html/2606.18303#bib.bib3)\]\. Fourth, in wide\-network limits, SGD induces diffusion equations rather than simple finite\-dimensional ordinary differential equations\[[7](https://arxiv.org/html/2606.18303#bib.bib7),[8](https://arxiv.org/html/2606.18303#bib.bib8)\]\.

The purpose of this paper is to connect these ingredients in a single rigorous architecture\.111Our position is deliberately conservative: We do not claim that generic neural\-network parameters satisfy Burgers’ equation\.We claim thatsymmetry quotient plus local\-entropy coarse\-grainingnaturally yields a viscous Hamilton–Jacobi equation on aquotient space\. Furthermore, we prove that if a gradient field on the quotient space summarizes the dynamics on the raw parameter space \(referred to in this paper as the closedness assumption for a one\-dimensional collective coordinate\), then, the quotiented gradient field obeys a Burgers\-type equation, with shock formation time controlled by the negative curvature of the coarse\-grained loss function\. This yields a precise mathematical reinterpretation of abrupt training\-regime changes: in the quotient description, they appear as shock\-type singularities or viscous shock layers in the coarse\-grained average gradient\.

Beyond its mathematical correspondence with Hamilton–Jacobi and Burgers\-type equations, we conjecture that the present framework has a practical meaning for modern deep learning systems\. The main point is that the theory identifies which variables should be monitored, which quantities should be interpreted as early\-warning signals of regime change, and which hyperparameters act as control knobs for smoothing or sharpening such transitions\.

## 2Preliminaries

### 2\.1Definition

![Refer to caption](https://arxiv.org/html/2606.18303v1/x1.png)Figure 1:Notation\.LetΘ⊂ℝdΘ\\Theta\\subset\\mathbb\{R\}^\{d\_\{\\Theta\}\}be a smooth parameter manifold, and let a Lie group or finite groupGGact smoothly onΘ\\Theta\. We assume that there is an open regular stratumΘreg⊂Θ\\Theta\_\{\\mathrm\{reg\}\}\\subset\\Thetaon which the action is free and proper\. Then, the quotientM:=Θreg/GM:=\\Theta\_\{\\mathrm\{reg\}\}/Gforms a smooth manifold, and the quotient mapπ:Θreg→M\\pi:\\Theta\_\{\\mathrm\{reg\}\}\\to Mis a smooth submersion\. In the finite\-group case, the global quotient may be an orbifold, but on each principal stratum, the local manifold picture is valid\. LetL:Θreg→ℝL:\\Theta\_\{\\mathrm\{reg\}\}\\to\\mathbb\{R\}be a smooth empirical loss satisfyingL​\(g⋅θ\)=L​\(θ\)L\(g\\cdot\\theta\)=L\(\\theta\)for allg∈Gg\\in Gandθ∈Θreg\\theta\\in\\Theta\_\{\\mathrm\{reg\}\}\. Then,LLdescends to a smooth function,effective potential,U:M→ℝU:M\\to\\mathbb\{R\}such thatL=U∘πL=U\\circ\\pi\. We consider the stochastic iterationθn\+1=θn−η​\(∇L​\(θn\)\+Mn\+1\),\\theta\_\{n\+1\}=\\theta\_\{n\}\-\\eta\\bigl\(\\nabla L\(\\theta\_\{n\}\)\+M\_\{n\+1\}\\bigr\),whereη\>0\\eta\>0is the learning rate and\(Mn\+1\)n≥0\(M\_\{n\+1\}\)\_\{n\\geq 0\}is a martingale\-difference sequence adapted to a filtration\(ℱn\)n≥0\(\\mathcal\{F\}\_\{n\}\)\_\{n\\geq 0\}, that is,𝔼​\[Mn\+1∣ℱn\]=0\.\\mathbb\{E\}\[M\_\{n\+1\}\\mid\\mathcal\{F\}\_\{n\}\]=0\.We assume throughout that on compact subsets ofΘreg\\Theta\_\{\\mathrm\{reg\}\},𝔼​\[∥Mn\+1∥3∣ℱn\]≤C\\mathbb\{E\}\\bigl\[\\lVert M\_\{n\+1\}\\rVert^\{3\}\\mid\\mathcal\{F\}\_\{n\}\\bigr\]\\leq Cfor some local constantCC\. We will repeatedly use the standard Hopf–Cole transform222It is a change of variables that transforms a special type of parabolic partial differential equations \(PDEs\) with a quadratic nonlinearity into a linear heat equation\.for viscous Burgers and viscous Hamilton–Jacobi equations, together with classical characteristic theory for inviscid Burgers; standard references include Evans and LeVeque\[[10](https://arxiv.org/html/2606.18303#bib.bib10),[11](https://arxiv.org/html/2606.18303#bib.bib11)\]\.

### 2\.2Quotient Reduction

We first show that, under alocal projectability assumption, the discrete\-time SGD recursion descends in a quotient chart to a closed stochastic recursion whose drift and conditional covariance depend only on the quotient state\. This is the basic reduction that allows us to replace the raw parameter dynamics by an effective dynamics on the symmetry quotient space\.

###### Assumption 2\.1\(Local projectability of drift and covariance\)

Letχ:U⊂M→ℝm\\chi:U\\subset M\\to\\mathbb\{R\}^\{m\}be a smooth chart and setΦ:=χ∘π\\Phi:=\\chi\\circ\\pi\. Assume the trajectory remains inπ−1​\(U\)\\pi^\{\-1\}\(U\)almost surely, and thatΦ\\PhiisC3C^\{3\}on the relevant compact subset\. Assume further that there exist locally bounded functionsb:χ​\(U\)→ℝmb:\\chi\(U\)\\to\\mathbb\{R\}^\{m\}andA:χ​\(U\)→ℝm×mA:\\chi\(U\)\\to\\mathbb\{R\}^\{m\\times m\}such that, almost surely,D​Φ​\(θn\)​∇L​\(θn\)=b​\(Yn\)D\\Phi\(\\theta\_\{n\}\)\\nabla L\(\\theta\_\{n\}\)=b\(Y\_\{n\}\)\(referred to as drift in this paper\),Yn:=Φ​\(θn\),Y\_\{n\}:=\\Phi\(\\theta\_\{n\}\),andCov​\(D​Φ​\(θn\)​Mn\+1∣ℱn\)=A​\(Yn\)\.\\mathrm\{Cov\}\\bigl\(D\\Phi\(\\theta\_\{n\}\)M\_\{n\+1\}\\mid\\mathcal\{F\}\_\{n\}\\bigr\)=A\(Y\_\{n\}\)\.333Here,D​Φ​\(θ\)D\\Phi\(\\theta\)denotes the derivative \(Jacobian\) of the quotient\-chart mapΦ\\Phiatθ\\theta, which linearly maps infinitesimal parameter\-space displacements to quotient\-coordinate displacements\. Accordingly,D​Φ​\(θn\)​∇L​\(θn\)D\\Phi\(\\theta\_\{n\}\)\\nabla L\(\\theta\_\{n\}\)is the gradient projected into the quotient coordinates, andD​Φ​\(θn\)​Mn\+1D\\Phi\(\\theta\_\{n\}\)M\_\{n\+1\}is the noise projected into the quotient coordinates\.

Assumption[2\.1](https://arxiv.org/html/2606.18303#S2.Thmtheorem1)requires that, in a local quotient chart, the projected gradient drift444Drift is the deterministic mean component of a stochastic update, that is, the average direction of motion after averaging out the random fluctuations\.b​\(Yn\)b\(Y\_\{n\}\)and the conditional covariance of the projected martingale noise depend only on the quotient stateYnY\_\{n\}\. Thus, the stochastic evolution closes at the level of the quotient variables: different representatives of the same symmetry orbit induce the same effective first\- and second\-order dynamics after projection\.555ReLU networks are a natural class of models in which to consider Assumption 1, because they do possess the relevant symmetries\. However, they do not satisfy Assumption 1 automatically in full generality\. On a regular stratum, they satisfy it locally if the projected drift and covariance close as functions only of the quotient\.

###### Theorem 2\.2\(Local quotient reduction for discrete\-time SGD\)

Under Assumption[2\.1](https://arxiv.org/html/2606.18303#S2.Thmtheorem1), there exist random variablesΞn\+1\\Xi\_\{n\+1\}andRnR\_\{n\}such thatYn\+1=Yn−η​b​\(Yn\)\+η​Ξn\+1\+η2​Rn,Y\_\{n\+1\}=Y\_\{n\}\-\\eta b\(Y\_\{n\}\)\+\\eta\\Xi\_\{n\+1\}\+\\eta^\{2\}R\_\{n\},with𝔼​\[Ξn\+1∣ℱn\]=0\\mathbb\{E\}\[\\Xi\_\{n\+1\}\\mid\\mathcal\{F\}\_\{n\}\]=0,Cov​\(Ξn\+1∣ℱn\)=A​\(Yn\),\\mathrm\{Cov\}\(\\Xi\_\{n\+1\}\\mid\\mathcal\{F\}\_\{n\}\)=A\(Y\_\{n\}\),andRnR\_\{n\}locally bounded in conditional expectation\.

###### Proof

WriteΔn:=θn\+1−θn=−η​\(∇L​\(θn\)\+Mn\+1\)\.\\Delta\_\{n\}:=\\theta\_\{n\+1\}\-\\theta\_\{n\}=\-\\eta\\bigl\(\\nabla L\(\\theta\_\{n\}\)\+M\_\{n\+1\}\\bigr\)\.Taylor’s theorem with integral remainder givesΦ​\(θn\+Δn\)=Φ​\(θn\)\+D​Φ​\(θn\)​\[Δn\]\+12​D2​Φ​\(θn\)​\[Δn,Δn\]\+ℛn\+1,\\Phi\(\\theta\_\{n\}\+\\Delta\_\{n\}\)=\\Phi\(\\theta\_\{n\}\)\+D\\Phi\(\\theta\_\{n\}\)\[\\Delta\_\{n\}\]\+\\frac\{1\}\{2\}D^\{2\}\\Phi\(\\theta\_\{n\}\)\[\\Delta\_\{n\},\\Delta\_\{n\}\]\+\\mathcal\{R\}\_\{n\+1\},where, for some random point on the segment joiningθn\\theta\_\{n\}andθn\+Δn\\theta\_\{n\}\+\\Delta\_\{n\},∥ℛn\+1∥≤C​∥Δn∥3\.\\lVert\\mathcal\{R\}\_\{n\+1\}\\rVert\\leq C\\lVert\\Delta\_\{n\}\\rVert^\{3\}\.Since𝔼​\[∥Mn\+1∥3∣ℱn\]≤C\\mathbb\{E\}\[\\lVert M\_\{n\+1\}\\rVert^\{3\}\\mid\\mathcal\{F\}\_\{n\}\]\\leq Clocally, we have𝔼​\[∥Δn∥3∣ℱn\]=O​\(η3\)\\mathbb\{E\}\[\\lVert\\Delta\_\{n\}\\rVert^\{3\}\\mid\\mathcal\{F\}\_\{n\}\]=O\(\\eta^\{3\}\)and hence𝔼​\[∥ℛn\+1∥∣ℱn\]=O​\(η3\)\.\\mathbb\{E\}\[\\lVert\\mathcal\{R\}\_\{n\+1\}\\rVert\\mid\\mathcal\{F\}\_\{n\}\]=O\(\\eta^\{3\}\)\.Moreover,D2​Φ​\(θn\)​\[Δn,Δn\]=OL1​\(η2\)D^\{2\}\\Phi\(\\theta\_\{n\}\)\[\\Delta\_\{n\},\\Delta\_\{n\}\]=O\_\{L^\{1\}\}\(\\eta^\{2\}\)locally, becauseΔn=O​\(η\)\\Delta\_\{n\}=O\(\\eta\)in conditionalL2L^\{2\}\.

Therefore,

Yn\+1−Yn\\displaystyle Y\_\{n\+1\}\-Y\_\{n\}=D​Φ​\(θn\)​\[Δn\]\+OL1​\(η2\)\\displaystyle=D\\Phi\(\\theta\_\{n\}\)\[\\Delta\_\{n\}\]\+O\_\{L^\{1\}\}\(\\eta^\{2\}\)=−η​D​Φ​\(θn\)​\[∇L​\(θn\)\]−η​D​Φ​\(θn\)​\[Mn\+1\]\+OL1​\(η2\)\.\\displaystyle=\-\\eta D\\Phi\(\\theta\_\{n\}\)\[\\nabla L\(\\theta\_\{n\}\)\]\-\\eta D\\Phi\(\\theta\_\{n\}\)\[M\_\{n\+1\}\]\+O\_\{L^\{1\}\}\(\\eta^\{2\}\)\.DefineΞn\+1:=−D​Φ​\(θn\)​\[Mn\+1\]\.\\Xi\_\{n\+1\}:=\-D\\Phi\(\\theta\_\{n\}\)\[M\_\{n\+1\}\]\.BecauseD​Φ​\(θn\)D\\Phi\(\\theta\_\{n\}\)isℱn\\mathcal\{F\}\_\{n\}\-measurable and𝔼​\[Mn\+1∣ℱn\]=0\\mathbb\{E\}\[M\_\{n\+1\}\\mid\\mathcal\{F\}\_\{n\}\]=0, we obtain𝔼​\[Ξn\+1∣ℱn\]=0\.\\mathbb\{E\}\[\\Xi\_\{n\+1\}\\mid\\mathcal\{F\}\_\{n\}\]=0\.By Assumption[2\.1](https://arxiv.org/html/2606.18303#S2.Thmtheorem1),D​Φ​\(θn\)​\[∇L​\(θn\)\]=b​\(Yn\)D\\Phi\(\\theta\_\{n\}\)\[\\nabla L\(\\theta\_\{n\}\)\]=b\(Y\_\{n\}\)a\.s\., andCov​\(Ξn\+1∣ℱn\)=A​\(Yn\)\.\\mathrm\{Cov\}\(\\Xi\_\{n\+1\}\\mid\\mathcal\{F\}\_\{n\}\)=A\(Y\_\{n\}\)\.Collecting theOL1​\(η2\)O\_\{L^\{1\}\}\(\\eta^\{2\}\)terms intoη2​Rn\\eta^\{2\}R\_\{n\}yields the claimed recursion\. ∎

If the chart is fixed and the coefficients are sufficiently regular, then the recursion above admits, withη​n≒t\\eta n\\fallingdotseq t, the standard weak continuous\-time approximationd​Yt=−b​\(Yt\)​d​t\+η​σ​\(Yt\)​d​BtdY\_\{t\}=\-b\(Y\_\{t\}\)\\,dt\+\\sqrt\{\\eta\}\\,\\sigma\(Y\_\{t\}\)\\,dB\_\{t\}andσ​σ⊤=A,\\sigma\\sigma^\{\\top\}=A,which is the symmetry\-reduced analogue of stochastic modified equations and stochastic modified flows for SGD\[[4](https://arxiv.org/html/2606.18303#bib.bib4),[5](https://arxiv.org/html/2606.18303#bib.bib5)\]\. The benefit of the theorem is fundamental for the rest of the paper\.

### 2\.3Coarse\-graining, Quotient Local Entropy

Let\(M,g\)\(M,g\)be a complete Riemannian manifold with the Laplace–Beltrami operatorΔM\\Delta\_\{M\}\.666ΔM\\Delta\_\{M\}denotes the Laplace–Beltrami operator on the quotient Riemannian manifold\(M,g\)\(M,g\), i\.e\. the geometric LaplacianΔM​f=divg⁡\(grad⁡f\)\\Delta\_\{M\}f=\\operatorname\{div\}\_\{g\}\(\\operatorname\{grad\}f\)governing diffusion onMM\.LetU:M→ℝU:M\\to\\mathbb\{R\}be a smootheffective potentialon the quotient, or equivalently a regularized effective loss function\. Define the heat semigroupPt:=et2​ΔM\.P\_\{t\}:=e^\{\\frac\{t\}\{2\}\\Delta\_\{M\}\}\.For viscosity parameterν\>0\\nu\>0and coarse\-graining scaleτ≥0\\tau\\geq 0, define thelocal\-entropy regularizationuν​\(τ,q\):=−ν​log⁡\(Pν​τ​e−U/ν​\(q\)\)u^\{\\nu\}\(\\tau,q\):=\-\\nu\\log\\bigl\(P\_\{\\nu\\tau\}e^\{\-U/\\nu\}\(q\)\\bigr\), whereq∈Mq\\in M\.Coarse\-grainingrefers to replacing the raw parameter dynamics by an effective dynamics obtained after symmetry quotienting and local\-entropy smoothing on the symmetry quotient\. The discrete SGD timennand the continuous heat timettare identified at the scaling level throughη​n≒t=ν​τ\\eta n\\fallingdotseq t=\\nu\\tau, whereν\\nudenotes the viscosity, equivalently coarse\-graining scale, andτ\\tauis a continuous normalized coarse\-graining parameter\.

## 3Hamilton–Jacobi Equation

We propose that symmetry quotient plus local\-entropy coarse\-graining yields a viscous Hamilton–Jacobi equation on a quotient space:

###### Theorem 3\.1\(Quotient Hamilton–Jacobi equation\)

AssumeU∈C2​\(M\)U\\in C^\{2\}\(M\), assumePt​e−U/νP\_\{t\}e^\{\-U/\\nu\}is strictly positive fort≥0t\\geq 0, and assume the functionw​\(τ,q\):=Pν​τ​e−U/ν​\(q\)w\(\\tau,q\):=P\_\{\\nu\\tau\}e^\{\-U/\\nu\}\(q\)belongs toC1,2​\(\(0,∞\)×M\)C^\{1,2\}\(\(0,\\infty\)\\times M\)and solves the heat equation pointwise:∂τw=ν2​ΔM​w\\partial\_\{\\tau\}w=\\frac\{\\nu\}\{2\}\\Delta\_\{M\}w, andw​\(0,q\)=e−U​\(q\)/ν\.w\(0,q\)=e^\{\-U\(q\)/\\nu\}\.777IfPtP\_\{t\}is the heat semigroup generated by12​ΔM\\frac\{1\}\{2\}\\Delta\_\{M\}and the required regularity holds, then,w​\(τ,q\)=Pν​τ​e−U/ν​\(q\)w\(\\tau,q\)=P\_\{\\nu\\tau\}e^\{\-U/\\nu\}\(q\)solves∂τw=ν2​ΔM​w\\partial\_\{\\tau\}w=\\frac\{\\nu\}\{2\}\\Delta\_\{M\}wautomatically\.Then, the functionuνu^\{\\nu\}defined byuν​\(τ,q\)=−ν​log⁡w​\(τ,q\)u^\{\\nu\}\(\\tau,q\)=\-\\nu\\log w\(\\tau,q\)solves∂τuν\+12​∥grad​uν∥g2=ν2​ΔM​uν\\partial\_\{\\tau\}u^\{\\nu\}\+\\frac\{1\}\{2\}\\lVert\\mathrm\{grad\}u^\{\\nu\}\\rVert\_\{g\}^\{2\}=\\frac\{\\nu\}\{2\}\\Delta\_\{M\}u^\{\\nu\},uν​\(0,q\)=U​\(q\)\.u^\{\\nu\}\(0,q\)=U\(q\)\.

###### Proof

Becausew\>0w\>0, the logarithm is well defined\. By differentiation,∂τuν=−ν​∂τww=−ν22​ΔM​ww\.\\partial\_\{\\tau\}u^\{\\nu\}=\-\\nu\\frac\{\\partial\_\{\\tau\}w\}\{w\}=\-\\frac\{\\nu^\{2\}\}\{2\}\\frac\{\\Delta\_\{M\}w\}\{w\}\.Also,grad​uν=−ν​grad​ww,∥grad​uν∥g2=ν2​∥grad​w∥g2w2\.\\mathrm\{grad\}u^\{\\nu\}=\-\\nu\\frac\{\\mathrm\{grad\}w\}\{w\},\\qquad\\lVert\\mathrm\{grad\}u^\{\\nu\}\\rVert\_\{g\}^\{2\}=\\nu^\{2\}\\frac\{\\lVert\\mathrm\{grad\}w\\rVert\_\{g\}^\{2\}\}\{w^\{2\}\}\.Using the identityΔM​\(log⁡w\)=ΔM​ww−∥grad​w∥g2w2,\\Delta\_\{M\}\(\\log w\)=\\frac\{\\Delta\_\{M\}w\}\{w\}\-\\frac\{\\lVert\\mathrm\{grad\}w\\rVert\_\{g\}^\{2\}\}\{w^\{2\}\},we obtainΔM​uν=−ν​ΔM​ww\+ν​∥grad​w∥g2w2\.\\Delta\_\{M\}u^\{\\nu\}=\-\\nu\\frac\{\\Delta\_\{M\}w\}\{w\}\+\\nu\\frac\{\\lVert\\mathrm\{grad\}w\\rVert\_\{g\}^\{2\}\}\{w^\{2\}\}\.Therefore,∂τuν\+12​∥grad​uν∥g2−ν2​ΔM​uν=0\.\\partial\_\{\\tau\}u^\{\\nu\}\+\\frac\{1\}\{2\}\\lVert\\mathrm\{grad\}u^\{\\nu\}\\rVert\_\{g\}^\{2\}\-\\frac\{\\nu\}\{2\}\\Delta\_\{M\}u^\{\\nu\}=0\.The initial condition follows fromP0=IdP\_\{0\}=\\mathrm\{Id\}\. ∎

Thm\.[3\.1](https://arxiv.org/html/2606.18303#S3.Thmtheorem1)shows that,after symmetry quotienting, quotient local entropy is not merely a heuristic smoothing of the loss, but exactly the viscous Hamilton–Jacobi evolution of the effective potentialUUon the quotient manifoldMM\.Crucially, the nonlinear term12​\|grad⁡uν\|​g2\\frac\{1\}\{2\}\|\\operatorname\{grad\}u^\{\\nu\}\|g^\{2\}is the mechanism that drives characteristic steepening: in the small\-viscosity regime, it tends to sharpen gradients and compress information into increasingly narrow transition layers\. Thus, Thm\.[3\.1](https://arxiv.org/html/2606.18303#S3.Thmtheorem1)already contains the mathematical precursor of shock formation\. The viscosity termν2​ΔM​uν\\frac\{\\nu\}\{2\}\\Delta\_\{M\}u^\{\\nu\}does not remove this mechanism; rather, it regularizes it, replacing discontinuous shocks by thin viscous shock layers\. This is precisely why the quotient Hamilton–Jacobi equation is the correct entry point to the shock\-wave interpretation of symmetry\-reduced SGD\.

The benefit of the theorem is threefold\. First, it gives a mathematically controlled effective landscape on which sharp geometric features are regularized by a viscosity parameterν\\nu\. Second, it makes available the standard Hamilton–Jacobi toolbox, including Hopf–Cole linearization, semigroup methods, comparison arguments, and small\-viscosity asymptotics, for the analysis of symmetry\-quotiented learning dynamics\. Third, it provides the precise entry point to the Burgers\-type description developed later: once a one\-dimensional collective coordinate closes the reduced dynamics, the gradient ofuνu^\{\\nu\}inherits a viscous transport structure whose steepening and possible shock\-layer formation can be analyzed quantitatively\.

### 3\.1Numerical illustration: Hopf–Cole shock layer in a quotient ReLU model

We include a small numerical experiment to verify that the quotient Hopf–Cole quantity introduced above produces a shock\-like transition layer in an explicitly symmetry\-reduced ReLU model\. The purpose of the experiment is not to optimize predictive performance on a benchmark dataset, but to test the geometric mechanism in a setting where the quotient coordinates and the quotient Laplace–Beltrami operator can be computed directly\.

##### Model and quotient coordinates\.

We use a one\-hidden\-layer ReLU networkf​\(x\)=∑j=1maj​σ​\(uj⊤​x~\)\+c,x~=\(x,1\),σ​\(z\)=max⁡\{z,0\},f\(x\)=\\sum\_\{j=1\}^\{m\}a\_\{j\}\\sigma\(u\_\{j\}^\{\\top\}\\widetilde\{x\}\)\+c,\\widetilde\{x\}=\(x,1\),\\sigma\(z\)=\\max\\\{z,0\\\},with input dimension one and hidden widthm=2m=2\. The ReLU positive rescaling symmetry\(aj,uj\)∼\(aj/rj,rj​uj\),rj\>0,\(a\_\{j\},u\_\{j\}\)\\sim\(a\_\{j\}/r\_\{j\},r\_\{j\}u\_\{j\}\),r\_\{j\}\>0,is fixed by the balanced representative\|aj\|=‖uj‖\.\|a\_\{j\}\|=\\\|u\_\{j\}\\\|\.Equivalently, the quotient coordinates areγj=aj​‖uj‖,sj=uj‖uj‖∈S1,c∈ℝ\.\\gamma\_\{j\}=a\_\{j\}\\\|u\_\{j\}\\\|,s\_\{j\}=\\frac\{u\_\{j\}\}\{\\\|u\_\{j\}\\\|\}\\in S^\{1\},c\\in\\mathbb\{R\}\.In these coordinates the quotient network is written asfQ​\(x;γ,s,c\)=∑j=1mγj​σ​\(sj⊤​x~\)\+c\.f\_\{Q\}\(x;\\gamma,s,c\)=\\sum\_\{j=1\}^\{m\}\\gamma\_\{j\}\\sigma\(s\_\{j\}^\{\\top\}\\widetilde\{x\}\)\+c\.For the binary classification lossUU, we use the empirical binary cross\-entropy

U​\(γ,s,c\)=1N​∑i=1NℓBCE​\(fQ​\(xi;γ,s,c\),yi\)\.U\(\\gamma,s,c\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\ell\_\{\\mathrm\{BCE\}\}\\bigl\(f\_\{Q\}\(x\_\{i\};\\gamma,s,c\),y\_\{i\}\\bigr\)\.\(1\)

##### Experimental setting\.

The dataset is a deterministic one\-dimensional step classification problem\. We sampleN=96N=96points in\[−2,2\]\[\-2,2\], add small Gaussian jitter with standard deviation0\.0350\.035, and assign labelsyi=𝟏\{xi\>0\}\.y\_\{i\}=\\mathbf\{1\}\_\{\\\{x\_\{i\}\>0\\\}\}\.The balanced gauge\-fixed ReLU model is trained for20002000epochs using SGD with learning rate0\.0350\.035\. After training, the final loss is2\.8620×10−22\.8620\\times 10^\{\-2\}, the training accuracy is1\.00001\.0000, and the maximum balanced\-gauge error is2\.384×10−72\.384\\times 10^\{\-7\}\(the balanced gauge remains fixed\)\.

We then evaluate the quotient Hopf–Cole profile on a one\-dimensional slice of the learned quotient coordinates\. Specifically, the first spherical directions1∈S1s\_\{1\}\\in S^\{1\}is rotated by an angleθ∈\[−1\.35,1\.35\]\\theta\\in\[\-1\.35,1\.35\], while all other quotient coordinates are held fixed\. On this slice we compute

w​\(τ,θ\)=exp⁡\(ν​τ​ΔQ\)​exp⁡\(−U​\(θ\)ν\),uν​\(τ,θ\)=−ν​log⁡w​\(τ,θ\),w\(\\tau,\\theta\)=\\exp\\\!\\bigl\(\\nu\\tau\\Delta\_\{Q\}\\bigr\)\\exp\\\!\\left\(\-\\frac\{U\(\\theta\)\}\{\\nu\}\\right\),\\qquad u^\{\\nu\}\(\\tau,\\theta\)=\-\\nu\\log w\(\\tau,\\theta\),\(2\)whereΔQ\\Delta\_\{Q\}is the quotient Laplace–Beltrami operator associated with the balanced quotient metric\. Numerically, the heat semigroup is approximated to first order,

w​\(τ,θ\)≈ϕν​\(θ\)\+ν​τ​ΔQ​ϕν​\(θ\),ϕν​\(θ\)=exp⁡\(−U​\(θ\)ν\)\.w\(\\tau,\\theta\)\\approx\\phi^\{\\nu\}\(\\theta\)\+\\nu\\tau\\Delta\_\{Q\}\\phi^\{\\nu\}\(\\theta\),\\qquad\\phi^\{\\nu\}\(\\theta\)=\\exp\\\!\\left\(\-\\frac\{U\(\\theta\)\}\{\\nu\}\\right\)\.\(3\)We use viscositiesν∈\{0\.05,0\.1,0\.5\}\\nu\\in\\\{0\.05,0\.1,0\.5\\\}and heat\-flow timesτ∈\{0,0\.006,0\.012,0\.05,0\.1\}\.\\tau\\in\\\{0,0\.006,0\.012,0\.05,0\.1\\\}\.For each pair\(ν,τ\)\(\\nu,\\tau\), we plotuνu^\{\\nu\}, its first derivative∂θuν\\partial\_\{\\theta\}u^\{\\nu\}, and its second derivative∂θ2uν\\partial\_\{\\theta\}^\{2\}u^\{\\nu\}\. Derivatives are computed by finite differences on a uniform grid of9191points inθ\\theta\.

![Refer to caption](https://arxiv.org/html/2606.18303v1/x2.png)Figure 2:Hopf–Cole shock profile in a quotient ReLU model\.The columns correspond toν=0\.01,0\.1,1\.0\\nu=0\.01,0\.1,1\.0, and the curves in each panel correspond to increasing heat\-flow timesτ∈\{0,0\.006,0\.012,0\.030,0\.054\}\\tau\\in\\\{0,0\.006,0\.012,0\.030,0\.054\\\}\. The top row shows the effective actionAν=−ν​log⁡wνA^\{\\nu\}=\-\\nu\\log w^\{\\nu\}, the middle row shows the quotient\-slice velocity∂θAν\\partial\_\{\\theta\}A^\{\\nu\}, and the bottom row shows∂θ2Aν\\partial\_\{\\theta\}^\{2\}A^\{\\nu\}\. Hereθ\\thetais the quotient\-slice coordinate obtained by rotating the learned first hidden\-unit directions1∗∈S1s\_\{1\}^\{\\ast\}\\in S^\{1\}ass1​\(θ\)=cos⁡θ​s1∗\+sin⁡θ​J​s1∗s\_\{1\}\(\\theta\)=\\cos\\theta\\,s\_\{1\}^\{\\ast\}\+\\sin\\theta\\,Js\_\{1\}^\{\\ast\}, with all other quotient coordinates fixed\. Smaller viscosity produces a stable, sharper transition layer, whereas larger viscosity diffuses the profile\. The underlying network is a width\-22, one\-hidden\-layer balanced gauge\-fixed ReLU classifier trained on the one\-dimensional tasky=𝟏\{x\>0\}y=\\mathbf\{1\}\_\{\\\{x\>0\\\}\}with9696samples, plain SGD, learning rate0\.0350\.035, and20002000epochs\.Table 1:Maximum absolute second derivative at the final heat\-flow timeτ=0\.1\\tau=0\.1\. The quantitymaxθ⁡\|∂θ2uν\|\\max\_\{\\theta\}\|\\partial\_\{\\theta\}^\{2\}u^\{\\nu\}\|measures the sharpness of the transition layer in the quotient slice\.Figure[2](https://arxiv.org/html/2606.18303#S3.F2)shows the qualitative behavior predicted by the quotient Hamilton–Jacobi and Burgers picture\. The local entropyuν=−ν​log⁡wu^\{\\nu\}=\-\\nu\\log wdevelops a localized steep transition along the quotient slice, and the corresponding velocity∂θuν\\partial\_\{\\theta\}u^\{\\nu\}rapidly changes over a narrow interval ofθ\\theta\. The second derivative∂θ2uν\\partial\_\{\\theta\}^\{2\}u^\{\\nu\}makes this transition layer explicit: it concentrates near the location where the velocity steepens\. As the viscosity increases fromν=0\.05\\nu=0\.05toν=0\.5\\nu=0\.5, the peak curvature decreases from4\.6633194\.663319to3\.3284053\.328405, which is consistent with viscous smoothing of a shock layer\.

This experiment therefore provides a concrete finite\-dimensional illustration of the theoretical mechanism\. After quotienting the ReLU rescaling symmetry, the Hopf–Cole transform on the quotient space produces a scalar local entropy whose gradient exhibits Burgers\-type steepening\. The observed layer is not a discontinuity in raw parameter space; it is a sharp transition in a symmetry\-corrected quotient coordinate, exactly the level at which the theory predicts shock\-like behavior\.

## 4Burgers\-type Equation

We now ask when the quotient Hamilton–Jacobi equation reduces to a scalar transport equation\.

###### Assumption 4\.1\(One\-dimensional closure with isoparametric condition\)

Letψ:M→I⊂ℝ\\psi:M\\to I\\subset\\mathbb\{R\}be aC3C^\{3\}collective coordinate and letN⊂MN\\subset Mbe a tubular neighborhood\.888A tubular neighborhood of a submanifold is a neighborhood that is diffeomorphic to a neighborhood of the zero section in its normal bundle, so that nearby points are represented by normal displacements from the submanifold\.Assume:

1. \(a\)there exists the reduced effective potentialU¯:I→ℝ\\bar\{U\}:I\\to\\mathbb\{R\}such thatU​\(q\)=U¯​\(ψ​\(q\)\)U\(q\)=\\bar\{U\}\(\\psi\(q\)\)for allq∈Nq\\in N;
2. \(b\)the coarse\-grained solution preserves this dependence onNN, namely,uν​\(τ,q\)=u¯ν​\(τ,ψ​\(q\)\)u^\{\\nu\}\(\\tau,q\)=\\bar\{u\}^\{\\nu\}\(\\tau,\\psi\(q\)\)for all\(τ,q\)∈\[0,T\]×N\(\\tau,q\)\\in\[0,T\]\\times N;
3. \(c\)ψ\\psiis unit speed onNN:∥grad​ψ∥g=1;\\lVert\\mathrm\{grad\}\\psi\\rVert\_\{g\}=1;
4. \(d\)ΔM​ψ\\Delta\_\{M\}\\psidepends only onψ\\psionNN, that is, there existsκ∈C1​\(I\)\\kappa\\in C^\{1\}\(I\)such thatΔM​ψ=κ∘ψ\\Delta\_\{M\}\\psi=\\kappa\\circ\\psionNN;
5. \(e\)u¯ν∈C1,3​\(\[0,T\]×I\)\\bar\{u\}^\{\\nu\}\\in C^\{1,3\}\(\[0,T\]\\times I\)\.

###### Theorem 4\.2\(Source\-corrected one\-dimensional reduction\)

Under Assumption[4\.1](https://arxiv.org/html/2606.18303#S4.Thmtheorem1),u¯ν\\bar\{u\}^\{\\nu\}satisfies∂τu¯ν\+12​\(∂su¯ν\)2=ν2​\(∂s​su¯ν\+κ​\(s\)​∂su¯ν\),\\partial\_\{\\tau\}\\bar\{u\}^\{\\nu\}\+\\frac\{1\}\{2\}\(\\partial\_\{s\}\\bar\{u\}^\{\\nu\}\)^\{2\}=\\frac\{\\nu\}\{2\}\\bigl\(\\partial\_\{ss\}\\bar\{u\}^\{\\nu\}\+\\kappa\(s\)\\partial\_\{s\}\\bar\{u\}^\{\\nu\}\\bigr\),wheres=ψ​\(q\)s=\\psi\(q\)\. Consequently, the gradient fieldvν​\(τ,s\):=∂su¯ν​\(τ,s\)v^\{\\nu\}\(\\tau,s\):=\\partial\_\{s\}\\bar\{u\}^\{\\nu\}\(\\tau,s\)satisfies∂τvν\+vν​∂svν=ν2​\(∂s​svν\+κ​\(s\)​∂svν\+κ′​\(s\)​vν\)\.\\partial\_\{\\tau\}v^\{\\nu\}\+v^\{\\nu\}\\partial\_\{s\}v^\{\\nu\}=\\frac\{\\nu\}\{2\}\\bigl\(\\partial\_\{ss\}v^\{\\nu\}\+\\kappa\(s\)\\partial\_\{s\}v^\{\\nu\}\+\\kappa^\{\\prime\}\(s\)v^\{\\nu\}\\bigr\)\.If, in addition,κ≡0\\kappa\\equiv 0, then the equation reduces to the classical viscous Burgers equation∂τvν\+vν​∂svν=ν2​∂s​svν\.\\partial\_\{\\tau\}v^\{\\nu\}\+v^\{\\nu\}\\partial\_\{s\}v^\{\\nu\}=\\frac\{\\nu\}\{2\}\\partial\_\{ss\}v^\{\\nu\}\.

###### Proof

Substituteuν​\(τ,q\)=u¯ν​\(τ,ψ​\(q\)\)u^\{\\nu\}\(\\tau,q\)=\\bar\{u\}^\{\\nu\}\(\\tau,\\psi\(q\)\)into the Hamilton–Jacobi equation\. By the chain rule,grad​uν=\(∂su¯ν\)​grad​ψ\\mathrm\{grad\}u^\{\\nu\}=\(\\partial\_\{s\}\\bar\{u\}^\{\\nu\}\)\\mathrm\{grad\}\\psi, and∥grad​uν∥g2=\(∂su¯ν\)2​∥grad​ψ∥g2=\(∂su¯ν\)2\.\\lVert\\mathrm\{grad\}u^\{\\nu\}\\rVert\_\{g\}^\{2\}=\(\\partial\_\{s\}\\bar\{u\}^\{\\nu\}\)^\{2\}\\lVert\\mathrm\{grad\}\\psi\\rVert\_\{g\}^\{2\}=\(\\partial\_\{s\}\\bar\{u\}^\{\\nu\}\)^\{2\}\.For the Laplacian,ΔM​uν=∂s​su¯ν​∥grad​ψ∥g2\+∂su¯ν​ΔM​ψ=∂s​su¯ν\+κ​\(ψ​\(q\)\)​∂su¯ν\.\\Delta\_\{M\}u^\{\\nu\}=\\partial\_\{ss\}\\bar\{u\}^\{\\nu\}\\,\\lVert\\mathrm\{grad\}\\psi\\rVert\_\{g\}^\{2\}\+\\partial\_\{s\}\\bar\{u\}^\{\\nu\}\\,\\Delta\_\{M\}\\psi\\\\ =\\partial\_\{ss\}\\bar\{u\}^\{\\nu\}\+\\kappa\(\\psi\(q\)\)\\partial\_\{s\}\\bar\{u\}^\{\\nu\}\.Since the right\-hand side depends onqqonly throughs=ψ​\(q\)s=\\psi\(q\), the scalar equation follows\. Differentiating it w\.r\.t\.ssis justified byu¯ν∈C1,3\\bar\{u\}^\{\\nu\}\\in C^\{1,3\}and gives the evolution forvνv^\{\\nu\}\. Ifκ≡0\\kappa\\equiv 0, the source terms vanish\. ∎

![Refer to caption](https://arxiv.org/html/2606.18303v1/x3.png)Figure 3:Shock formation in viscous Burgers equation via Hopf–Cole transform\.Each panel shows the velocity fieldvν​\(τ,s\)v^\{\\nu\}\(\\tau,s\), obtained from the Hopf–Cole solutionvν=∂s\(−ν​log⁡w\)v^\{\\nu\}=\\partial\_\{s\}\(\-\\nu\\log w\)with potentialU0​\(s\)=e−s2/2U\_\{0\}\(s\)=e^\{\-s^\{2\}/2\}, for three viscositiesν∈\{0\.02,0\.10,0\.40\}\\nu\\in\\\{0\.02,0\.10,0\.40\\\}\(left to right\)\. The initial drift isv0​\(s\)=U0′​\(s\)=−s​e−s2/2v\_\{0\}\(s\)=U^\{\\prime\}\_\{0\}\(s\)=\-se^\{\-s^\{2\}/2\}, whose second derivative satisfiesinfsU0′′​\(s\)=−1\\inf\_\{s\}U^\{\\prime\\prime\}\_\{0\}\(s\)=\-1ats=0s=0\. Time samples are non\-uniformly spaced with higher density nearτ∗\\tau^\{\*\}to resolve the rapid transition\. Asν→0\\nu\\rightarrow 0, the shock layer sharpens into a discontinuity, while largeν\\nusmooths the front and suppresses shock formation entirely\. Note that this is a numerical integration \(simulation\) of the viscous Burgers equation\.Thm\.[4\.2](https://arxiv.org/html/2606.18303#S4.Thmtheorem2)shows that,once the quotient coarse\-grained local\-entropy dynamics close through a single collective coordinates=ψ​\(q\)s=\\psi\(q\), the symmetry\-quotiented Hamilton–Jacobi equation collapses to a scalar nonlinear evolution, and its gradient field becomes a Burgers\-type equation\.Thus, the effective training dynamics admit a genuine transport normal form: the gradient fieldvν=∂su¯νv^\{\\nu\}=\\partial\_\{s\}\\bar\{u\}^\{\\nu\}evolves by nonlinear self\-advection, while diffusion and geometric forcing appear as lower\-order corrections\.

The benefit of the theorem is that it converts a high\-dimensional quotient dynamics onMMinto a one\-dimensional local\-entropy dynamics whose singular behavior is mathematically explicit\. In particular, the termvν​∂svνv^\{\\nu\}\\partial\_\{s\}v^\{\\nu\}identifies the steepening mechanism responsible for shock\-type regime changes, whereas the terms involvingν\\nudescribe viscous smoothing and the coefficientκ\\kapparecords the geometric effect of the embedding of the collective coordinate inside the quotient manifold\. Therefore, abrupt transitions in the reduced learning dynamics can be analyzed by standard tools from Burgers theory rather than only by abstract arguments on the original parameter space\.

### 4\.1Shock Formation in Inviscid Limit

In the flat caseκ≡0\\kappa\\equiv 0, the theorem yields the classical viscous Burgers equation exactly, so the entire classical theory of shock layers, inviscid limits, characteristic crossing, and viscosity regularization becomes available\. Whenκ≡0\\kappa\\equiv 0andν→0\\nu\\to 0, the viscous Burgers equation formally becomes inviscid Burgers,∂τv\+v​∂sv=0,\\partial\_\{\\tau\}v\+v\\partial\_\{s\}v=0,andv​\(0,s\)=U¯′​\(s\)\.v\(0,s\)=\\bar\{U\}^\{\\prime\}\(s\)\.

###### Theorem 4\.3\(Shock time for the reduced drift\)

LetIIbe eitherℝ\\mathbb\{R\}or the one\-dimensional torus𝕋\\mathbb\{T\}\. AssumeU¯∈C2​\(I\)\\bar\{U\}\\in C^\{2\}\(I\)and definev0​\(s\):=U¯′​\(s\)v\_\{0\}\(s\):=\\bar\{U\}^\{\\prime\}\(s\)\. Then, the classical solution of the inviscid Burgers equation, as long as it exists classically, is constant along characteristics999Thecharacteristicshere refer to the solution curvess​\(τ;ξ\)s\(\\tau;\\xi\)of the characteristic equationdd​τ​s​\(τ;ξ\)=v​\(τ,s​\(τ;ξ\)\)\\frac\{d\}\{d\\tau\}s\(\\tau;\\xi\)=v\(\\tau,s\(\\tau;\\xi\)\)withs​\(0;ξ\)=ξs\(0;\\xi\)=\\xi; along each such curve,vvremains constant, so that for the inviscid Burgers equation, one hass​\(τ;ξ\)=ξ\+τ​v0​\(ξ\)s\(\\tau;\\xi\)=\\xi\+\\tau v\_\{0\}\(\\xi\)\.:s​\(τ;ξ\)=ξ\+τ​v0​\(ξ\),v​\(τ,s​\(τ;ξ\)\)=v0​\(ξ\)\.s\(\\tau;\\xi\)=\\xi\+\\tau v\_\{0\}\(\\xi\),v\\bigl\(\\tau,s\(\\tau;\\xi\)\\bigr\)=v\_\{0\}\(\\xi\)\.Its maximal classical existence time isτ∗=−1infξ∈Iv0′​\(ξ\)\\tau\_\{\*\}=\-\\dfrac\{1\}\{\\inf\_\{\\xi\\in I\}v\_\{0\}^\{\\prime\}\(\\xi\)\}forinfξ∈Iv0′​\(ξ\)<0\\inf\_\{\\xi\\in I\}v\_\{0\}^\{\\prime\}\(\\xi\)<0, and\+∞\+\\inftyforinfξ∈Iv0′​\(ξ\)≥0\\inf\_\{\\xi\\in I\}v\_\{0\}^\{\\prime\}\(\\xi\)\\geq 0, and equivalently,τ∗=−1infξ∈IU¯′′​\(ξ\)\\tau\_\{\*\}=\-\\dfrac\{1\}\{\\inf\_\{\\xi\\in I\}\\bar\{U\}^\{\\prime\\prime\}\(\\xi\)\}forinfξ∈IU¯′′​\(ξ\)<0\\inf\_\{\\xi\\in I\}\\bar\{U\}^\{\\prime\\prime\}\(\\xi\)<0, and\+∞\+\\inftyforinfξ∈IU¯′′​\(ξ\)≥0\\inf\_\{\\xi\\in I\}\\bar\{U\}^\{\\prime\\prime\}\(\\xi\)\\geq 0\.

###### Proof

The characteristic system isdd​τ​s​\(τ\)=v​\(τ,s​\(τ\)\),dd​τ​v​\(τ,s​\(τ\)\)=0\.\\frac\{d\}\{d\\tau\}s\(\\tau\)=v\(\\tau,s\(\\tau\)\),\\qquad\\frac\{d\}\{d\\tau\}v\(\\tau,s\(\\tau\)\)=0\.Hencev​\(τ,s​\(τ;ξ\)\)=v0​\(ξ\)v\(\\tau,s\(\\tau;\\xi\)\)=v\_\{0\}\(\\xi\)and therefores​\(τ;ξ\)=ξ\+τ​v0​\(ξ\)\.s\(\\tau;\\xi\)=\\xi\+\\tau v\_\{0\}\(\\xi\)\.A classical solution persists exactly as long as the characteristic mapξ↦s​\(τ;ξ\)\\xi\\mapsto s\(\\tau;\\xi\)remains aC1C^\{1\}diffeomorphism ofIIonto itself\. Differentiating with respect toξ\\xigives∂ξs​\(τ;ξ\)=1\+τ​v0′​\(ξ\)\.\\partial\_\{\\xi\}s\(\\tau;\\xi\)=1\+\\tau v\_\{0\}^\{\\prime\}\(\\xi\)\.The first loss of invertibility therefore occurs at the time stated above\. Sincev0′=U¯′′v\_\{0\}^\{\\prime\}=\\bar\{U\}^\{\\prime\\prime\}, the second formula follows\. ∎

Thm\.[4\.3](https://arxiv.org/html/2606.18303#S4.Thmtheorem3)states that the inviscid gradient fieldvvremains classical only up to the shock timeτ∗=−1/infξ∈IU¯′′​\(ξ\)\\tau^\{\*\}=\-1/\\inf\_\{\\xi\\in I\}\\bar\{U\}^\{\\prime\\prime\}\(\\xi\)wheninfξ∈IU¯′′​\(ξ\)<0\\inf\_\{\\xi\\in I\}\\bar\{U\}^\{\\prime\\prime\}\(\\xi\)<0, so negative curvature of the reduced effective potential causes finite\-time characteristic crossing; Fig\.[3](https://arxiv.org/html/2606.18303#S4.F3)illustrates the corresponding viscous precursor, where the shock layer sharpens asν→0\\nu\\to 0\.

## 5Architecture\-specific Instantiations

### 5\.1ReLU Multilayer Perceptrons

Consider a bias\-free ReLU multilayer perceptron \(MLP\)

fW​\(x\)=WL​ϕ​\(WL−1​ϕ​\(⋯​ϕ​\(W1​x\)​⋯\)\),f\_\{W\}\(x\)=W\_\{L\}\\phi\(W\_\{L\-1\}\\phi\(\\cdots\\phi\(W\_\{1\}x\)\\cdots\)\),\(4\)where the ReLU activationϕ​\(z\)=max⁡\{z,0\}\\phi\(z\)=\\max\\\{z,0\\\}acts coordinatewise\.

###### Proposition 1\(Positive rescaling symmetry for ReLU MLPs\)

Fix an internal hidden unitjjin a ReLU MLP\. If the incoming weights attached to that unit are multiplied by a constantc\>0c\>0and the outgoing weights from that unit are multiplied byc−1c^\{\-1\}, then the realized functionfWf\_\{W\}is unchanged\. Hidden\-unit permutations are also exact symmetries provided the inverse permutation is applied in the subsequent layer\[[1](https://arxiv.org/html/2606.18303#bib.bib1),[2](https://arxiv.org/html/2606.18303#bib.bib2)\]\.

###### Corollary 1\(Conditional ReLU MLP quotient dynamics\)

On any regular stratum where the activation pattern is fixed101010“the activation pattern is fixed” means that on the local stratum under consideration, the active or inactive status of every ReLU unit does not change, so the network mapfWf\_\{W\}is smooth in the weights on that stratum\.and the stabilizer is trivial111111The stabilizer of a parameter point is the subgroup of symmetries that leaves that point unchanged; saying that the stabilizer is trivial means that only the identity symmetry fixes the point, so the action is free on that stratum\. If the stabilizer were nontrivial, the point would retain residual symmetry, which could lower the orbit dimension and induce local singularities in the quotient\., the symmetry action generated by positive rescalings and permutations is exact\. Consequently, if the quotiented SGD drift and quotiented conditional covariance are projectable121212Here, “projected” means pushed forward to the quotient coordinates via the quotient\-chart JacobianD​ΦD\\Phi, whereas “projectable” means that the resulting quantity depends only on the quotient state and not on the particular representative in the symmetry orbit\. In particular, mere projection is not enough: the projected quantity must agree for all representatives of the same orbit in order to well\-define a closed reduced dynamics on the quotient\. In this paper, the projected drift isD​Φ​\(θ\)​∇L​\(θ\)D\\Phi\(\\theta\)\\nabla L\(\\theta\), while projectability means that this projected quantity is constant along symmetry orbits and therefore can be written as a well\-defined functionb​\(Y\)b\(Y\)of the quotient coordinateY=Φ​\(θ\)Y=\\Phi\(\\theta\)\.through the quotient chart in the sense of Assumption[2\.1](https://arxiv.org/html/2606.18303#S2.Thmtheorem1), then, Thm\.[2\.2](https://arxiv.org/html/2606.18303#S2.Thmtheorem2)applies\. If, in addition, the assumptions of Thm\.[3\.1](https://arxiv.org/html/2606.18303#S3.Thmtheorem1)hold on the quotient, and the local\-entropy dynamics satisfy Assumption[4\.1](https://arxiv.org/html/2606.18303#S4.Thmtheorem1), then, the gradient field obeys the source\-corrected Burgers equation, and in the flat\-coordinate case it obeys classical viscous Burgers\.

###### Proof

On a fixed activation\-pattern stratum,fWf\_\{W\}is smooth in the weights, and the proposition gives an exact symmetry action\. The quotient reduction is therefore valid once projectability of the drift and projected covariance is verified\. The Hamilton–Jacobi and Burgers conclusions then follow from Theorems[3\.1](https://arxiv.org/html/2606.18303#S3.Thmtheorem1)and[4\.2](https://arxiv.org/html/2606.18303#S4.Thmtheorem2)\.

### 5\.2Convolutional Neural Networks

The same mechanism extends to convolutional neural networks \(CNNs\)\.

###### Proposition 2\(Channel rescaling symmetry for CNNs\)

Consider two consecutive convolutional layers separated only by positively homogeneous nonlinearities\. If one rescales an intermediate channel byc\>0c\>0and rescales the corresponding incoming kernel of the next layer byc−1c^\{\-1\}, the network function is unchanged\. Channel permutations are also exact symmetries if propagated consistently across adjacent layers\[[1](https://arxiv.org/html/2606.18303#bib.bib1),[2](https://arxiv.org/html/2606.18303#bib.bib2)\]\.

###### Corollary 2\(Conditional CNN quotient dynamics\)

On a regular stratum of a positively homogeneous CNN, the same conditional conclusion holds as in Corollary[1](https://arxiv.org/html/2606.18303#Thmcorollary1): if the projected SGD drift and projected conditional covariance are projectable through the quotient chart in the sense of Assumption[2\.1](https://arxiv.org/html/2606.18303#S2.Thmtheorem1), then SGD descends locally to quotient coordinates\. Under the hypotheses of Theorems[3\.1](https://arxiv.org/html/2606.18303#S3.Thmtheorem1)and[4\.2](https://arxiv.org/html/2606.18303#S4.Thmtheorem2), the quotient local\-entropy dynamics admit the same source\-corrected Burgers reduction as in the MLP case\.

###### Proof

Use the previous proposition in place of the MLP symmetry proposition; the rest is identical\.

### 5\.3Networks with Batch Normalization and Layer Normalization

Batch Normalization creates scale\-invariant directions because normalizing a pre\-activation removes uniform positive scalings of the incoming weight vector\[[6](https://arxiv.org/html/2606.18303#bib.bib6)\]\. The exact symmetry group depends on the architecture\.

###### Proposition 3\(Positive scale invariance before idealized Batch Normalization\)

Letz=w⊤​xz=w^\{\\top\}xbe a pre\-activation immediately followed by the idealized Batch Normalization mapBN​\(z\)=γ​z−μ​\(z\)σ​\(z\)\+β,\\mathrm\{BN\}\(z\)=\\gamma\\frac\{z\-\\mu\(z\)\}\{\\sigma\(z\)\}\+\\beta,with noε\\varepsilon\-stabilizer inside the denominator\. Ifwwis replaced byc​wcwwithc\>0c\>0, then the normalized output is unchanged\[[6](https://arxiv.org/html/2606.18303#bib.bib6)\]\.

###### Theorem 5\.1\(Conditional quotient reduction for normalized networks\)

Suppose a BN or LN architecture admits an exact regular symmetry subgroupGnormG\_\{\\mathrm\{norm\}\}preserving the realized function and the empirical loss on a regular stratum\. Assume also that the projected SGD drift and projected conditional covariance are projectable through the quotient chart in the sense of Assumption[2\.1](https://arxiv.org/html/2606.18303#S2.Thmtheorem1)\. Then, the quotient reduction theorem applies on that stratum\. If, in addition, the hypotheses of Thm\.[3\.1](https://arxiv.org/html/2606.18303#S3.Thmtheorem1)hold on the quotient and the local\-entropy dynamics satisfy Assumption[4\.1](https://arxiv.org/html/2606.18303#S4.Thmtheorem1), then, the source\-corrected Burgers reduction also holds\.

###### Proof

Once an exact symmetry subgroup acts freely and properly on the relevant regular stratum and preserves the loss, the quotient manifold exists locally\. Thm\.[2\.2](https://arxiv.org/html/2606.18303#S2.Thmtheorem2)then applies by the assumed projectability of the reduced drift and covariance\. The subsequent conclusions follow from Theorems[3\.1](https://arxiv.org/html/2606.18303#S3.Thmtheorem1)and[4\.2](https://arxiv.org/html/2606.18303#S4.Thmtheorem2)\.

### 5\.4Transformers

Recent work argues that Transformer architectures possess richer parameter symmetries than classical rescaling alone, and that quotient\-manifold constructions are needed to define meaningful sharpness measures\[[9](https://arxiv.org/html/2606.18303#bib.bib9)\]\. This motivates a deliberately abstract statement\.

###### Theorem 5\.2\(Conditional local quotient reduction for Transformer symmetry strata\)

LetΘtr,reg\\Theta\_\{\\mathrm\{tr\},\\mathrm\{reg\}\}be a regular symmetry stratum of a Transformer parameter space, and letGtrG\_\{\\mathrm\{tr\}\}be a smooth symmetry group acting freely and properly on that stratum while preserving the realized function and empirical loss\. Assume that the projected SGD drift and projected conditional covariance are projectable through the quotient chart in the sense of Assumption[2\.1](https://arxiv.org/html/2606.18303#S2.Thmtheorem1)\. Then, the quotient reduction theorem holds onΘtr,reg/Gtr\\Theta\_\{\\mathrm\{tr\},\\mathrm\{reg\}\}/G\_\{\\mathrm\{tr\}\}\. If the hypotheses of Thm\.[3\.1](https://arxiv.org/html/2606.18303#S3.Thmtheorem1)hold on the quotient and there exists a one\-dimensional collective coordinate satisfying Assumption[4\.1](https://arxiv.org/html/2606.18303#S4.Thmtheorem1), then the corresponding reduced gradient field satisfies the source\-corrected Burgers equation\.

###### Proof

This is the same argument as in Thm\.[5\.1](https://arxiv.org/html/2606.18303#S5.Thmtheorem1)\.

For Transformers, the most robust exact statement is typically the quotient Hamilton–Jacobi equation rather than classical Burgers\. The reason is structural: high\-dimensional symmetry groups make one\-dimensional closure non\-generic\. Hence the proper workflow is to first identify a quotient chart, then verify projectability, and only then ask whether a scalar normal form is justified\.

### 5\.5Mean\-field Limits

The correct large\-width object is often a probability measure on parameter space rather than a finite\-dimensional parameter vector\[[7](https://arxiv.org/html/2606.18303#bib.bib7),[8](https://arxiv.org/html/2606.18303#bib.bib8),[5](https://arxiv.org/html/2606.18303#bib.bib5)\]\. We now formulate the corrected quotient counterpart\. Let𝒫​\(Θ\)\\mathcal\{P\}\(\\Theta\)denote the space of Borel probability measures onΘ\\Theta\. Suppose thatμt∈𝒫​\(Θ\)\\mu\_\{t\}\\in\\mathcal\{P\}\(\\Theta\)solves the continuity equation∂tμt\+divθ​\(μt​V​\[μt\]\)=0\\partial\_\{t\}\\mu\_\{t\}\+\\mathrm\{div\}\_\{\\theta\}\\bigl\(\\mu\_\{t\}V\[\\mu\_\{t\}\]\\bigr\)=0in weak form\. Letπ:Θreg→M\\pi:\\Theta\_\{\\mathrm\{reg\}\}\\to Mbe a quotient map as before\.

###### Assumption 5\.3\(Pushforward projectability of the mean\-field velocity\)

Assume there exists a Borel mapV¯:𝒫​\(M\)×M→T​M\\bar\{V\}:\\mathcal\{P\}\(M\)\\times M\\to TMsuch that for every probability measureμ\\musupported inΘreg\\Theta\_\{\\mathrm\{reg\}\}and everyθ∈Θreg\\theta\\in\\Theta\_\{\\mathrm\{reg\}\},D​πθ​V​\[μ\]​\(θ\)=V¯​\[π\#​μ\]​\(π​\(θ\)\)\.D\\pi\_\{\\theta\}V\[\\mu\]\(\\theta\)=\\bar\{V\}\[\\pi\_\{\\\#\}\\mu\]\\bigl\(\\pi\(\\theta\)\\bigr\)\.131313π\#​μ\\pi\_\{\\\#\}\\mudenotes the pushforward of the probability measureμ\\muunder the quotient mapπ\\pi, i\.e\. the induced measure on the quotient space defined by\(π\#​μ\)​\(A\)=μ​\(π−1​\(A\)\)\(\\pi\_\{\\\#\}\\mu\)\(A\)=\\mu\(\\pi^\{\-1\}\(A\)\)\.

###### Theorem 5\.4\(Pushforward quotient transport\)

Assume Assumption[5\.3](https://arxiv.org/html/2606.18303#S5.Thmtheorem3)\. Letνt:=π\#​μt\\nu\_\{t\}:=\\pi\_\{\\\#\}\\mu\_\{t\}\. Then,νt\\nu\_\{t\}solves the weak quotient transport equation∂tνt\+divM​\(νt​V¯​\[νt\]\)=0\.\\partial\_\{t\}\\nu\_\{t\}\+\\mathrm\{div\}\_\{M\}\\bigl\(\\nu\_\{t\}\\bar\{V\}\[\\nu\_\{t\}\]\\bigr\)=0\.

###### Proof

Take any test functionφ∈Cc∞​\(M\)\\varphi\\in C\_\{c\}^\{\\infty\}\(M\)\. By definition of pushforward,∫Mφ​\(q\)​νt​\(d​q\)=∫Θφ​\(π​\(θ\)\)​μt​\(d​θ\)\.\\int\_\{M\}\\varphi\(q\)\\,\\nu\_\{t\}\(dq\)=\\int\_\{\\Theta\}\\varphi\\bigl\(\\pi\(\\theta\)\\bigr\)\\,\\mu\_\{t\}\(d\\theta\)\.Differentiate in time and use the weak form of the continuity equation:

dd​t​∫Mφ​𝑑νt=∫Θ∇θ\(φ∘π\)⁡\(θ\)⋅V​\[μt\]​\(θ\)​μt​\(d​θ\)\\displaystyle\\frac\{d\}\{dt\}\\int\_\{M\}\\varphi\\,d\\nu\_\{t\}=\\int\_\{\\Theta\}\\nabla\_\{\\theta\}\(\\varphi\\circ\\pi\)\(\\theta\)\\cdot V\[\\mu\_\{t\}\]\(\\theta\)\\,\\mu\_\{t\}\(d\\theta\)=\\displaystyle=∫Θ\(D​πθ\)⊤​∇Mφ​\(π​\(θ\)\)⋅V​\[μt\]​\(θ\)​μt​\(d​θ\)=∫Θ∇Mφ​\(π​\(θ\)\)⋅D​πθ​V​\[μt\]​\(θ\)​μt​\(d​θ\)\.\\displaystyle\\int\_\{\\Theta\}\\bigl\(D\\pi\_\{\\theta\}\\bigr\)^\{\\top\}\\nabla\_\{M\}\\varphi\\bigl\(\\pi\(\\theta\)\\bigr\)\\cdot V\[\\mu\_\{t\}\]\(\\theta\)\\,\\mu\_\{t\}\(d\\theta\)=\\int\_\{\\Theta\}\\nabla\_\{M\}\\varphi\\bigl\(\\pi\(\\theta\)\\bigr\)\\cdot D\\pi\_\{\\theta\}V\[\\mu\_\{t\}\]\(\\theta\)\\,\\mu\_\{t\}\(d\\theta\)\.By Assumption[5\.3](https://arxiv.org/html/2606.18303#S5.Thmtheorem3), the last integrand equals∇Mφ​\(π​\(θ\)\)⋅V¯​\[νt\]​\(π​\(θ\)\)\.\\nabla\_\{M\}\\varphi\\bigl\(\\pi\(\\theta\)\\bigr\)\\cdot\\bar\{V\}\[\\nu\_\{t\}\]\\bigl\(\\pi\(\\theta\)\\bigr\)\.Pushing forward the measure therefore givesdd​t​∫Mφ​𝑑νt=∫M∇Mφ​\(q\)⋅V¯​\[νt\]​\(q\)​νt​\(d​q\),\\frac\{d\}\{dt\}\\int\_\{M\}\\varphi\\,d\\nu\_\{t\}=\\int\_\{M\}\\nabla\_\{M\}\\varphi\(q\)\\cdot\\bar\{V\}\[\\nu\_\{t\}\]\(q\)\\,\\nu\_\{t\}\(dq\),which is the weak formulation of the quotient transport equation\.

Thm\.[5\.4](https://arxiv.org/html/2606.18303#S5.Thmtheorem4)states that, in the mean\-field regime, the pushforward distributionνt=π\#​μt\\nu\_\{t\}=\\pi\_\{\\\#\}\\mu\_\{t\}satisfies a closed transport equation on the quotient spaceMM\. Thus, the large\-width dynamics can be described directly on the quotient in terms of probability measures\. Its benefit is that the correct large\-width reduced dynamics can be formulated directly on the quotient space, without returning to redundant parameter coordinates\.

## 6Discussion

Our theorems suggest three main lessons\. First,symmetry\-corrected observables should replace raw weights\. In positively homogeneous ReLU networks, path\-based quantities and quotient\-flatness measures are more informative than Euclidean norms or raw\-coordinate Hessian spectra\[[1](https://arxiv.org/html/2606.18303#bib.bib1),[2](https://arxiv.org/html/2606.18303#bib.bib2)\]\. In Batch Normalization \(BN\) and Layer Normalization \(LN\), raw parameter magnitudes are even less informative because normalization creates scale\-invariant directions\[[6](https://arxiv.org/html/2606.18303#bib.bib6)\]\. Quotient\-based sharpness measures for Transformers support the same conclusion\[[9](https://arxiv.org/html/2606.18303#bib.bib9)\]\. More broadly, overparameterized models contain substantial symmetry\-induced redundancy, so raw norms, uncorrected Hessian surrogates, and layerwise scale statistics can vary even when the realized function changes little\. The quotient viewpoint therefore favors symmetry\-corrected observables, including invariant statistics of attention logits, head\-output energies, attention entropies, and attention–Multi\-Layer Perceptron \(MLP\) branch\-balance measures\.

Second,negative reduced curvature could predict abrupt regime changes, although it may computationally demanding and require a surrogate\. In the one\-dimensional closed model, shock time is determined bymin⁡U¯′′\\min\\bar\{U\}^\{\\prime\\prime\}\. Regime changes should therefore be detected through projected or quotient\-corrected curvature rather than the ambient Hessian alone\. Even without exact scalar closure, strongly negative reduced curvature should still indicate rapid qualitative changes in learning, such as sharp changes in loss slope, attention\-head concentration, branch\-balance collapse, or optimization\-regime transitions\. This yields a falsifiable prediction: quotient\-corrected curvature should anticipate such events more reliably than raw parameter\-space curvature\.

Third,the appropriate reduced model depends on architecture and width\. For ReLU Multi\-Layer Perceptrons \(MLPs\) and Convolutional Neural Networks \(CNNs\), Burgers\-type normal forms are natural only after verifying projectability\. For BN, LN, and Transformers, quotient Hamilton–Jacobi is usually the better first target\. In mean\-field limits, quotient transport on probability measures is primary, and scalar Burgers should be introduced only after further observable reduction\. This hierarchy matters in realistic Transformer settings, where exact one\-dimensional reduction is typically non\-generic\. The default procedure is therefore to use quotient observables, test for approximate low\-dimensional closure, and invoke scalar shock diagnostics only when the data support them\.

\{credits\}

#### 6\.0\.1Acknowledgements

I thank Toshinori Araki, my manager, for his dedicated support for this work\.

#### 6\.0\.2\\discintname

The authors have no competing interests to declare that are relevant to the content of this article\.

## References

- \[1\]Neyshabur, Behnam, Russ R\. Salakhutdinov, and Nati Srebro\. "Path\-sgd: Path\-normalized optimization in deep neural networks\."Advances in neural information processing systems28 \(2015\)\.
- \[2\]Rangamani, Akshay, et al\. "A scale invariant flatness measure for deep network minima\."arXiv preprintarXiv:1902\.02434 \(2019\)\.
- \[3\]Chaudhari, Pratik, et al\. "Deep relaxation: partial differential equations for optimizing deep neural networks\."Research in the Mathematical Sciences5\.3 \(2018\): 30\.
- \[4\]Li, Qianxiao, and Cheng Tai\. "Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations\."Journal of Machine Learning Research20\.40 \(2019\): 1\-47\.
- \[5\]Gess, Benjamin, Sebastian Kassing, and Vitalii Konarovskyi\. "Stochastic modified flows, mean\-field limits and dynamics of stochastic gradient descent\."Journal of Machine Learning Research25\.30 \(2024\): 1\-27\.
- \[6\]Arora, Sanjeev, Zhiyuan Li, and Kaifeng Lyu\. "Theoretical analysis of auto rate\-tuning by batch normalization\."arXiv preprintarXiv:1812\.03981 \(2018\)\.
- \[7\]Mei, Song, Andrea Montanari, and Phan\-Minh Nguyen\. "A mean field view of the landscape of two\-layer neural networks\."Proceedings of the National Academy of Sciences115\.33 \(2018\): E7665\-E7671\.
- \[8\]Sirignano, Justin, and Konstantinos Spiliopoulos\. "Mean field analysis of deep neural networks\."Mathematics of Operations Research47\.1 \(2022\): 120\-152\.
- \[9\]Da Silva, Marvin F\., Felix Dangel, and Sageev Oore\. "Hide & seek: Transformer symmetries obscure sharpness & Riemannian geometry finds it\."arXiv preprintarXiv:2505\.05409 \(2025\)\.
- \[10\]Evans, Lawrence C\.Partial differential equations\.Vol\. 19\. American mathematical society, 2022\.
- \[11\]LeVeque, Randall J\., and Randall J\. Leveque\.Numerical methods for conservation laws\. Vol\. 132\. Basel: Birkhäuser, 1992\.

Similar Articles

Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

arXiv cs.LG

This paper shows that discrete Gradient Descent with large step sizes restores symmetry in multi-pathway Deep Linear Networks, countering the symmetry-breaking predicted by Gradient Flow, and leads to signal re-balancing across pathways. The authors theoretically prove that balanced solutions are flatter (less sharp) than sparse ones, and large learning rates drive the network toward stable, balanced configurations.

State-Space NTK Collapse Near Bifurcations

arXiv cs.LG

This paper develops a local theory of gradient descent near bifurcations in dynamical models, showing that the state-space neural tangent kernel collapses to a rank-one operator that dominates learning dynamics, making optimization effectively low-dimensional and predictable from normal forms.

Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent

arXiv cs.LG

This paper develops a sharp pseudospectral theory for block-triangular Jacobians in coupled gradient descent, proving Kreiss-constant bounds and establishing iteration complexity results. The work exposes non-asymptotic, instance-dependent transient amplification phenomena relevant to bilevel optimization, two-time-scale stochastic approximation, and GAN training.

The Hamilton-Jacobi Theory of Deep Learning

Hugging Face Daily Papers

This paper identifies neural network training as a search through Hamilton-Jacobi initial-value problems, showing that residual networks, transformers, and RNNs discretize the same class of viscous Hamilton-Jacobi equations. It derives quantitative consequences including minimax optimal generalization rates, adversarial robustness bounds, and a closed-form influence function.