The Hamilton-Jacobi Theory of Deep Learning
Summary
This paper identifies neural network training as a search through Hamilton-Jacobi initial-value problems, showing that residual networks, transformers, and RNNs discretize the same class of viscous Hamilton-Jacobi equations. It derives quantitative consequences including minimax optimal generalization rates, adversarial robustness bounds, and a closed-form influence function.
View Cached Full Text
Cached at: 06/02/26, 03:36 PM
Paper page - The Hamilton-Jacobi Theory of Deep Learning
Source: https://huggingface.co/papers/2605.28983
Abstract
Neural network training is formulated as a Hamilton--Jacobi initial-value problem where gradient steps correspond to solving viscous Hamilton--Jacobi equations, with connections to residual networks, transformers, and RNNs through shared mathematical structures.
In this paper, training a neural network is identified, exactly, as a search throughHamilton--Jacobi initial-value problems: each gradient step selects the initial data of aviscous Hamilton--Jacobi equationwhoseHopf--Cole propagatorbest fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact forlog-sum-exp layersand structural for broader architectures:residual networks,transformers, andrecurrent architectures(RNNs,LSTMs,SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter varepsilon unifies all four perspectives (network,tropical algebra, viscous PDE,convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: theminimax optimal generalization rateO(n^{-1/(d+2)}) for fixed t;adversarial robustnesscontrolled by varepsilon;backpropagationas theco-state equationof the Hamiltonian system forresidual networks(Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension viaPDE quadrature; and a closed-form O(N)influence function(softmax attribution weightsπ_j) whoseentropy landscapeundergoesfold bifurcationsas varepsilon increases, each merging attribution basins.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.28983
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.28983 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.28983 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.28983 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
The Hamilton-Jacobi Theory of Deep Learning
This paper establishes an exact correspondence between neural network training and Hamilton-Jacobi initial-value problems, unifying deep learning architectures through a deformation parameter.
@techwith_ram: What if I told you a neural network understands local change before it understands the full picture? That idea is deepl…
This thread explains the intuition behind the Jacobian Matrix and its widespread applications in AI and machine learning, including backpropagation, normalizing flows, computer vision, and robotics.
Equilibrium Propagation and Hamiltonian Inference in the Diffusive Fitzhugh-Nagumo Model
This paper extends Equilibrium Propagation to skew-gradient systems and demonstrates an equivalence between deep Energy-Based Models and Hamiltonian neural networks, focusing on diffusively coupled Fitzhugh-Nagumo neurons. It derives a layer-wise Hamiltonian recurrence relation for inference in such networks.
@simplifyinAI: DeepSeek has dropped a fundamental rewrite of the Transformer architecture. And it solves the "identity crisis" that br…
DeepSeek has published a paper introducing mHC (Manifold-Constrained Hyper-Connections), a fundamental rewrite of the Transformer architecture that stabilizes large models by replacing standard residual connections with mathematically constrained multi-stream pathways.
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
This paper performs full Jacobian eigendecomposition across production-scale LLMs, revealing a learned spectral gradient from rotation-dominated early layers to symmetric late layers, along with a low-rank bottleneck that compresses perturbations. The results link perturbation propagation and compression to network functional topology.