State-Space NTK Collapse Near Bifurcations

arXiv cs.LG Papers

Summary

This paper develops a local theory of gradient descent near bifurcations in dynamical models, showing that the state-space neural tangent kernel collapses to a rank-one operator that dominates learning dynamics, making optimization effectively low-dimensional and predictable from normal forms.

arXiv:2605.12763v1 Announce Type: new Abstract: Rich feature learning in tasks that unfold over time often requires the model to pass through bifurcations, constituting qualitative changes in the underlying model dynamics. We develop a local theory of gradient descent near these transitions through the empirical state-space neural tangent kernel (sNTK). Our central finding is that bifurcations both dominate and simplify learning dynamics: near bifurcations, we can reduce sNTK to a rank-one operator corresponding to learning in a classical normal form system, providing an analytically tractable description of the local learning geometry, even for high-dimensional recurrent systems. Concretely, we give a procedure for decomposing sNTK into bifurcation-relevant and residual channels, showing that near commonly codimension-1 bifurcations the relevant channel is a rank-one operator that is highly amplified. This amplification causes the bifurcation channel to dominate the full sNTK. Thus, bifurcations locally warp the learning landscape, funneling gradient descent into a few critical dynamical directions and making the nearby kernel and loss geometry predictable from classical normal forms. We illustrate this in a student-teacher recurrent neural network: the first learned bifurcation coincides with a sharp collapse in sNTK effective rank and the emergence of a dominant parameter direction whose restricted sNTK closely matches the landscape predicted by the scalar pitchfork normal form. Finally, we show that low-rank natural gradient methods resolve the resulting learning instability near bifurcations with very little overhead over SGD.
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:18 AM

# State-Space NTK Collapse Near Bifurcations
Source: [https://arxiv.org/html/2605.12763](https://arxiv.org/html/2605.12763)
\\hldauthor\\Name

James Hazelden\\Emailjhazelde@uw\.edu \\addrUniversity of Washington and\\NameEric Shea\-Brown\\Emailetsb@uw\.edu \\addrUniversity of Washington

###### Abstract

Rich feature learning in tasks that unfold over time often requires the model to pass through bifurcations, constituting qualitative changes in the underlying model dynamics\. We develop a local theory of gradient descent near these transitions through the empirical state\-space neural tangent kernel \(sNTK\\mathrm\{sNTK\}\)\. Our central finding is that bifurcations both dominate and simplify learning dynamics: near bifurcations, we can reducesNTK\\mathrm\{sNTK\}to a rank\-one operator corresponding to learning in a classical normal form system, providing an analytically tractable description of the local learning geometry, even for high\-dimensional recurrent systems\. Concretely, we give a procedure for decomposingsNTK\\mathrm\{sNTK\}into bifurcation\-relevant and residual channels, showing that near commonly codimension\-1 bifurcations the relevant channel is a rank\-one operator that is highly amplified\. This amplification causes the bifurcation channel to dominate the full NTK\. Thus, bifurcations locally warp the learning landscape, funneling gradient descent into a few critical dynamical directions and making the nearby kernel and loss geometry predictable from classical normal forms\. We illustrate this in a student\-teacher recurrent neural network: the first learned bifurcation coincides with a sharp collapse in NTK effective rank and the emergence of a dominant parameter direction whose restricted NTK closely matches the landscape predicted by the scalar pitchfork normal form\. Finally, we show that low\-rank natural gradient methods resolve the resulting learning instability near bifurcations with very little overhead over SGD\.

## 1Introduction

Gradient descent \(GD\) trains a dynamical model by reshaping its latent dynamics until the resulting trajectories solve the task\. In the rich, feature\-learning regime, this often requires more than adjusting outputs: learning must create, destroy, or reorganize fixed points and related dynamical motifs\. Classical dynamical systems theory tells us that such qualitative changes occur through local bifurcations\(guckenheimer1983nonlinear\)\.

Prior work has shown that some bifurcations coincide with large drops in loss in recurrent networks with ReLU activations, suggesting that bifurcations are important events in optimization as well as in dynamics\(eisenmann2023bifurcations\)\. However, that machinery is tied to specific architectures and bifurcation types, and does not explain more generally how GD behaves near a dynamical transition\. More broadly, for a given parameterization and loss function, which bifurcations does GD traverse, which does it avoid, and how do parameter updates behave near bifurcation sets? We study these questions through the empirical state\-space neural tangent kernel \(sNTK\\mathrm\{sNTK\}\), the Gram operator of the global parameter\-to\-state Jacobian\. Recent tools makesNTK\\mathrm\{sNTK\}interpretable and computable for finite recurrent models\(hazelden2026globalempiricalntkselfreferential\), making it a natural lens for studying learning near bifurcations\.

We show that near a bifurcation, learning becomes effectively low\-dimensional\. For codimension\-one bifurcations,sNTK\\mathrm\{sNTK\}reduces to an approximately rank\-one channel, so GD is funneled into a narrow set of dynamical corrections\. This channel is amplified by the underlying dynamics, making optimization stiff and strongly anisotropic near the transition\. Thus, bifurcations are not just dynamical events, but optimization bottlenecks\. Moreover, this reduction gives a simple analytic model for learning near bifurcations: the local dynamics are governed by the low\-ranksNTK\\mathrm\{sNTK\}of the corresponding normal form, allowing analysis in a simple low\-dimensional setting that still closely matches the behavior of the full model near the transition\. To summarize, ourcontributionsare:

- •Near a local bifurcation, thesNTK\\mathrm\{sNTK\}of a generic model admits an additive decomposition into bifurcation\-relevant and residual terms, with the former low rank\.
- •For codimension\-one bifurcations, the corresponding normal forms predictsNTK\\mathrm\{sNTK\}amplification near the transition, inducing an effectively rank\-one local learning geometry that closely matches the empirically computedsNTK\\mathrm\{sNTK\}in the full model\.
- •In a student\-teacher RNN, a pitchfork bifurcation coincides with a sharp drop insNTK\\mathrm\{sNTK\}effective rank, suggesting that low\-rank natural\-gradient corrections can stabilize training near such transitions\.

## 2NTK Collapse Due to Bifurcations in a Student\-Teacher RNN

![Refer to caption](https://arxiv.org/html/2605.12763v1/x1.png)Figure 1:NTK collapse in a student\-teacher RNN\. Over SGD, we measure \(A\) loss, \(B\) stable rank ofsNTK\\mathrm\{sNTK\}, and \(C\) the spectral radius of the student’s weights\. \(D\) compares final readout dynamics\. The dashed line in A–C corresponds to a pitchfork bifurcation, shown in \(G\), corresponding to a sudden drop in loss and collapse ofsNTK\\mathrm\{sNTK\}effective rank to 1\. \(E\) illustrates localsNTK\\mathrm\{sNTK\}norm amplification near this bifurcation, matching pitchfork normal\-form prediction\.We begin with a student\-teacher RNN trained on a dynamical task with two\-dimensional readout \(details in Appendix[B](https://arxiv.org/html/2605.12763#A2)\)\. The teacher exhibits a fixed\-point \(FP\) structure that shapes the sampled trajectories, and the student must learn to reproduce these trajectories\. Figure[1](https://arxiv.org/html/2605.12763#S2.F1)summarizes the results\. Panel D shows the teacher dynamics, consisting of four stable FPs and five unstable FPs\. Initially, the student exhibits dynamics that collapse to a single FP, so multiple bifurcations are required to reproduce the teacher dynamics\. The dashed line in panels A–C marks the first learned bifurcation, a pitchfork \(as in panel G\), which coincides with a sudden drop in loss \(panel A\), consistent witheisenmann2023bifurcations\. In panel B, we compute the effective rank of the state\-space NTK \(sNTK\\mathrm\{sNTK\}\), described in detail below\. At the first bifurcation, this rank collapses sharply to one, before expanding again later in training during further bifurcations\. The rest of this work analyzes this collapse phenomenon and its consequences\.

In particular, the local GD amplification landscape near the bifurcation \(panel E\) is well predicted by the rank\-onesNTK\\mathrm\{sNTK\}landscape associated with a pitchfork normal form \(Figure[2](https://arxiv.org/html/2605.12763#S3.F2)B\), which can be characterized exactly\. In contrast to the simpler stability\-flip bifurcation, whose landscape exhibits strictly monotonic gain near the bifurcation, the pitchfork appears to self\-regulate, with the landscape peaking and then decaying around the bifurcation\. In Appendix[C](https://arxiv.org/html/2605.12763#A3), we show that a rank\-one natural\-gradient corrector yields smoother loss curves for the same task, effectively neutralizing the low\-rank unstable contribution to the NTK \(and hence the Fisher information\), with very little overhead compared to SGD\.

Overall, this experiment shows that, in finite\-size nonlinear recurrent networks, \(1\) learning can become highly low\-dimensional near bifurcations, \(2\) this behavior is well captured by analytically tractable normal forms from dynamical systems theory, and \(3\) low\-rank natural gradient can train such models more stably with little additional overhead\.

## 3Normal Forms Make the Mechanism Explicit

The decomposition above reduces the local learning geometry to a bifurcation\-relevant channelsNTKg\\mathrm\{sNTK\}\_\{g\}and a residual term\. The key question is whethersNTKg\\mathrm\{sNTK\}\_\{g\}is sufficiently amplified near criticality to dominate the full NTK\. To study this, we use normal forms from dynamical systems theory: low\-dimensional polynomial models that describe dynamics near bifurcations\. For generic systems, a smooth change of coordinates brings the local dynamics near a bifurcation into agreement with the corresponding normal form\. Thus, studying the NTK of normal forms can reveal local learning behavior that also appears in full network models\.

Specifically, we study one\-dimensional normal formsht\+1=f​\(ht,g\)h\_\{t\+1\}=f\(h\_\{t\},g\)withg→g∗g\\to g^\{\*\}inducing a codimension\-one bifurcation\. In this setting, the relevant NTK channel is rank one, so its norm directly measures learning strength along the critical dynamical direction\. Figure[2](https://arxiv.org/html/2605.12763#S3.F2)shows that this channel is strongly amplified near criticality across representative codimension\-one bifurcations\. We focus on two cases here for simplicity: a stability flip and a pitchfork\.

![Refer to caption](https://arxiv.org/html/2605.12763v1/x2.png)Figure 2:Rank\-one sNTK amplification for codimension\-one bifurcations\.We plot the norm ofsNTK=\(Dg​h\)​\(Dg​h\)T\\mathrm\{sNTK\}=\(D\_\{g\}h\)\(D\_\{g\}h\)^\{T\}for initial conditionsh0h\_\{0\}sampled uniformly from\[−0\.05,0\.05\]\[\-0\.05,0\.05\]\(blue\) and\[−0\.1,0\.1\]\[\-0\.1,0\.1\]\(orange\), usingT=30T=30timesteps\. In all cases, the norm is strongly amplified near the bifurcation pointg∗=1g^\{\*\}=1\(dashed line\)\. The stability flip exhibits monotone growth past criticality, while the nonlinear bifurcations self\-regulate and show peaked amplification\.#### A Linear Example: Stability Flip\.

We begin with the simple linear systemht\+1=g​hth\_\{t\+1\}=gh\_\{t\}\. This is not a normal form in the usual sense, but it is the simplest system exhibiting a qualitative change in its dynamics asggvaries\.

Indeed, we can derive the corresponding NTK explicitly\. WritingTTfor the time horizon of the underlying model,‖sNTKg‖2∝∑t=0T−1\(t\+1\)2​g2​t\\\|\\mathrm\{sNTK\}\_\{g\}\\\|\_\{2\}\\propto\\sum\_\{t=0\}^\{T\-1\}\(t\+1\)^\{2\}g^\{2t\}\. For\|g\|<1\|g\|<1, this behaves likeT3/3T^\{3\}/3, while for\|g\|\>1\|g\|\>1the behavior matchesT2​g2​T/\(1−g2\)=O​\(g2​T\)T^\{2\}g^\{2T\}/\(1\-g^\{2\}\)=O\(g^\{2T\}\), characterized by extreme unbounded blowup\. Thus, local to the bifurcation, the NTK collapses to effectively rank one with massive norm, dominated bysNTKg\\mathrm\{sNTK\}\_\{g\}, and this continues to worsen as GD pushes further into the unstable regime\.

#### Pitchfork Bifurcation\.

For the pitchfork normal formht\+1=g​ht−ht3h\_\{t\+1\}=gh\_\{t\}\-h\_\{t\}^\{3\}, the same instability appears, but nonlinear escape to the stable branches cuts off indefinite growth \(Figure[2](https://arxiv.org/html/2605.12763#S3.F2)B\)\. For0<g<10<g<1, the behavior matches the linear stability\-flip case, with‖sNTKg‖2∝∑t=0T−1\(t\+1\)2​g2​t\\\|\\mathrm\{sNTK\}\_\{g\}\\\|\_\{2\}\\propto\\sum\_\{t=0\}^\{T\-1\}\(t\+1\)^\{2\}g^\{2t\}, while forg\>1g\>1, amplification peaks and then decays as trajectories are damped onto the additional stable FPs at±g−1\\pm\\sqrt\{g\-1\}\. Thus, unlike the stability flip, the pitchfork exhibits regulated rather than strictly exponential amplification beyond bifurcation\. This is exactly the behavior seen in the student\-teacher RNN \(compare Figures[1](https://arxiv.org/html/2605.12763#S2.F1)E and[2](https://arxiv.org/html/2605.12763#S3.F2)B\)\.

#### Takeaways\.

Bifurcations strongly affect learning because they induce rank\-one NTK terms that can dominate the local learning geometry\. Once this happens, GD becomes highly anisotropic: error signals aligned with the bifurcation\-relevant direction in state\-space produce large changes in the underlying model states, while orthogonal signals produce much smaller updates\. Thus, even scalar normal\-form models can accurately predict the local GD landscape of the corresponding full high\-dimensional models\.

## 4Discussion

We have shown that bifurcations correspond not only to major changes in model dynamics, but also to strong and predictable features of learning\. Near codimension\-1 transitions, the empirical state\-space NTK collapses onto a single critical channel, making gradient descent effectively low\-dimensional even in a large parameter space\. This opens the door to an analytic theory of learning near bifurcations through normal forms\. It also suggestssNTK\\mathrm\{sNTK\}collapse as a more general signature of critical feature\-learning events in models beyond classical dynamical systems \(e\.g\., transformers or input\-driven MLPs\)\. The main limitation of the present work is that it is local and explanatory rather than fully predictive, but this also points to a natural next step: developing practical methods for detecting bifurcations during GD learning and using that information to more stably optimize the model\.

\\acks

We acknowledge Alexander Hsu for suggesting low\-rank natural gradients\(yang2020sketchy\)\.

## References

## Appendix ADerivations

### A\.1Local decomposition in bifurcation coordinates

As in\(hazelden2026globalempiricalntkselfreferential\), thesNTK\\mathrm\{sNTK\}operator can be written as

sNTK=𝒫​𝒦​𝒫T\\mathrm\{sNTK\}=\\mathcal\{P\}\\mathcal\{K\}\\mathcal\{P\}^\{T\}following from an implicit reparameterization of the dynamicsht=f​\(ht−1,θ\)h\_\{t\}=f\(h\_\{t\-1\},\\theta\)in the formℱ​\(h,θ\)=h−f​\(T↓​h,θ\)=0\\mathcal\{F\}\(h,\\theta\)=h\-f\(T\_\{\\downarrow\}h,\\theta\)=0, whereT↓T\_\{\\downarrow\}is a linear operator decrementing time by one andh∈ℝB×T×Nh\\in\\mathbb\{R\}^\{B\\times T\\times N\}corresponds to all sample trajectories of the hidden state over a batch of sizeBB, simulated forT−1T\-1timesteps \(with specific values ofB,T,HB,T,Hin the RNN task below, Appendix[B](https://arxiv.org/html/2605.12763#A2)\)\. Here,𝒫=\(Dh​ℱ\)−1,𝒦=\(Dθ​ℱ\)​\(Dθ​ℱ\)∗\\mathcal\{P\}=\(D\_\{h\}\\mathcal\{F\}\)^\{\-1\},\\mathcal\{K\}=\(D\_\{\\theta\}\\mathcal\{F\}\)\(D\_\{\\theta\}\\mathcal\{F\}\)^\{\*\}\. Crucially, parameter change only changes𝒦\\mathcal\{K\}\. Ifϕ:ℝm→ℝ×ℝm−1\\phi:\\mathbb\{R\}^\{m\}\\rightarrow\\mathbb\{R\}\\times\\mathbb\{R\}^\{m\-1\}is a coordinate diffeomorphismϕ:θ→\(g,R\)\\phi:\\theta\\rightarrow\(g,R\)local to a codimension\-one bifurcation,θ∗=\(g∗,R∗\)\\theta^\{\*\}=\(g^\{\*\},R^\{\*\}\), then sinceθ=ϕ−1​\(g,R\)\\theta=\\phi^\{\-1\}\(g,R\), by the chain rule,

Dθ​ℱ​\(θ∗\)=\(Dg​ℱ​\(g∗\),DR​ℱ​\(R∗\)\)⋅Dθ​ϕ​\(θ∗\)D\_\{\\theta\}\\mathcal\{F\}\(\\theta^\{\*\}\)=\(D\_\{g\}\\mathcal\{F\}\(g^\{\*\}\),D\_\{R\}\\mathcal\{F\}\(R^\{\*\}\)\)\\cdot D\_\{\\theta\}\\phi\(\\theta^\{\*\}\)Finally, locally, we can chooseϕ\\phito be a local isometry atθ∗\\theta^\{\*\}, so thatDθ​ϕD\_\{\\theta\}\\phiis the identity atθ∗\\theta^\{\*\}, yielding atθ∗\\theta^\{\*\}

𝒦=𝒦g\+𝒦R=Dg​ℱ​\(g∗\)​Dg​ℱ​\(g∗\)T\+DR​ℱ​\(R∗\)​DR​ℱ​\(R∗\)T\\mathcal\{K\}=\\mathcal\{K\}\_\{g\}\+\\mathcal\{K\}\_\{R\}=D\_\{g\}\\mathcal\{F\}\(g^\{\*\}\)D\_\{g\}\\mathcal\{F\}\(g^\{\*\}\)^\{T\}\+D\_\{R\}\\mathcal\{F\}\(R^\{\*\}\)D\_\{R\}\\mathcal\{F\}\(R^\{\*\}\)^\{T\}Hence,

sNTK=𝒫​\(𝒦g\+𝒦R\)​𝒫T=sNTKg\+sNTKR\\mathrm\{sNTK\}=\\mathcal\{P\}\(\\mathcal\{K\}\_\{g\}\+\\mathcal\{K\}\_\{R\}\)\\mathcal\{P\}^\{T\}=\\mathrm\{sNTK\}\_\{g\}\+\\mathrm\{sNTK\}\_\{R\}under this local isometry change of coordinatesθ↦\(g,R\)\\theta\\mapsto\(g,R\), yielding the clean separation of the NTK into a bifurcation\-relevant rank\-one operator and a rankm−1m\-1residual operator\. Of course, this same procedure can be applied for higher\-rank bifurcations, withg∈ℝkg\\in\\mathbb\{R\}^\{k\}\.

## Appendix BStudent\-Teacher Task Details

Both the student and teacher are vanilla RNNs\(hochreiter1997long\)with tanh activation\. We trained the model with SGD,η=5⋅10−3\\eta=5\\cdot 10^\{\-3\}, for 35,000 GD iterations, without momentum or gradient clipping\. The batch size was256256, with each batch entry corresponding to a distinct initial condition for evaluating the student and teacher\.

Each model had 64 hidden neurons in this case\. The hidden neuronsh0h\_\{0\}andh1h\_\{1\}were chosen as the read\-in and readout of the model, i\.e\., the input and output weights are fixed and the same for both the student and teacher\. The state of the model is thus a 3\-tensor,h∈ℝB×T×65h\\in\\mathbb\{R\}^\{B\\times T\\times 65\}, withB=256B=256the batch size andT=25T=25the number of unrolled timesteps\. The goal of the task is to minimize the readout trajectory difference, i\.e\., on the same initial condition, minimize the MSE loss quantifying the average squared error‖h​\(t\)−h∗​\(t\)‖\\\|h\(t\)\-h^\{\*\}\(t\)\\\|between the student and teacher\. The student was initialized with Xavier weights and biases, while the teacher was constructed the same way, then with weights adjusted so that its readout had the exact fixed\-point structure in Figure[1](https://arxiv.org/html/2605.12763#S2.F1)\. This was done by using the fact thatW​tanh⁡\(x\)=xW\\tanh\(x\)=xforx∈ℝ64x\\in\\mathbb\{R\}^\{64\}can be ensured by replacingWWbyx⋅tanh\(x\)T‖tanh⁡\(x\)‖2\+W⟂\\frac\{x\\cdot\\tanh\(x\)^\{T\}\}\{\\\|\\tanh\(x\)\\\|^\{2\}\}\+W\_\{\\perp\}, whereW⟂​x=0W\_\{\\perp\}x=0\. Note that this also creates a fixed point at−x\-xsincetanh\\tanhis an odd function; hence, we can create symmetric fixed points centered at the origin\.

Note here that no input is provided to the model: the task input can be seen as the choice of initial sample condition, so this resembles tasks in generative modeling \(e\.g\., transforming one distribution into another\)\. Adding input or a non\-autonomous term into the dynamics requires a more robust realization of a fixed point and its bifurcation; hence, it is less naturally amenable to analysis by classical dynamical systems, so lenses such assNTK\\mathrm\{sNTK\}low\-rank collapse and amplification ofsNTK\\mathrm\{sNTK\}norm could have particular use cases in these situations \(see Discussion\)\.

For the scalar\-valued normal forms, there is no actual task: any loss can be chosen without choosing the NTK\. However, a natural choice of task in this setting \(for future experiments\) would be a similar student\-teacher setup\. For example, if a normal form has the dynamicsht=f​\(ht,g\)h\_\{t\}=f\(h\_\{t\},g\), one could aim to minimize the MSE loss between sampled trajectories and a teacher network withg=g∗g=g^\{\*\}chosen selectively \(e\.g\., on the same or different side of the bifurcations\)\. This could provide a useful testbed for analyzing what exactly causes GD to fail in these simple normal form examples\.

## Appendix CAdditional Examples and Low\-Rank Natural Gradient Sketching

![Refer to caption](https://arxiv.org/html/2605.12763v1/figures/nd_two_ev.png)Figure 3:Learning two unstable modes\. After the first student mode becomes unstable, the local NTK geometry collapses toward one dominant direction, so the second unstable mode is learned only much more slowly\. Solid lines denote student eigenvalues and dashed lines denote the unstable teacher eigenvalues\. The right panel shows a zoomed view, where the large NTK norm near the transition leads to visible fluctuations from oversized GD steps\.Prior work studied learning an integrator in recurrent models through the infinite\-width NTK\(bordelon2025dynamicallylearningintegraterecurrent\)\. Here, we instead consider learning multiple unstable modes in a finite\-width setting\. The point is to illustrate a direct consequence of the rank\-one collapse described in the main text: once one mode becomes near\-critical, the local learning geometry is dominated by that direction, making additional unstable modes much harder to learn\.

To test this, we train an initially stable student network to follow a teacher network with two unstable eigenvalues\. Figure[3](https://arxiv.org/html/2605.12763#A3.F3)shows that the first eigenvalue in the student network quickly crosses the stability boundary and moves toward the dominant unstable teacher mode\. However, the second rises much more slowly and remains far from the teacher on the same timescale\. In our experiments, learning the second unstable mode requires roughly two orders of magnitude more iterations\.

This is consistent with the mechanism described in the main text\. When the first mode becomes near\-critical, this concentrates the NTK into an effectively one\-dimensional subspace\. Gradient descent therefore gains access to the unstable dynamics needed to fit the teacher network’s dynamics, but allocates most of its updates to that first mode\. In this sense, the same transition that enables unstable behavior also produces a strongly anisotropic, nearly rank\-one learning geometry that suppresses progress on learning additional unstable directions\.

### C\.1Low\-Rank Natural Gradient Resolves Bifurcations

![Refer to caption](https://arxiv.org/html/2605.12763v1/figures/training_curves.png)Figure 4:Rank\-one natural\-gradient correction near bifurcation\. Reconditioning only the top Fisher/NTK mode produces much smoother and cleaner training than plain GD, with very little overhead, while still preserving the large loss drops at the bifurcations\. This suggests that the jagged behavior in Figure[1](https://arxiv.org/html/2605.12763#S2.F1)is partly an optimization effect, but that the transitions themselves reflect genuine structural changes in the learned dynamics rather than mere numerical artifacts\.Because the state\-space NTK geometry near each bifurcation is already dominated by a single amplified mode, a rank\-one natural\-gradient approximation provides a well\-matched correction to the unstable rank\-one mode of the NTK, and hence of the state\-space Fisher information \(FIM\), since

sNTK=Jθ​JθT,FIM=JθT​Jθ,\\mathrm\{sNTK\}=J\_\{\\theta\}J\_\{\\theta\}^\{T\},\\qquad\\text\{FIM\}=J\_\{\\theta\}^\{T\}J\_\{\\theta\},which have identical nonzero eigenvalues\. This requires approximating the top eigenpair of the positive\-definite state\-space FIM operator, a well\-studied problem in numerical linear algebra\. The simplest approach is power iteration, which repeatedly applies the operator to a random parameter\-space direction\(saad2011numerical;mishkin2018slang;yang2020sketchy\)\. This keeps the overhead small relative to vanilla SGD\. Empirically, we find that learning with a rank\-one corrected natural gradient, rather than vanilla GD, makes the loss curves much smoother and cleaner than in Figure[1](https://arxiv.org/html/2605.12763#S2.F1), while still preserving the large loss drops at the bifurcations\. That is the key point: the correction stabilizes the optimization around the transition, but it does not remove the transition itself\. This shows, consistent witheisenmann2023bifurcations, that the loss drops are not merely numerical artifacts caused by oversized GD steps, but instead reflect genuine structural changes in the learned dynamics, manifesting as low\-rank valleys in the loss landscape\. Finally, for higher\-dimensional bifurcations, or for learning multiple bifurcations in sequence, replacing the rank\-one natural\-gradient approximation with a low\-rank estimator is more natural\. Efficiently selecting the rank of the natural\-gradient estimator during training is a direction for future work\.

Similar Articles

Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent

arXiv cs.LG

This paper develops a sharp pseudospectral theory for block-triangular Jacobians in coupled gradient descent, proving Kreiss-constant bounds and establishing iteration complexity results. The work exposes non-asymptotic, instance-dependent transient amplification phenomena relevant to bilevel optimization, two-time-scale stochastic approximation, and GAN training.

The Implicit Bias of Depth: From Neural Collapse to Softmax Codes

arXiv cs.LG

This paper studies how depth alone induces an implicit low-rank bias in deep unconstrained feature models trained without regularization, shifting the optimal solution from neural collapse to softmax codes, and provides the first asymptotic and dynamic characterization of this bias under gradient descent with cross-entropy loss.

Steerable Neural ODEs on Homogeneous Spaces

arXiv cs.LG

This paper introduces steerable neural ordinary differential equations on homogeneous spaces, providing a geometric framework for learning continuous-time equivariant dynamics.

Feature Lottery? A Bifurcation Theory of Concept Emergence

arXiv cs.LG

This paper introduces a bifurcation theory of representation dynamics to detect when neural networks acquire structured representations during training, using a Hessian analysis of a GMM probe. The resulting ratio β/β_c serves as a label-free phase coordinate that predicts the onset of usable structure and can forecast feature interpretability in sparse autoencoders early in training.