Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos
Summary
This paper develops a mean-field theory of dropout as a perturbation at the edge of chaos in neural networks, deriving scaling laws for correlation decay and establishing distinct universality classes for smooth and ReLU-like activations. It also yields optimal dropout scheduling that reduces test loss with no extra computational cost.
View Cached Full Text
Cached at: 05/22/26, 08:50 AM
# Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos
Source: [https://arxiv.org/html/2605.21648](https://arxiv.org/html/2605.21648)
###### Abstract
We develop a mean\-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos\. Dropout shifts the perfect\-alignment fixed point, making the depth scale for information propagation finite even at critical initialization\. We derive critical and crossover scaling laws for correlation decay and establish that smooth activations and kinked,ReLU\\operatorname\{ReLU\}\-like activations constitute distinct universality classes, with different critical exponents and a universal two\-parameter scaling collapse in detuning and dropout strength\. The distinction traces to the analytic structure of the correlation map: smooth activations admit a Taylor expansion near perfect alignment, while kinked activations develop a branch point with universal non\-analyticity\. As a corollary, the framework yields saturated dropout profiles under fixed budget; a rank\-flow tie\-breaker then selects front\-loaded schedules, substantially reducing held\-out test loss at no extra computational cost, with accuracy gains as a consistent secondary effect\. We test the predictions in MLPs and Vision Transformers and discuss CNN/ResNet extensions\.
dropout, mean\-field theory, edge of chaos, critical scaling, neural network initialization
## 1Introduction
Mean\-field analyses of randomly initialized deep networks reveal an order\-to\-chaos phase diagram that controls both signal propagation and gradient penetration with depth\(Pooleet al\.,[2016](https://arxiv.org/html/2605.21648#bib.bib35); Schoenholzet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib38); Bahriet al\.,[2020](https://arxiv.org/html/2605.21648#bib.bib48),[2024b](https://arxiv.org/html/2605.21648#bib.bib3)\)\. In this work, we study how this picture is modified by dropout\. Using the representation\-group \(coarse\-graining\) language of\(Robertset al\.,[2022](https://arxiv.org/html/2605.21648#bib.bib36); Bahriet al\.,[2024b](https://arxiv.org/html/2605.21648#bib.bib3)\), we show that dropout behaves as a relevant perturbation: it displaces the critical fixed point and grows under depth coarse\-graining, pushing the dynamics away from criticality and determining the macroscopic phase\. Concretely, dropout deforms the correlation map\(Schoenholzet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib38)\)by adding a correlation\-independent shift at perfect alignment, so that perfect correlation between inputs is no longer a fixed point for any nonzero dropout\. Thus, even at the edge\-of\-chaos\(Sompolinskyet al\.,[1988](https://arxiv.org/html/2605.21648#bib.bib40); Packard,[1988](https://arxiv.org/html/2605.21648#bib.bib29); Bertschinger and Natschläger,[2004](https://arxiv.org/html/2605.21648#bib.bib12)\), the depth correlation length is finite\. We interpret this shift as an equation of state for the decorrelation order parameterm≡1−c∗m\\equiv 1\-c^\{\\ast\}\(withc∗c^\{\\ast\}the asymptotic inter\-signal correlation\), derive the resulting scaling laws, and show that smooth and kinked,ReLU\\operatorname\{ReLU\}\-like activations exhibit the same qualitative phase diagram but different critical scaling in the presence of dropout\.
Additionally, we show that the two control parameters \(dropout strength and distance to criticality\) admit a scaling collapse into a single universal form,111This is analogous to the homogeneous scaling of connected correlation functions near a critical point; related scaling forms hold for one\-particle\-irreducible vertex functions after the usual Legendre transform\(Zinn\-Justin,[2002](https://arxiv.org/html/2605.21648#bib.bib37)\)\.and we illustrate how these predictions compare with mean\-field recursions\.
This analysis yields three main conceptual contributions\. First, while dropout was previously shown to destroy the order\-to\-chaos critical point\(Schoenholzet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib38)\), we show that the deformed mean\-field map still retains a nontrivialc∗<1c^\{\\ast\}<1fixed point\. This keeps the correlation length recursion and yields a Landau equation of state form=1−c∗m=1\-c^\{\\ast\}together with a two\-parameter scaling collapse\. Second, we show that universality classes and scaling laws are set by activation smoothness: smooth and kinked activations have distinct critical exponents \(Table[2](https://arxiv.org/html/2605.21648#S3.T2)\), a distinction reflected in their Hermite spectral structure \(App\.[C](https://arxiv.org/html/2605.21648#A3)\)\. Since the same Gaussian activation kernels arise beyond MLPs, including CNNs and residual branches \(App\.[A\.4](https://arxiv.org/html/2605.21648#A1.SS4)\), we expect the qualitative smooth/kinked split to be a well\-motivated heuristic\. Third, we treat dropout as a depth\-dependent dynamical field: the mean\-field variational problem fixes a saturated, step\-like allocation at fixed budget, while a rank\-flow tie\-breaker outside the permutation\-invariant closure selects front\-loading\. Treating classically fixed hyperparameters as fields that can change through the network opens a new dimension for hyperparameter optimization across depth, and potentially over training time\. If the reader takes away only one thing from this paper, we hope it is the three ideas outlined above\.
This work complements prior studies of depth\-dependent regularization: stochastic depth drops whole residual blocks with increasing probability\(Huanget al\.,[2016](https://arxiv.org/html/2605.21648#bib.bib23)\), curriculum dropout anneals dropout over training time\(Morerioet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib27)\), and LayerDrop removes transformer layers for efficiency\(Fanet al\.,[2020](https://arxiv.org/html/2605.21648#bib.bib19)\)\. Our schedules are spatial depth profiles, so they are complementary to temporal curricula and layer\-dropping mechanisms\. Qualitatively different edge\-of\-chaos behavior for smooth andReLU\\operatorname\{ReLU\}\-like activations was previously observed in\(Hayouet al\.,[2019](https://arxiv.org/html/2605.21648#bib.bib20)\); here we extract the scaling exponents and identify the corresponding universality classes\.
## 2Mean\-field theory background
We first state the mean\-field assumptions for MLPs, which provide the baseline for the dropout analysis\(Pooleet al\.,[2016](https://arxiv.org/html/2605.21648#bib.bib35); Schoenholzet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib38); Bahriet al\.,[2020](https://arxiv.org/html/2605.21648#bib.bib48),[2024b](https://arxiv.org/html/2605.21648#bib.bib3)\)\. We consider fully\-connected MLPs at random initialization, where preactivations are well\-approximated as Gaussian random variables with self\-consistently determined statistics \(means, variances, covariances\)\. In the strict infinite width limit this Gaussian description becomes exact\(Leeet al\.,[2018](https://arxiv.org/html/2605.21648#bib.bib47)\), while at large but finite width deviations from Gaussianity can be organized perturbatively in the depth\-to\-width aspect ratio \(schematicallyL/NL/N\), as treated comprehensively in\(Robertset al\.,[2022](https://arxiv.org/html/2605.21648#bib.bib36); Bahriet al\.,[2024b](https://arxiv.org/html/2605.21648#bib.bib3)\)\. The intuition behind this treatment is that as we add more activations, through the CLT, the statistics will become more and more Gaussian, approaching this distribution asymptotically in the large N limit\.
We consider an MLP of depthLL,NlN\_\{l\}neurons in layerll, and an activationϕ:ℝ→ℝ\\phi:\\mathbb\{R\}\\to\\mathbb\{R\}\(e\.g\.tanh\\tanh,ReLU\\operatorname\{ReLU\},*etc\.*\)\. We use the standard i\.i\.d\. Gaussian initialization
Wijl∼𝒩\(0,σw2Nl\),bil∼𝒩\(0,σb2\),W^\{l\}\_\{ij\}\\sim\\mathcal\{N\}\\left\(0,\\frac\{\\sigma\_\{w\}^\{2\}\}\{N\_\{l\}\}\\right\),\\qquad b^\{l\}\_\{i\}\\sim\\mathcal\{N\}\\left\(0,\\sigma\_\{b\}^\{2\}\\right\),\(1\)take the input asyi;a0=xi;ay^\{0\}\_\{i;a\}=x\_\{i;a\}, and propagate forward via
zi;al=Wijlyj;al\+bil,yi;al\+1=ϕ\(zi;al\),z^\{l\}\_\{i;a\}=W^\{l\}\_\{ij\}y^\{l\}\_\{j;a\}\+b^\{l\}\_\{i\},\\qquad y^\{l\+1\}\_\{i;a\}=\\phi\\left\(z^\{l\}\_\{i;a\}\\right\),\(2\)whereaalabels the input\. Throughout, expectations𝔼\[⋅\]\\mathbb\{E\}\[\\cdot\]are over the random weights and biases at initialization, with inputs held fixed\. We also use the standard Gaussian measure
∫Dz\(⋯\)≡12π∫−∞∞𝑑ze−z2/2\(⋯\),\\int Dz\(\\cdots\)\\;\\equiv\\;\\frac\{1\}\{\\sqrt\{2\\pi\}\}\\int\_\{\-\\infty\}^\{\\infty\}dze^\{\-z^\{2\}/2\}\(\\cdots\),\(3\)and similarly for∫Dz1Dz2\\int Dz\_\{1\}Dz\_\{2\}\.
In the infinite\-width limit this becomes exact layer\-by\-layer, while at finite width the first layer is exactly Gaussian \(being a linear map of fixed inputs\) and subsequent layers are only approximately Gaussian, with controlled non\-Gaussian corrections parametrized by the depth\-to\-width ratio\(Robertset al\.,[2022](https://arxiv.org/html/2605.21648#bib.bib36)\)\. A fuller treatment of the MFT background is given in App\.[A\.2](https://arxiv.org/html/2605.21648#A1.SS2)\.
We track the single\-input preactivation varianceqaalq^\{l\}\_\{aa\}, the two\-input covarianceqablq^\{l\}\_\{ab\}, and the induced correlationcablc^\{l\}\_\{ab\}:
qaal\\displaystyle q^\{l\}\_\{aa\}=σw2∫Dzϕ2\(qaal−1z\)\+σb2\\displaystyle=\\sigma\_\{w\}^\{2\}\\int Dz\\phi^\{2\}\\left\(\\sqrt\{q^\{l\-1\}\_\{aa\}\}z\\right\)\+\\sigma\_\{b\}^\{2\}\(4\)qabl\\displaystyle q^\{l\}\_\{ab\}=σw2∫Dz1Dz2ϕ\(u1\)ϕ\(u2\)\+σb2\\displaystyle=\\sigma\_\{w\}^\{2\}\\int Dz\_\{1\}Dz\_\{2\}\\phi\(u\_\{1\}\)\\phi\(u\_\{2\}\)\+\\sigma\_\{b\}^\{2\}\(5\)u1\\displaystyle u\_\{1\}=qaal−1z1,cabl−1=qabl−1qaal−1qbbl−1,\\displaystyle=\\sqrt\{q^\{l\-1\}\_\{aa\}\}z\_\{1\},\\qquad c^\{l\-1\}\_\{ab\}=\\frac\{q^\{l\-1\}\_\{ab\}\}\{\\sqrt\{q^\{l\-1\}\_\{aa\}q^\{l\-1\}\_\{bb\}\}\},\(6\)u2\\displaystyle u\_\{2\}=qbbl−1\(cabl−1z1\+1−\(cabl−1\)2z2\)\.\\displaystyle=\\sqrt\{q^\{l\-1\}\_\{bb\}\}\\left\(c^\{l\-1\}\_\{ab\}z\_\{1\}\+\\sqrt\{1\-\\left\(c^\{l\-1\}\_\{ab\}\\right\)^\{2\}\}z\_\{2\}\\right\)\.\(7\)In the absence of dropout, for a wide array of activationsqabl,cablq^\{l\}\_\{ab\},c^\{l\}\_\{ab\}settle to fixed pointsq∗,c∗q^\{\*\},c^\{\*\}; in particular, there exists ac∗=1c^\{\*\}=1fixed point as a straightforward calculation shows\. The variance fixed point sets the typical activation scale, while the correlation fixed point describes whether distinct inputs become indistinguishable under depth iteration\. We denote the recurrence relation forcablc^\{l\}\_\{ab\}bycabl=F\(cabl−1\)c^\{l\}\_\{ab\}=F\(c^\{l\-1\}\_\{ab\}\)\.
To probe its stability, one linearizes the map atc=1c=1, defining the \(angular\) susceptibility
χ⟂≡∂cabl∂cabl−1\|cab=1=σw2∫Dz\[ϕ′\(q∗z\)\]2\.\\chi\_\{\\perp\}\\equiv\\left\.\\frac\{\\partial c^\{l\}\_\{ab\}\}\{\\partial c^\{l\-1\}\_\{ab\}\}\\right\|\_\{c\_\{ab\}=1\}=\\sigma\_\{w\}^\{2\}\\int Dz\\big\[\\phi^\{\\prime\}\(\\sqrt\{q^\{\*\}\}z\)\\big\]^\{2\}\.\(8\)This creates a phase diagram with three regimes:χ⟂<1\\chi\_\{\\perp\}<1is the ordered regime,χ⟂\>1\\chi\_\{\\perp\}\>1is chaotic, andχ⟂=1\\chi\_\{\\perp\}=1defines the critical regime \(the so\-called ”edge\-of\-chaos”\)\. The depth scale controlling cross\-input correlations can become arbitrarily large in the latter, allowing information to penetrate deep into the network\. ForReLU\\operatorname\{ReLU\}, the criticality condition givesσw2=2\\sigma\_\{w\}^\{2\}=2, coinciding with the variance used in He initialization\(Heet al\.,[2015](https://arxiv.org/html/2605.21648#bib.bib21)\)222This is not true for general activations, but rather an idiosyncrasy ofReLU\\operatorname\{ReLU\}\(see App\.[A\.2](https://arxiv.org/html/2605.21648#A1.SS2)\), a more fundamental viewpoint than variance preservation\.
This leads to an emergent characteristic depth scaleξc\\xi\_\{c\}, which parametrizes both signal propagation and gradient flow\. Iterating random affine maps induces an effective coarse\-graining in depth akin to renormalization\-group flow, called representation group flow in the ML context\(Robertset al\.,[2022](https://arxiv.org/html/2605.21648#bib.bib36)\); information about norms and angles relaxes exponentially fast,
\|c∗−cabl\|∝e−l/ξc,ξc−1≡−log\|rc\|\.\|\{c^\{\*\}\-c^\{l\}\_\{ab\}\}\|\\propto e^\{\-l/\\xi\_\{c\}\},\\qquad\\xi\_\{c\}^\{\-1\}\\equiv\-\\log\|r\_\{c\}\|\.\(9\)An analogous correlation lengthξq\\xi\_\{q\}controls how quickly single\-input norms settle toq∗q^\{\*\}\. Heuristically,ξc\\xi\_\{c\}controls how far distinctions between different inputs survive with depth; in the ordered phasec∗=1c^\{\*\}=1, sou1∗=u2∗u\_\{1\}^\{\*\}=u\_\{2\}^\{\*\}andξc−1=−logχ⟂\\xi\_\{c\}^\{\-1\}=\-\\log\\chi\_\{\\perp\}, henceχ⟂→1\\chi\_\{\\perp\}\\to 1impliesξc→∞\\xi\_\{c\}\\to\\infty\. Consequently, the edge\-of\-chaos is primarily aboutξc\\xi\_\{c\}rather thanξq\\xi\_\{q\}, except in special cases\. ForReLU\\operatorname\{ReLU\}, the curvature vanishes everywhere away from the origin, leading to a coincidence betweenξc\\xi\_\{c\}andξq\\xi\_\{q\}, both diverging asχ⟂→1\\chi\_\{\\perp\}\\to 1, provided a finiteq∗q^\{\*\}exists\. This is one concrete reason whyReLU\\operatorname\{ReLU\}\-style near\-critical initializations are comparatively forgiving in practice\.
As previously discussed, mean\-field theory is an expansion in non\-Gaussian corrections suppressed by powers ofL/NL/N\. Therefore, in our experiments we work in a controlled regime withL/N≪1L/N\\ll 1\(i\.e\.,N≫LN\\gg L\), where the large\-width expansion is controlled\(Robertset al\.,[2022](https://arxiv.org/html/2605.21648#bib.bib36)\)\. The correlation length diverges as we approach criticality, amplifying finite\-width and other non\-Gaussian corrections\. Thus, when probing critical power laws with dropout, we keep the dropout rate small but non\-negligible: infinitesimal dropout would require prohibitively deep networks to estimate the correlation length and related quantities\.
Although we derive these results for MLPs, analogous large\-width limits exist elsewhere\. CNNs are the canonical convolutional setting\(LeCunet al\.,[1998](https://arxiv.org/html/2605.21648#bib.bib14)\), and at infinite channel count their covariance recursions again close through Gaussian activation kernels\(Xiaoet al\.,[2018](https://arxiv.org/html/2605.21648#bib.bib44)\)\. ResNets are especially suggestive: Yang and Schoenholz\(Yang and Schoenholz,[2017](https://arxiv.org/html/2605.21648#bib.bib45)\)find qualitatively different tanh/ReLU\\operatorname\{ReLU\}large\-depth behavior\. Skip connections alter global depth dynamics, easing the information propagation by injecting the original signal after every layer, changing exponential convergence to subexponential or polynomial laws and allowing norm drift, but each residual branch still inherits the local nonlinear Gaussian kernel\. Thus our smooth/kinked universality split offers a possible theoretical explanation for their observed dichotomy; a dropout\-deformed ResNet theory is left for future work\. For transformers\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib42)\), related infinite\-width attention analyses use large hidden dimension and/or many heads\(Hronet al\.,[2020](https://arxiv.org/html/2605.21648#bib.bib46)\)\. Because the exponents are controlled by the local analytic structure of this channel, we expect the mechanism to persist when it controls the wide\-limit recursion\. In Sec\.[3\.3](https://arxiv.org/html/2605.21648#S3.SS3), we test this extrapolation in transformers\.
In App\.[D](https://arxiv.org/html/2605.21648#A4)we probe realistic scenarios across architectures, datasets, and schedules\. We focus on overparameterized regimes where dropout improves generalization and accuracy, and thus where it is used in practice, and we explicitly study the trade\-off between dropout\-driven regularization and stable signal propagation\. The mean\-field objective predicts sparse, step\-like allocation at fixed budget, but it is permutation invariant in depth and therefore cannot by itself decide where dropout should be concentrated\. To break this degeneracy we appeal to effects outside the mean\-field closure, such as early coadaptation and rank collapse: these make the first layers the most valuable place to inject noise, explaining why front\-loaded schedules work best in the experiments\.
## 3Universality Classes and Critical Scaling under Dropout
Dropout\(Srivastavaet al\.,[2014](https://arxiv.org/html/2605.21648#bib.bib41); Wageret al\.,[2013](https://arxiv.org/html/2605.21648#bib.bib43)\)is often used as a regularizer that reduces overfitting by randomly masking activations during training, thereby reducing the network’s reliance on any single subset of pathways and discouraging co\-adaptation\. In the previous literature, dropout was explored in MFT\(Huanget al\.,[2020](https://arxiv.org/html/2605.21648#bib.bib25); Schoenholzet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib38)\), but not in the universality class sense\. We formalize dropout as multiplicative Bernoulli noise applied post\-activation,
pj\(ℓ\)∼Bernoulli\(ρ\),yj\(ℓ\)↦y~j\(ℓ\)≡pj\(ℓ\)ρyj\(ℓ\),p^\{\(\\ell\)\}\_\{j\}\\sim\\mathrm\{Bernoulli\}\(\\rho\),\\qquad y^\{\(\\ell\)\}\_\{j\}\\;\\mapsto\\;\\tilde\{y\}^\{\(\\ell\)\}\_\{j\}\\equiv\\frac\{p^\{\(\\ell\)\}\_\{j\}\}\{\\rho\}y^\{\(\\ell\)\}\_\{j\},\(10\)i\.e\. the standard inverted\-dropout convention in which𝔼p\[y~j\(ℓ\)\]=yj\(ℓ\)\\mathbb\{E\}\_\{p\}\[\\tilde\{y\}^\{\(\\ell\)\}\_\{j\}\]=y^\{\(\\ell\)\}\_\{j\}, and hence𝔼p\[z~iℓ∣W,b\]=ziℓ\\mathbb\{E\}\_\{p\}\[\\tilde\{z\}^\{\\ell\}\_\{i\}\\mid W,b\]=z^\{\\ell\}\_\{i\}, so the mean preactivation is invariant to the dropout rate\(Schoenholzet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib38)\)\. Throughout,ρ\\rhois the keep probability\. When comparing two inputs, we draw the dropout masks independently for each input333Correlated dropout masks can alleviate this detuning: if the same mask is shared across the batch, thec=1c=1fixed point is preserved and near\-critical scaling can be recovered by retuning, but the noise becomes batch coupled and the regularization is correspondingly weaker and more structured\. We leave detailed analysis of correlated masks to future work\.; this independence is the source of the decorrelation effect analyzed below\.
This distinction originally emerged from the following variational question: among activations of fixed scale, which functions make the correlation length of the near\-critical Gaussian channel as large as possible? Extremizing the correlation length functional shows that the natural eigenfunctions of this problem are Hermite polynomials\. When standard nonlinearities are decomposed in that basis, two qualitatively different spectra emerge\. Smooth, analytic activations such astanh\\tanhhave exponentially decaying Hermite coefficients, so their effect is dominated by the first few modes\. Kinked activations such asReLU\\operatorname\{ReLU\}have power\-law tails instead: the kink keeps exciting higher Hermite modes, leaving a broader spectrum in the correlation map\. This spectral split was the original diagnostic for the smooth/kinked universality classes, which later reappears in the dropout scaling laws\. The details of these computations can be found in App\.[C](https://arxiv.org/html/2605.21648#A3)\.
Our classification is orthogonal to the scale\-invariant vs\. non\-scale\-invariant distinction of\(Robertset al\.,[2022](https://arxiv.org/html/2605.21648#bib.bib36)\): the multi\-scale Hermite spectrum highlighted above is controlled primarily by analytic regularity \(kinks and other non\-analytic structure\), not by scale invariance per se\. The criterion is based on two facts: \(a\) near\-critical scaling laws differ between smooth and kinked activations, and \(b\) the Hermite spectral weight of non\-analytic activations decays as a power law, in contrast to the exponential decay characteristic of smooth activations \(Apps\.[B](https://arxiv.org/html/2605.21648#A2)and[C](https://arxiv.org/html/2605.21648#A3)\)\.
It is in this context that we seek a sharper analytical understanding of how dropout deforms the edge\-of\-chaos\. Extending mean\-field propagation to include dropout, one finds that for a single input dropout largely acts as a renormalization of the effective weight variance,σw2↦σw2/ρ\\sigma\_\{w\}^\{2\}\\mapsto\\sigma\_\{w\}^\{2\}/\\rho, but for two inputs the perfect\-alignment fixed point atc∗=1c^\{\*\}=1is lost for anyρ<1\\rho<1when the two dropout masks are chosen independently\(Schoenholzet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib38)\)\. In other words, dropout acts as an explicit symmetry\-breaking “field” for alignment: even if two inputs are perfectly aligned at some depth, the next layer producesc¯ab<1\\bar\{c\}\_\{ab\}<1, cutting off the divergence of the depth correlation length at the edge\.
Introducing multiplicative Bernoulli noise with keep probabilityρ\\rho, the perturbed preactivations read
ziℓ=1ρ∑jWijℓpjℓyjℓ\+biℓ\.z\_\{i\}^\{\\ell\}=\\frac\{1\}\{\\rho\}\\sum\_\{j\}W\_\{ij\}^\{\\ell\}p\_\{j\}^\{\\ell\}y\_\{j\}^\{\\ell\}\+b\_\{i\}^\{\\ell\}\.\(11\)The single\-input variance recursion becomes
q¯aaℓ=σw2ρ∫Dzϕ2\(q¯aaℓ−1z\)\+σb2,\\bar\{q\}\_\{aa\}^\{\\ell\}=\\frac\{\\sigma\_\{w\}^\{2\}\}\{\\rho\}\\int Dz\\phi^\{2\}\\left\(\\sqrt\{\\bar\{q\}\_\{aa\}^\{\\ell\-1\}\}z\\right\)\+\\sigma\_\{b\}^\{2\},\(12\)where the bar reminds us that the variance fixed point is shifted by dropout\. We denote the \(putative\) dropout variance fixed point by
q¯∗=σw2ρ∫Dzϕ2\(q¯∗z\)\+σb2\.\\bar\{q\}^\{\*\}\\;=\\;\\frac\{\\sigma\_\{w\}^\{2\}\}\{\\rho\}\\int Dz\\phi^\{2\}\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}z\\big\)\+\\sigma\_\{b\}^\{2\}\.\(13\)For two distinct inputsa≠ba\\neq b, we find a dropout shift at perfect alignment\. If at some depthcabℓ=1c^\{\\ell\}\_\{ab\}=1, then at the next layer one finds\(Schoenholzet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib38)\)
F¯ρ\(1\)=c¯abℓ\+1=1−1−ρρq¯∗σw2∫Dzϕ2\(q¯∗z\)\.\\bar\{F\}\_\{\\rho\}\(1\)=\\bar\{c\}^\{\\ell\+1\}\_\{ab\}=1\-\\frac\{1\-\\rho\}\{\\rho\\bar\{q\}^\{\*\}\}\\sigma\_\{w\}^\{2\}\\int Dz\\phi^\{2\}\\left\(\\sqrt\{\\bar\{q\}^\{\*\}\}z\\right\)\.\(14\)It is therefore natural to define the dropout field
h\\displaystyle h≡1−F¯ρ\(1\)\\displaystyle\\;\\equiv\\;1\-\\bar\{F\}\_\{\\rho\}\(1\)=1−ρρq¯∗σw2∫Dzϕ2\(q¯∗z\)\.\\displaystyle\\;=\\;\\frac\{1\-\\rho\}\{\\rho\\bar\{q\}^\{\*\}\}\\;\\sigma\_\{w\}^\{2\}\\int Dz\\phi^\{2\}\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}z\\big\)\.\(15\)By construction,ρ<1⇒h\>0,\\rho<1\\Rightarrow h\>0,so the fixed point atc∗=1c^\{\*\}=1is immediately spoiled: even if two inputs are perfectly aligned at some depth, the next layer producesc¯ab=1−h<1\\bar\{c\}\_\{ab\}=1\-h<1\. Since we focus on weak dropout, it is useful to perform the small\-δρ\\delta\\rhoexpansion\. Writingδρ≡1−ρ\\delta\\rho\\equiv 1\-\\rhowithδρ≪1\\delta\\rho\\ll 1, we haveq¯∗=q∗\+𝒪\(δρ\)\\bar\{q\}^\{\*\}=q^\{\*\}\+\\mathcal\{O\}\(\\delta\\rho\), so to leading order we have
h=δρσw2q¯∗∫Dzϕ2\(q¯∗z\)\+𝒪\(δρ2\)\.h=\\delta\\rho\\;\\frac\{\\sigma\_\{w\}^\{2\}\}\{\\bar\{q\}^\{\*\}\}\\int Dz\\phi^\{2\}\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}z\\big\)\+\\mathcal\{O\}\(\\delta\\rho^\{2\}\)\.\(16\)The backward pass has the same source of asymmetry\. Letδj,aℓ\\delta^\{\\ell\}\_\{j,a\}be the backpropagated preactivation gradient for inputaa\. With inverted dropout,
δj,aℓ=pj,aℓρϕ′\(uj,aℓ\)∑iWijℓ\+1δi,aℓ\+1,\\delta^\{\\ell\}\_\{j,a\}=\\frac\{p^\{\\ell\}\_\{j,a\}\}\{\\rho\}\\phi^\{\\prime\}\(u^\{\\ell\}\_\{j,a\}\)\\sum\_\{i\}W^\{\\ell\+1\}\_\{ij\}\\delta^\{\\ell\+1\}\_\{i,a\},\(17\)up to convention\-dependent layer indexing\. IfGabℓ≡𝔼\[δaℓδbℓ\]G^\{\\ell\}\_\{ab\}\\equiv\\mathbb\{E\}\[\\delta^\{\\ell\}\_\{a\}\\delta^\{\\ell\}\_\{b\}\]denotes the two\-input gradient covariance, the wide\-limit recursion contains
Gabℓ=σw2𝔼\[paℓpbℓρ2ϕ′\(ua\)ϕ′\(ub\)\]Gabℓ\+1\.G^\{\\ell\}\_\{ab\}=\\sigma\_\{w\}^\{2\}\\mathbb\{E\}\\left\[\\frac\{p^\{\\ell\}\_\{a\}p^\{\\ell\}\_\{b\}\}\{\\rho^\{2\}\}\\phi^\{\\prime\}\(u\_\{a\}\)\\phi^\{\\prime\}\(u\_\{b\}\)\\right\]G^\{\\ell\+1\}\_\{ab\}\.\(18\)Thus independent masks give a diagonal factor𝔼\[pa2\]/ρ2=1/ρ\\mathbb\{E\}\[p\_\{a\}^\{2\}\]/\\rho^\{2\}=1/\\rhobut an off\-diagonal factor𝔼\[papb\]/ρ2=1\\mathbb\{E\}\[p\_\{a\}p\_\{b\}\]/\\rho^\{2\}=1\. Dropout inflates single\-sample gradient variance without similarly inflating cross\-sample covariance, mirroring the forward asymmetry that producedhh\. After normalization this suggests a backward fieldh∇\(ρ\)h\_\{\\nabla\}\(\\rho\)that moves perfect gradient alignment below one\. We do not identify it with the forward field: the backward kernel is built fromϕ′\\phi^\{\\prime\}, notϕ\\phi, and forReLU\\operatorname\{ReLU\}ϕ′\\phi^\{\\prime\}is discontinuous\. A full dropout\-deformed backward critical theory, including finite\-width gradient susceptibilities and training\-time mask correlations, is left for future work\.
### 3\.1Landau theory and critical exponents
For smooth activations, the dropout\-deformed correlation edge admits a simple Landau\-style description in terms of three mean\-field variables:
m≡1−c∗,t≡χρ−1,h≡1−F¯ρ\(1\)\.m\\equiv 1\-c^\{\*\},\\qquad t\\equiv\\chi\_\{\\rho\}\-1,\\qquad h\\equiv 1\-\\bar\{F\}\_\{\\rho\}\(1\)\.\(19\)The fields\(m,t,h\)\(m,t,h\)are chosen to mirror statistical\-mechanics notation:ttplays the role of a reduced temperature,mmis the order parameter \(magnetization\), andhhis an external field conjugate tommthat explicitly biases the system away from perfect alignment\.
We assume inverted dropout with keep probabilityρ∈\(0,1\]\\rho\\in\(0,1\], with masks drawn i\.i\.d\. across units and independently across inputs\. Letq¯∗\\bar\{q\}^\{\*\}denote the dropout variance fixed point \([13](https://arxiv.org/html/2605.21648#S3.E13)\)\. For an activation for which Price’s theorem applies andϕ′,ϕ′′∈L2\(𝒩\(0,q¯∗\)\)\\phi^\{\\prime\},\\phi^\{\\prime\\prime\}\\in L^\{2\}\(\\mathcal\{N\}\(0,\\bar\{q\}^\{\\ast\}\)\)\(for instance,ϕ∈C2\\phi\\in C^\{2\}with sufficient Gaussian integrability\) repeated application of Price’s theorem gives the first two derivatives of the correlation map at perfect alignment,444Theρ\\rho\-dependence is implicit throughq¯∗\\bar\{q\}^\{\*\}\. The full computation is given in App\.[A\.3](https://arxiv.org/html/2605.21648#A1.SS3)\.
χρ\\displaystyle\\chi\_\{\\rho\}≡F¯ρ′\(1\)=σw2∫Dz\[ϕ′\(q¯∗z\)\]2,\\displaystyle\\equiv\\bar\{F\}\_\{\\rho\}^\{\\prime\}\(1\)=\\sigma\_\{w\}^\{2\}\\int Dz\\big\[\\phi^\{\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}z\)\\big\]^\{2\},\(20\)gρ\\displaystyle g\_\{\\rho\}≡F¯ρ′′\(1\)=σw2q¯∗∫Dz\[ϕ′′\(q¯∗z\)\]2\.\\displaystyle\\equiv\\bar\{F\}\_\{\\rho\}^\{\\prime\\prime\}\(1\)=\\sigma\_\{w\}^\{2\}\\bar\{q\}^\{\*\}\\int Dz\\big\[\\phi^\{\\prime\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}z\)\\big\]^\{2\}\.\(21\)In the smooth\-class analysis below we assumegρ\>0g\_\{\\rho\}\>0\. Ifgρ=0g\_\{\\rho\}=0, as for a linear activation, the quadratic Landau term is absent and the leading nonzero term determines a degenerate case rather than the generic smooth universality class\. For kinked,ReLU\\operatorname\{ReLU\}\-like activations,F¯ρ′′\(1\)\\bar\{F\}\_\{\\rho\}^\{\\prime\\prime\}\(1\)is not finite and the analytic expansion below is not the correct organizing principle; we treat that universality class separately\.
The Landau equation of state follows by expandingF¯ρ\(c\)\\bar\{F\}\_\{\\rho\}\(c\)aroundc=1c=1\. Writingc∗=1−mc^\{\*\}=1\-mwith0<m≪10<m\\ll 1acting as the order parameter,
F¯ρ\(1−m\)\\displaystyle\\bar\{F\}\_\{\\rho\}\(1\-m\)=F¯ρ\(1\)−χρm\+gρ2m2\+𝒪\(m3\)\\displaystyle=\\bar\{F\}\_\{\\rho\}\(1\)\-\\chi\_\{\\rho\}m\+\\frac\{g\_\{\\rho\}\}\{2\}m^\{2\}\+\\mathcal\{O\}\(m^\{3\}\)\(22\)=\(1−h\)−χρm\+gρ2m2\+𝒪\(m3\),\\displaystyle=\(1\-h\)\-\\chi\_\{\\rho\}m\+\\frac\{g\_\{\\rho\}\}\{2\}m^\{2\}\+\\mathcal\{O\}\(m^\{3\}\),\(23\)imposing the fixed\-point condition1−m=F¯ρ\(1−m\)1\-m=\\bar\{F\}\_\{\\rho\}\(1\-m\)yields
h\+tm−gρ2m2=0,⟺h=gρ2m2−tm,h\+tm\-\\frac\{g\_\{\\rho\}\}\{2\}m^\{2\}=0,\\quad\\Longleftrightarrow\\quad h=\\frac\{g\_\{\\rho\}\}\{2\}m^\{2\}\-tm,\(24\)witht≡χρ−1t\\equiv\\chi\_\{\\rho\}\-1\. The physical \(nonnegative\) solution is
m\(t,h\)=t\+t2\+2gρhgρ\.m\(t,h\)\\;=\\;\\frac\{t\+\\sqrt\{t^\{2\}\+2g\_\{\\rho\}h\}\}\{g\_\{\\rho\}\}\.\(25\)At the correlation edget=0t=0, dropout displaces the fixed point by
m\(0,h\)=2hgρ\.m\(0,h\)\\;=\\;\\sqrt\{\\frac\{2h\}\{g\_\{\\rho\}\}\}\.\(26\)
Table 1:Definitions of critical exponents in the thermodynamic limit \(t→0t\\to 0,h→0h\\to 0\)\.For theα\\alphacolumn we define the singular Landau potentialfsingf\_\{\\rm sing\}through the equation of state\. In the smooth case,
∂fsing∂m=−h−tm\+gρ2m2,\\frac\{\\partial f\_\{\\rm sing\}\}\{\\partial m\}=\-h\-tm\+\\frac\{g\_\{\\rho\}\}\{2\}m^\{2\},\(27\)and in the kinked case,
∂fsing∂m=−h−tm\+κm3/2\.\\frac\{\\partial f\_\{\\rm sing\}\}\{\\partial m\}=\-h\-tm\+\\kappa m^\{3/2\}\.\(28\)Evaluatingfsingf\_\{\\rm sing\}on the stable branch ath=0h=0givesfsing∼\|t\|3f\_\{\\rm sing\}\\sim\|t\|^\{3\}for the smooth class andfsing∼\|t\|5f\_\{\\rm sing\}\\sim\|t\|^\{5\}for the kinked class, henceα=−1\\alpha=\-1andα=−3\\alpha=\-3, respectively\.
This square\-root power law is our first encounter with a larger family of critical exponents familiar in the physics literature, defined in Table[1](https://arxiv.org/html/2605.21648#S3.T1)\. The key exponents for our analysis are those governing the depth correlation lengthξc,ρ\\xi\_\{c,\\rho\}, as it sets the characteristic depth scale over which distinct inputs remain distinguishable and therefore provides an effective upper bound on the depth over which the network can reliably propagate information \(and learn\)\. Their derivations are straightforward but cumbersome, so we collect them in App\.[B](https://arxiv.org/html/2605.21648#A2)and simply quote the resulting exponent values in Table[2](https://arxiv.org/html/2605.21648#S3.T2)\. For reference, we also list the standard mean\-field \(Landau\) exponents of the Ising universality class\(Ising,[1925](https://arxiv.org/html/2605.21648#bib.bib10); Landau and Lifshitz,[1980](https://arxiv.org/html/2605.21648#bib.bib11)\)and the corresponding mean\-field exponents of spin\-glass theory\(Edwards and Anderson,[1975](https://arxiv.org/html/2605.21648#bib.bib7); Parisi,[1979](https://arxiv.org/html/2605.21648#bib.bib31),[1980](https://arxiv.org/html/2605.21648#bib.bib32),[1983b](https://arxiv.org/html/2605.21648#bib.bib33)\)\. The SK entries are quoted in the standard mean\-field convention and are meant as structural motivation–an order parameter without a simpleℤ2\\mathbb\{Z\}\_\{2\}symmetry and a cubic effective interaction–not as a claim of shared universality: the smooth class shares the static entries\(β,γ,δ,α\)\(\\beta,\\gamma,\\delta,\\alpha\)in this convention, but not the correlation\-length or relaxation exponents, while the kinked class is separated more sharply\. The spin\-glass comparison can be found in\(Mézardet al\.,[1987](https://arxiv.org/html/2605.21648#bib.bib26); Parisi,[1983a](https://arxiv.org/html/2605.21648#bib.bib34); Sherrington and Kirkpatrick,[1975](https://arxiv.org/html/2605.21648#bib.bib39)\)\.
Table 2:Critical exponents for neural network initialization across different universality classes\. The kinkedνρ\\nu\_\{\\rho\}is a normal\-form field exponent; ordinary bias\-freeReLU\\operatorname\{ReLU\}dropout follows the constrained patht=−ht=\-hbut has the same leadingξ∼h−1/3\\xi\\sim h^\{\-1/3\}law \(App\.[B](https://arxiv.org/html/2605.21648#A2)\)\.Figure[1](https://arxiv.org/html/2605.21648#S3.F1)summarizes these power laws from direct mean\-field recursion and shows where higher\-order corrections become visible away from the asymptotic regime\.
Figure 1:Critical scaling for smooth \(tanh\) and kinked \(ReLU\\operatorname\{ReLU\}\) activation functions, comparing tuning at zero dropout to tuning at the edge\-of\-chaos using a dropout field\. The top row compares the different critical exponents at zero dropout and probes critical detuning decay, while the bottom row explores on critical networks with non\-zero dropout\. As the variables grow, higher\-order effects become comparable and the linearized theory departs from the full recursion\.
### 3\.2Scaling collapse and crossover region
At leading normal\-form order, dropout and detuning enter the same local equation of state\. The collapse below is exact for the truncated Landau normal form, while the full mean\-field recursion receives higher\-order corrections such as𝒪\(m3\)\\mathcal\{O\}\(m^\{3\}\)in the smooth case,𝒪\(m2\)\\mathcal\{O\}\(m^\{2\}\)in the kinked case, and corrections from the nonlinear microscopic maph=Δ\(ρ\)h=\\Delta\(\\rho\)\. These corrections become visible away from the asymptotic near\-critical regime\. From the field\-theory point of view, the resulting collapse is the standard signature of criticality: in the massless limit, connectednn\-point functions become homogeneous scaling functions of their separations, so the data collapse after rescaling by the appropriate critical dimensions\(Zinn\-Justin,[2002](https://arxiv.org/html/2605.21648#bib.bib37)\)\. In practice, experiments are rarely tuned exactly to the correlation edget=0t=0while simultaneously taking the dropout field to zero,h→0h\\to 0\. The useful near\-critical description is therefore a two\-parameter scaling function\.
For smooth activations, the order parameterm≡1−c∗m\\equiv 1\-c^\{\*\}obeys \([24](https://arxiv.org/html/2605.21648#S3.E24)\),
h=gρ2m2−tm,t≡χρ−1\.h=\\frac\{g\_\{\\rho\}\}\{2\}m^\{2\}\-tm,\\qquad t\\equiv\\chi\_\{\\rho\}\-1\.\(29\)The explicit crossover solution is the physical branch in \([25](https://arxiv.org/html/2605.21648#S3.E25)\)\. Define rescaled variables
m~≡mgρ2h,t~≡−t2gρh\.\\tilde\{m\}\\equiv m\\sqrt\{\\frac\{g\_\{\\rho\}\}\{2h\}\},\\qquad\\tilde\{t\}\\equiv\-\\frac\{t\}\{\\sqrt\{2g\_\{\\rho\}h\}\}\.\(30\)Then all curves collapse onto the universal scaling function
m~=1\+t~2−t~\.\\tilde\{m\}=\\sqrt\{1\+\\tilde\{t\}^\{2\}\}\-\\tilde\{t\}\.\(31\)The crossover scale is
\|t\|∼gρh\.\|t\|\\sim\\sqrt\{g\_\{\\rho\}h\}\.\(32\)For\|t\|≪gρh\|t\|\\ll\\sqrt\{g\_\{\\rho\}h\}, one is in the field\-dominated regime
m∼2hgρ\.m\\sim\\sqrt\{\\frac\{2h\}\{g\_\{\\rho\}\}\}\.\(33\)For\|t\|≫gρh\|t\|\\gg\\sqrt\{g\_\{\\rho\}h\}, the asymptotics depend on the sign oftt:
t<0:m\\displaystyle t<0:\\qquad m≃h\|t\|,\\displaystyle\\simeq\\frac\{h\}\{\|t\|\},\(34\)t\>0:m\\displaystyle t\>0:\\qquad m≃2tgρ\+ht⋯\.\\displaystyle\\simeq\\frac\{2t\}\{g\_\{\\rho\}\}\+\\frac\{h\}\{t\}\\cdots\.\(35\)Thus, at either side of the crossover window the system is either in the linear\-response regime on the subcritical side \(t<0t<0\), or it sits on the chaotic fixed\-point branch on the supercritical side \(t\>0t\>0\)\.
For kinked activations, the leading non\-linearity is non\-analytic and the equation of state takes the form \(see \([102](https://arxiv.org/html/2605.21648#A2.E102)\)\),
h=κm3/2−tm\+𝒪\(m2\)\.h=\\kappa m^\{3/2\}\-tm\+\\mathcal\{O\}\(m^\{2\}\)\.\(36\)This implies the scaling form
m\(t,h\)=\(hκ\)2/3ℱ\(tκ2/3h1/3\)\.m\(t,h\)=\\left\(\\frac\{h\}\{\\kappa\}\\right\)^\{2/3\}\\mathcal\{F\}\\left\(\\frac\{t\}\{\\kappa^\{2/3\}h^\{1/3\}\}\\right\)\.\(37\)Let
u≡tκ2/3h1/3,u\\equiv\\frac\{t\}\{\\kappa^\{2/3\}h^\{1/3\}\},\(38\)and definey\(u\)y\(u\)as the positive real root of
y3−uy2−1=0⟹ℱ\(u\)=y\(u\)2\.y^\{3\}\-uy^\{2\}\-1=0\\qquad\\implies\\qquad\\mathcal\{F\}\(u\)=y\(u\)^\{2\}\.\(39\)The associated crossover scale is
\|t\|∼κ2/3h1/3,\|t\|\\sim\\kappa^\{2/3\}h^\{1/3\},\(40\)showing that kinked activations lie in a distinct universality class from smooth activations\. For literal bias\-freeReLU\\operatorname\{ReLU\}with standard inverted dropout, the microscopic path lockst=ρ−1=−ht=\\rho\-1=\-h, so thet=0,h\>0t=0,h\>0direction is a normal\-form field direction rather than a directly independentReLU\\operatorname\{ReLU\}knob\. This path nevertheless satisfiest/\(κ2/3h1/3\)→0t/\(\\kappa^\{2/3\}h^\{1/3\}\)\\to 0, hence it enters the same field\-dominated regime and retains the leading constrained\-path lawξ∼h−1/3\\xi\\sim h^\{\-1/3\}; App\.[B](https://arxiv.org/html/2605.21648#A2)gives the short derivation and the nonzero\-bias caveat\.
Figure 2:Two\-parameter crossover and scaling collapse of the dropout\-deformed equation of state for the smooth universality class \(tanh\)\. Plots obtained using MFT recursion relations\. The curves collapse onto a universal function after rescaling byt~\\tilde\{t\}andm~\\tilde\{m\}\. The kinked counterpart is shown in Fig\.[4](https://arxiv.org/html/2605.21648#A4.F4)\.Figure[2](https://arxiv.org/html/2605.21648#S3.F2)shows this collapse for smooth activations\. This also clarifies the tanh/ReLU\\operatorname\{ReLU\}dichotomy observed in ResNet mean\-field theory\(Yang and Schoenholz,[2017](https://arxiv.org/html/2605.21648#bib.bib45)\): tanh belongs to the smooth class, where the analyticm2m^\{2\}term makes weak fields especially important and long correlation lengths are preserved when dropout is kept small;ReLU\\operatorname\{ReLU\}belongs to the kinked class, where them3/2m^\{3/2\}branch point broadens the crossover window and makes the dynamics more forgiving to imperfect tuning\. In short, smooth channels reward keeping the dropout field very weak, while kinked channels tolerate detuning over a wider range; neither class is uniformly preferable\.
### 3\.3Optimal dropout scheduling and rank\-flow tie\-breaking
So far we have taken the keep probability to be constant with depth\. Since dropout enters as a relevant field that cuts off the depth correlation length even at the correlation edge, it is natural to ask how to allocate a fixed dropout budget across layers so as to preserve near\-critical propagation as much as possible\. Let us introduce a depth\-dependent keep probabilityρℓ\\rho\_\{\\ell\}and define the associated dropout field
hℓ≡Δ\(ρℓ\),h\_\{\\ell\}\\;\\equiv\\;\\Delta\(\\rho\_\{\\ell\}\),\(41\)i\.e\. the shift of the correlation map at perfect alignment induced by keep probabilityρℓ\\rho\_\{\\ell\}\. In the smooth universality class, linearizing the correlation map about the dropout\-shifted fixed point gives \(see \([87](https://arxiv.org/html/2605.21648#A2.E87)\)\)
λ\(t,hℓ\)≃1−t2\+2gρhℓ\.\\lambda\(t,h\_\{\\ell\}\)\\;\\simeq\\;1\-\\sqrt\{t^\{2\}\+2g\_\{\\rho\}h\_\{\\ell\}\}\.\(42\)For0<λℓ≲10<\\lambda\_\{\\ell\}\\lesssim 1, perturbations propagate multiplicatively,
δcL∼∏ℓ=1Lλℓδc0,λℓ≡λ\(t,hℓ\),\\delta c\_\{L\}\\;\\sim\\;\\prod\_\{\\ell=1\}^\{L\}\\lambda\_\{\\ell\}\\delta c\_\{0\},\\qquad\\lambda\_\{\\ell\}\\;\\equiv\\;\\lambda\(t,h\_\{\\ell\}\),\(43\)which motivates defining an effective inverse correlation length as the mean decay rate per layer,
ξeff−1≡−1L∑ℓ=1Llogλℓ≃1L∑ℓ=1Lt2\+2gρhℓ\.\\xi\_\{\\rm eff\}^\{\-1\}\\;\\equiv\\;\-\\frac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}\\log\\lambda\_\{\\ell\}\\;\\simeq\\;\\frac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}\\sqrt\{t^\{2\}\+2g\_\{\\rho\}h\_\{\\ell\}\}\.\(44\)At the correlation edget=0t=0, this reduces to
ξeff−1\(t=0\)≃2gρ1L∑ℓ=1Lhℓ1/2∝1L∑ℓ=1Lhℓ1/2\.\\xi\_\{\\rm eff\}^\{\-1\}\(t=0\)\\;\\simeq\\;\\sqrt\{2g\_\{\\rho\}\}\\frac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}h\_\{\\ell\}^\{1/2\}\\;\\propto\\;\\frac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}h\_\{\\ell\}^\{1/2\}\.\(45\)We now fix a total dropout budget and local bounds,
h¯≡1L∑ℓ=1Lhℓ,0≤hℓ≤hmax\.\\bar\{h\}\\;\\equiv\\;\\frac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}h\_\{\\ell\},\\qquad 0\\leq h\_\{\\ell\}\\leq h\_\{\\max\}\.\(46\)The budget above is a budget in the dropout fieldhℓh\_\{\\ell\}, not directly in the raw dropout probabilitypdrop,ℓ=1−ρℓp\_\{\\mathrm\{drop\},\\ell\}=1\-\\rho\_\{\\ell\}\. For weak dropout,
hℓ\\displaystyle h\_\{\\ell\}=apdrop,ℓ\+𝒪\(pdrop,ℓ2\),\\displaystyle=ap\_\{\\mathrm\{drop\},\\ell\}\+\\mathcal\{O\}\(p\_\{\\mathrm\{drop\},\\ell\}^\{2\}\),\(47\)a\\displaystyle a=σw2q∗∫Dzϕ2\(q∗z\),\\displaystyle=\\frac\{\\sigma\_\{w\}^\{2\}\}\{q^\{\\ast\}\}\\int Dz\\phi^\{2\}\(\\sqrt\{q^\{\\ast\}\}z\),so a fixedhh\-budget and a fixed dropout\-probability budget agree to leading order\. For larger dropout, the nonlinear maphℓ=Δ\(ρℓ\)h\_\{\\ell\}=\\Delta\(\\rho\_\{\\ell\}\)should be used explicitly\. Maximizingξeff\\xi\_\{\\rm eff\}is equivalent to minimizingξeff−1\\xi\_\{\\rm eff\}^\{\-1\}, so in the continuum limit \(x=ℓ/Lx=\\ell/L\) we obtain the constrained variational problem
minh∫01h\(x\)1/2dxs\.t\.∫01h\(x\)dx=h¯\.\\min\_\{h\}\\int\_\{0\}^\{1\}h\(x\)^\{1/2\}dx\\qquad\{\\rm s\.t\.\}\\qquad\\int\_\{0\}^\{1\}h\(x\)dx=\\bar\{h\}\.\(48\)Sinceh1/2h^\{1/2\}is concave, Jensen’s inequality implies that any constant schedule maximizes∫h1/2\\int h^\{1/2\}at fixed mean; hence any minimizer must saturate the box constraint and is step\-like\. Equivalently, any schedule that takes values in\{0,hmax\}\\\{0,h\_\{\\max\}\\\}with active fractionf=h¯/hmaxf=\\bar\{h\}/h\_\{\\max\}is optimal for the mean\-field functional\. A front\-loaded representative, selected below by the rank\-flow tie\-breaker, is
h\(x\)=\{hmax,0≤x≤f,0,f<x≤1,f=h¯hmax,h\(x\)=\\begin\{cases\}h\_\{\\max\},&0\\leq x\\leq f,\\\\ 0,&f<x\\leq 1,\\end\{cases\}\\qquad f=\\frac\{\\bar\{h\}\}\{h\_\{\\max\}\},\(49\)wherex=0x=0denotes the input side\. All permutations of this saturated set give the sameξeff\\xi\_\{\\rm eff\}in the mean\-field objective; the ordering is fixed only after the beyond\-mean\-field tie\-breaker is introduced\. For this representative, uniform dropout yieldsξeff∼h¯−1/2\\xi\_\{\\rm eff\}\\sim\\bar\{h\}^\{\-1/2\}while the step schedule gives
ξeff,stepξeff,uniform=hmaxh¯≥1\.\\frac\{\\xi\_\{\\rm eff,step\}\}\{\\xi\_\{\\rm eff,uniform\}\}=\\sqrt\{\\frac\{h\_\{\\max\}\}\{\\bar\{h\}\}\}\\;\\geq\\;1\.\(50\)For kinked activations att=0t=0, one hasξ−1∝h1/3\\xi^\{\-1\}\\propto h^\{1/3\}, so the objective becomes∫01h\(x\)1/3𝑑x\\int\_\{0\}^\{1\}h\(x\)^\{1/3\}dx, which is also concave and therefore the same conclusion immediately follows\.
At the level of mean\-field correlation dynamics,ξeff\\xi\_\{\\rm eff\}depends on\{hℓ\}\\\{h\_\{\\ell\}\\\}only through symmetric sums \(e\.g\. \([45](https://arxiv.org/html/2605.21648#S3.E45)\)\), and is therefore invariant under permutations of the depth profile\. Mean field fixes the optimal allocation in magnitude \(step\-like\), but cannot by itself select which layers should receive the concentrated dropout\.
In practice, this suggests a simple construction\. Choose a mean\-field budgeth¯\\bar\{h\}and a local caphmaxh\_\{\\max\}, set the active fractionf=h¯/hmaxf=\\bar\{h\}/h\_\{\\max\}, and put dropout at the cap on roughly anfffraction of the layers\. The correlation calculation fixes the amount of dropout but not the ordering in depth\. When early coadaptation or rank collapse is the main concern, the natural tie\-break is to put more of the budget near the input\. In residual or transformer\-like models, a linearly decreasing schedule is a smoother version of the same idea and can be preferable empirically\.
The remaining tie\-breaker comes from representation dynamics beyond the mean\-field closure\. Rank collapse gives one concrete mechanism\(Donget al\.,[2021](https://arxiv.org/html/2605.21648#bib.bib16); Nociet al\.,[2022](https://arxiv.org/html/2605.21648#bib.bib17); Daneshmandet al\.,[2020](https://arxiv.org/html/2605.21648#bib.bib15)\): token diversity in pure attention decays with depth, and unnormalized MLPs can suffer analogous rank loss\. Dropout pushes in the opposite direction at the level of second moments by inflating diagonal variance and shrinking normalized correlations, thereby increasing a participation\-ratio effective rank \(App\.[A\.5](https://arxiv.org/html/2605.21648#A1.SS5)\)\. App\.[A\.5](https://arxiv.org/html/2605.21648#A1.SS5)makes the tie\-breaker depth\-resolved: in a preventative rank\-flow proxy, the future collapse that a dropout layer can avert is monotone decreasing with depth, so exchange arguments move the saturated dropout set toward the input\. Thus rank flow breaks the permutation symmetry ofξeff\\xi\_\{\\rm eff\}by favoring concentrated dropout where collapse is still avoidable; in regularization and rank\-collapse\-limited regimes, this means front\-loading the dropout budget\.
Table 3:Final\-epoch improvements over the corresponding constant\-dropout baseline\. Loss reductions are relative reductions in cross\-entropy\. Accuracy gains are shown as percentage\-point differences and relative percent improvements over the baseline final accuracy\. Detailed appendix tables list best and final test accuracy where applicable\.To test these predictions, we deliberately work in controlled regimes where mean\-field assumptions remain valid \(width≫\\ggdepth, moderate dropout\) rather than pursuing benchmark performance\. We evaluate MLPs and Vision Transformers\(Dosovitskiyet al\.,[2021](https://arxiv.org/html/2605.21648#bib.bib18)\)using CIFAR\-10/100\(Krizhevsky,[2009](https://arxiv.org/html/2605.21648#bib.bib24)\)\(full details in App\.[D](https://arxiv.org/html/2605.21648#A4)\), where finite\-width corrections and extensive data augmentation do not obscure the signal\-propagation effects central to our analysis\. The clearest experimental signal is loss: Table[3](https://arxiv.org/html/2605.21648#S3.T3)shows final cross\-entropy reductions of1818–35%35\\%in the MLP and activation\-sweep settings, with smaller but consistent reductions in the transformer runs\. Accuracy improves as well \(e\.g\., ViT on CIFAR\-100: linearly decreasing achieves49\.38%49\.38\\%vs48\.69%48\.69\\%constant,p<0\.05p<0\.05\), but we treat it mainly as a confirmatory metric\. The appendices check that this is not a single\-run effect: the gains persist acrossh¯\\bar\{h\}\-sweeps and large\-width sweeps for kinkedReLU\\operatorname\{ReLU\}MLPs \(App\.[D\.4](https://arxiv.org/html/2605.21648#A4.SS4), Fig\.[7](https://arxiv.org/html/2605.21648#A4.F7)\), across the smooth GELU activation class \(App\.[D\.4](https://arxiv.org/html/2605.21648#A4.SS4), Fig\.[8](https://arxiv.org/html/2605.21648#A4.F8)\), and across transformer schedules and component ablations \(Figs\.[9](https://arxiv.org/html/2605.21648#A4.F9)and[11](https://arxiv.org/html/2605.21648#A4.F11)\)\. The advantage weakens only at high dropout or in overly narrow networks, precisely where the theory exits its regime of validity\.
## 4Conclusions and further scope for research
By extending mean\-field signal propagation to include multiplicative dropout, we established that activation non\-analyticity fundamentally dictates universality\. This formalism yields a Landau equation of state for the decorrelation order parameter, a universal two\-parameter scaling collapse, and distinct critical exponents for smooth versus kinked channels\. Mean\-field theory is often best at clarifying the mechanisms inside wide networks, and less often at producing directly testable design rules; here it gives one such rule\. Treating dropout as a depth\-dependent field leads to front\-loaded schedules, which reduce held\-out loss in the controlled regimes we test, with accuracy gains as a useful secondary check\. Extending this mean\-field analysis to attention mechanisms and incorporating finite\-width corrections are natural directions for future work\.
## 5Limitations
Mean\-field theory gives the leading description of MLPs at large width, with finite\-width corrections controlled perturbatively by the depth\-to\-width ratioL/NL/N\. Our experiments are deliberately carried out in this regime, but the present analysis does not compute these corrections explicitly\. Although such corrections are not expected to change the smooth/kinked universality classification, they can accumulate with depth and may induce subleading changes to the optimal dropout profile\.
A second limitation is architectural scope\. We analyze dropout most explicitly in MLP mean\-field theory, leaving full treatments of CNNs, RNNs, ResNets, Transformers, and normalization layers to future work\. Still, the recursion relation is the qualitative engine behind the universality result, and analogous local Gaussian kernels appear in several wider architectures\. App\.[A\.4](https://arxiv.org/html/2605.21648#A1.SS4)explains why infinite\-channel CNNs and residual branches should inherit much of the same smooth/kinked split, although we do not derive the full spatial\-mode CNN Landau theory, run CNN experiments, or give a dropout\-deformed ResNet treatment\. Residual architectures are subtler because skip connections modify large\-depth propagation while preserving the local activation channel\. Extending the analysis to include residual\-branch strength as an additional control parameter would put the Transformer experiments on firmer theoretical footing\.
Third, the theory still focuses primarily on forward correlation dynamics\. The main text gives the leading wide\-limit backward covariance recursion and shows that dropout creates the same diagonal/off\-diagonal asymmetry for gradients, but we do not develop the full dropout\-deformed backward theory, including finite\-width gradient susceptibilities, training\-time mask correlations, or feature\-learning effects\. Finally, the framework is an initialization theory rather than a full theory of training dynamics: it does not model how the relevant observables evolve after feature learning substantially reshapes the representation, for instance in large\-learning\-rate regimes where catapult dynamics can appear\(Zhuet al\.,[2024](https://arxiv.org/html/2605.21648#bib.bib49)\)\. Extending the same approach to training\-time regularizers, dropout warm\-up, adaptive schedules, other regularization fields such asL2L^\{2\}penalties, or broader scaling\-law questions\(Bahriet al\.,[2024a](https://arxiv.org/html/2605.21648#bib.bib50)\)is a natural direction for future work\.
*Acknowledgments\.*The author thanks Riccardo Penco for support and discussions; this work is partially supported by the U\.S\. Department of Energy under grant DE\-SC0010118\.
## Impact statement
The main practical benefit is improved performance at fixed training cost, or potentially reduced training cost when targeting a fixed performance level\. We do not anticipate additional ethical risks\.
## References
- A\. Arratia, A\. Cabaña, and J\. R\. León \(2020\)Deep and wide neural networks covariance estimation\.InArtificial Neural Networks and Machine Learning – ICANN 2020,Lecture Notes in Computer Science, Vol\.12396,pp\. 195–206\.External Links:[Document](https://dx.doi.org/10.1007/978-3-030-61609-0%5F16)Cited by:[§C\.3](https://arxiv.org/html/2605.21648#A3.SS3.p1.5)\.
- Y\. Bahri, E\. Dyer, J\. Kaplan, J\. Lee, and U\. Sharma \(2024a\)Explaining neural scaling laws\.Proceedings of the National Academy of Sciences121\(27\),pp\. e2311878121\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2311878121)Cited by:[§5](https://arxiv.org/html/2605.21648#S5.p3.1)\.
- Y\. Bahri, B\. Hanin, A\. Brossollet, V\. Erba, C\. Keup, R\. Pacelli, and J\. B\. Simon \(2024b\)Les Houches lectures on deep learning at large and infinite width\.Journal of Statistical Mechanics: Theory and Experiment2024\(10\),pp\. 104012\.External Links:[Document](https://dx.doi.org/10.1088/1742-5468/ad2dd3)Cited by:[§A\.2](https://arxiv.org/html/2605.21648#A1.SS2.p2.1),[§1](https://arxiv.org/html/2605.21648#S1.p1.3),[§2](https://arxiv.org/html/2605.21648#S2.p1.1)\.
- Y\. Bahri, J\. Kadmon, J\. Pennington, S\. S\. Schoenholz, J\. Sohl\-Dickstein, and S\. Ganguli \(2020\)Statistical mechanics of deep learning\.Annual Review of Condensed Matter Physics11\(1\),pp\. 501–528\.External Links:[Document](https://dx.doi.org/10.1146/annurev-conmatphys-031119-050745)Cited by:[§A\.2](https://arxiv.org/html/2605.21648#A1.SS2.p2.1),[§1](https://arxiv.org/html/2605.21648#S1.p1.3),[§2](https://arxiv.org/html/2605.21648#S2.p1.1)\.
- N\. Bertschinger and T\. Natschläger \(2004\)Real\-time computation at the edge of chaos in recurrent neural networks\.Neural Computation16\(7\),pp\. 1413–1436\.External Links:[Document](https://dx.doi.org/10.1162/089976604323057443)Cited by:[§1](https://arxiv.org/html/2605.21648#S1.p1.3)\.
- Y\. Cho and L\. K\. Saul \(2009\)Kernel methods for deep learning\.InAdvances in Neural Information Processing Systems,Vol\.22,pp\. 342–350\.Cited by:[§B\.2](https://arxiv.org/html/2605.21648#A2.SS2.p3.2),[§B\.3](https://arxiv.org/html/2605.21648#A2.SS3.p1.1)\.
- H\. Daneshmand, J\. Köhler, F\. R\. Bach, T\. Hofmann, and A\. Lucchi \(2020\)Batch normalization provably avoids ranks collapse for randomly initialised deep networks\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 4318–4329\.Cited by:[§A\.5](https://arxiv.org/html/2605.21648#A1.SS5.p1.3),[§A\.5](https://arxiv.org/html/2605.21648#A1.SS5.p1.8),[§3\.3](https://arxiv.org/html/2605.21648#S3.SS3.p4.1)\.
- J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei \(2009\)ImageNet: a large\-scale hierarchical image database\.In2009 IEEE Conference on Computer Vision and Pattern Recognition,pp\. 248–255\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2009.5206848)Cited by:[§D\.5](https://arxiv.org/html/2605.21648#A4.SS5.p3.1)\.
- Y\. Dong, J\. Cordonnier, and A\. Loukas \(2021\)Attention is not all you need: pure attention loses rank doubly exponentially with depth\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.139,pp\. 2793–2803\.Cited by:[§A\.5](https://arxiv.org/html/2605.21648#A1.SS5.p1.3),[§3\.3](https://arxiv.org/html/2605.21648#S3.SS3.p4.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby \(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=YicbFdNTTy)Cited by:[§D\.5](https://arxiv.org/html/2605.21648#A4.SS5.p1.1),[§3\.3](https://arxiv.org/html/2605.21648#S3.SS3.p5.8)\.
- F\. J\. Dyson \(2004\)A meeting with Enrico Fermi\.Nature427,pp\. 297\.External Links:[Document](https://dx.doi.org/10.1038/427297a)Cited by:[§A\.2](https://arxiv.org/html/2605.21648#A1.SS2.p1.1)\.
- S\. F\. Edwards and P\. W\. Anderson \(1975\)Theory of spin glasses\.Journal of Physics F: Metal Physics5,pp\. 965–974\.Cited by:[§3\.1](https://arxiv.org/html/2605.21648#S3.SS1.p5.3)\.
- A\. Fan, E\. Grave, and A\. Joulin \(2020\)Reducing transformer depth on demand with structured dropout\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SylO2yStDr)Cited by:[§1](https://arxiv.org/html/2605.21648#S1.p4.1)\.
- S\. Hayou, A\. Doucet, and J\. Rousseau \(2019\)On the impact of the activation function on deep neural networks training\.InProceedings of the 36th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.97,pp\. 2672–2680\.Cited by:[§B\.2](https://arxiv.org/html/2605.21648#A2.SS2.p9.6),[§1](https://arxiv.org/html/2605.21648#S1.p4.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2015\)Delving deep into rectifiers: surpassing human\-level performance on ImageNet classification\.InProceedings of the IEEE International Conference on Computer Vision,pp\. 1026–1034\.Cited by:[§2](https://arxiv.org/html/2605.21648#S2.p5.6)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 770–778\.Cited by:[§D\.5](https://arxiv.org/html/2605.21648#A4.SS5.p2.1)\.
- R\. A\. Horn and C\. R\. Johnson \(1985\)Matrix analysis\.Cambridge University Press\.Cited by:[§C\.1](https://arxiv.org/html/2605.21648#A3.SS1.p1.7)\.
- J\. Hron, Y\. Bahri, J\. Sohl\-Dickstein, and R\. Novak \(2020\)Infinite attention: NNGP and NTK for deep attention networks\.InProceedings of the 37th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.119,pp\. 4376–4386\.External Links:[Link](https://proceedings.mlr.press/v119/hron20a.html)Cited by:[§2](https://arxiv.org/html/2605.21648#S2.p8.1)\.
- G\. Huang, Y\. Sun, Z\. Liu, D\. Sedra, and K\. Q\. Weinberger \(2016\)Deep networks with stochastic depth\.InEuropean Conference on Computer Vision,Lecture Notes in Computer Science, Vol\.9908,pp\. 646–661\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-46493-0%5F39)Cited by:[§1](https://arxiv.org/html/2605.21648#S1.p4.1)\.
- W\. Huang, R\. Y\. D\. Xu, W\. Du, Y\. Zeng, and Y\. Zhao \(2020\)Mean field theory for deep dropout networks: digging up gradient backpropagation deeply\.InECAI 2020 \- 24th European Conference on Artificial Intelligence,Frontiers in Artificial Intelligence and Applications, Vol\.325,pp\. 1215–1222\.External Links:[Document](https://dx.doi.org/10.3233/FAIA200221)Cited by:[§3](https://arxiv.org/html/2605.21648#S3.p1.4)\.
- E\. Ising \(1925\)Beitrag zur theorie des ferromagnetismus\.Zeitschrift für Physik31,pp\. 253–258\.Cited by:[§3\.1](https://arxiv.org/html/2605.21648#S3.SS1.p5.3)\.
- A\. Krizhevsky \(2009\)Learning multiple layers of features from tiny images\.Technical reportUniversity of Toronto\.Cited by:[§D\.5](https://arxiv.org/html/2605.21648#A4.SS5.p1.1),[§3\.3](https://arxiv.org/html/2605.21648#S3.SS3.p5.8)\.
- L\. D\. Landau and E\. M\. Lifshitz \(1980\)Statistical physics\.Vol\.5,Pergamon Press\.Cited by:[§3\.1](https://arxiv.org/html/2605.21648#S3.SS1.p5.3)\.
- Y\. LeCun, L\. Bottou, Y\. Bengio, and P\. Haffner \(1998\)Gradient\-based learning applied to document recognition\.Proceedings of the IEEE86\(11\),pp\. 2278–2324\.External Links:[Document](https://dx.doi.org/10.1109/5.726791)Cited by:[§A\.4](https://arxiv.org/html/2605.21648#A1.SS4.p2.9),[§2](https://arxiv.org/html/2605.21648#S2.p8.1)\.
- J\. Lee, Y\. Bahri, R\. Novak, S\. S\. Schoenholz, J\. Pennington, and J\. Sohl\-Dickstein \(2018\)Deep neural networks as Gaussian processes\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=B1EA-M-0Z)Cited by:[§A\.2](https://arxiv.org/html/2605.21648#A1.SS2.p2.1),[§D\.3](https://arxiv.org/html/2605.21648#A4.SS3.p2.3),[§2](https://arxiv.org/html/2605.21648#S2.p1.1)\.
- M\. Mézard, G\. Parisi, and M\. A\. Virasoro \(1987\)Spin glass theory and beyond\.World Scientific,Singapore\.Cited by:[§3\.1](https://arxiv.org/html/2605.21648#S3.SS1.p5.3)\.
- P\. Morerio, J\. Cavazza, R\. Volpi, R\. Vidal, and V\. Murino \(2017\)Curriculum dropout\.InProceedings of the IEEE International Conference on Computer Vision,pp\. 3544–3552\.Cited by:[§1](https://arxiv.org/html/2605.21648#S1.p4.1)\.
- R\. M\. Neal \(1996\)Bayesian learning for neural networks\.Lecture Notes in Statistics, Vol\.118,Springer,New York, NY\.Cited by:[§D\.3](https://arxiv.org/html/2605.21648#A4.SS3.p2.3)\.
- L\. Noci, S\. Anagnostidis, L\. Biggio, A\. Orvieto, S\. P\. Singh, and A\. Lucchi \(2022\)Signal propagation in transformers: theoretical perspectives and the role of rank collapse\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 27198–27211\.Cited by:[§A\.5](https://arxiv.org/html/2605.21648#A1.SS5.p1.3),[§3\.3](https://arxiv.org/html/2605.21648#S3.SS3.p4.1)\.
- N\. H\. Packard \(1988\)Adaptation toward the edge of chaos\.InDynamic Patterns in Complex Systems,J\. A\. S\. Kelso, A\. J\. Mandell, and M\. F\. Shlesinger \(Eds\.\),pp\. 293–301\.Cited by:[§1](https://arxiv.org/html/2605.21648#S1.p1.3)\.
- G\. Parisi \(1979\)Infinite number of order parameters for spin glasses\.Physical Review Letters43,pp\. 1754–1756\.Cited by:[§3\.1](https://arxiv.org/html/2605.21648#S3.SS1.p5.3)\.
- G\. Parisi \(1980\)A sequence of approximated solutions to the S\-K model for spin glasses\.Journal of Physics A: Mathematical and General13,pp\. L115–L121\.Cited by:[§3\.1](https://arxiv.org/html/2605.21648#S3.SS1.p5.3)\.
- G\. Parisi \(1983a\)Mean\-field theory of spin glasses\.Reviews of Modern Physics55,pp\. 477–531\.Cited by:[§3\.1](https://arxiv.org/html/2605.21648#S3.SS1.p5.3)\.
- G\. Parisi \(1983b\)Order parameter for spin glasses\.Physical Review Letters50,pp\. 1946–1948\.Cited by:[§3\.1](https://arxiv.org/html/2605.21648#S3.SS1.p5.3)\.
- B\. Poole, S\. Lahiri, M\. Raghu, J\. Sohl\-Dickstein, and S\. Ganguli \(2016\)Exponential expressivity in deep neural networks through transient chaos\.InAdvances in Neural Information Processing Systems,Vol\.29,pp\. 3360–3368\.Cited by:[§A\.2](https://arxiv.org/html/2605.21648#A1.SS2.p2.1),[§1](https://arxiv.org/html/2605.21648#S1.p1.3),[§2](https://arxiv.org/html/2605.21648#S2.p1.1)\.
- R\. Price \(1958\)A useful theorem for nonlinear devices having Gaussian inputs\.IRE Transactions on Information Theory4\(2\),pp\. 69–72\.External Links:[Document](https://dx.doi.org/10.1109/TIT.1958.1057444)Cited by:[§A\.3](https://arxiv.org/html/2605.21648#A1.SS3.p3.12),[§C\.1](https://arxiv.org/html/2605.21648#A3.SS1.p1.2)\.
- D\. A\. Roberts, S\. Yaida, and B\. Hanin \(2022\)The principles of deep learning theory: an effective theory approach to understanding neural networks\.Cambridge University Press\.External Links:[Document](https://dx.doi.org/10.1017/9781009023405),ISBN 9781316519332Cited by:[§A\.2](https://arxiv.org/html/2605.21648#A1.SS2.p4.1),[§A\.2](https://arxiv.org/html/2605.21648#A1.SS2.p4.8),[§D\.3](https://arxiv.org/html/2605.21648#A4.SS3.p2.3),[§1](https://arxiv.org/html/2605.21648#S1.p1.3),[§2](https://arxiv.org/html/2605.21648#S2.p1.1),[§2](https://arxiv.org/html/2605.21648#S2.p3.1),[§2](https://arxiv.org/html/2605.21648#S2.p6.1),[§2](https://arxiv.org/html/2605.21648#S2.p7.3),[§3](https://arxiv.org/html/2605.21648#S3.p3.1)\.
- S\. S\. Schoenholz, J\. Gilmer, S\. Ganguli, and J\. Sohl\-Dickstein \(2017\)Deep information propagation\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=H1W1UN9gg)Cited by:[§A\.2](https://arxiv.org/html/2605.21648#A1.SS2.p2.1),[§C\.1](https://arxiv.org/html/2605.21648#A3.SS1.p1.2),[§D\.3](https://arxiv.org/html/2605.21648#A4.SS3.p2.3),[§1](https://arxiv.org/html/2605.21648#S1.p1.3),[§1](https://arxiv.org/html/2605.21648#S1.p3.2),[§2](https://arxiv.org/html/2605.21648#S2.p1.1),[§3](https://arxiv.org/html/2605.21648#S3.p1.3),[§3](https://arxiv.org/html/2605.21648#S3.p1.4),[§3](https://arxiv.org/html/2605.21648#S3.p4.4),[§3](https://arxiv.org/html/2605.21648#S3.p5.3)\.
- D\. Sherrington and S\. Kirkpatrick \(1975\)Solvable model of a spin\-glass\.Physical Review Letters35,pp\. 1792–1796\.Cited by:[§3\.1](https://arxiv.org/html/2605.21648#S3.SS1.p5.3)\.
- H\. Sompolinsky, A\. Crisanti, and H\. Sommers \(1988\)Chaos in random neural networks\.Physical Review Letters61\(3\),pp\. 259–262\.Cited by:[§1](https://arxiv.org/html/2605.21648#S1.p1.3)\.
- N\. Srivastava, G\. Hinton, A\. Krizhevsky, I\. Sutskever, and R\. Salakhutdinov \(2014\)Dropout: a simple way to prevent neural networks from overfitting\.Journal of Machine Learning Research15\(1\),pp\. 1929–1958\.Cited by:[§3](https://arxiv.org/html/2605.21648#S3.p1.4)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30,pp\. 5998–6008\.Cited by:[§2](https://arxiv.org/html/2605.21648#S2.p8.1)\.
- S\. Wager, S\. Wang, and P\. Liang \(2013\)Dropout training as adaptive regularization\.InAdvances in Neural Information Processing Systems,Vol\.26,pp\. 351–359\.Cited by:[§3](https://arxiv.org/html/2605.21648#S3.p1.4)\.
- L\. Xiao, Y\. Bahri, J\. Sohl\-Dickstein, S\. S\. Schoenholz, and J\. Pennington \(2018\)Dynamical isometry and a mean field theory of CNNs: how to train 10,000\-layer vanilla convolutional neural networks\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 5393–5402\.Cited by:[§A\.4](https://arxiv.org/html/2605.21648#A1.SS4.p2.9),[§2](https://arxiv.org/html/2605.21648#S2.p8.1)\.
- G\. Yang and S\. S\. Schoenholz \(2017\)Mean field residual networks: on the edge of chaos\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§A\.4](https://arxiv.org/html/2605.21648#A1.SS4.p3.2),[§D\.5](https://arxiv.org/html/2605.21648#A4.SS5.p2.1),[§2](https://arxiv.org/html/2605.21648#S2.p8.1),[§3\.2](https://arxiv.org/html/2605.21648#S3.SS2.p4.4)\.
- L\. Zhu, C\. Liu, A\. Radhakrishnan, and M\. Belkin \(2024\)Quadratic models for understanding catapult dynamics of neural networks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PvJnX3dwsD)Cited by:[§5](https://arxiv.org/html/2605.21648#S5.p3.1)\.
- J\. Zinn\-Justin \(2002\)Quantum field theory and critical phenomena\.4 edition,Oxford University Press,Oxford\.Cited by:[§3\.2](https://arxiv.org/html/2605.21648#S3.SS2.p1.6),[footnote 1](https://arxiv.org/html/2605.21648#footnote1)\.
## Appendix AMean\-Field Background and Dropout Recursions
### A\.1Notation guide
The same few letters carry several nearby meanings, so we collect the main conventions here\.
Table 4:Notation used across the main text and appendices\.
### A\.2Mean\-Field Theory Primer
Even the Standard Model of physics, with only on the order of a few dozen empirical parameters, invites the feeling that some deeper organizing principle is yet to be found\. Fermi made the same point more sharply to Dyson by quoting von Neumann: “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk”\(Dyson,[2004](https://arxiv.org/html/2605.21648#bib.bib5)\)\. Modern machine learning pushes this worry to an extreme: parameters are abundant, so understanding cannot mean tracking each weight individually\. A pressing question in modern ML literature is what set of parameters or effective fields can give a useful description of the network’s behavior, and how to derive such a description from the microscopic model\.
Mean\-field theory gives one such effective description: instead of tracking every neuron and every random weight, one tracks a small number of self\-consistent fields\. In the signal\-propagation version used here, those fields are variances, covariances, and correlations\. This is the deep information propagation line of mean\-field theory\(Pooleet al\.,[2016](https://arxiv.org/html/2605.21648#bib.bib35); Schoenholzet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib38)\), with broader background in statistical\-mechanics treatments of deep learning\(Bahriet al\.,[2020](https://arxiv.org/html/2605.21648#bib.bib48),[2024b](https://arxiv.org/html/2605.21648#bib.bib3)\)\. The same infinite\-width covariance recursion can also be read as the NNGP kernel recursion\(Leeet al\.,[2018](https://arxiv.org/html/2605.21648#bib.bib47)\); we use the MFT language because our questions concern fixed points, susceptibilities, correlation lengths, and critical exponents rather than the induced function\-space prior or training dynamics\.
Section[2](https://arxiv.org/html/2605.21648#S2)gives the main\-body version of the signal\-propagation story\. Here we only fix notation and record the elementary steps used later in the appendices\. For a depth\-LLMLP with activationϕ:ℝ→ℝ\\phi:\\mathbb\{R\}\\to\\mathbb\{R\}, weights and biases are initialized as
Wijl∼𝒩\(0,σw2Nl\),bil∼𝒩\(0,σb2\),W^\{l\}\_\{ij\}\\sim\\mathcal\{N\}\\left\(0,\\frac\{\\sigma\_\{w\}^\{2\}\}\{N\_\{l\}\}\\right\),\\qquad b^\{l\}\_\{i\}\\sim\\mathcal\{N\}\\left\(0,\\sigma\_\{b\}^\{2\}\\right\),\(51\)and propagated by
zi;al=Wijlyj;al\+bil,yi;al\+1=ϕ\(zi;al\),z^\{l\}\_\{i;a\}=W^\{l\}\_\{ij\}y^\{l\}\_\{j;a\}\+b^\{l\}\_\{i\},\\qquad y^\{l\+1\}\_\{i;a\}=\\phi\\left\(z^\{l\}\_\{i;a\}\\right\),\(52\)whereaalabels the input\. We write
∫Dz\(⋯\)=12π∫−∞∞𝑑ze−z2/2\(⋯\),\\int Dz\(\\cdots\)=\\frac\{1\}\{\\sqrt\{2\\pi\}\}\\int\_\{\-\\infty\}^\{\\infty\}dze^\{\-z^\{2\}/2\}\(\\cdots\),\(53\)and use the same convention for multiple Gaussian variables\.
The Gaussian closure comes from the fact that each preactivation is a sum of many independently initialized weighted inputs\. In the infinite\-width limit this central\-limit effect becomes exact layer\-by\-layer, while finite\-width corrections are organized perturbatively in the depth\-to\-width aspect ratio, schematicallyL/NL/N\(Robertset al\.,[2022](https://arxiv.org/html/2605.21648#bib.bib36)\)\. The single\-input variance and two\-input covariance recursions are therefore the closed equations already quoted in Sec\.[2](https://arxiv.org/html/2605.21648#S2):
qaal\\displaystyle q^\{l\}\_\{aa\}=σw2∫Dzϕ2\(qaal−1z\)\+σb2,\\displaystyle=\\sigma\_\{w\}^\{2\}\\int Dz\\phi^\{2\}\\left\(\\sqrt\{q^\{l\-1\}\_\{aa\}\}z\\right\)\+\\sigma\_\{b\}^\{2\},qabl\\displaystyle q^\{l\}\_\{ab\}=σw2∫Dz1Dz2ϕ\(u1\)ϕ\(u2\)\+σb2,\\displaystyle=\\sigma\_\{w\}^\{2\}\\int Dz\_\{1\}Dz\_\{2\}\\phi\(u\_\{1\}\)\\phi\(u\_\{2\}\)\+\\sigma\_\{b\}^\{2\},u1\\displaystyle u\_\{1\}=qaal−1z1,cabl−1=qabl−1qaal−1qbbl−1,\\displaystyle=\\sqrt\{q^\{l\-1\}\_\{aa\}\}z\_\{1\},\\qquad c^\{l\-1\}\_\{ab\}=\\frac\{q^\{l\-1\}\_\{ab\}\}\{\\sqrt\{q^\{l\-1\}\_\{aa\}q^\{l\-1\}\_\{bb\}\}\},u2\\displaystyle u\_\{2\}=qbbl−1\(cabl−1z1\+1−\(cabl−1\)2z2\)\.\\displaystyle=\\sqrt\{q^\{l\-1\}\_\{bb\}\}\\left\(c^\{l\-1\}\_\{ab\}z\_\{1\}\+\\sqrt\{1\-\\left\(c^\{l\-1\}\_\{ab\}\\right\)^\{2\}\}z\_\{2\}\\right\)\.\(54\)Once the variance relaxes toq∗q^\{\*\}, these equations induce a one\-dimensional correlation mapcabl=F\(cabl−1\)c^\{l\}\_\{ab\}=F\(c^\{l\-1\}\_\{ab\}\)\. Without dropout, perfect alignment is a fixed point,F\(1\)=1F\(1\)=1\. Its linear stability is controlled by the angular susceptibility
χ1=∂cabl∂cabl−1\|cab=1=σw2∫Dz\[ϕ′\(q∗z\)\]2,\\chi\_\{1\}=\\left\.\\frac\{\\partial c^\{l\}\_\{ab\}\}\{\\partial c^\{l\-1\}\_\{ab\}\}\\right\|\_\{c\_\{ab\}=1\}=\\sigma\_\{w\}^\{2\}\\int Dz\\big\[\\phi^\{\\prime\}\(\\sqrt\{q^\{\*\}\}z\)\\big\]^\{2\},\(55\)calledχ⟂\\chi\_\{\\perp\}in the main text and in\(Robertset al\.,[2022](https://arxiv.org/html/2605.21648#bib.bib36)\)\. The ordered, chaotic, and critical regimes correspond toχ1<1\\chi\_\{1\}<1,χ1\>1\\chi\_\{1\}\>1, andχ1=1\\chi\_\{1\}=1respectively; the last case is the edge\-of\-chaos, where the cross\-input correlation length can diverge\.
The usual He initialization follows from this criterion in the special case ofReLU\\operatorname\{ReLU\}\. Sinceϕ′\(z\)=θ\(z\)\\phi^\{\\prime\}\(z\)=\\theta\(z\)almost everywhere,
∫Dz\(ϕ′\(q∗z\)\)2=12,χ1=1⟹σw2=2,\\int Dz\\big\(\\phi^\{\\prime\}\(\\sqrt\{q^\{\*\}\}z\)\\big\)^\{2\}=\\frac\{1\}\{2\},\\qquad\\chi\_\{1\}=1\\Longrightarrow\\sigma\_\{w\}^\{2\}=2,\(56\)which is the variance\-2/N2/Ninitialization\. The same example also shows why variance preservation and correlation criticality are not identical: forReLU\\operatorname\{ReLU\},
qaal=σw22qaal−1\+σb2,q^\{l\}\_\{aa\}=\\frac\{\\sigma\_\{w\}^\{2\}\}\{2\}q^\{l\-1\}\_\{aa\}\+\\sigma\_\{b\}^\{2\},\(57\)so with nonzero bias the finite\-q∗q^\{\*\}condition requiresσw2<2\\sigma\_\{w\}^\{2\}<2, while exact angular criticality setsσw2=2\\sigma\_\{w\}^\{2\}=2\. In practice one either sets biases to zero or works slightly subcritical to keep the variance scale finite\.
Finally, the correlation length used throughout the paper is obtained by linearizing the correlation map around its fixed point\. Ifrc=F′\(c∗\)r\_\{c\}=F^\{\\prime\}\(c^\{\*\}\), then
\|c∗−cabl\|∝e−l/ξc,ξc−1=−log\|rc\|\.\|c^\{\*\}\-c^\{l\}\_\{ab\}\|\\propto e^\{\-l/\\xi\_\{c\}\},\\qquad\\xi\_\{c\}^\{\-1\}=\-\\log\|r\_\{c\}\|\.\(58\)The analogous variance lengthξq\\xi\_\{q\}tracks relaxation ofqaalq^\{l\}\_\{aa\}toq∗q^\{\*\}, but the edge\-of\-chaos phenomenon is primarily controlled byξc\\xi\_\{c\}: it measures how many layers preserve distinctions between different inputs\. This is the scale deformed by dropout in the main analysis\.
### A\.3Derivation of Correlation Map and Derivatives with Dropout
This subsection supplies the steps leading to Eqs\. \([20](https://arxiv.org/html/2605.21648#S3.E20)\) and \([21](https://arxiv.org/html/2605.21648#S3.E21)\), starting from the two\-input correlation recursion that underlies the perfect\-alignment shiftF¯ρ\(1\)=1−Δ\\bar\{F\}\_\{\\rho\}\(1\)=1\-\\Deltain \([15](https://arxiv.org/html/2605.21648#S3.E15)\)\.
Fix a layer where the single\-input variance has converged to the dropout\-shifted fixed pointq¯∗\\bar\{q\}^\{\*\}defined by \([13](https://arxiv.org/html/2605.21648#S3.E13)\)\. For two inputs, let\(u1,u2\)\(u\_\{1\},u\_\{2\}\)be jointly Gaussian with
𝔼\[u12\]=𝔼\[u22\]=q¯∗,𝔼\[u1u2\]=cq¯∗,c∈\[−1,1\]\.\\mathbb\{E\}\[u\_\{1\}^\{2\}\]=\\mathbb\{E\}\[u\_\{2\}^\{2\}\]=\\bar\{q\}^\{\*\},\\qquad\\mathbb\{E\}\[u\_\{1\}u\_\{2\}\]=c\\bar\{q\}^\{\*\},\\qquad c\\in\[\-1,1\]\.\(59\)The normalized correlation map is
F¯ρ\(c\)≡q¯abl\+1q¯∗=σw2𝔼\[ϕ\(u1\)ϕ\(u2\)\]\+σb2q¯∗\.\\bar\{F\}\_\{\\rho\}\(c\)\\;\\equiv\\;\\frac\{\\bar\{q\}^\{l\+1\}\_\{ab\}\}\{\\bar\{q\}^\{\*\}\}\\;=\\;\\frac\{\\sigma\_\{w\}^\{2\}\\mathbb\{E\}\\big\[\\phi\(u\_\{1\}\)\\phi\(u\_\{2\}\)\\big\]\+\\sigma\_\{b\}^\{2\}\}\{\\bar\{q\}^\{\*\}\}\.\(60\)Atc=1c=1,u1=u2u\_\{1\}=u\_\{2\}andF¯ρ\(1\)=1−Δ\\bar\{F\}\_\{\\rho\}\(1\)=1\-\\Deltaby definition, giving \([15](https://arxiv.org/html/2605.21648#S3.E15)\)\.
Sinceq¯∗\\bar\{q\}^\{\*\}is independent ofcc, differentiating \([60](https://arxiv.org/html/2605.21648#A1.E60)\) yields
F¯ρ′\(c\)=σw2q¯∗∂∂c𝔼\[ϕ\(u1\)ϕ\(u2\)\]\.\\bar\{F\}\_\{\\rho\}^\{\\prime\}\(c\)\\;=\\;\\frac\{\\sigma\_\{w\}^\{2\}\}\{\\bar\{q\}^\{\*\}\}\\frac\{\\partial\}\{\\partial c\}\\mathbb\{E\}\\big\[\\phi\(u\_\{1\}\)\\phi\(u\_\{2\}\)\\big\]\.\(61\)Writeui=q¯∗viu\_\{i\}=\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{i\}, where\(v1,v2\)\(v\_\{1\},v\_\{2\}\)are standard Gaussians with𝔼\[v1v2\]=c\\mathbb\{E\}\[v\_\{1\}v\_\{2\}\]=c, and define
G\(c\)≡𝔼\[ϕ\(q¯∗v1\)ϕ\(q¯∗v2\)\]\.G\(c\)\\;\\equiv\\;\\mathbb\{E\}\\Big\[\\phi\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{1\}\\big\)\\phi\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{2\}\\big\)\\Big\]\.\(62\)Price’s theorem\(Price,[1958](https://arxiv.org/html/2605.21648#bib.bib30)\)gives
ddcG\(c\)=𝔼\[∂2∂v1∂v2\(ϕ\(q¯∗v1\)ϕ\(q¯∗v2\)\)\]\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}c\}G\(c\)\\;=\\;\\mathbb\{E\}\\left\[\\frac\{\\partial^\{2\}\}\{\\partial v\_\{1\}\\partial v\_\{2\}\}\\Big\(\\phi\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{1\}\\big\)\\phi\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{2\}\\big\)\\Big\)\\right\]\.\(63\)A direct differentiation yields
∂2∂v1∂v2\(ϕ\(q¯∗v1\)ϕ\(q¯∗v2\)\)=q¯∗ϕ′\(q¯∗v1\)ϕ′\(q¯∗v2\)\.\\frac\{\\partial^\{2\}\}\{\\partial v\_\{1\}\\partial v\_\{2\}\}\\Big\(\\phi\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{1\}\\big\)\\phi\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{2\}\\big\)\\Big\)=\\bar\{q\}^\{\*\}\\phi^\{\\prime\}\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{1\}\\big\)\\phi^\{\\prime\}\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{2\}\\big\)\.\(64\)Therefore,
G′\(c\)=q¯∗𝔼\[ϕ′\(q¯∗v1\)ϕ′\(q¯∗v2\)\]\.G^\{\\prime\}\(c\)=\\bar\{q\}^\{\*\}\\mathbb\{E\}\\Big\[\\phi^\{\\prime\}\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{1\}\\big\)\\phi^\{\\prime\}\\big\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{2\}\\big\)\\Big\]\.\(65\)Takingc→1c\\to 1collapsesv1=v2=zv\_\{1\}=v\_\{2\}=zwithz∼𝒩\(0,1\)z\\sim\\mathcal\{N\}\(0,1\), hence
G′\(1\)=q¯∗∫Dz\[ϕ′\(q¯∗z\)\]2\.G^\{\\prime\}\(1\)=\\bar\{q\}^\{\*\}\\int Dz\\big\[\\phi^\{\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}z\)\\big\]^\{2\}\.\(66\)Substituting into \([61](https://arxiv.org/html/2605.21648#A1.E61)\) gives
F¯ρ′\(1\)=σw2∫Dz\[ϕ′\(q¯∗z\)\]2,\\bar\{F\}\_\{\\rho\}^\{\\prime\}\(1\)=\\sigma\_\{w\}^\{2\}\\int Dz\\big\[\\phi^\{\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}z\)\\big\]^\{2\},\(67\)which is Eq\. \([20](https://arxiv.org/html/2605.21648#S3.E20)\)\. Differentiating \([61](https://arxiv.org/html/2605.21648#A1.E61)\) once more:
F¯ρ′′\(c\)=σw2q¯∗∂2∂c2𝔼\[ϕ\(u1\)ϕ\(u2\)\]\.\\bar\{F\}\_\{\\rho\}^\{\\prime\\prime\}\(c\)\\;=\\;\\frac\{\\sigma\_\{w\}^\{2\}\}\{\\bar\{q\}^\{\*\}\}\\frac\{\\partial^\{2\}\}\{\\partial c^\{2\}\}\\mathbb\{E\}\\big\[\\phi\(u\_\{1\}\)\\phi\(u\_\{2\}\)\\big\]\.\(68\)Applying Price’s theorem again tof\(v1,v2\)=ϕ′\(q¯∗v1\)ϕ′\(q¯∗v2\)f\(v\_\{1\},v\_\{2\}\)=\\phi^\{\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{1\}\)\\phi^\{\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{2\}\)yields
ddc𝔼\[ϕ′\(q¯∗v1\)ϕ′\(q¯∗v2\)\]=q¯∗𝔼\[ϕ′′\(q¯∗v1\)ϕ′′\(q¯∗v2\)\]\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}c\}\\mathbb\{E\}\\Big\[\\phi^\{\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{1\}\)\\phi^\{\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{2\}\)\\Big\]=\\bar\{q\}^\{\*\}\\mathbb\{E\}\\Big\[\\phi^\{\\prime\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{1\}\)\\phi^\{\\prime\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{2\}\)\\Big\]\.\(69\)Combining with the previous step gives
∂2∂c2𝔼\[ϕ\(u1\)ϕ\(u2\)\]=\(q¯∗\)2𝔼\[ϕ′′\(q¯∗v1\)ϕ′′\(q¯∗v2\)\]\.\\frac\{\\partial^\{2\}\}\{\\partial c^\{2\}\}\\mathbb\{E\}\\big\[\\phi\(u\_\{1\}\)\\phi\(u\_\{2\}\)\\big\]=\(\\bar\{q\}^\{\*\}\)^\{2\}\\mathbb\{E\}\\Big\[\\phi^\{\\prime\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{1\}\)\\phi^\{\\prime\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}v\_\{2\}\)\\Big\]\.\(70\)Evaluating atc=1c=1again collapsesv1=v2=zv\_\{1\}=v\_\{2\}=z, so
∂2∂c2𝔼\[ϕ\(u1\)ϕ\(u2\)\]\|c=1=\(q¯∗\)2∫Dz\[ϕ′′\(q¯∗z\)\]2\.\\left\.\\frac\{\\partial^\{2\}\}\{\\partial c^\{2\}\}\\mathbb\{E\}\\big\[\\phi\(u\_\{1\}\)\\phi\(u\_\{2\}\)\\big\]\\right\|\_\{c=1\}=\(\\bar\{q\}^\{\*\}\)^\{2\}\\int Dz\\big\[\\phi^\{\\prime\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}z\)\\big\]^\{2\}\.\(71\)Substituting into \([68](https://arxiv.org/html/2605.21648#A1.E68)\) yields
F¯ρ′′\(1\)=σw2q¯∗∫Dz\[ϕ′′\(q¯∗z\)\]2,\\bar\{F\}\_\{\\rho\}^\{\\prime\\prime\}\(1\)=\\sigma\_\{w\}^\{2\}\\bar\{q\}^\{\*\}\\int Dz\\big\[\\phi^\{\\prime\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}z\)\\big\]^\{2\},\(72\)which is Eq\. \([21](https://arxiv.org/html/2605.21648#S3.E21)\)\.
### A\.4Extensions to Infinite\-Channel CNNs and ResNets
We briefly discuss extensions to architectures not treated in the main analysis\. Although our derivation is for MLPs, the universality\-class mechanism is tied to the local Gaussian kernel rather than to the MLP architecture itself\. Other architectures introduce additional structure that can change the exact exponents, but the qualitative smooth\-versus\-kinked distinction should remain when the same local channel controls the wide\-limit recursion\.
For example, the same local channel which appears in MLPs re\-emerges in infinite\-width CNNs when the number of channels is taken to infinity: preactivations at different spatial positions become jointly Gaussian across channels, and the covariance recursion composes the same bivariate Gaussian activation expectation with a linear operator that averages over convolutional filter offsets\(LeCunet al\.,[1998](https://arxiv.org/html/2605.21648#bib.bib14); Xiaoet al\.,[2018](https://arxiv.org/html/2605.21648#bib.bib44)\)\. In schematic form, a spatial covarianceKαβℓ\(x,x′\)K^\{\\ell\}\_\{\\alpha\\beta\}\(x,x^\{\\prime\}\)evolves asKℓ\+1=σb2\+σw2𝒜\[Vϕ\(Kℓ\)\]K^\{\\ell\+1\}=\\sigma\_\{b\}^\{2\}\+\\sigma\_\{w\}^\{2\}\\mathcal\{A\}\[V\_\{\\phi\}\(K^\{\\ell\}\)\], whereVϕV\_\{\\phi\}is the scalar Gaussian channel analyzed above and𝒜\\mathcal\{A\}is the convolutional averaging operator\. Independent dropout shifts the single\-input variance enteringVϕV\_\{\\phi\}as in the fully connected case, while the cross\-input covariance is still controlled by the same bivariate expectation\. Therefore, for an isolated leading spatial mode, convolution changes the relevant eigenvalue and mode shape but not the local smooth\-versus\-kinked non\-analyticity: smooth activations retain the analytic Landau equation, whileReLU\\operatorname\{ReLU\}\-like kinks retain them3/2m^\{3/2\}branch point\. A complete CNN theory would promotemmto a spatial covariance\-mode amplitude and track boundary conditions, pooling, and degeneracies of𝒜\\mathcal\{A\}\.
ResNets modify depth dynamics in a complementary way: skip connections can change exponential convergence to subexponential or polynomial behavior, and norms may drift rather than relax to a fixed point\(Yang and Schoenholz,[2017](https://arxiv.org/html/2605.21648#bib.bib45)\)\. Still, the residual branch evaluates the same nonlinear Gaussian channelVϕV\_\{\\phi\}, so the local analytic distinction is unchanged\. The tanh/ReLU\\operatorname\{ReLU\}asymptotic split observed in\(Yang and Schoenholz,[2017](https://arxiv.org/html/2605.21648#bib.bib45)\)is consistent with our smooth/kinked classes; a dropout\-deformed ResNet theory would additionally track residual\-branch strength and norm drift\.
### A\.5Rank Collapse and Rank\-Flow Tie\-Breaking
The mean\-field propagation objective is invariant under permutations of the depth profilehℓh\_\{\\ell\}, so it fixes the amount of dropout but not its ordering\. Rank collapse supplies one beyond\-mean\-field tie\-breaker\(Donget al\.,[2021](https://arxiv.org/html/2605.21648#bib.bib16); Nociet al\.,[2022](https://arxiv.org/html/2605.21648#bib.bib17); Daneshmandet al\.,[2020](https://arxiv.org/html/2605.21648#bib.bib15)\)\. In pure self\-attention, for token representationsXℓ∈ℝn×dX\_\{\\ell\}\\in\\mathbb\{R\}^\{n\\times d\}, and ignoring residual connections,Xℓ\+1=AℓXℓX\_\{\\ell\+1\}=A\_\{\\ell\}X\_\{\\ell\}\. Hence
rank\(Xℓ\+1\)≤min\{rank\(Aℓ\),rank\(Xℓ\)\}\.\\operatorname\{rank\}\(X\_\{\\ell\+1\}\)\\;\\leq\\;\\min\\\{\\operatorname\{rank\}\(A\_\{\\ell\}\),\\operatorname\{rank\}\(X\_\{\\ell\}\)\\\}\.\(73\)AsAℓA\_\{\\ell\}contracts toward the rank\-one projector𝟏𝟏⊤/n\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}/n, a rank\-diversity proxy normalized so thatR=1R=1at complete token collapse obeys
R\(ℓ\)−1≲\(R\(0\)−1\)exp\(−c2ℓ\),R\(\\ell\)\-1\\;\\lesssim\\;\\bigl\(R\(0\)\-1\\bigr\)\\exp\\left\(\-c2^\{\\ell\}\\right\),\(74\)for an architecture\-dependent constantc\>0c\>0\. Unnormalized MLPs show a related rank\-collapse mechanism: products of random linear maps converge toward a rank\-one direction at an exponential rate, whileReLU\\operatorname\{ReLU\}MLPs exhibit fast rank loss empirically\(Daneshmandet al\.,[2020](https://arxiv.org/html/2605.21648#bib.bib15)\)\. A simple proxy for the linear case is
R\(ℓ\)−1∼\(R\(0\)−1\)exp\(−2\(λ1−λ2\)ℓ\),R\(\\ell\)\-1\\;\\sim\\;\\bigl\(R\(0\)\-1\\bigr\)\\exp\\left\(\-2\(\\lambda\_\{1\}\-\\lambda\_\{2\}\)\\ell\\right\),\(75\)whereλ1\>λ2\\lambda\_\{1\}\>\\lambda\_\{2\}are the top Lyapunov exponents governing singular\-value growth\. In both cases, the dominant loss of diversity occurs early in depth, especially in attention\-dominated models\.
This turns the ordering question into a depth\-resolved tie\-breaker\. LetDℓ≡R\(ℓ\)−1D\_\{\\ell\}\\equiv R\(\\ell\)\-1denote the diversity left above the collapsed rank\-one limit, and suppose dropout can only preserve diversity that has not yet been lost\. The future collapse available to prevent at depthℓ\\ellisVℓ∝∑k=ℓL−1\(Dk−Dk\+1\)=Dℓ−DLV\_\{\\ell\}\\propto\\sum\_\{k=\\ell\}^\{L\-1\}\(D\_\{k\}\-D\_\{k\+1\}\)=D\_\{\\ell\}\-D\_\{L\}\. For bothDℓ∼D0exp\(−c2ℓ\)D\_\{\\ell\}\\sim D\_\{0\}\\exp\(\-c2^\{\\ell\}\)andDℓ∼D0exp\(−aℓ\)D\_\{\\ell\}\\sim D\_\{0\}\\exp\(\-a\\ell\),VℓV\_\{\\ell\}decreases withℓ\\ell\. Thus, among schedules equivalent for the mean\-field objective, swapping a saturated dropout layer from a later depth to an earlier one weakly improves this rank\-flow proxy: dropout is most useful before diversity has already collapsed\.
Dropout counteracts this collapse by shrinking within\-layer normalized correlations\. With keep probabilityρℓ∈\(0,1\]\\rho\_\{\\ell\}\\in\(0,1\], inverted dropout acts as
y~ℓ,i=pℓ,iρℓyℓ,i,pℓ,i∼Bernoulli\(ρℓ\),\\tilde\{y\}\_\{\\ell,i\}\\;=\\;\\frac\{p\_\{\\ell,i\}\}\{\\rho\_\{\\ell\}\}y\_\{\\ell,i\},\\qquad p\_\{\\ell,i\}\\sim\\mathrm\{Bernoulli\}\(\\rho\_\{\\ell\}\),\(76\)with thepℓ,ip\_\{\\ell,i\}independent acrossii\. Let the uncentered second moment at depthℓ\\ellbe
Σℓ≡𝔼\[yℓyℓ⊤\],\\Sigma\_\{\\ell\}\\;\\equiv\\;\\mathbb\{E\}\\left\[y\_\{\\ell\}y\_\{\\ell\}^\{\\top\}\\right\],\(77\)and defineΣ~ℓ\\tilde\{\\Sigma\}\_\{\\ell\}analogously fory~ℓ\\tilde\{y\}\_\{\\ell\}\. Averaging over the dropout mask gives
Σ~ℓ=Σℓ\+\(1ρℓ−1\)diag\(Σℓ\),\\tilde\{\\Sigma\}\_\{\\ell\}\\;=\\;\\Sigma\_\{\\ell\}\+\\left\(\\frac\{1\}\{\\rho\_\{\\ell\}\}\-1\\right\)\\mathrm\{diag\}\(\\Sigma\_\{\\ell\}\),\(78\)so off\-diagonal second moments are unchanged while the diagonal is inflated\. Writing
Dℓ≡diag\(Σℓ\),Cℓ≡Dℓ−1/2ΣℓDℓ−1/2,D\_\{\\ell\}\\;\\equiv\\;\\mathrm\{diag\}\(\\Sigma\_\{\\ell\}\),\\qquad C\_\{\\ell\}\\;\\equiv\\;D\_\{\\ell\}^\{\-1/2\}\\Sigma\_\{\\ell\}D\_\{\\ell\}^\{\-1/2\},\(79\)one obtains
C~ℓ=ρℓCℓ\+\(1−ρℓ\)I\.\\tilde\{C\}\_\{\\ell\}\\;=\\;\\rho\_\{\\ell\}C\_\{\\ell\}\+\(1\-\\rho\_\{\\ell\}\)I\.\(80\)This identity suppresses off\-diagonal correlations and moves the eigenvalues of the normalized correlation matrixCℓC\_\{\\ell\}toward one\. The statement is about the normalized correlation matrix; the raw covarianceΣ~ℓ\\tilde\{\\Sigma\}\_\{\\ell\}can still have heterogeneous diagonal variances\.
A convenient quantitative proxy is the participation\-ratio effective rank
Reff\(Σ\)≡\(trΣ\)2tr\(Σ2\)\.R\_\{\\mathrm\{eff\}\}\(\\Sigma\)\\;\\equiv\\;\\frac\{\\bigl\(\\mathrm\{tr\}\\Sigma\\bigr\)^\{2\}\}\{\\mathrm\{tr\}\\left\(\\Sigma^\{2\}\\right\)\}\.\(81\)Combining \([78](https://arxiv.org/html/2605.21648#A1.E78)\) with trace identities yields
Reff\(Σ~ℓ\)=\(trΣℓ\)2ρℓ2tr\(Σℓ2\)\+\(1−ρℓ2\)tr\(diag\(Σℓ\)2\)\.R\_\{\\mathrm\{eff\}\}\\left\(\\tilde\{\\Sigma\}\_\{\\ell\}\\right\)\\;=\\;\\frac\{\\bigl\(\\mathrm\{tr\}\\Sigma\_\{\\ell\}\\bigr\)^\{2\}\}\{\\rho\_\{\\ell\}^\{2\}\\mathrm\{tr\}\\left\(\\Sigma\_\{\\ell\}^\{2\}\\right\)\+\\left\(1\-\\rho\_\{\\ell\}^\{2\}\\right\)\\mathrm\{tr\}\\left\(\\mathrm\{diag\}\(\\Sigma\_\{\\ell\}\)^\{2\}\\right\)\}\.\(82\)Sincetr\(Σℓ2\)≥tr\(diag\(Σℓ\)2\)\\mathrm\{tr\}\(\\Sigma\_\{\\ell\}^\{2\}\)\\geq\\mathrm\{tr\}\(\\mathrm\{diag\}\(\\Sigma\_\{\\ell\}\)^\{2\}\)for any PSDΣℓ\\Sigma\_\{\\ell\},
Reff\(Σ~ℓ\)≥Reff\(Σℓ\),R\_\{\\mathrm\{eff\}\}\\left\(\\tilde\{\\Sigma\}\_\{\\ell\}\\right\)\\;\\geq\\;R\_\{\\mathrm\{eff\}\}\(\\Sigma\_\{\\ell\}\),\(83\)with equality iffρℓ=1\\rho\_\{\\ell\}=1orΣℓ\\Sigma\_\{\\ell\}is diagonal \(equivalently, for any nontrivial dropout0<ρℓ<10<\\rho\_\{\\ell\}<1, iff all off\-diagonal entries vanish\)\. Independent additive noise would also inflate the diagonal and break perfect alignment; dropout is the multiplicative Bernoulli realization for which the effective field is directly controlled byρℓ\\rho\_\{\\ell\}\.
## Appendix BCritical Exponents and Non\-Analyticity
The Hermite decomposition in App\.[C\.3](https://arxiv.org/html/2605.21648#A3.SS3)diagnoses which expansion is valid here\. In the Gaussian channel,
F\(c\)=σw2∑n≥0an2cn\+σb2q∗\.F\(c\)=\\frac\{\\sigma\_\{w\}^\{2\}\\sum\_\{n\\geq 0\}a\_\{n\}^\{2\}c^\{n\}\+\\sigma\_\{b\}^\{2\}\}\{q^\{\*\}\}\.\(84\)Rapid Hermite decay makes this series analytic atc=1c=1, giving the ordinary Taylor/Landau equation below\. A kink produces a power\-law Hermite tail, so high\-degree modes contribute collectively asc→1c\\to 1; the Taylor expansion is replaced by the branch\-point termm3/2m^\{3/2\}, which is the source of the kinked exponents\.
### B\.1Smooth Networks
The standard static exponents follow directly from \([24](https://arxiv.org/html/2605.21648#S3.E24)\)\. Along the critical isothermt=0t=0,
h=gρ2m2⇒m∼h1/2,h=\\frac\{g\_\{\\rho\}\}\{2\}m^\{2\}\\qquad\\Rightarrow\\qquad m\\sim h^\{1/2\},\(85\)soδ=2\\delta=2\. The analogue of the magnetic susceptibility isχM\(t\)≡∂m/∂h\|h→0\\chi\_\{M\}\(t\)\\equiv\\left\.\\partial m/\\partial h\\right\|\_\{h\\to 0\}\. On the subcritical sidet<0t<0, the analytic branch satisfiesm≃−h/tm\\simeq\-h/t, henceχM\(t\)∼\|t\|−1\\chi\_\{M\}\(t\)\\sim\|t\|^\{\-1\}andγ=1\\gamma=1\. Alongh=0h=0, the physical branch ism=0m=0fort≤0t\\leq 0andm=2t/gρm=2t/g\_\{\\rho\}fort≥0t\\geq 0, som∼tm\\sim tast→0\+t\\to 0^\{\+\}andβ=1\\beta=1\.
The depth correlation length follows from linearizing the correlation dynamics about the displaced fixed point\. To leading order,
λ≡F¯ρ′\(c∗\)=F¯ρ′\(1−m\)≃χρ−gρm=1\+t−gρm,\\lambda\\equiv\\bar\{F\}\_\{\\rho\}^\{\\prime\}\(c^\{\*\}\)=\\bar\{F\}\_\{\\rho\}^\{\\prime\}\(1\-m\)\\simeq\\chi\_\{\\rho\}\-g\_\{\\rho\}m=1\+t\-g\_\{\\rho\}m,\(86\)and substituting \([25](https://arxiv.org/html/2605.21648#S3.E25)\) gives
λ\(t,h\)=1−t2\+2gρh\.\\lambda\(t,h\)=1\-\\sqrt\{t^\{2\}\+2g\_\{\\rho\}h\}\.\(87\)Forλ≃1\\lambda\\simeq 1withλ\>0\\lambda\>0, the depth correlation length satisfies
ξc,ρ−1≡−logλ≃1−λ=t2\+2gρh\.\\xi\_\{c,\\rho\}^\{\-1\}\\equiv\-\\log\\lambda\\simeq 1\-\\lambda=\\sqrt\{t^\{2\}\+2g\_\{\\rho\}h\}\.\(88\)Thusξc,ρ\(t,0\)∼\|t\|−1\\xi\_\{c,\\rho\}\(t,0\)\\sim\|t\|^\{\-1\}andνt=1\\nu\_\{t\}=1, whileξc,ρ\(0,h\)∼h−1/2\\xi\_\{c,\\rho\}\(0,h\)\\sim h^\{\-1/2\}andνρ=1/2\\nu\_\{\\rho\}=1/2\. Equivalently,yt=1/νt=1y\_\{t\}=1/\\nu\_\{t\}=1andyρ=1/νρ=2y\_\{\\rho\}=1/\\nu\_\{\\rho\}=2, so dropout is the more relevant perturbation of the correlation edge\.
At the marginal pointt=h=0t=h=0, the same smooth normal form gives the depth relaxation law:
mℓ\+1−mℓ≃−gρ2mℓ2\.m\_\{\\ell\+1\}\-m\_\{\\ell\}\\simeq\-\\frac\{g\_\{\\rho\}\}\{2\}m\_\{\\ell\}^\{2\}\.\(89\)In a continuum\-depth approximation this givesm\(ℓ\)∼ℓ−1m\(\\ell\)\\sim\\ell^\{\-1\}, henceθrel=1\\theta\_\{\\rm rel\}=1for the smooth class\.
A pseudo\-free\-energy functional consistent with \([24](https://arxiv.org/html/2605.21648#S3.E24)\) is
f\(t,m;h\)=−t2m2\+gρ6m3−hm,f\(t,m;h\)=\-\\frac\{t\}\{2\}m^\{2\}\+\\frac\{g\_\{\\rho\}\}\{6\}m^\{3\}\-hm,\(90\)since∂f/∂m=0\\partial f/\\partial m=0is equivalent to \([24](https://arxiv.org/html/2605.21648#S3.E24)\)\. This object should be read in the same spirit as a low\-order one\-particle\-irreducible \(1PI\) effective action: it is not a literal training loss or microscopic energy, but an effective potential for the slow variablemm\. Its stationarity condition gives the equation of state, while its on\-shell singular part organizes the thermodynamic\-style exponents; analyticmm\-independent backgrounds can be added without changing the physics of the recursion\.
Ath=0h=0, substituting the physical branchm=0m=0fort≤0t\\leq 0andm=2t/gρm=2t/g\_\{\\rho\}fort≥0t\\geq 0yields
fon\(t,0\)=0\(t≤0\),fon\(t,0\)=−23gρ2t3\(t≥0\)\.f\_\{\\rm on\}\(t,0\)=0\\quad\(t\\leq 0\),\\qquad f\_\{\\rm on\}\(t,0\)=\-\\frac\{2\}\{3g\_\{\\rho\}^\{2\}\}t^\{3\}\\quad\(t\\geq 0\)\.\(91\)Therefore the analogue of the specific heat,C≡−∂t2fonC\\equiv\-\\partial\_\{t\}^\{2\}f\_\{\\rm on\}, stays finite and vanishes linearly:
C\(t\)=4gρ2tΘ\(t\)∼\|t\|,C\(t\)=\\frac\{4\}\{g\_\{\\rho\}^\{2\}\}t\\Theta\(t\)\\sim\|t\|,\(92\)whereΘ\(t\)\\Theta\(t\)is the Heaviside step function\. In standard exponent languageCsing\(t\)∼\|t\|−αC\_\{\\rm sing\}\(t\)\\sim\|t\|^\{\-\\alpha\}, this corresponds to
up to an additive analytic backgroundf0\(t,h\)f\_\{0\}\(t,h\)that does not affect the equation of state\.
Collecting exponents for the smooth universality class:
β=1,γ=1,δ=2,νt=1,νρ=12,θrel=1,α=−1\.\\beta=1,\\qquad\\gamma=1,\\qquad\\delta=2,\\qquad\\nu\_\{t\}=1,\\qquad\\nu\_\{\\rho\}=\\frac\{1\}\{2\},\\qquad\\theta\_\{\\rm rel\}=1,\\qquad\\alpha=\-1\.\(94\)For comparison, standard Ising\-like Landau mean\-field theory \(withm→−mm\\to\-msymmetry and a quartic interaction\) has
β=12,γ=1,δ=3,ν=12,α=0\.\\beta=\\frac\{1\}\{2\},\\qquad\\gamma=1,\\qquad\\delta=3,\\qquad\\nu=\\frac\{1\}\{2\},\\qquad\\alpha=0\.\(95\)Our dropout\-deformed correlation\-edge theory sharesγ=1\\gamma=1, but the lack of aℤ2\\mathbb\{Z\}\_\{2\}symmetry \(and the resulting cubic interaction\) shiftsβ\\betaandδ\\delta, producesνt=1\\nu\_\{t\}=1rather thanν=1/2\\nu=1/2, and yields a specific\-heat analogue that vanishes at the transition \(α=−1\\alpha=\-1\) rather than exhibiting the Ising mean\-field behavior \(α=0\\alpha=0\)\.
### B\.2Kinked Networks
Up to now, the Landau\-style expansion ofF¯ρ\(c\)\\bar\{F\}\_\{\\rho\}\(c\)aroundc=1c=1relied on the assumption thatF¯ρ\(c\)\\bar\{F\}\_\{\\rho\}\(c\)is analytic in
For smooth activations this is the natural situation, and the first nonlinear correction is of orderm2m^\{2\}, controlled by
gρ∝∫Dz\[ϕ′′\(q¯∗z\)\]2\.g\_\{\\rho\}\\propto\\int Dz\\bigl\[\\phi^\{\\prime\\prime\}\(\\sqrt\{\\bar\{q\}^\{\*\}\}z\)\\bigr\]^\{2\}\.\(97\)For kinked activations, this logic fails: the map remains smooth forc<1c<1, but as it approachesc→1c\\to 1, it is governed by a branch point, and the leading nonlinear correction is non\-analytic inmm\. This changes the scaling of the dropout deformation\.
We illustrate this explicitly for theReLU\\operatorname\{ReLU\}activationϕ\(x\)=max\(0,x\)\\phi\(x\)=\\max\(0,x\)\. To keep formulas compact we setσb=0\\sigma\_\{b\}=0throughout this subsection and use the same inverted\-dropout convention as before, with masks drawn independently across inputs\.
The mean\-field correlation map forReLU\\operatorname\{ReLU\}admits the standard arc\-cosine closed form\(Cho and Saul,[2009](https://arxiv.org/html/2605.21648#bib.bib6)\)\. Writingc∈\[−1,1\]c\\in\[\-1,1\]for the preactivation correlation at some depth,
FReLU\(c\)\\displaystyle F\_\{\\operatorname\{ReLU\}\}\(c\)=1π\[1−c2\+\(π−cos−1c\)c\]\.\\displaystyle=\\frac\{1\}\{\\pi\}\\Bigl\[\\sqrt\{1\-c^\{2\}\}\+\\bigl\(\\pi\-\\cos^\{\-1\}c\\bigr\)c\\Bigr\]\.\(98\)Near perfect alignment, settingc=1−mc=1\-mwith0<m≪10<m\\ll 1, one obtains the non\-analytic expansion
FReLU\(1−m\)\\displaystyle F\_\{\\operatorname\{ReLU\}\}\(1\-m\)=1−m\+κm3/2\+𝒪\(m2\),\\displaystyle=1\-m\+\\kappa m^\{3/2\}\+\{\\cal O\}\(m^\{2\}\),\(99\)κ\\displaystyle\\kappa≡223π\.\\displaystyle\\equiv\\frac\{2\\sqrt\{2\}\}\{3\\pi\}\.Thus, the Taylor expansion valid for smooth activations would have missed them3/2m^\{3/2\}contribution\.
We now turn on dropout and again view dropout as a field deformation atc=1c=1:
F¯ρ\(1\)=1−h,h\>0\.\\bar\{F\}\_\{\\rho\}\(1\)=1\-h,\\qquad h\>0\.\(100\)In the presentσb=0\\sigma\_\{b\}=0setting, this field is exactly
h=1−F¯ρ\(1\)=1−ρ,h\\;=\\;1\-\\bar\{F\}\_\{\\rho\}\(1\)\\;=\\;1\-\\rho,\(101\)independent of the activation, since atc=1c=1the only source of mismatch is the independence of the two dropout masks\. We write the fixed point asc∗=1−m∗c^\{\*\}=1\-m^\{\*\}withm∗\>0m^\{\*\}\>0and imposec∗=F¯ρ\(c∗\)c^\{\*\}=\\bar\{F\}\_\{\\rho\}\(c^\{\*\}\)\. For the literal bias\-freeReLU\\operatorname\{ReLU\}map,F¯ρ\(c\)=ρFReLU\(c\)\\bar\{F\}\_\{\\rho\}\(c\)=\\rho F\_\{\\operatorname\{ReLU\}\}\(c\), soχρ=ρ\\chi\_\{\\rho\}=\\rhoandt=ρ−1=−ht=\\rho\-1=\-h\. Thus thet=0,h\>0t=0,h\>0direction below is the kinked normal\-form field direction, while standardReLU\\operatorname\{ReLU\}dropout follows a constrained path through the same scaling function\.
Nearm∗≪1m^\{\*\}\\ll 1, the kinked analogue of the Landau equation of state is obtained by combining the shiftF¯ρ\(1\)=1−h\\bar\{F\}\_\{\\rho\}\(1\)=1\-h, the linear slopeχρ=1\+t\\chi\_\{\\rho\}=1\+t, and the non\-analytic term \([99](https://arxiv.org/html/2605.21648#A2.E99)\):
h=κm3/2−tm\+𝒪\(m2\),t≡χρ−1\.h\\;=\\;\\kappa m^\{3/2\}\-tm\+\\mathcal\{O\}\(m^\{2\}\),\\qquad t\\equiv\\chi\_\{\\rho\}\-1\.\(102\)Along the normal\-form critical isothermt=0t=0, this gives
m∗∼\(hκ\)2/3,\(t=0\),m^\{\*\}\\;\\sim\\;\\left\(\\frac\{h\}\{\\kappa\}\\right\)^\{2/3\},\\qquad\(t=0\),\(103\)so the smooth exponentδ=2\\delta=2is replaced by
m∝h1/δkink,δkink=32\.m\\propto h^\{1/\\delta\_\{\\mathrm\{kink\}\}\},\\qquad\\delta\_\{\\mathrm\{kink\}\}=\\frac\{3\}\{2\}\.\(104\)And with this we immediately find the smoking gun telling us that kinked and smooth activations fall into distinct mean\-field universality classes under the dropout deformation\. As before, the depth scale follows from linearizing the correlation dynamics aroundc∗c^\{\*\}\. Differentiating \([99](https://arxiv.org/html/2605.21648#A2.E99)\) gives
λ≡F¯ρ′\(c∗\)≃1\+t−3κ2m∗\.\\lambda\\equiv\\bar\{F\}\_\{\\rho\}^\{\\prime\}\(c^\{\*\}\)\\simeq 1\+t\-\\frac\{3\\kappa\}\{2\}\\sqrt\{m^\{\*\}\}\.\(105\)Att=0t=0we therefore have
1−λ∝m∗∝h1/3\.1\-\\lambda\\propto\\sqrt\{m^\{\*\}\}\\propto h^\{1/3\}\.\(106\)Usingξc,ρ−1≃1−λ\\xi\_\{c,\\rho\}^\{\-1\}\\simeq 1\-\\lambdaforλ≃1\\lambda\\simeq 1yields
ξc,ρ\(0,h\)∼h−1/3,νρ,kink=13,\\xi\_\{c,\\rho\}\(0,h\)\\sim h^\{\-1/3\},\\qquad\\nu\_\{\\rho,\\mathrm\{kink\}\}=\\frac\{1\}\{3\},\(107\)to be compared with the smooth resultξc,ρ\(0,h\)∼h−1/2\\xi\_\{c,\\rho\}\(0,h\)\\sim h^\{\-1/2\}\. For ordinary bias\-freeReLU\\operatorname\{ReLU\}dropout, substituting the constrained patht=−ht=\-hinto \([102](https://arxiv.org/html/2605.21648#A2.E102)\) gives
h=κm3/2\+hm\+𝒪\(m2\)\.h=\\kappa m^\{3/2\}\+hm\+\\mathcal\{O\}\(m^\{2\}\)\.\(108\)Since the leading solution hasm∼h2/3m\\sim h^\{2/3\}, thehmhmterm is subleading\. Moreover,
1−λ≃h\+3κ2m∼h1/3,1\-\\lambda\\simeq h\+\\frac\{3\\kappa\}\{2\}\\sqrt\{m\}\\sim h^\{1/3\},\(109\)so the sameνρ,kink=1/3\\nu\_\{\\rho,\\mathrm\{kink\}\}=1/3survives as a constrained\-path exponent\.
The remaining kinked exponents follow from the same equation of state\. Ath=0h=0andt\>0t\>0,
κm3/2=tm⇒m=\(tκ\)2,\\kappa m^\{3/2\}=tm\\qquad\\Rightarrow\\qquad m=\\left\(\\frac\{t\}\{\\kappa\}\\right\)^\{2\},\(110\)soβkink=2\\beta\_\{\\rm kink\}=2\. On the subcritical sidet<0t<0, the small\-field balance ish≃\|t\|mh\\simeq\|t\|m, givingχM∼\|t\|−1\\chi\_\{M\}\\sim\|t\|^\{\-1\}and thereforeγkink=1\\gamma\_\{\\rm kink\}=1\. Substituting the zero\-field branch into \([105](https://arxiv.org/html/2605.21648#A2.E105)\) gives1−λ∝t1\-\\lambda\\propto t, soξc,ρ\(t,0\)∼t−1\\xi\_\{c,\\rho\}\(t,0\)\\sim t^\{\-1\}andνt,kink=1\\nu\_\{t,\\rm kink\}=1\.
Adding nonzeroReLU\\operatorname\{ReLU\}bias variance does not restore a finite\-q∗q^\{\*\}critical isotherm\. With inverted dropout,
q¯ℓ\+1=σw22ρq¯ℓ\+σb2,\\bar\{q\}\_\{\\ell\+1\}=\\frac\{\\sigma\_\{w\}^\{2\}\}\{2\\rho\}\\bar\{q\}\_\{\\ell\}\+\\sigma\_\{b\}^\{2\},\(111\)so finite variance requiresa≡σw2/\(2ρ\)<1a\\equiv\\sigma\_\{w\}^\{2\}/\(2\\rho\)<1, while angular criticality requiresχρ=σw2/2=1\\chi\_\{\\rho\}=\\sigma\_\{w\}^\{2\}/2=1, incompatible withρ<1\\rho<1\. Equivalently,h=a\(1−ρ\)h=a\(1\-\\rho\)andt=ρa−1=−h−\(1−a\)t=\\rho a\-1=\-h\-\(1\-a\), so finite variance adds a subcritical detuning that cuts off the asymptotic dropout law unless one takes the singular double scaling1−a≪h1/31\-a\\ll h^\{1/3\}\.
It is also useful, in direct analogy with the smooth case, to package \([102](https://arxiv.org/html/2605.21648#A2.E102)\) into a pseudo\-free\-energy\. A functional whose stationarity condition reproduces the kinked equation of state is
fkink\(t,m;h\)=−t2m2\+2κ5m5/2−hm,f\_\{\\mathrm\{kink\}\}\(t,m;h\)\\;=\\;\-\\frac\{t\}\{2\}m^\{2\}\+\\frac\{2\\kappa\}\{5\}m^\{5/2\}\-hm,\(112\)since∂fkink/∂m=0\\partial f\_\{\\mathrm\{kink\}\}/\\partial m=0givesh=κm3/2−tmh=\\kappa m^\{3/2\}\-tmat leading order\. In the effective\-action language, them5/2m^\{5/2\}term is the pseudo\-free\-energy avatar of the branch point in theReLU\\operatorname\{ReLU\}correlation map; this is precisely where the kinked universality class departs from the analytic Landau polynomial above\.555As before,fkinkf\_\{\\mathrm\{kink\}\}is defined only up to addition of anmm\-independent analytic backgroundf0\(t,h\)f\_\{0\}\(t,h\), which does not affect the equation of state\.At zero field the physical branch ism=0m=0fort≤0t\\leq 0andm=\(t/κ\)2m=\(t/\\kappa\)^\{2\}fort≥0t\\geq 0, so the on\-shell free energy behaves as
fkink,on\(t,0\)=0\(t≤0\),fkink,on\(t,0\)=−110κ4t5\(t≥0\)\.f\_\{\\mathrm\{kink,on\}\}\(t,0\)=0\\quad\(t\\leq 0\),\\qquad f\_\{\\mathrm\{kink,on\}\}\(t,0\)=\-\\frac\{1\}\{10\\kappa^\{4\}\}t^\{5\}\\quad\(t\\geq 0\)\.\(113\)Thus the analogue of the specific heat,C≡−∂t2fkink,onC\\equiv\-\\partial\_\{t\}^\{2\}f\_\{\\mathrm\{kink,on\}\}, stays finite and in fact vanishes as
C\(t\)=2κ4t3Θ\(t\)∼\|t\|3\.C\(t\)\\;=\\;\\frac\{2\}\{\\kappa^\{4\}\}t^\{3\}\\Theta\(t\)\\;\\sim\\;\|t\|^\{3\}\.\(114\)In the standard exponent conventionCsing\(t\)∼\|t\|−αC\_\{\\mathrm\{sing\}\}\(t\)\\sim\|t\|^\{\-\\alpha\}, this corresponds to
αkink=−3\.\\alpha\_\{\\mathrm\{kink\}\}=\-3\.\(115\)Although we usedReLU\\operatorname\{ReLU\}to compute the near\-alignment expansion in closed form, the resulting fractional power is not a peculiarity ofReLU\\operatorname\{ReLU\}\. Rather, it is a generic feature of the mean\-field Gaussian channel whenever the activation has an ordinary kink:ϕ\\phiis continuous and piecewiseC1C^\{1\}with a finite jump in slope at some point\. In that case the correlated\-Gaussian expectation defining the normalized correlation map remains smooth forc<1c<1, but its approach toc→1c\\to 1is controlled by a square\-root branch associated with the thin Gaussian tube around the diagonalu1=u2u\_\{1\}=u\_\{2\}\. Writingc=1−mc=1\-mwith0<m≪10<m\\ll 1, the transverse fluctuations scale asm\\sqrt\{m\}, so the probability that two nearly aligned preactivations fall on opposite sides of the kink scales as𝒪\(m\)\\mathcal\{O\}\(\\sqrt\{m\}\); in that straddling region a finite\-slope kink produces activation mismatches of size𝒪\(m\)\\mathcal\{O\}\(\\sqrt\{m\}\), so the leading non\-linear correction to the kernel scales as𝒪\(m\)×𝒪\(m\)=𝒪\(m3/2\)\\mathcal\{O\}\(\\sqrt\{m\}\)\\times\\mathcal\{O\}\(m\)=\\mathcal\{O\}\(m^\{3/2\}\)\. Thus, for any standard kinked activation one expects
F\(1−m\)=1−m\+κϕm3/2\+𝒪\(m2\),F\(1\-m\)=1\-m\+\\kappa\_\{\\phi\}m^\{3/2\}\+\{\\cal O\}\(m^\{2\}\),\(116\)with an activation\-dependent coefficientκϕ\\kappa\_\{\\phi\}, while the exponent3/23/2is fixed by the tube geometry\.
Finally, it is useful to recall what happens when dropout is switched off exactly at the marginal point\. Setting
ρ=1,t=0,\\rho=1,\\qquad t=0,\(117\)the non\-analytic correction controls the approach toc=1c=1\. Writingmℓ≡1−cℓm\_\{\\ell\}\\equiv 1\-c\_\{\\ell\}and using \([99](https://arxiv.org/html/2605.21648#A2.E99)\),
mℓ\+1−mℓ≃−κmℓ3/2\.m\_\{\\ell\+1\}\-m\_\{\\ell\}\\simeq\-\\kappa m\_\{\\ell\}^\{3/2\}\.\(118\)In a continuum\-depth approximationmℓ\+1−mℓ→∂ℓmm\_\{\\ell\+1\}\-m\_\{\\ell\}\\to\\partial\_\{\\ell\}m, we obtain
∂ℓm≃−κm3/2⇒m\(ℓ\)∼ℓ−2,\\partial\_\{\\ell\}m\\simeq\-\\kappa m^\{3/2\}\\qquad\\Rightarrow\\qquad m\(\\ell\)\\sim\\ell^\{\-2\},\(119\)which reproduces the characteristic1−cℓ∼ℓ−21\-c\_\{\\ell\}\\sim\\ell^\{\-2\}relaxation ofReLU\\operatorname\{ReLU\}activations at the edge\-of\-chaos\(Hayouet al\.,[2019](https://arxiv.org/html/2605.21648#bib.bib20)\)and givesθrel=2\\theta\_\{\\rm rel\}=2\.
### B\.3Non\-Analytic Expansion of theReLU\\operatorname\{ReLU\}Correlation Map nearc=1c=1
Here we derive the non\-analytic expansion
FReLU\(1−m\)=1−m\+κm3/2\+𝒪\(m2\),κ=223π,\\displaystyle F\_\{\\operatorname\{ReLU\}\}\(1\-m\)=1\-m\+\\kappa m^\{3/2\}\+\{\\cal O\}\(m^\{2\}\),\\qquad\\kappa=\\frac\{2\\sqrt\{2\}\}\{3\\pi\},\(120\)starting from the closed formReLU\\operatorname\{ReLU\}correlation map\(Cho and Saul,[2009](https://arxiv.org/html/2605.21648#bib.bib6)\)
FReLU\(c\)\\displaystyle F\_\{\\operatorname\{ReLU\}\}\(c\)=1π\[1−c2\+\(π−cos−1c\)c\],c∈\[−1,1\]\.\\displaystyle=\\frac\{1\}\{\\pi\}\\Bigl\[\\sqrt\{1\-c^\{2\}\}\+\\bigl\(\\pi\-\\cos^\{\-1\}c\\bigr\)c\\Bigr\],\\qquad c\\in\[\-1,1\]\.\(121\)Setc=1−mc=1\-mwith0<m≪10<m\\ll 1\.
First expand the square root term,
1−c2\\displaystyle\\sqrt\{1\-c^\{2\}\}=1−\(1−m\)2=2m−m2=2m1−m2\\displaystyle=\\sqrt\{1\-\(1\-m\)^\{2\}\}=\\sqrt\{2m\-m^\{2\}\}=\\sqrt\{2m\}\\sqrt\{1\-\\frac\{m\}\{2\}\}=2m\(1−m4\+𝒪\(m2\)\)=2m1/2−24m3/2\+𝒪\(m5/2\)\.\\displaystyle=\\sqrt\{2m\}\\left\(1\-\\frac\{m\}\{4\}\+\{\\cal O\}\(m^\{2\}\)\\right\)=\\sqrt\{2\}m^\{1/2\}\-\\frac\{\\sqrt\{2\}\}\{4\}m^\{3/2\}\+\{\\cal O\}\(m^\{5/2\}\)\.\(122\)Next expandcos−1\(1−m\)\\cos^\{\-1\}\(1\-m\)\. Writeθ=cos−1\(1−m\)\\theta=\\cos^\{\-1\}\(1\-m\)and use
cosθ=1−θ22\+θ424\+𝒪\(θ6\),\\displaystyle\\cos\\theta=1\-\\frac\{\\theta^\{2\}\}\{2\}\+\\frac\{\\theta^\{4\}\}\{24\}\+\{\\cal O\}\(\\theta^\{6\}\),\(123\)together with an ansatzθ=am1/2\+bm3/2\+𝒪\(m5/2\)\\theta=am^\{1/2\}\+bm^\{3/2\}\+\{\\cal O\}\(m^\{5/2\}\)\. Matchingcosθ=1−m\\cos\\theta=1\-morder by order givesa=2a=\\sqrt\{2\}andb=2/12b=\\sqrt\{2\}/12, hence
cos−1\(1−m\)=2m1/2\+212m3/2\+𝒪\(m5/2\)\.\\displaystyle\\cos^\{\-1\}\(1\-m\)=\\sqrt\{2\}m^\{1/2\}\+\\frac\{\\sqrt\{2\}\}\{12\}m^\{3/2\}\+\{\\cal O\}\(m^\{5/2\}\)\.\(124\)Now expand the product term in \([121](https://arxiv.org/html/2605.21648#A2.E121)\),
\(π−cos−1c\)c\\displaystyle\\bigl\(\\pi\-\\cos^\{\-1\}c\\bigr\)c=\(π−cos−1\(1−m\)\)\(1−m\)\\displaystyle=\\Bigl\(\\pi\-\\cos^\{\-1\}\(1\-m\)\\Bigr\)\(1\-m\)=\(π−2m1/2−212m3/2\+𝒪\(m5/2\)\)\(1−m\)\\displaystyle=\\Bigl\(\\pi\-\\sqrt\{2\}m^\{1/2\}\-\\frac\{\\sqrt\{2\}\}\{12\}m^\{3/2\}\+\{\\cal O\}\(m^\{5/2\}\)\\Bigr\)\(1\-m\)=π\(1−m\)−2m1/2\+11212m3/2\+𝒪\(m2\)\.\\displaystyle=\\pi\(1\-m\)\-\\sqrt\{2\}m^\{1/2\}\+\\frac\{11\\sqrt\{2\}\}\{12\}m^\{3/2\}\+\{\\cal O\}\(m^\{2\}\)\.\(125\)Adding \([122](https://arxiv.org/html/2605.21648#A2.E122)\) and \([125](https://arxiv.org/html/2605.21648#A2.E125)\) cancels the𝒪\(m1/2\)\{\\cal O\}\(m^\{1/2\}\)terms and yields
1−c2\+\(π−cos−1c\)c\\displaystyle\\sqrt\{1\-c^\{2\}\}\+\\bigl\(\\pi\-\\cos^\{\-1\}c\\bigr\)c=π\(1−m\)\+\(−24\+11212\)m3/2\+𝒪\(m2\)\\displaystyle=\\pi\(1\-m\)\+\\left\(\-\\frac\{\\sqrt\{2\}\}\{4\}\+\\frac\{11\\sqrt\{2\}\}\{12\}\\right\)m^\{3/2\}\+\{\\cal O\}\(m^\{2\}\)=π\(1−m\)\+223m3/2\+𝒪\(m2\)\.\\displaystyle=\\pi\(1\-m\)\+\\frac\{2\\sqrt\{2\}\}\{3\}m^\{3/2\}\+\{\\cal O\}\(m^\{2\}\)\.\(126\)Dividing byπ\\pigives \([120](https://arxiv.org/html/2605.21648#A2.E120)\) withκ=22/\(3π\)\\kappa=2\\sqrt\{2\}/\(3\\pi\)\.
The key structural point is the cancellation of the𝒪\(m1/2\)\{\\cal O\}\(m^\{1/2\}\)pieces, leaving a leading non\-analytic correction of orderm3/2m^\{3/2\}, which encodes the kink\.
### B\.4Origin of them3/2m^\{3/2\}term for kinked activations
Them3/2m^\{3/2\}term is a near\-alignment effect\. Writec=1−mc=1\-mwithm≪1m\\ll 1\. Then the two Gaussian preactivations differ only by a small transverse fluctuation\. Away from a kink, the activation is locally smooth and the usual Taylor expansion produces only integer powers ofmm\. The non\-analytic term comes from the small set of samples where the two nearly identical preactivations fall on opposite sides of the kink\. ForReLU\\operatorname\{ReLU\}, the main text gives the closed\-form expansion
F\(1−m\)=1−m\+κm3/2\+O\(m2\),F\(1\-m\)=1\-m\+\\kappa m^\{3/2\}\+O\(m^\{2\}\),\(127\)withκ=223π\\kappa=\\frac\{2\\sqrt\{2\}\}\{3\\pi\}forReLU\\operatorname\{ReLU\}\. The coefficient is specific toReLU\\operatorname\{ReLU\}and to the normalization convention, but the exponent is not: the same3/23/2power appears for any activation with an ordinary finite\-slope kink\.
Let\(u1,u2\)\(u\_\{1\},u\_\{2\}\)be jointly Gaussian with
𝔼\[u1\]=𝔼\[u2\]=0,𝔼\[u12\]=𝔼\[u22\]=1,𝔼\[u1u2\]=c\.\\mathbb\{E\}\[u\_\{1\}\]=\\mathbb\{E\}\[u\_\{2\}\]=0,\\qquad\\mathbb\{E\}\[u\_\{1\}^\{2\}\]=\\mathbb\{E\}\[u\_\{2\}^\{2\}\]=1,\\qquad\\mathbb\{E\}\[u\_\{1\}u\_\{2\}\]=c\.\(128\)The normalized correlation map takes the form
F\(c\)∝𝔼\[ϕ\(u1\)ϕ\(u2\)\],F\(c\)\\;\\propto\\;\\mathbb\{E\}\\big\[\\phi\(u\_\{1\}\)\\phi\(u\_\{2\}\)\\big\],\(129\)with proportionality fixed by the variance normalization used in the main text\.
We study the approach to perfect alignment by writing
c=1−m,0<m≪1\.c=1\-m,\\qquad 0<m\\ll 1\.\(130\)Introduce the sum and difference coordinates
s=u1\+u22,d=u1−u22\.s=\\frac\{u\_\{1\}\+u\_\{2\}\}\{\\sqrt\{2\}\},\\qquad d=\\frac\{u\_\{1\}\-u\_\{2\}\}\{\\sqrt\{2\}\}\.\(131\)A direct covariance computation gives
Var\(s\)=1\+c=2−m,Var\(d\)=1−c=m,𝔼\[sd\]=0\.\\mathrm\{Var\}\(s\)=1\+c=2\-m,\\qquad\\mathrm\{Var\}\(d\)=1\-c=m,\\qquad\\mathbb\{E\}\[sd\]=0\.\(132\)Hence asm→0m\\to 0one has the scaling
s∼O\(1\),d∼O\(m\)\.s\\sim O\(1\),\\qquad d\\sim O\(\\sqrt\{m\}\)\.\(133\)Thus the average coordinatessremains order one, while the separationddhas standard deviationm\\sqrt\{m\}\. Geometrically, the joint Gaussian measure concentrates in a thin tube of transverse thicknessm\\sqrt\{m\}around the diagonalu1=u2u\_\{1\}=u\_\{2\}\. Only this tube can see the kink: the probability of straddling the kink is of orderm\\sqrt\{m\}, and inside that region the activation mismatch produced by the slope jump is also of orderm\\sqrt\{m\}\. In the kernel this gives an order\-mmlocal correction carried by an order\-m\\sqrt\{m\}region, hence the leading non\-analytic contribution scales asm3/2m^\{3/2\}\.
This argument is not special toReLU\\operatorname\{ReLU\}\. Near the kink, any activation with finite one\-sided slopes can be separated into a smooth part plus a multiple ofReLU\\operatorname\{ReLU\}\. If those slopes area±a\_\{\\pm\}atu=0u=0, then locally
ϕ\(u\)=a−u\+\(a\+−a−\)ReLU\(u\)\+r\(u\),r\(u\)=o\(u\)asu→0,\\phi\(u\)=a\_\{\-\}u\+\\big\(a\_\{\+\}\-a\_\{\-\}\\big\)\\operatorname\{ReLU\}\(u\)\+r\(u\),\\qquad r\(u\)=o\(u\)\\ \\ \\text\{as \}u\\to 0,\(134\)whererris continuous and has no slope jump atu=0u=0\. This remainder behaves like an ordinary smooth contribution in the near\-alignment expansion and therefore gives only integer powers ofmm\. The first fractional\-power term is inherited from theReLU\\operatorname\{ReLU\}component, so every ordinary kink gives the samem3/2m^\{3/2\}exponent\. Only the amplitude depends on nonuniversal details such as the slope jumpa\+−a−a\_\{\+\}\-a\_\{\-\}and the precise variance normalization\.
## Appendix CHermite and Variational Perspective
We start from the variational problem that first led us to the smooth/kinked split: choose the shape of the activation so as to maximize the mean\-field correlation length\. Since wide\-network preactivations are Gaussian, this is naturally a problem in a function space with Gaussian measure\. Hermite polynomials are the standard orthonormal basis for such spaces, so expanding the rescaled activation in Hermite modes turns the correlation\-length calculation into a spectral problem\. This appendix makes that reduction explicit\.
Before doing this, we remove a trivial scale freedom:ϕ↦aϕ\\phi\\mapsto a\\phican be absorbed byσw2↦σw2/a2\\sigma\_\{w\}^\{2\}\\mapsto\\sigma\_\{w\}^\{2\}/a^\{2\}without changing the mean\-field fixed point\. To focus on shape rather than scale, we fix a variance fixed pointq∗q^\{\*\}and the bias varianceσb2\\sigma\_\{b\}^\{2\}, and determineσw2\\sigma\_\{w\}^\{2\}from the fixed\-point condition\. With this convention, the remaining freedom is the shape of a functionf∈L2\(Dz\)f\\in L^\{2\}\(Dz\)with Gaussian inner product
Dz≡e−z2/22πdz,⟨f,g⟩≡∫Dzf\(z\)g\(z\)\.Dz\\equiv\\frac\{e^\{\-z^\{2\}/2\}\}\{\\sqrt\{2\\pi\}\}dz,\\qquad\\langle f,g\\rangle\\equiv\\int Dzf\(z\)g\(z\)\.\(135\)
### C\.1A scale\-fixed variational problem forχ\\chi
We define the fixed\-point rescaled activation
f\(z\)≡ϕ\(q∗z\)\.f\(z\)\\equiv\\phi\(\\sqrt\{q^\{\*\}\}z\)\.\(136\)Assumingϕ\\phiis weakly differentiable withf∈H1\(Dz\)f\\in H^\{1\}\(Dz\), Price’s theorem\(Price,[1958](https://arxiv.org/html/2605.21648#bib.bib30)\)yields the correlation susceptibility used in deep information propagation\(Schoenholzet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib38)\)
χ≡F′\(1\)=σw2q∗∫Dz\(f′\(z\)\)2\.\\chi\\equiv F^\{\\prime\}\(1\)=\\frac\{\\sigma\_\{w\}^\{2\}\}\{q^\{\*\}\}\\int Dz\\bigl\(f^\{\\prime\}\(z\)\\bigr\)^\{2\}\.\(137\)The variance fixed\-point equation
q∗=σw2∫Dzf\(z\)2\+σb2q^\{\*\}=\\sigma\_\{w\}^\{2\}\\int Dzf\(z\)^\{2\}\+\\sigma\_\{b\}^\{2\}\(138\)determines
σw2=q∗−σb2∫Dzf2\.\\sigma\_\{w\}^\{2\}=\\frac\{q^\{\*\}\-\\sigma\_\{b\}^\{2\}\}\{\\int Dzf^\{2\}\}\.\(139\)Substituting gives the scale\-fixed factorization
χ=\(1−σb2q∗\)∫Dz\(f′\)2∫Dzf2≡\(1−σb2q∗\)𝒬\[f\],𝒬\[f\]≡‖f′‖22‖f‖22\.\\chi=\\left\(1\-\\frac\{\\sigma\_\{b\}^\{2\}\}\{q^\{\*\}\}\\right\)\\frac\{\\int Dz\(f^\{\\prime\}\)^\{2\}\}\{\\int Dzf^\{2\}\}\\equiv\\left\(1\-\\frac\{\\sigma\_\{b\}^\{2\}\}\{q^\{\*\}\}\\right\)\\mathcal\{Q\}\[f\],\\qquad\\mathcal\{Q\}\[f\]\\equiv\\frac\{\\\|f^\{\\prime\}\\\|\_\{2\}^\{2\}\}\{\\\|f\\\|\_\{2\}^\{2\}\}\.\(140\)Thus, at fixed\(q∗,σb2\)\(q^\{\*\},\\sigma\_\{b\}^\{2\}\), extremizingχ\\chiover activations reduces to extremizing the Rayleigh quotient \(see for example\(Horn and Johnson,[1985](https://arxiv.org/html/2605.21648#bib.bib9)\)\)𝒬\[f\]\\mathcal\{Q\}\[f\]over the shape offf\. Criticalityχ=1\\chi=1is equivalent to the constraint
𝒬\[f\]=\(1−σb2q∗\)−1\.\\mathcal\{Q\}\[f\]=\\left\(1\-\\frac\{\\sigma\_\{b\}^\{2\}\}\{q^\{\*\}\}\\right\)^\{\-1\}\.\(141\)In this convention,𝒬\[f\]\\mathcal\{Q\}\[f\]quantifies the activation’s effective sensitivity to typical Gaussian fluctuations: shifting spectral weight to higher modes increases‖f′‖2\\\|f^\{\\prime\}\\\|\_\{2\}relative to‖f‖2\\\|f\\\|\_\{2\}and therefore increasesχ\\chiunlessσw2\\sigma\_\{w\}^\{2\}is retuned\.
### C\.2The Ornstein–Uhlenbeck eigenvalue problem
Because𝒬\[f\]\\mathcal\{Q\}\[f\]is scale\-invariant, we may impose the normalization∫Dzf2=1\\int Dzf^\{2\}=1and extremize∫Dz\(f′\)2\\int Dz\(f^\{\\prime\}\)^\{2\}\. Introducing a Lagrange multiplierλ\\lambda, consider
𝒮\[f\]≡∫Dz\[\(f′\(z\)\)2−λf\(z\)2\]\.\\mathcal\{S\}\[f\]\\equiv\\int Dz\\Bigl\[\(f^\{\\prime\}\(z\)\)^\{2\}\-\\lambda f\(z\)^\{2\}\\Bigr\]\.\(142\)Gaussian\-weighted integration by parts \(with boundary terms vanishing under mild decay assumptions\) gives the Euler–Lagrange equation
\(−∂z2\+z∂z\)f\(z\)=λf\(z\),\\left\(\-\\partial\_\{z\}^\{2\}\+z\\partial\_\{z\}\\right\)f\(z\)=\\lambda f\(z\),\(143\)i\.e\. the Ornstein–Uhlenbeck eigenproblem onL2\(Dz\)L^\{2\}\(Dz\)\. Its spectrum is
λn=n,n=0,1,2,…,\\lambda\_\{n\}=n,\\qquad n=0,1,2,\\dots,\(144\)and the eigenfunctions form a complete orthonormal basis\. Under the unit\-norm constraint, these eigenfunctions are stationary points of𝒬\[f\]\\mathcal\{Q\}\[f\], with𝒬\[f\]=λ\\mathcal\{Q\}\[f\]=\\lambda\.
### C\.3Hermite basis and diagonal correlation propagation
A natural orthonormal eigenbasis of \([143](https://arxiv.org/html/2605.21648#A3.E143)\) is given by the normalized Hermite polynomials; this is the usual Hermite/Ornstein–Uhlenbeck diagonalization of Gaussian channels\. Hermite expansions have also been used directly to express neural\-network covariance recursions in the infinite\-width setting\(Arratiaet al\.,[2020](https://arxiv.org/html/2605.21648#bib.bib2)\)\.
hn\(z\)≡12nn\!Hn\(z2\),⟨hn,hm⟩=δnm,h\_\{n\}\(z\)\\equiv\\frac\{1\}\{\\sqrt\{2^\{n\}n\!\}\}H\_\{n\}\\left\(\\frac\{z\}\{\\sqrt\{2\}\}\\right\),\\qquad\\langle h\_\{n\},h\_\{m\}\\rangle=\\delta\_\{nm\},\(145\)whereHnH\_\{n\}are the physicists’ Hermite polynomials\.666Equivalently,hn\(z\)=Hen\(z\)/n\!h\_\{n\}\(z\)=\\mathrm\{He\}\_\{n\}\(z\)/\\sqrt\{n\!\}in terms of the probabilists’ Hermite polynomialsHen\\mathrm\{He\}\_\{n\}\.They satisfy
hn′\(z\)=nhn−1\(z\),\(−∂z2\+z∂z\)hn\(z\)=nhn\(z\),𝒬\[hn\]=n\.h\_\{n\}^\{\\prime\}\(z\)=\\sqrt\{n\}h\_\{n\-1\}\(z\),\\qquad\\left\(\-\\partial\_\{z\}^\{2\}\+z\\partial\_\{z\}\\right\)h\_\{n\}\(z\)=nh\_\{n\}\(z\),\\qquad\\mathcal\{Q\}\[h\_\{n\}\]=n\.\(146\)Moreover, this basis diagonalizes correlated\-Gaussian expectations: if\(z1,z2\)\(z\_\{1\},z\_\{2\}\)are jointly Gaussian with unit variances and correlationc∈\[−1,1\]c\\in\[\-1,1\], then
𝔼\[hn\(z1\)hm\(z2\)\]=cnδnm\.\\mathbb\{E\}\\bigl\[h\_\{n\}\(z\_\{1\}\)h\_\{m\}\(z\_\{2\}\)\\bigr\]=c^\{n\}\\delta\_\{nm\}\.\(147\)Thus each Hermite degree propagates independently through the Gaussian channel with attenuationcnc^\{n\}\.
Expanding an arbitraryf∈L2\(Dz\)f\\in L^\{2\}\(Dz\)in this basis,
f\(z\)=∑n≥0anhn\(z\),an=⟨f,hn⟩,∑n≥0an2=∫Dzf2,f\(z\)=\\sum\_\{n\\geq 0\}a\_\{n\}h\_\{n\}\(z\),\\qquad a\_\{n\}=\\langle f,h\_\{n\}\\rangle,\\qquad\\sum\_\{n\\geq 0\}a\_\{n\}^\{2\}=\\int Dzf^\{2\},\(148\)we obtain
𝔼\[f\(z1\)f\(z2\)\]=∑n≥0an2cn\.\\mathbb\{E\}\\bigl\[f\(z\_\{1\}\)f\(z\_\{2\}\)\\bigr\]=\\sum\_\{n\\geq 0\}a\_\{n\}^\{2\}c^\{n\}\.\(149\)Consequently, the correlation map takes the form
F\(c\)=σw2∑n≥0an2cn\+σb2q∗,q∗=σw2∑n≥0an2\+σb2\.F\(c\)=\\frac\{\\sigma\_\{w\}^\{2\}\\sum\_\{n\\geq 0\}a\_\{n\}^\{2\}c^\{n\}\+\\sigma\_\{b\}^\{2\}\}\{q^\{\*\}\},\\qquad q^\{\*\}=\\sigma\_\{w\}^\{2\}\\sum\_\{n\\geq 0\}a\_\{n\}^\{2\}\+\\sigma\_\{b\}^\{2\}\.\(150\)Analyticity ofF\(c\)F\(c\)nearc=1c=1is controlled by the decay of\|an\|\|a\_\{n\}\|: rapid decay \(typical for smooth activations\) yields an ordinary Taylor expansion aroundc=1c=1, while slow decay \(typical for kinked or piecewise smooth activations\) allows infinitely many high\-degree modes to contribute collectively and can produce non\-Taylor behavior \(e\.g\. fractional powers or logarithms\) asc→1c\\to 1, corresponding to a branch point atc=1c=1\.
Usinghn′=nhn−1h\_\{n\}^\{\\prime\}=\\sqrt\{n\}h\_\{n\-1\}, we can compute the susceptibility
χ=F′\(1\)=σw2q∗∑n≥1nan2\.\\chi=F^\{\\prime\}\(1\)=\\frac\{\\sigma\_\{w\}^\{2\}\}\{q^\{\*\}\}\\sum\_\{n\\geq 1\}na\_\{n\}^\{2\}\.\(151\)In the zero\-bias caseσb2=0\\sigma\_\{b\}^\{2\}=0this becomes
χ=∑n≥1nan2∑n≥0an2=𝒬\[f\],\\chi=\\frac\{\\sum\_\{n\\geq 1\}na\_\{n\}^\{2\}\}\{\\sum\_\{n\\geq 0\}a\_\{n\}^\{2\}\}=\\mathcal\{Q\}\[f\],\(152\)soχ\\chiis the mean Hermite degree under weights proportional toan2a\_\{n\}^\{2\}, and criticalityχ=1\\chi=1corresponds to unit mean degree\.
For completeness, the variance\-channel susceptibilityχq≡∂q\(ℓ\+1\)/∂q\(ℓ\)\|q∗\\chi\_\{q\}\\equiv\\partial q^\{\(\\ell\+1\)\}/\\partial q^\{\(\\ell\)\}\|\_\{q^\{\*\}\}can be written in the same basis as
χq=σw2q∗\(∑n≥1nan2\+∑n≥0\(n\+1\)\(n\+2\)anan\+2\)\.\\chi\_\{q\}=\\frac\{\\sigma\_\{w\}^\{2\}\}\{q^\{\*\}\}\\left\(\\sum\_\{n\\geq 1\}na\_\{n\}^\{2\}\+\\sum\_\{n\\geq 0\}\\sqrt\{\(n\+1\)\(n\+2\)\}a\_\{n\}a\_\{n\+2\}\\right\)\.\(153\)Thusχ\\chidepends only on the distribution\{an2\}\\\{a\_\{n\}^\{2\}\\\}across Hermite degrees, whereasχq\\chi\_\{q\}is additionally sensitive to relative signs/phases within each parity subsector through theanan\+2a\_\{n\}a\_\{n\+2\}couplings\.
Pure modesf∝hnf\\propto h\_\{n\}are stationary points with𝒬\[hn\]=n\\mathcal\{Q\}\[h\_\{n\}\]=n\. Excluding the degenerate constant mode, or imposing⟨f⟩=0\\langle f\\rangle=0, then=1n=1mode is the lowest eigenmode, but by itself it gives a linear network\. Nonlinear activations necessarily mix in higher Hermite degrees\. Those modes raise𝒬\[f\]\\mathcal\{Q\}\[f\], so at fixedq∗q^\{\*\}andσb2\\sigma\_\{b\}^\{2\}they shorten the correlation depth unlessσw2\\sigma\_\{w\}^\{2\}is reduced accordingly\.
### C\.4Smooth and kinked activations
For smooth activations, the Hermite coefficients typically decay rapidly andF\(c\)F\(c\)is analytic nearc=1c=1, withχ\\chidominated by the lowest few degrees\. For kinked activations such asReLU\\operatorname\{ReLU\},ϕ′\(u\)=θ\(u\)\\phi^\{\\prime\}\(u\)=\\theta\(u\)almost everywhere, so
χ=σw2∫Dzθ\(u\)=σw22,χ=1⇒σw2=2\.\\chi=\\sigma\_\{w\}^\{2\}\\int Dz\\theta\(u\)=\\frac\{\\sigma\_\{w\}^\{2\}\}\{2\},\\qquad\\chi=1\\Rightarrow\\sigma\_\{w\}^\{2\}=2\.\(154\)
Table 5:Orthonormal Hermite coefficientsan=∫Dzf\(z\)hn\(z\)a\_\{n\}=\\int Dzf\(z\)h\_\{n\}\(z\)for two standard activations, shown for the unscaled choicesf\(z\)=ReLU\(z\)f\(z\)=\\operatorname\{ReLU\}\(z\)andf\(z\)=tanh\(z\)f\(z\)=\\tanh\(z\)\. Forf\(z\)=tanh\(sz\)f\(z\)=\\tanh\(sz\), usean\(s\)=∫Dztanh\(sz\)hn\(z\)a\_\{n\}\(s\)=\\int Dz\\tanh\(sz\)h\_\{n\}\(z\), witha2k\(s\)=0a\_\{2k\}\(s\)=0by odd parity\.Figure 3:Magnitude of Hermite coefficients\|an\|\|a\_\{n\}\|forReLU\\operatorname\{ReLU\}versustanh\\tanh\.ReLU\\operatorname\{ReLU\}exhibits a slow \(power\-law\) decay, reflecting multi\-scale support across Hermite degrees, whiletanh\\tanhdecays rapidly and concentrates most spectral mass in the lowest modes\.
### C\.5Hermite decompositions forReLU\\operatorname\{ReLU\}andtanh\\tanh
Throughout,Z∼𝒩\(0,1\)Z\\sim\\mathcal\{N\}\(0,1\)and
Dz≡dz2πe−z2/2\.Dz\\equiv\\frac\{dz\}\{\\sqrt\{2\\pi\}\}e^\{\-z^\{2\}/2\}\.\(155\)We expand the fixed point rescaled activationf\(z\)f\(z\)in an orthonormal Hermite basis ofL2\(Dz\)L^\{2\}\(Dz\),
f\(z\)=∑n≥0anhn\(z\),an=∫Dzf\(z\)hn\(z\),∫Dzhn\(z\)hm\(z\)=δnm\.f\(z\)=\\sum\_\{n\\geq 0\}a\_\{n\}h\_\{n\}\(z\),\\qquad a\_\{n\}=\\int Dzf\(z\)h\_\{n\}\(z\),\\qquad\\int Dzh\_\{n\}\(z\)h\_\{m\}\(z\)=\\delta\_\{nm\}\.\(156\)With physicists HermitesHnH\_\{n\},
Hn\(x\)≡\(−1\)nex2dndxne−x2,hn\(z\)≡12nn\!Hn\(z2\)\.H\_\{n\}\(x\)\\equiv\(\-1\)^\{n\}e^\{x^\{2\}\}\\frac\{d^\{n\}\}\{dx^\{n\}\}e^\{\-x^\{2\}\},\\qquad h\_\{n\}\(z\)\\equiv\\frac\{1\}\{\\sqrt\{2^\{n\}n\!\}\}H\_\{n\}\\left\(\\frac\{z\}\{\\sqrt\{2\}\}\\right\)\.\(157\)
#### C\.5\.1ReLU\\operatorname\{ReLU\}
Letf\(z\)=ReLU\(z\)=max\(0,z\)f\(z\)=\\operatorname\{ReLU\}\(z\)=\\max\(0,z\)\. The coefficients have a closed form\.
a0=12π,a1=12,a2k\+1=0fork≥1,a\_\{0\}=\\frac\{1\}\{\\sqrt\{2\\pi\}\},\\qquad a\_\{1\}=\\frac\{1\}\{2\},\\qquad a\_\{2k\+1\}=0\\ \\ \\text\{for \}k\\geq 1,\(158\)and fork≥1k\\geq 1,
a2k=\(−1\)k−1\(2k−3\)\!\!2π\(2k\)\!\.a\_\{2k\}=\\frac\{\(\-1\)^\{k\-1\}\(2k\-3\)\!\!\}\{\\sqrt\{2\\pi\(2k\)\!\}\}\.\(159\)We use the convention\(−1\)\!\!=1\(\-1\)\!\!=1\. If insteadf\(z\)=ReLU\(q∗z\)f\(z\)=\\operatorname\{ReLU\}\(\\sqrt\{q^\{\*\}\}z\), then all coefficients scale asan↦q∗ana\_\{n\}\\mapsto\\sqrt\{q^\{\*\}\}a\_\{n\}\.
#### C\.5\.2tanh\\tanh
Letf\(z\)=tanh\(sz\)f\(z\)=\\tanh\(sz\)withs\>0s\>0\. The coefficients are
an\(s\)=∫Dztanh\(sz\)hn\(z\)\.a\_\{n\}\(s\)=\\int Dz\\tanh\(sz\)h\_\{n\}\(z\)\.\(160\)Parity givesa2k\(s\)=0a\_\{2k\}\(s\)=0\. There is no simple closed form for generalss, but the coefficients are easy to compute numerically by Gaussian quadrature using Mathematica, or other specialized packages\.
## Appendix DExperimental Details and Additional Figures
### D\.1Mean\-Field Fits and Scaling Collapse
Table[6](https://arxiv.org/html/2605.21648#A4.T6)reports exponents extracted from iterated MFT recursions\.
Table 6:Critical exponents from log–log linear fits \(dashed lines\) in Fig\.[1](https://arxiv.org/html/2605.21648#S3.F1)compared with mean\-field theory predictions\.The same scaling\-collapse diagnostic can be carried out for the kinked universality class, usingReLU\\operatorname\{ReLU\}as the representative activation\.
Figure 4:Kinked counterpart to Fig\.[1](https://arxiv.org/html/2605.21648#S3.F1), showing the corresponding universal scaling collapse\.
### D\.2Dropout Scheduling Experiments
The scheduling experiments below test Sec\.[3\.3](https://arxiv.org/html/2605.21648#S3.SS3): MLP schedules in Fig\.[5](https://arxiv.org/html/2605.21648#A4.F5)and Tab\.[7](https://arxiv.org/html/2605.21648#A4.T7), matched\-budget controls in Fig\.[6](https://arxiv.org/html/2605.21648#A4.F6)and Tab\.[8](https://arxiv.org/html/2605.21648#A4.T8), robustness sweeps in Fig\.[7](https://arxiv.org/html/2605.21648#A4.F7), ViT CIFAR\-100 schedules in Fig\.[9](https://arxiv.org/html/2605.21648#A4.F9)and Tab\.[11](https://arxiv.org/html/2605.21648#A4.T11), and transformer ablations in Fig\.[11](https://arxiv.org/html/2605.21648#A4.F11)and Tab\.[13](https://arxiv.org/html/2605.21648#A4.T13)\.
### D\.3MLPs
Finite\-width experiments support these spatial scheduling predictions\. The main finding is that concentrating dropout in early layers, where coadaptation and rank\-diversity loss are expected to matter most, yields the best generalization\. The theoretically motivated schedules improve over constant dropout at no extra computational cost\.
To test this prediction in a controlled finite\-width setting, we design an experiment in the overfitting regime, where information propagation and regularization are both important factors in determining performance\. In the effective\-theory treatment of\(Robertset al\.,[2022](https://arxiv.org/html/2605.21648#bib.bib36)\), deviations from Gaussian mean\-field behavior\(Neal,[1996](https://arxiv.org/html/2605.21648#bib.bib28); Leeet al\.,[2018](https://arxiv.org/html/2605.21648#bib.bib47)\)generate corrections to the correlation recursions that are suppressed by1/N1/N\(withNNthe layer width\) and accumulate across depth asL/NL/N\. At the same time, deep information propagation suggests that successful learning is controlled by a modest multiple of the correlation length\(Schoenholzet al\.,[2017](https://arxiv.org/html/2605.21648#bib.bib38)\)\.
We train on 5000 CIFAR\-10 images with width 256 and depthL=6L=6, roughly2ξ2\\xifor the constant\-dropout baseline, using a batch size of 75 to ensure the regularizing effect of dropout through injected stochastic noise\. We initialize close to the edge\-of\-chaos atσw2=1\.98\\sigma\_\{w\}^\{2\}=1\.98andσb2=0\.02\\sigma\_\{b\}^\{2\}=0\.02, near standard He initialization but avoiding the instabilities due to the divergence of the variance fixed pointq∗q^\{\\ast\}asσw2→2\\sigma\_\{w\}^\{2\}\\to 2\.
We compare six depth profiles at fixed mean dropout budget: \(i\) no dropout, \(ii\) uniform dropout, \(iii\) a linearly increasing schedule, \(iv\) a linearly decreasing schedule, \(v\) a step schedule concentrated in the final half \(step\), and \(vi\) a step schedule concentrated in the first half \(early step\), as summarized in Fig\.[5](https://arxiv.org/html/2605.21648#A4.F5)\. Front\-loaded dropout consistently outperforms back\-loaded dropout: the need for regularization in early layers, where coadaptation builds up most rapidly, dominates the signal propagation cost\. The effective correlation length heuristic remains predictive; constant dropout is most damaging, but rank collapse breaks the permutation symmetry of Eq\. \([45](https://arxiv.org/html/2605.21648#S3.E45)\) in favor of early regularization\.
Figure 5:Finite\-width MLP training and test curves at fixed mean dropout budgeth¯\\bar\{h\}\(with0≤hℓ≤hmax0\\leq h\_\{\\ell\}\\leq h\_\{\\max\}\)\. We compare uniform dropout, linear ramps \(increasing/decreasing with depth\), and step schedules that concentrate dropout in either the first or last half of the network, together with the no\-dropout baseline\.Table 7:Performance of different dropout schedules in the overfitting regime for MLPs and CIFAR\-10, corresponding to Fig\.[5](https://arxiv.org/html/2605.21648#A4.F5)\.The budget\-control experiment checks a simpler explanation: early\-concentrated dropout may only be winning because it raises the local dropout rate, not because the allocation over depth matters\. After all, the step schedule applies dropout at rate2h¯2\\bar\{h\}to the first half of the network, so perhaps the same improvement could be achieved by applying this higher rate uniformly throughout\.
To disentangle allocation from total dropout strength, we compare constant schedules with fieldsh¯,2h¯,3h¯\\bar\{h\},2\\bar\{h\},3\\bar\{h\}against step schedules using2h¯2\\bar\{h\}in the first half or3h¯3\\bar\{h\}in the first third of the network \(Fig\.[6](https://arxiv.org/html/2605.21648#A4.F6), Tab\.[8](https://arxiv.org/html/2605.21648#A4.T8)\)\.
Figure 6:Matched\-budget controls comparing front\-loaded step schedules against constant dropout fields ath¯\\bar\{h\},2h¯2\\bar\{h\}, and3h¯3\\bar\{h\}\.If the step schedules succeed merely because they apply locally higher dropout rates, then the uniform schedules with matching dropout should perform at least as well\. In contrast, if spatial allocation genuinely matters, the step schedules should outperform their uniform counterparts despite having lower total dropout\. The results in Fig\.[6](https://arxiv.org/html/2605.21648#A4.F6)and Table[8](https://arxiv.org/html/2605.21648#A4.T8)confirm that spatial allocation dominates: Big step \(ξeff=6\.7\\xi\_\{\\text\{eff\}\}=6\.7\) achieves the highest accuracy despite having only one\-third the total dropout of the schedule with3h¯3\\bar\{h\}, and front\-loaded step \(ξeff=5\.1\\xi\_\{\\text\{eff\}\}=5\.1\) outperforms Double despite half the total dropout budget\. Moreover, increasing uniform dropout beyond2h¯2\\bar\{h\}becomes counterproductive; Triple \(3h¯3\\bar\{h\}\) performs worst among all dropout schedules, as excessive regularization throughout the network impairs learning capacity\. These findings validate the effective correlation lengthξeff\\xi\_\{\\text\{eff\}\}as a predictive metric: schedules with longerξeff\\xi\_\{\\text\{eff\}\}consistently achieve better generalization, regardless of their total dropout budget\.
We note that our theoretical analysis is perturbative inhh, with higher\-order corrections ofO\(h2\)O\(h^\{2\}\)becoming significant as dropout increases\. This motivates our choice to test up to3h¯3\\bar\{h\}withh¯=0\.1\\bar\{h\}=0\.1; beyond this point, the perturbative framework becomes less reliable and additional nonlinear effects may dominate\. We also tested these effects at lower dropout rates and obtained similar results, though they become more pronounced as dropout increases\.
Table 8:Summary of dropout schedules, correlation lengths, and test performance for MLPs on CIFAR\-10, corresponding to Fig\.[6](https://arxiv.org/html/2605.21648#A4.F6)\.Table 9:Experimental and hyperparameter configuration for high overfitting regime\. Corresponding performance metrics in Fig\.[5](https://arxiv.org/html/2605.21648#A4.F5)\.
### D\.4Robustness Sweeps
To test whether scheduling is tied to one budget, width, or activation, we ran CIFAR\-10 MLP sweeps \(Figs\.[7](https://arxiv.org/html/2605.21648#A4.F7)and[8](https://arxiv.org/html/2605.21648#A4.F8)\)\. TheReLU\\operatorname\{ReLU\}sweeps probe mean\-field boundaries: ath¯=0\.15\\bar\{h\}=0\.15, Step/Big step use nonperturbative local fields0\.30/0\.450\.30/0\.45; atN=64N=64, finite\-width non\-Gaussianity weakens the correlation\-length prediction\. The GELU MLP sweep gives the same best\-accuracy ordering ath¯=0\.1\\bar\{h\}=0\.1–constant41\.96±0\.1641\.96\\pm 0\.16, Step42\.24±0\.1542\.24\\pm 0\.15, Big step42\.46±0\.1242\.46\\pm 0\.12–and Big step falls to38\.97±0\.2338\.97\\pm 0\.23ath¯=0\.15\\bar\{h\}=0\.15, matching this perturbative\-boundary picture\. The final\-epoch gains summarized in Table[3](https://arxiv.org/html/2605.21648#S3.T3)use the last recorded epoch, giving\+2\.04\+2\.04pp for theReLU\\operatorname\{ReLU\}h¯=0\.1\\bar\{h\}=0\.1sweep and\+0\.62\+0\.62pp for the GELUh¯=0\.1\\bar\{h\}=0\.1sweep\.
\(a\)Sweep over mean dropout field at widthN=256N=256\.
\(b\)Schedule advantage over constant dropout ath¯=0\.1\\bar\{h\}=0\.1\.
Figure 7:Robustness sweeps for depth\-6 near\-criticalReLU\\operatorname\{ReLU\}MLPs on CIFAR\-10\. Early schedules improve over constant dropout throughout the large\-width regime, whileN=64N=64illustrates the expected finite\-width boundary\.Figure 8:Smooth\-activationhh\-sweep for depth\-6 near\-critical GELU MLPs on CIFAR\-10 at widthN=256N=256\. Step\-like schedules improve over constant dropout aroundh¯=0\.1\\bar\{h\}=0\.1, while the largest field leaves the small\-dropout regime where the mean\-field perturbation is expected to be predictive\.
### D\.5Transformers on CIFAR\-100
Table[10](https://arxiv.org/html/2605.21648#A4.T10)and Fig\.[9](https://arxiv.org/html/2605.21648#A4.F9)detail the architectural parameters and corresponding test dynamics for the CIFAR\-100 Vision Transformer evaluation\(Dosovitskiyet al\.,[2021](https://arxiv.org/html/2605.21648#bib.bib18); Krizhevsky,[2009](https://arxiv.org/html/2605.21648#bib.bib24)\); Table[11](https://arxiv.org/html/2605.21648#A4.T11)reports the schedule comparison\.
Figure 9:Finite\-width Vision Transformer training and test curves at fixed mean dropout budgeth¯\\bar\{h\}\. We compare the no\-dropout baseline, uniform dropout, decreasing linear ramps, and early step schedules\. The cropped view shows epochs 20–75 for readability; the full curve is also in App\.[D](https://arxiv.org/html/2605.21648#A4)\.Figure 10:Extended training curves for Figure[9](https://arxiv.org/html/2605.21648#A4.F9), showing the full 75 epochs\.Table 10:Experimental configuration for Vision Transformer \(ViT\) simulations\.Accuracy curves are shown in Fig\.[9](https://arxiv.org/html/2605.21648#A4.F9), with explicit results in Table[11](https://arxiv.org/html/2605.21648#A4.T11)\. All reported errors are SEM\. On CIFAR\-100 \(and similarly on CIFAR\-10\), concentrating dropout earlier consistently improved accuracy and reduced overfitting\. Interestingly, linear decreasing schedules, which put more dropout in early layers, often performed especially well in transformers; residual connections were introduced in\(Heet al\.,[2016](https://arxiv.org/html/2605.21648#bib.bib22)\), and mean\-field analyses show that they ease information propagation\(Yang and Schoenholz,[2017](https://arxiv.org/html/2605.21648#bib.bib45)\)\. We suspect this makes distributing regularization early preferable empirically, though both early weighted and step\-like schedules outperformed the constant baseline\.
We focus on controlled regimes rather than ImageNet\-scale benchmarking because our central claims are mechanistic and derived in a mean\-field initialization framework, which is most cleanly tested when finite\-width corrections remain subdominant\. The experimental validation is designed to test the theory in settings where MFT predictions apply cleanly, rather than to establish state\-of\-the\-art performance on large\-scale datasets such as ImageNet\(Denget al\.,[2009](https://arxiv.org/html/2605.21648#bib.bib51)\)\. ImageNet\-scale training introduces substantial confounders from augmentation and extensive hyperparameter tuning that can obscure MFT effects; we leave that evaluation as future work\. In these controlled runs, the predicted schedules outperform constant dropout, illustrating the practical utility of the mean\-field framework\.
Table 11:Test accuracy across different dropout schedules for ViT architecture on CIFAR\-100, corresponding to Fig\.[9](https://arxiv.org/html/2605.21648#A4.F9)\.
### D\.6Transformer Ablations
To isolate the effect of dropout scheduling on distinct transformer components, we performed component\-level ablations\. We tested transformers with dropout applied exclusively to the attention block, the MLP block, or both, comparing step\-like and constant schedules against a no\-dropout baseline\. Experimental hyperparameters are listed in Table[12](https://arxiv.org/html/2605.21648#A4.T12); they largely mirror the main transformer experiment, though we conducted five simulations per configuration\. We repeated the ablation on CIFAR\-10\.
Figure 11:Component ablations applying dropout schedules to the attention block, MLP block, or both\. The figure compares no dropout, constant dropout, and step\-like dropout\.Table 12:Experimental configuration for Vision Transformer \(ViT\) ablation study\.Table 13:Component\-level ablations on CIFAR\-10, comparing Constant and Step\-like dropout schedules applied exclusively to the MLP block, Attention block, or both\. The Step \(early\) schedule consistently outperforms the Constant schedule across all configurations\.Similar Articles
Edge of Stability Selectively Shapes Learning Across the Data Distribution
MIT researchers show that the edge of stability (EoS) in neural network training is not merely a global optimization phenomenon but selectively redistributes learning across subsets of the training distribution, amplifying progress on some data groups while suppressing others. They identify two key conditions governing this allocation: gradient alignment with the top Hessian eigenvector and sustained non-vanishing gradient magnitude.
Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations
This paper establishes a theoretical framework showing that smooth activations in deep neural networks can mitigate the curse of dimensionality in uniform convergence, providing non-asymptotic guarantees and outperforming ReLU networks in worst-case reliability.
Conformal Prediction for Neural Operators: Distribution-Free Uncertainty Quantification in Physics Simulation
Proposes the first application of split conformal prediction to neural operator-based physics simulation, providing distribution-free prediction intervals with finite-sample coverage guarantees and adaptive-width intervals using MC Dropout uncertainty.
Balancing Learning Rates Across Layers: Exact Two-Step Dynamics and Optimal Scaling in Linear Neural Networks
This paper derives exact closed-form expressions for gradients and test loss after one and two steps of gradient descent in two-layer and three-layer linear neural networks, characterizing optimal learning rate selection and revealing a distinct early-training regime where unequal layer-wise learning rates are initially optimal.
Generalized Neurons
The article explores the Universal Approximation Theorem in deep learning, analyzing the representation capacity of individual neurons and neural network layers using ReLU activation functions.