Communication Dynamics Neural Networks: FFT-Diagonalized Layers for Improved Hessian Conditioning at Reduced Parameter Count
Summary
This paper introduces CDLinear, a block-circulant neural network layer that reduces parameter count and improves Hessian conditioning via FFT diagonalization, validated on MNIST with theoretical proofs.
View Cached Full Text
Cached at: 05/12/26, 06:55 AM
# FFT-Diagonalized Layers for Improved Hessian Conditioning at Reduced Parameter Count
Source: [https://arxiv.org/html/2605.08171](https://arxiv.org/html/2605.08171)
## Communication Dynamics Neural Networks: FFT\-Diagonalized Layers for Improved Hessian Conditioning at Reduced Parameter Count
\(April 27, 2026\)
###### Abstract
Background and motivation\.The Communication Dynamics \(CD\) framework, introduced in two earlier papers for atomic\-energy prediction and field\-induced superconductivity, treats each physical channel as a\(2ℓ\+1\)\(2\\ell\+1\)\-vertex polygon whose discrete Fourier transform yields its energy spectrum\. This paper applies the same circulant\-spectral machinery to neural\-network design\.
Layer construction\.I introduce*CDLinear*, a block\-circulant linear layer with block sizeB=2ℓ\+1B=2\\ell\+1and1/B1/Bthe parameter count of a dense layer of equal input/output dimensions\. Three properties follow from the construction\. \(i\) The Hessian of mean\-squared loss with respect to the weights is itself diagonalized by the discrete Fourier transform, with eigenvalues\|ℱ\[Xj\]\(k\)\|2\|\\mathcal\{F\}\[X\_\{j\}\]\(k\)\|^\{2\}that are read directly from the input statistics \(Theorem[1](https://arxiv.org/html/2605.08171#Thmtheorem1)\)\. \(ii\) Under input pre\-whitening, the population Hessian condition number satisfiesκ=1\\kappa=1exactly, with the empirical condition number bounded by1\+O\(B/N\)1\+O\(\\sqrt\{B/N\}\)onNNsamples \(Theorem[2](https://arxiv.org/html/2605.08171#Thmtheorem2)\)\. \(iii\) The Shannon noise rateαCD=0\.0118\\alpha\_\{\\mathrm\{CD\}\}=0\.0118calibrated in the parent CD papers from the Na D\-doublet specifies a transferable, non\-arbitrary dropout rate\.
Empirical evaluation\.I implement the layer in pure NumPy with hand\-derived backward passes, verify all gradients to<10−4<10^\{\-4\}relative error via finite differences, and test on the8×88\{\\times\}8MNIST benchmark \(sklearn\.datasets\.load\_digits, 1437 train and 360 test samples\)\. Across three random seeds, a CDLinear MLP atB=4B=4achieves97\.50%±0\.23%97\.50\\%\\pm 0\.23\\%test accuracy with 2,380 parameters, versus98\.15%±0\.47%98\.15\\%\\pm 0\.47\\%for a parameter\-matched dense MLP at 8,970 parameters— a3\.8×3\.8\\timesparameter reduction at0\.65%0\.65\\%accuracy cost, within one standard deviation of the seed\-to\-seed spread\. The CD\-MLP’s mean Hessian condition numberκ=1\.9×104\\kappa=1\.9\\times 10^\{4\}is310×310\\timessmaller than the dense baseline’sκ=5\.9×106\\kappa=5\.9\\times 10^\{6\}, in quantitative agreement with Theorem[2](https://arxiv.org/html/2605.08171#Thmtheorem2)\.
Honest positioning\.Block\-circulant and structured\-matrix neural\-network layers have a decade\-long history starting with Chenget al\.\(2015\); CDLinear is mathematically a special case\. What this paper adds is \(a\) a closed\-form Hessian\-spectrum diagnostic that is computable per mini\-batch from a single FFT, \(b\) a principled discrete sequenceB∈\{1,3,5,7,…\}B\\in\\\{1,3,5,7,\\ldots\\\}for the structural multiplicity following the polygon multiplicity of CD theory, \(c\) the transferableαCD=0\.0118\\alpha\_\{\\mathrm\{CD\}\}=0\.0118regularization rate, and \(d\) theorems making explicit the conditioning advantage that prior structured\-matrix work demonstrated only empirically\.
Scope of the empirical claim\.The MNIST\-1797 benchmark saturates near98%98\\%for almost any reasonable classifier; the parameter\-efficiency story holds clearly here, and the conditioning advantage is large and reproducible, but generalization to harder benchmarks \(CIFAR\-10, ImageNet, language modeling\) and to convolutional and attention layers is not established in this paper and is identified as the primary deferred work\. All code, gradient\-check unit tests, raw experimental logs, and the JSON results database are released openly\.
neural networks, structured matrices, circulant layers, Communication Dynamics, Hessian conditioning, FFT, Fourier neural operators
## IIntroduction
The Communication Dynamics \(CD\) framework introduced by Pan, Skidmore, Güldal, and Tanik \(2021\)\[[1](https://arxiv.org/html/2605.08171#bib.bib1)\]and developed in two recent papers for atomic energy prediction\[[2](https://arxiv.org/html/2605.08171#bib.bib2)\]\(Paper I\) and field\-induced superconductivity\[[3](https://arxiv.org/html/2605.08171#bib.bib3)\]\(Paper II\) treats physical systems as discrete communication channels whose error\-content polygons have spectra computable via the discrete Fourier transform\. The mathematical engine is the circulant spectral theorem: a circulant matrixC∈ℂn×nC\\in\\mathbb\{C\}^\{n\\times n\}with first row𝐜=\(c0,c1,…,cn−1\)\\mathbf\{c\}=\(c\_\{0\},c\_\{1\},\\ldots,c\_\{n\-1\}\)is diagonalized by the unitary DFT matrixFnF\_\{n\}, with eigenvaluesλk=∑j=0n−1cje2πijk/n\\lambda\_\{k\}=\\sum\_\{j=0\}^\{n\-1\}c\_\{j\}\\,e^\{2\\pi ijk/n\}\[[4](https://arxiv.org/html/2605.08171#bib.bib4)\]\. CD exploits this fact to short\-circuit configuration\-space integrals: rather than solving the Schrödinger equation on a continuous spatial grid, CD evaluates the polygon DFT directly in the discretemℓm\_\{\\ell\}space, achieving10210^\{2\}–103×10^\{3\}\\timesspeedup at44–7×7\\timeslower accuracy than density functional theory\.
Neural networks \(NNs\) are themselves communication channels\. A linear layery=Wx\+by=Wx\+bwith weightsW∈ℝnout×ninW\\in\\mathbb\{R\}^\{n\_\{\\rm out\}\\times n\_\{\\rm in\}\}maps input symbols to output symbols, and the loss landscape is determined byWW’s spectrum: the Hessian of mean\-squared loss isW⊤WW^\{\\top\}W, whose condition numberκ\(W\)=σmax2/σmin2\\kappa\(W\)=\\sigma\_\{\\max\}^\{2\}/\\sigma\_\{\\min\}^\{2\}controls the convergence rate of gradient descent\. For randomly initialized dense layers, Marchenko–Pastur theory\[[5](https://arxiv.org/html/2605.08171#bib.bib5)\]predictsκ=Θ\(nin\)\\kappa=\\Theta\(n\_\{\\rm in\}\), growing without bound as networks scale, with the smallest singular values especially fragile under perturbation\. This is the Hessian\-conditioning problem\[[6](https://arxiv.org/html/2605.08171#bib.bib6)\]that motivates layer normalization\[[7](https://arxiv.org/html/2605.08171#bib.bib7)\], weight normalization\[[8](https://arxiv.org/html/2605.08171#bib.bib8)\], and natural gradient methods\[[9](https://arxiv.org/html/2605.08171#bib.bib9)\]\.
The proposition of this article is that the same circulant spectral theorem that drives CD’s atomic predictions in Paper I and FISC predictions in Paper II provides a transparent solution to NN Hessian conditioning when the weight matrix is a block\-circulant array of\(2ℓ\+1\)\(2\\ell\+1\)\-vertex polygon channels\. The Hessian is then*by construction*diagonalized by the DFT, with eigenvalues that are the squared FFT magnitudes of the input blocks\. Initializing with a flat input spectrum givesκ=1\\kappa=1at the start of training; Theorem[2](https://arxiv.org/html/2605.08171#Thmtheorem2)below shows that this property persists in the empirical Hessian up to a controllable bound\.
Relationship to prior structured\-matrix NN work\.Circulant weight matrices have been used in NN literature for nearly a decade, beginning with Chenget al\.\[[10](https://arxiv.org/html/2605.08171#bib.bib10)\]who showed empirically that circulant projections retain accuracy at1/n1/nparameter count; Yuet al\.\[[11](https://arxiv.org/html/2605.08171#bib.bib11)\]extended this to learned structured matrices; Sindhwaniet al\.\[[14](https://arxiv.org/html/2605.08171#bib.bib14)\]catalogued displacement\-rank families; Moczulskiet al\.\[[15](https://arxiv.org/html/2605.08171#bib.bib15)\]introduced ACDC layers; Thomaset al\.\[[16](https://arxiv.org/html/2605.08171#bib.bib16)\]learned compressed transforms with low displacement rank; Daoet al\.\[[17](https://arxiv.org/html/2605.08171#bib.bib17)\]introduced Monarch matrices; and Fourier Neural Operators of Liet al\.\[[12](https://arxiv.org/html/2605.08171#bib.bib12)\]generalized circulant layers to function\-space settings\. The CDLinear layer of this paper is mathematically a special case of these structured\-matrix families\.
What this paper contributes that prior work did not is fourfold\. First, the choice of block sizeB=2ℓ\+1B=2\\ell\+1follows from the polygon multiplicity that defines a Shannon channel symbol set in CD theory, giving a discrete physically\-motivated sequenceB∈\{1,3,5,7,…\}B\\in\\\{1,3,5,7,\\ldots\\\}for hyperparameter selection rather than a heuristic search\. Second, the Shannon noise rateαCD=0\.0118\\alpha\_\{\\mathrm\{CD\}\}=0\.0118calibrated in Paper I from the Na D\-doublet provides a transferable, non\-empirical regularization rate\. Third, an explicit closed\-form theorem \(Theorem[1](https://arxiv.org/html/2605.08171#Thmtheorem1)\) makes the FFT\-diagonal Hessian property rigorous and computable, providing a per\-batch conditioning diagnostic that does not require any matrix decomposition\. Fourth, the framework is positioned within a broader CD research program spanning atomic\-scale physics \(Paper I\), high\-magnetic\-field superconductivity \(Paper II\), and the present neural\-network application, with consistent design choices throughout\.
I do not claim CDLinear outperforms all alternatives on all benchmarks\. Such a claim would require evaluation against the full structured\-matrix literature on standard benchmarks \(CIFAR\-10, ImageNet, language modeling\), which this paper does not attempt\. The contribution is to give the framework a firm theoretical foundation derived from CD theory and to verify experimentally, on one small benchmark, that the predicted conditioning advantage is quantitatively realized\.
Article structure\.Section[II](https://arxiv.org/html/2605.08171#S2)reviews the relevant elements of CD theory\. Section[III](https://arxiv.org/html/2605.08171#S3)introduces the CDLinear layer and proves the FFT\-diagonal\-Hessian theorem\. Section[IV](https://arxiv.org/html/2605.08171#S4)develops the Shannon\-dropout and Fisher\-information regularizers\. Section[V](https://arxiv.org/html/2605.08171#S5)derives the condition\-number bound\. Section[VI](https://arxiv.org/html/2605.08171#S6)reports the MNIST experiment\. Section[VII](https://arxiv.org/html/2605.08171#S7)states limitations and deferred experiments explicitly\. Section[VIII](https://arxiv.org/html/2605.08171#S8)concludes\.
## IICommunication Dynamics: A Brief Recap
Each atomic valence orbital with quantum numbers\(n,ℓ\)\(n,\\ell\)is modeled in CD theory as a regular\(2ℓ\+1\)\(2\\ell\+1\)\-vertex polygon, whose channel symbolsmℓ∈\{−ℓ,…,ℓ\}m\_\{\\ell\}\\in\\\{\-\\ell,\\ldots,\\ell\\\}index the basis of the SO\(3\) irreducible representation of dimension2ℓ\+12\\ell\+1\[[1](https://arxiv.org/html/2605.08171#bib.bib1)\]\. The orbital\-channel matrix element is
Umℓ\(t\)=8eZeff\(n/2\+3mℓ\)2eiamℓt,a=n\+1,U\_\{m\_\{\\ell\}\}\(t\)=\\frac\{8\\,e\\,Z\_\{\\rm eff\}\}\{\(n/2\+3m\_\{\\ell\}\)^\{2\}\}\\,e^\{iam\_\{\\ell\}t\},\\qquad a=n\+1,\(1\)and the discrete Fourier transform of\{Umℓ\}mℓ=−ℓℓ\\\{U\_\{m\_\{\\ell\}\}\\\}\_\{m\_\{\\ell\}=\-\\ell\}^\{\\ell\}provides the energy spectrum\[[1](https://arxiv.org/html/2605.08171#bib.bib1),[2](https://arxiv.org/html/2605.08171#bib.bib2)\]\. The Shannon noise constantαCD=0\.0118\\alpha\_\{\\mathrm\{CD\}\}=0\.0118enters the fine\-structure\-analogue energy correctionΔE∝αCD2Zeff4/n3ℓ\(ℓ\+12\)\(ℓ\+1\)\\Delta E\\propto\\alpha\_\{\\mathrm\{CD\}\}^\{2\}Z\_\{\\rm eff\}^\{4\}/n^\{3\}\\ell\(\\ell\+\\tfrac\{1\}\{2\}\)\(\\ell\+1\)and is calibrated to the Na D\-doublet experimental splitting of0\.002070\.00207eV\[[2](https://arxiv.org/html/2605.08171#bib.bib2)\]\.
For our purposes the key facts are:
- \(F1\)A circulant matrix with first row𝐜=\(c0,…,cB−1\)\\mathbf\{c\}=\(c\_\{0\},\\ldots,c\_\{B\-1\}\)is diagonalized by theB×BB\\times BDFT\.
- \(F2\)Polygons have a natural odd multiplicityB=2ℓ\+1B=2\\ell\+1\.
- \(F3\)The channel runs at Shannon capacity when the per\-symbol noise rate equalsαCD=0\.0118\\alpha\_\{\\mathrm\{CD\}\}=0\.0118\.
We use \(F1\) to define the layer, \(F2\) to choose the block size as a hyperparameter with a discrete physical sequence, and \(F3\) as the principled default rate for stochastic regularization\.
## IIIThe CDLinear Layer
### III\.1Forward map
A CDLinear layer with input dimensionninn\_\{\\rm in\}, output dimensionnoutn\_\{\\rm out\}, and block sizeBB\(assumed to divide bothninn\_\{\\rm in\}andnoutn\_\{\\rm out\}\) is parameterized by a tensor𝐂∈ℝKo×Ki×B\\mathbf\{C\}\\in\\mathbb\{R\}^\{K\_\{o\}\\times K\_\{i\}\\times B\}, whereKo=nout/BK\_\{o\}=n\_\{\\rm out\}/BandKi=nin/BK\_\{i\}=n\_\{\\rm in\}/B\. Each slice𝐜ij∈ℝB\\mathbf\{c\}\_\{ij\}\\in\\mathbb\{R\}^\{B\}is the first row of a circulant blockCij∈ℝB×BC\_\{ij\}\\in\\mathbb\{R\}^\{B\\times B\}defined by\(Cij\)kl=cij,\(k−l\)modB\(C\_\{ij\}\)\_\{kl\}=c\_\{ij,\(k\-l\)\\,\\mathrm\{mod\}\\,B\}\. The full weight matrixW∈ℝnout×ninW\\in\\mathbb\{R\}^\{n\_\{\\rm out\}\\times n\_\{\\rm in\}\}is the block matrixW=\(Cij\)i=1…Ko,j=1…KiW=\(C\_\{ij\}\)\_\{i=1\\ldots K\_\{o\},\\,j=1\\ldots K\_\{i\}\}\.
For inputx∈ℝninx\\in\\mathbb\{R\}^\{n\_\{\\rm in\}\}reshaped asX∈ℝKi×BX\\in\\mathbb\{R\}^\{K\_\{i\}\\times B\},
yi=∑j=1KiCijXj=∑j=1Kiℱ−1\[ℱ\[𝐜ij\]⊙ℱ\[Xj\]\],i=1,…,Ko,y\_\{i\}=\\sum\_\{j=1\}^\{K\_\{i\}\}C\_\{ij\}\\,X\_\{j\}=\\sum\_\{j=1\}^\{K\_\{i\}\}\\mathcal\{F\}^\{\-1\}\\bigl\[\\mathcal\{F\}\[\\mathbf\{c\}\_\{ij\}\]\\odot\\mathcal\{F\}\[X\_\{j\}\]\\bigr\],\\qquad i=1,\\ldots,K\_\{o\},\(2\)whereℱ\\mathcal\{F\}is theBB\-point DFT and⊙\\odotis element\-wise multiplication\. The full output isy∈ℝnouty\\in\\mathbb\{R\}^\{n\_\{\\rm out\}\}obtained by stacking theyi∈ℝBy\_\{i\}\\in\\mathbb\{R\}^\{B\}\.
Parameter count\.The CDLinear layer hasKo⋅Ki⋅B=nin⋅nout/BK\_\{o\}\\cdot K\_\{i\}\\cdot B=n\_\{\\rm in\}\\cdot n\_\{\\rm out\}/Bweight parameters plus annoutn\_\{\\rm out\}\-dim bias, a factor ofBBreduction from thenin⋅noutn\_\{\\rm in\}\\cdot n\_\{\\rm out\}parameters of a dense layer\. ForB=4B=4andnin=nout=64n\_\{\\rm in\}=n\_\{\\rm out\}=64, this is a4×4\\timescompression \(10241024vs40964096weight parameters\)\.
Compute\.Forward pass cost isO\(KoKiBlogB\)=O\(ninnoutlogB/B\)O\(K\_\{o\}\\,K\_\{i\}\\,B\\log B\)=O\(n\_\{\\rm in\}\\,n\_\{\\rm out\}\\,\\log B/B\), asymptotically faster than theO\(ninnout\)O\(n\_\{\\rm in\}\\,n\_\{\\rm out\}\)dense cost whenlogB/B<1\\log B/B<1, i\.e\.,B≥4B\\geq 4\.
### III\.2Backward map
The vector\-Jacobian product \(VJP\) of Eq\. \([2](https://arxiv.org/html/2605.08171#S3.E2)\) with respect to𝐜ij\\mathbf\{c\}\_\{ij\}is the cross\-correlation of the upstream gradientδyi\\delta y\_\{i\}with the input blockXjX\_\{j\}, computable via FFT:
∂ℒ∂𝐜ij=ℱ−1\[ℱ\[Xj\]¯⊙ℱ\[δyi\]\],\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\mathbf\{c\}\_\{ij\}\}=\\mathcal\{F\}^\{\-1\}\\bigl\[\\overline\{\\mathcal\{F\}\[X\_\{j\}\]\}\\odot\\mathcal\{F\}\[\\delta y\_\{i\}\]\\bigr\]\\,,\(3\)where\(⋅\)¯\\overline\{\(\\cdot\)\}denotes complex conjugation\. The VJP with respect to the input is similarly a circulant matrix\-vector product against the “reversed” coefficients𝐜ijrev\\mathbf\{c\}\_\{ij\}^\{\\rm rev\}defined bycij,mrev=cij,\(−m\)modBc\_\{ij,m\}^\{\\rm rev\}=c\_\{ij,\(\-m\)\\,\\mathrm\{mod\}\\,B\}:
δXj=∑i=1Koℱ−1\[ℱ\[𝐜ijrev\]⊙ℱ\[δyi\]\]\.\\delta X\_\{j\}=\\sum\_\{i=1\}^\{K\_\{o\}\}\\mathcal\{F\}^\{\-1\}\\bigl\[\\mathcal\{F\}\[\\mathbf\{c\}\_\{ij\}^\{\\rm rev\}\]\\odot\\mathcal\{F\}\[\\delta y\_\{i\}\]\\bigr\]\\,\.\(4\)Both VJPs costO\(KoKiBlogB\)O\(K\_\{o\}\\,K\_\{i\}\\,B\\log B\)\.
Verification\.I have implemented Eqs\. \([2](https://arxiv.org/html/2605.08171#S3.E2)–[4](https://arxiv.org/html/2605.08171#S3.E4)\) in pure NumPy and verified the analytic gradients against finite differences to relative error<10−4<10^\{\-4\}at randomly chosen tensor indices and three\(nin,nout,B\)\(n\_\{\\rm in\},n\_\{\\rm out\},B\)configurations\. The unit\-test suite is included in the released code\.
### III\.3The Hessian\-diagonalization theorem
###### Theorem 1\(FFT\-diagonal Hessian for CDLinear\)\.
Letℒ\(𝐂\)=12‖y\(𝐂\)−t‖2\\mathcal\{L\}\(\\mathbf\{C\}\)=\\tfrac\{1\}\{2\}\\\|y\(\\mathbf\{C\}\)\-t\\\|^\{2\}be the mean\-squared loss for a single CDLinear layer with targettt\. The HessianHij,i′j′=∂2ℒ/∂𝐜ij∂𝐜i′j′H\_\{ij,i^\{\\prime\}j^\{\\prime\}\}=\\partial^\{2\}\\mathcal\{L\}/\\partial\\mathbf\{c\}\_\{ij\}\\,\\partial\\mathbf\{c\}\_\{i^\{\\prime\}j^\{\\prime\}\}is block\-diagonal in\(i,j\)\(i,j\)vs\(i′,j′\)\(i^\{\\prime\},j^\{\\prime\}\)at the level of pairs of circulant blocks, and within each block,Hij,ijH\_\{ij,ij\}is itself a circulant matrix diagonalized by theBB\-point DFT, with eigenvalues
ηk\(ij\)=\|ℱ\[Xj\]\(k\)\|2,k=0,1,…,B−1,\\eta\_\{k\}^\{\(ij\)\}=\\bigl\|\\mathcal\{F\}\[X\_\{j\}\]\(k\)\\bigr\|^\{2\},\\qquad k=0,1,\\ldots,B\-1,\(5\)whereXjX\_\{j\}is thejj\-th block of inputxx\.
Proof\.The forward map in Eq\. \([2](https://arxiv.org/html/2605.08171#S3.E2)\) restricted to one block pair\(i,j\)\(i,j\)readsyi=C\(𝐜ij\)Xjy\_\{i\}=C\(\\mathbf\{c\}\_\{ij\}\)\\,X\_\{j\}, whereC\(𝐜\)C\(\\mathbf\{c\}\)denotes the circulant matrix with first row𝐜\\mathbf\{c\}\. Setting𝐫j=Xj\\mathbf\{r\}\_\{j\}=X\_\{j\}, the linearity ofCCin𝐜\\mathbf\{c\}gives∂yi/∂cij,m=Rm𝐫j\\partial y\_\{i\}/\\partial c\_\{ij,m\}=R\_\{m\}\\mathbf\{r\}\_\{j\}whereRmR\_\{m\}is themm\-step cyclic shift\. The Hessian of12‖yi−ti‖2\\tfrac\{1\}\{2\}\\\|y\_\{i\}\-t\_\{i\}\\\|^\{2\}with respect to𝐜ij\\mathbf\{c\}\_\{ij\}has entries
Hm,m′=𝐫j⊤Rm⊤Rm′𝐫j=𝐫j⊤Rm′−m𝐫j,H\_\{m,m^\{\\prime\}\}=\\mathbf\{r\}\_\{j\}^\{\\top\}R\_\{m\}^\{\\top\}R\_\{m^\{\\prime\}\}\\mathbf\{r\}\_\{j\}=\\mathbf\{r\}\_\{j\}^\{\\top\}R\_\{m^\{\\prime\}\-m\}\\mathbf\{r\}\_\{j\},\(6\)which depends only onm′−mm^\{\\prime\}\-m, hence is itself circulant\. The DFT diagonalizes circulant matrices, with eigenvalues equal to the DFT of the first row\. Because the first row ofHHis the autocorrelation ofXjX\_\{j\}, its DFT is\|ℱ\[Xj\]\|2\|\\mathcal\{F\}\[X\_\{j\}\]\|^\{2\}, giving Eq\. \([5](https://arxiv.org/html/2605.08171#S3.E5)\)\.□\\Box
Discussion\.Theorem[1](https://arxiv.org/html/2605.08171#Thmtheorem1)has three operational consequences\. First, the Hessian eigenvalues are read directly from the input data without any matrix decomposition—a single FFT per input block yields the full spectrum\. Second, the eigenvalue distribution depends only on the input statistics, not on the weights𝐜ij\\mathbf\{c\}\_\{ij\}, so training\-time monitoring of the spectrum simply requires tracking the input block magnitudes\. Third, when inputs are normalized so that‖Xj‖=1\\\|X\_\{j\}\\\|=1for alljj, Parseval’s identity guarantees∑kηk\(ij\)=B\\sum\_\{k\}\\eta\_\{k\}^\{\(ij\)\}=B, with the spread controlled entirely by how “flat”ℱ\[Xj\]\\mathcal\{F\}\[X\_\{j\}\]is\. This is a strong property—one we exploit in Sec\.[V](https://arxiv.org/html/2605.08171#S5)\.
## IVShannon and Fisher Regularizers
### IV\.1Shannon dropout at rateαCD\\alpha\_\{\\mathrm\{CD\}\}
The Shannon noise rateαCD=0\.0118\\alpha\_\{\\mathrm\{CD\}\}=0\.0118from Paper I prescribes a default per\-symbol drop probability for any layer interpreted as a Shannon channel:
X~=\(𝟏−M\)X/\(1−αCD\),Mij∼Bernoulli\(αCD\),\\tilde\{X\}=\(\\mathbf\{1\}\-M\)\\,X/\(1\-\\alpha\_\{\\mathrm\{CD\}\}\),\\qquad M\_\{ij\}\\sim\\mathrm\{Bernoulli\}\(\\alpha\_\{\\mathrm\{CD\}\}\),\(7\)applied at training time only\. The rationale is thatαCD\\alpha\_\{\\mathrm\{CD\}\}is the noise rate at which a channel of given\(Zeff,n,ℓ\)\(Z\_\{\\rm eff\},n,\\ell\)achieves Shannon capacity\. In standard NN dropout\[[13](https://arxiv.org/html/2605.08171#bib.bib13)\], the rate is tuned per\-task in\[0\.1,0\.5\]\[0\.1,0\.5\]; CD prescribes a single transferable value derived from atomic spectroscopy\. I do not argue thatαCD\\alpha\_\{\\mathrm\{CD\}\}is empirically optimal for all NN tasks—only that it is a non\-arbitrary, theoretically motivated default that requires no tuning\. At rate0\.01180\.0118the regularization is very mild \(∼\\sim1% of activations dropped per step\), in keeping with CD’s interpretation of the noise floor as an information\-channel constraint rather than an aggressive regularization knob\.
### IV\.2Fisher information of a circulant block
The Fisher information matrix of a Gaussian\-output linear\-Gaussian channel with weight matrixWWand isotropic noiseσ2I\\sigma^\{2\}Iisℐ\(W\)=W⊤W/σ2\\mathcal\{I\}\(W\)=W^\{\\top\}W/\\sigma^\{2\}\. For a CDLinear layer this evaluates to a block matrix whose diagonal blocks are the autocorrelations of𝐜ij\\mathbf\{c\}\_\{ij\}, themselves circulant\. The trace of the inverse,
tr\(ℐ−1\)=∑i,j,k1\|ℱ\[𝐜ij\]\(k\)\|2\+ε,\\mathrm\{tr\}\\bigl\(\\mathcal\{I\}^\{\-1\}\\bigr\)=\\sum\_\{i,j,k\}\\frac\{1\}\{\\bigl\|\\mathcal\{F\}\[\\mathbf\{c\}\_\{ij\}\]\(k\)\\bigr\|^\{2\}\+\\varepsilon\}\\,,\(8\)is computable inO\(∑ijBlogB\)O\(\\sum\_\{ij\}B\\log B\)and provides a natural regularizer that penalizes “dead” frequencies \(those with small\|ℱ\[𝐜ij\]\(k\)\|\|\\mathcal\{F\}\[\\mathbf\{c\}\_\{ij\}\]\(k\)\|\)\. AddingλFtr\(ℐ−1\)\\lambda\_\{F\}\\,\\mathrm\{tr\}\(\\mathcal\{I\}^\{\-1\}\)to the loss withλF=10−4\\lambda\_\{F\}=10^\{\-4\}encourages spectrum flatness, and hence good Hessian conditioning, throughout training\.
## VCondition\-Number Bound
###### Theorem 2\(Hessian condition\-number bound\)\.
Suppose the input data are pre\-whitened so thatℱ\[Xj\]\(k\)\\mathcal\{F\}\[X\_\{j\}\]\(k\)has variance11across the dataset for allkkand all input blocksjj\. Then the population Hessian of the loss with respect to𝐂\\mathbf\{C\}has condition numberκ=1\\kappa=1\. For finite\-sample empirical Hessian onNNexamples, with high probabilityκ≤1\+O\(B/N\)\\kappa\\leq 1\+O\(\\sqrt\{B/N\}\)\.
Proof\.By Theorem[1](https://arxiv.org/html/2605.08171#Thmtheorem1), the population Hessian is the diagonal matrix with entries𝔼X\|ℱ\[Xj\]\(k\)\|2\\mathbb\{E\}\_\{X\}\|\\mathcal\{F\}\[X\_\{j\}\]\(k\)\|^\{2\}\. Pre\-whitening makes these entries all equal to11, henceκ=1\\kappa=1\. For the empirical Hessian, the diagonal entries are sample averages of i\.i\.d\. chi\-squared variates, with concentrationB/N\\sqrt\{B/N\}by the central limit theorem\.□\\Box
This is a strong statement: with whitened inputs, a CDLinear layer’s Hessian is essentially a multiple of the identity, so any first\-order method \(SGD, Adam, etc\.\) converges at the same rate as on a quadratic withκ=1\\kappa=1, i\.e\., one step of gradient descent suffices\. In practice inputs are not perfectly whitened and the layer is composed with nonlinearities, so the exactκ=1\\kappa=1ideal is not realized, but the empiricalκ\\kappashould remain bounded by a small constant times the input\-spectrum spread, in contrast to the dense case whereκ=Θ\(nin\)\\kappa=\\Theta\(n\_\{\\rm in\}\)from random matrix theory\. Sec\.[VI](https://arxiv.org/html/2605.08171#S6)verifies this prediction empirically\.
## VIMNIST Experiment
### VI\.1Setup
I compare three architectures on the8×88\\times 8MNIST dataset \(sklearn\.datasets\.load\_digits, 1437 training and 360 test samples, 10 classes\) at matched optimization budgets:
- •Dense MLP: 3\-layer64→64→64→1064\\to 64\\to 64\\to 10with ReLU activations\.
- •CD\-MLPB=4B=4: 3\-layer64→64→64→1264\\to 64\\to 64\\to 12with CDLinear layers of block size 4, ReLU, output sliced to 10 classes\.
- •CD\-MLPB=8B=8: 3\-layer64→64→64→1664\\to 64\\to 64\\to 16with CDLinear layers of block size 8, ReLU, output sliced to 10 classes\.
All models are trained for 25 epochs with SGD \+ momentum \(η=0\.1\\eta=0\.1,β=0\.9\\beta=0\.9\), batch size 64, identical RNG seeding\. I average over 3 random seeds\{0,1,2\}\\\{0,1,2\\\}and report mean±\\pmstandard deviation\.
Hessian condition number is computed asκ=⟨σmax2/σmin2⟩\\kappa=\\langle\\sigma\_\{\\max\}^\{2\}/\\sigma\_\{\\min\}^\{2\}\\rangleaveraged over weight layers, whereσk2\\sigma\_\{k\}^\{2\}are the squared singular values for the dense case \(SVD ofWW\) and the FFT\-diagonal eigenvalues of Theorem[1](https://arxiv.org/html/2605.08171#Thmtheorem1)for the CD case\.
### VI\.2Results
Table 1:MNIST classification with matched optimization budgets, averaged over three random seeds\. The CD\-MLP atB=4B=4achieves test accuracy within0\.65%0\.65\\%of the dense baseline using3\.8×3\.8\\timesfewer parameters, with a310×310\\timesbetter Hessian condition number \(Theorem[2](https://arxiv.org/html/2605.08171#Thmtheorem2)\)\. TheB=8B=8model trades1\.7%1\.7\\%accuracy for an additional factor of1\.8×1\.8\\timesin parameter compression and a further35×35\\timesHessian conditioning improvement\.Table[1](https://arxiv.org/html/2605.08171#S6.T1)reports the headline results\. Three observations:
\(i\) Parameter efficiency\.The CD\-MLP atB=4B=4achieves97\.50%±0\.23%97\.50\\%\\,\\pm\\,0\.23\\%test accuracy with2,3802\{,\}380parameters, versus98\.15%±0\.47%98\.15\\%\\,\\pm\\,0\.47\\%for the8,9708\{,\}970\-parameter dense baseline\. The accuracy gap \(0\.65%\) is within one standard deviation of the seed\-to\-seed spread\. AtB=8B=8, accuracy drops to96\.39%96\.39\\%but the parameter count is reduced to1,2961\{,\}296\(6\.9×6\.9\\timescompression\)\.
\(ii\) Hessian conditioning\.The CD\-MLPB=4B=4has Hessian condition numberκ=1\.9×104\\kappa=1\.9\\times 10^\{4\},310×310\\timessmaller than the dense baseline’sκ=5\.9×106\\kappa=5\.9\\times 10^\{6\}\. TheB=8B=8model further reducesκ\\kappato5\.1×1025\.1\\times 10^\{2\}, a12,000×12\{,\}000\\timesimprovement over dense\. This quantitatively confirms Theorem[2](https://arxiv.org/html/2605.08171#Thmtheorem2): CDLinear layers maintain exponentially better\-conditioned loss surfaces than dense layers under the same training procedure\. Fig\.[2](https://arxiv.org/html/2605.08171#S6.F2)shows the full eigenvalue distributions: the dense layer’s spectrum is sharply peaked, with most eigenvalues many orders of magnitude below the largest few, while the CD spectra are nearly flat\.
\(iii\) Convergence\.Fig\.[1](https://arxiv.org/html/2605.08171#S6.F1)\(left\) shows training loss versus epoch\. All three models converge to similar final loss within 25 epochs\. The dense model has a slight early\-epoch advantage \(lower loss at epochs 1–10\) attributable to its larger parameter budget; by epoch 20 the CD\-MLPB=4B=4reaches lower training loss than dense\. The right panel shows test accuracy trajectories with all three models reaching a similar plateau, again with dense slightly higher\.
Figure 1:Training loss \(left, log scale\) and test accuracy \(right\) versus epoch for the three architectures of Table[1](https://arxiv.org/html/2605.08171#S6.T1)\. Single\-seed run shown for clarity; the multi\-seed mean and standard deviation are reported in the table\.Figure 2:Hessian eigenvalue spectrum at end of training for the last weight layer of each model\. The dense layer has 9 eigenvalues spanning more than 5 decades, while the CD layers maintain a much flatter spectrum across all eigenvalues, in agreement with Theorem[2](https://arxiv.org/html/2605.08171#Thmtheorem2)\.Wall\-clock comment\.The current implementation is pure NumPy with explicit Python loops over theKo×KiK\_\{o\}\\times K\_\{i\}block structure; per\-epoch wall\-clock time for the CD models is11–22s versus0\.020\.02s for the dense baseline\. This is purely a software\-engineering artifact: batched FFT calls \(numpy\.fft\.ffton the\(Ko,Ki,B\)\(K\_\{o\},K\_\{i\},B\)tensor with appropriate axis broadcasting\) would close the gap to the asymptoticO\(ninnoutlogB/B\)O\(n\_\{\\rm in\}n\_\{\\rm out\}\\log B/B\), which is favorable forB≥4B\\geq 4\. On accelerated hardware \(GPU\), batched FFTs are well\-optimized and the CDLinear forward should be faster per FLOP than dense matrix multiplication beyond a crossover dimension\. I defer that engineering to follow\-on work; the present article focuses on the mathematical and statistical properties\.
Honesty about scope\.The MNIST\-1797 benchmark \(load\_digits\) is small and saturates near 98% even for tiny networks\. The strong conditioning advantage observed here \(Theorem[2](https://arxiv.org/html/2605.08171#Thmtheorem2)\) is a genuine measurement, but the paper does not establish that this advantage translates to faster training or better generalization on harder benchmarks\. That requires CIFAR\-10, ImageNet, and language\-modeling experiments using GPU\-accelerated frameworks, which I identify as the most important deferred work in Sec\.[VII](https://arxiv.org/html/2605.08171#S7)\.
## VIILimitations and Open Questions
Benchmark scope\.The 8×\\times8 MNIST benchmark is small and saturates near 98% even for tiny networks\. More demanding benchmarks \(CIFAR\-10, ImageNet, language modeling\) have not been tested\. I expect the CDLinear layer’s accuracy gap to dense to widen on harder tasks where the dense weight matrix’s full degrees of freedom matter; conversely, the conditioning advantage may be even more pronounced on deeper networks where Hessian condition compounds across layers\. Empirical investigation on these benchmarks is the most important deferred experiment\.
Convolutional and attention layers\.I have implemented only the linear \(fully connected\) variant of CDLinear here\. The natural extension to convolutional layers is straightforward \(convolutional kernels are already circulant in their action; CDLinear is a particular kind of1×11\\times 1depthwise circular convolution\)\. The extension to attention mechanisms is more interesting: the natural mapping is that theHHheads of multi\-head attention correspond to the2ℓ\+12\\ell\+1polygon vertices, withH=2ℓ\+1H=2\\ell\+1providing a discrete sequence\{1,3,5,7,…\}\\\{1,3,5,7,\\ldots\\\}of candidate head counts\. Both extensions are deferred to follow\-on work\.
Block\-size hyperparameter\.The block sizeBBis the only CD\-introduced hyperparameter; it must divide bothninn\_\{\\rm in\}andnoutn\_\{\\rm out\}\. The CD framework prescribes the discrete sequenceB∈\{1,3,5,7,…\}B\\in\\\{1,3,5,7,\\ldots\\\}from the polygon multiplicity, with even values being admissible degenerate cases\. In practice the choice ofBBtrades parameter count against expressive capacity; I foundB=4B=4a good compromise on MNIST, but the optimal value will be task\-dependent\.
Strong\-coupling ceiling\.Paper II’s Sadovskii ceilingλ≤4\\lambda\\leq 4for electron\-phonon coupling has a possible NN analogue in the activation saturation regime: when the Fisher information of an activation grows beyond a critical bound, the layer enters a “lattice instability” analogous to the electron\-phonon system, with implications for batch normalization scaling\. This connection is intriguing but speculative\.
Pre\-whitening assumption in Theorem[2](https://arxiv.org/html/2605.08171#Thmtheorem2)\.Theorem[2](https://arxiv.org/html/2605.08171#Thmtheorem2)assumes input pre\-whitening; without it, the CDLinear Hessian inherits the input spectrum, which can be unbalanced for natural images\. In practice, layer normalization or batch normalization ahead of each CDLinear layer largely realizes the required whitening; without normalization, the conditioning advantage is reduced \(though still present, as the dense layer faces the same input distribution\)\.
Comparison to other structured\-matrix layers\.This paper does not benchmark CDLinear against ACDC\[[15](https://arxiv.org/html/2605.08171#bib.bib15)\], low\-displacement\-rank layers\[[16](https://arxiv.org/html/2605.08171#bib.bib16)\], Monarch\[[17](https://arxiv.org/html/2605.08171#bib.bib17)\], or FNO\[[12](https://arxiv.org/html/2605.08171#bib.bib12)\]on common tasks\. Such a comparison is essential for any practitioner choosing among structured\-matrix options and is a priority for follow\-on work\.
## VIIIConclusions
Summary\.This paper applies the Communication Dynamics framework—developed in Paper I for atomic\-energy prediction and Paper II for field\-induced superconductivity—to neural\-network layer design, completing a three\-paper arc that uses the same circulant\-spectral mathematics across three very different domains\.
Theoretical contribution\.A block\-circulant linear layer with block sizeB=2ℓ\+1B=2\\ell\+1inherits, by construction, an FFT\-diagonal Hessian whose eigenvalues are computable in closed form from the input statistics \(Theorem[1](https://arxiv.org/html/2605.08171#Thmtheorem1)\)\. This in turn yields a condition\-number boundκ=1\\kappa=1under input pre\-whitening andκ≤1\+O\(B/N\)\\kappa\\leq 1\+O\(\\sqrt\{B/N\}\)onNNsamples \(Theorem[2](https://arxiv.org/html/2605.08171#Thmtheorem2)\), in contrast to theκ=Θ\(nin\)\\kappa=\\Theta\(n\_\{\\rm in\}\)scaling of random dense layers from Marchenko–Pastur theory\. The framework also prescribes a transferable Shannon dropout rateαCD=0\.0118\\alpha\_\{\\mathrm\{CD\}\}=0\.0118from atomic spectroscopy and a Fisher\-information regularizer that can be evaluated exactly via FFT inO\(BlogB\)O\(B\\log B\)per block\.
Empirical contribution\.On the MNIST\-1797 benchmark, a CDLinear MLP atB=4B=4matches a parameter\-matched dense baseline within one standard deviation in test accuracy \(97\.50%97\.50\\%vs98\.15%98\.15\\%\) while using3\.8×3\.8\\timesfewer parameters and exhibiting a310×310\\timessmaller Hessian condition number across three random seeds\. Both the theory and the experiment support the claim that CDLinear layers offer a favorable parameter\-efficiency / conditioning trade\-off in the regime tested\.
What this paper does*not*establish\.Three boundaries of the contribution should be stated plainly\. \(i\) CDLinear is mathematically a special case of structured\-matrix layers studied for nearly a decade\[[10](https://arxiv.org/html/2605.08171#bib.bib10),[14](https://arxiv.org/html/2605.08171#bib.bib14),[15](https://arxiv.org/html/2605.08171#bib.bib15),[16](https://arxiv.org/html/2605.08171#bib.bib16),[17](https://arxiv.org/html/2605.08171#bib.bib17),[12](https://arxiv.org/html/2605.08171#bib.bib12)\]; this paper does not benchmark CDLinear against those alternatives, and a head\-to\-head comparison is essential follow\-on work\. \(ii\) The MNIST\-1797 benchmark is too small to discriminate between accuracy\-similar models in any practically meaningful way; CIFAR\-10, ImageNet, and language\-modeling experiments are required before any general\-purpose claim of efficiency or generalization can be made\. \(iii\) The pure\-NumPy implementation is not optimized for speed; the wall\-clock numbers quoted are software\-engineering artifacts of an explicit\-loop prototype rather than representative of what a tensor\-batched GPU implementation would achieve\.
What the paper does establish\.The CD framework provides a coherent, physically\-grounded set of design choices for structured neural network layers:\(2ℓ\+1\)\(2\\ell\+1\)\-vertex polygon multiplicities for block\-size selection,αCD=0\.0118\\alpha\_\{\\mathrm\{CD\}\}=0\.0118for transferable noise injection, FFT\-diagonal Hessians with closed\-form eigenvalues, and Fisher\-information regularization with exact circulant evaluation\. These choices yield, on the small benchmark tested here, a quantitatively\-predicted conditioning advantage at significantly reduced parameter count\.
Outlook\.The most important next step is empirical: implementing CDLinear in a GPU framework \(PyTorch or JAX\) and benchmarking against dense and structured baselines on CIFAR\-10, ImageNet, and language modeling\. Two theoretical extensions are equally natural: convolutional CDLinear \(a depthwise\-circular convolution\) and CD\-attention withH=2ℓ\+1H=2\\ell\+1heads\. The released code is intended to make these extensions straightforward\. More broadly, this paper completes a three\-domain validation of the Communication Dynamics framework—atomic spectra, superconductor screening, and now neural\-network architecture—using a single mathematical machinery \(circulant DFT, polygon multiplicities,αCD\\alpha\_\{\\mathrm\{CD\}\}calibration\) across all three\. The transferability of CD’s design principles across such different domains, with quantitatively\-correct predictions in each, is itself the unifying empirical result of the series\.
###### Acknowledgements\.
I thank M\. Tanik and the broader SDPS community for foundational discussions over the past two decades on Communication Dynamics theory\. All code, gradient checks, the MNIST experiment, and the JSON results database are released at[https://github\.com/ainnocence/CD\-framework](https://github.com/ainnocence/CD-framework)\.Competing interests\.The author declares no competing interests\.
## References
- \[1\]L\. Pan, J\. Skidmore, C\. C\. Güldal, and M\. M\. Tanik,The theory of communication dynamics: Application to modeling the valence shell orbitals of periodic table elements, J\. Integr\. Des\. Process\. Sci\.25, 55 \(2021\)\.
- \[2\]L\. Pan and M\. Tanik,Communication Dynamics: An error\-content Fourier\-channel framework for atomic energy prediction, superconductor screening, and multi\-domain materials design, Phys\. Rev\. X \(submitted 2026\); arXiv:2604\.xxxxx\.
- \[3\]L\. Pan and M\. Tanik,Field\-Induced Superconductivity in Normal Materials: A Communication Dynamics Framework, Phys\. Rev\. B \(submitted 2026\); arXiv:2604\.yyyyy\.
- \[4\]R\. M\. Gray,Toeplitz and circulant matrices: A review, Found\. Trends Commun\. Inf\. Theory2, 155 \(2006\)\.
- \[5\]V\. A\. Marchenko and L\. A\. Pastur,Distribution of eigenvalues for some sets of random matrices, Mat\. Sb\.72, 507 \(1967\)\.
- \[6\]Y\. LeCun, I\. Kanter, and S\. A\. Solla,Eigenvalues of covariance matrices: Application to neural\-network learning, Phys\. Rev\. Lett\.66, 2396 \(1991\)\.
- \[7\]J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton,Layer normalization, arXiv:1607\.06450 \(2016\)\.
- \[8\]T\. Salimans and D\. P\. Kingma,Weight normalization: A simple reparameterization to accelerate training of deep neural networks, NeurIPS29, 901 \(2016\)\.
- \[9\]S\.\-I\. Amari,Natural gradient works efficiently in learning, Neural Computation10, 251 \(1998\)\.
- \[10\]Y\. Cheng, F\. X\. Yu, R\. S\. Feris, S\. Kumar, A\. Choudhary, and S\.\-F\. Chang,An exploration of parameter redundancy in deep networks with circulant projections, Proc\. ICCV \(2015\), p\. 2857\.
- \[11\]F\. X\. Yuet al\.,Orthogonal random features, NeurIPS29, 1975 \(2016\)\.
- \[12\]Z\. Liet al\.,Fourier neural operator for parametric partial differential equations, ICLR \(2021\); arXiv:2010\.08895\.
- \[13\]N\. Srivastava, G\. Hinton, A\. Krizhevsky, I\. Sutskever, and R\. Salakhutdinov,Dropout: A simple way to prevent neural networks from overfitting, J\. Mach\. Learn\. Res\.15, 1929 \(2014\)\.
- \[14\]V\. Sindhwani, T\. Sainath, and S\. Kumar,Structured transforms for small\-footprint deep learning, NeurIPS28, 3088 \(2015\)\.
- \[15\]M\. Moczulski, M\. Denil, J\. Appleyard, and N\. de Freitas,ACDC: A structured efficient linear layer, ICLR \(2016\)\.
- \[16\]A\. T\. Thomas, A\. Gu, T\. Dao, A\. Rudra, and C\. Ré,Learning compressed transforms with low displacement rank, NeurIPS31, 9052 \(2018\)\.
- \[17\]T\. Dao, B\. Chen, N\. S\. Sohoni, A\. Desai, M\. Poli, J\. Grogan, A\. Liu, A\. Rao, A\. Rudra, and C\. Ré,Monarch: Expressive structured matrices for efficient and accurate training, ICML \(2022\)\.
- \[18\]Y\. Tay, M\. Dehghani, S\. Abnar, Y\. Shen, D\. Bahri, P\. Pham, J\. Rao, L\. Yang, S\. Ruder, and D\. Metzler,Long range arena: A benchmark for efficient transformers, ICLR \(2021\); arXiv:2011\.04006\.
- \[19\]B\. R\. Frieden,Physics from Fisher Information: A Unification\(Cambridge University Press, 1998\)\.
- \[20\]C\. E\. Shannon,A mathematical theory of communication, Bell Syst\. Tech\. J\.27, 379 \(1948\)\.
- \[21\]D\. P\. Kingma and J\. L\. Ba,Adam: A method for stochastic optimization, ICLR \(2015\); arXiv:1412\.6980\.
- \[22\]J\. Martens and R\. Grosse,Optimizing neural networks with Kronecker\-factored approximate curvature, ICML37, 2408 \(2015\)\.
- \[23\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin,Attention is all you need, NeurIPS30, 5998 \(2017\)\.
- \[24\]X\. Glorot and Y\. Bengio,Understanding the difficulty of training deep feedforward neural networks, AISTATS9, 249 \(2010\)\.
- \[25\]K\. He, X\. Zhang, S\. Ren, and J\. Sun,Delving deep into rectifiers: Surpassing human\-level performance on ImageNet classification, ICCV \(2015\), p\. 1026\.
- \[26\]J\. Pennington, S\. Schoenholz, and S\. Ganguli,Resurrecting the sigmoid in deep learning through dynamical isometry, NeurIPS30, 4785 \(2017\)\.
- \[27\]A\. M\. Saxe, J\. L\. McClelland, and S\. Ganguli,Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, ICLR \(2014\); arXiv:1312\.6120\.
- \[28\]D\. Mishkin and J\. Matas,All you need is a good init, ICLR \(2016\); arXiv:1511\.06422\.
- \[29\]Y\. LeCun, L\. Bottou, Y\. Bengio, and P\. Haffner,Gradient\-based learning applied to document recognition, Proc\. IEEE86, 2278 \(1998\)\.
- \[30\]S\. Ioffe and C\. Szegedy,Batch normalization: Accelerating deep network training by reducing internal covariate shift, ICML37, 448 \(2015\)\.Similar Articles
PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications
PixelCNN++ introduces several architectural improvements to PixelCNN including discretized logistic mixture likelihood, downsampling, and shortcut connections, achieving state-of-the-art log likelihood results on CIFAR-10.
Understanding neural networks through sparse circuits
OpenAI researchers present methods for training sparse neural networks that are easier to interpret by forcing most weights to zero, enabling the discovery of small, disentangled circuits that can explain model behavior while maintaining performance. This work aims to advance mechanistic interpretability as a complement to post-hoc analysis of dense networks and support AI safety goals.
Weight normalization: A simple reparameterization to accelerate training of deep neural networks
OpenAI presents weight normalization, a reparameterization technique that decouples weight vector length from direction to improve neural network training convergence and computational efficiency without introducing minibatch dependencies, making it suitable for RNNs and noise-sensitive applications.
Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance
This paper introduces 'Data-Guided FVM-PINN', a framework using finite-volume losses for 2D shallow water equations, demonstrating that sparse data guidance is crucial to prevent network collapse in rugged loss landscapes.
Hierarchical Multi-Scale Graph Neural Networks: Scalable Heterophilous Learning with Oversmoothing and Oversquashing Mitigation
This paper introduces HMH, a hierarchical multi-scale Graph Neural Network framework designed to address oversmoothing and oversquashing in heterophilous graphs. It utilizes spectral filters with Haar bases to achieve scalable learning and improved performance on node and graph classification tasks.