Score Broadcast and Decorrelation: A General Framework for Broadcast-Based Credit Assignment

arXiv cs.LG Papers

Summary

Introduces Score Broadcast and Decorrelation (SBD), a principled framework for broadcast-based credit assignment that generalizes to differentiable loss families including cross-entropy, Bregman divergences, and proper scoring rules. The work provides theoretical grounding for the three-factor learning rule and demonstrates improved performance over existing broadcast approaches on CIFAR-10 and Tiny ImageNet.

arXiv:2605.30638v1 Announce Type: new Abstract: We introduce Score Broadcast and Decorrelation (SBD), a principled framework for broadcast-based credit assignment for general families of differentiable losses. Error broadcast is a biologically plausible alternative to backpropagation that sends output information to hidden layers without weight transport. The Error Broadcast and Decorrelation (EBD) framework, recently introduced for the mean-squared-error (MSE) setting, grounded this mechanism in the stochastic orthogonality of optimal estimators, under which the optimal residual is orthogonal to functions of the input. We generalize that foundation by introducing an orthogonality principle between the output score (the gradient of loss with respect to the final-layer output) and hidden-layer activations, which holds whenever the optimal score has conditional mean zero. This single principle unifies broadcast-based credit assignment across the standard differentiable-loss families, including cross-entropy, Bregman divergences, proper scoring rules, and exponential-family negative log-likelihoods. The framework supplies a theoretical grounding for the three-factor learning rule under general losses, with the neuromodulatory factor derived as the broadcast loss score. We derive the cross-entropy case explicitly, characterize the admissible loss class, and introduce a score vector expansion technique that enriches the broadcast signal while preserving the orthogonality framework. Experiments on CIFAR-10 and Tiny ImageNet show that SBD substantially improves over existing broadcast approaches, with score vector expansion delivering further gains. Overall, this work identifies the loss score as the signal to broadcast, supplies the orthogonality theory and theoretical grounding for the three-factor learning rule from neuroscience, and shows how score vector expansion enriches the decorrelation directions of the resulting objective.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:29 AM

# Score Broadcast and Decorrelation: A General Framework for Broadcast-Based Credit Assignment
Source: [https://arxiv.org/html/2605.30638](https://arxiv.org/html/2605.30638)
Mustafa Uzun1,2Mete Erdogan3Cengiz Pehlevan4,5,6Alper T\. Erdogan1,2 1KUIS AI Center, Koc University, Turkey 2Electrical and Electronics Engineering, Koc University, Turkey 3Department of Electrical Engineering, Stanford University, USA 4John A\. Paulson School of Engineering & Applied Sciences, Harvard University, USA 5Kempner Institute, Harvard University, USA6Center for Brain Science, Harvard University, USA \{muzun22, alperdogan\}@ku\.edu\.trmerdogan@stanford\.educpehlevan@seas\.harvard\.edu

###### Abstract

We introduceScore Broadcast and Decorrelation\(SBD\), a principled framework for broadcast\-based credit assignment for general families of differentiable losses\. Error broadcast is a biologically plausible alternative to backpropagation that sends output information to hidden layers without weight transport\. The Error Broadcast and Decorrelation \(EBD\) framework, recently introduced for the mean\-squared\-error \(MSE\) setting, grounded this mechanism in the stochastic orthogonality of optimal estimators, under which the optimal residual is orthogonal to functions of the input\. We generalize that foundation by introducing an orthogonality principle between theoutput score\(the gradient of loss with respect to the final\-layer output\) and hidden\-layer activations, which holds whenever the optimal score has conditional mean zero\. This single principle unifies broadcast\-based credit assignment across the standard differentiable\-loss families, including cross\-entropy, Bregman divergences \(with MSE as a special case\), proper scoring rules through unconstrained links, and exponential\-family negative log\-likelihoods\. The framework supplies a theoretical grounding for the three\-factor learning rule under general losses, with the neuromodulatory factor derived as the broadcast loss score rather than postulated\. We derive the cross\-entropy case explicitly, characterize the admissible loss class, and introduce ascore vector expansiontechnique that enriches the broadcast signal while preserving the orthogonality framework\. Experiments on CIFAR\-10 and Tiny ImageNet show that SBD substantially improves over existing broadcast approaches, with score vector expansion delivering further gains\. Overall, this work identifies the loss score as the signal to broadcast, supplies the orthogonality theory and theoretical grounding for the three\-factor learning rule from neuroscience, and shows how score vector expansion enriches the decorrelation directions of the resulting objective\.

## 1Introduction

Neural networks serve as the fundamental computational models of natural intelligence and form the core engine of modern AI\. Across both domains, a key challenge is the credit assignment problem: how to update local synaptic weights to optimize a global performance metric\. The dominant solution in machine learning is backpropagation \(BP\), which derives exact gradients by transmitting output errors backward through the network\(Rumelhart et al\.,[1986](https://arxiv.org/html/2605.30638#bib.bib1)\)\. Although highly effective, BP requires symmetric backward pathways and exact weight transport, architectural constraints long viewed as biologically implausible\(Crick,[1989](https://arxiv.org/html/2605.30638#bib.bib2); Lillicrap et al\.,[2020](https://arxiv.org/html/2605.30638#bib.bib3)\)and unsuitable for efficient hardware implementations\.

These constraints have driven the search for alternative credit assignment frameworks that relax the strict routing requirements of BP\(Whittington and Bogacz,[2019](https://arxiv.org/html/2605.30638#bib.bib5)\)\. One notable family of such approaches relies onerror broadcast: distributing global output information directly to hidden layers rather than propagating it sequentially\. Although attractive and biologically plausible, this raises a fundamental question:*for a given loss, what specific quantity should be broadcasted, and what theoretical principle justifies that such a decentralized mechanism can drive learning?*

TheError Broadcast and Decorrelation\(EBD\) framework\(Erdogan et al\.,[2025](https://arxiv.org/html/2605.30638#bib.bib7)\)provided a principled answer in the mean\-squared\-error \(MSE\) setting\. Its starting point is the stochastic orthogonality property of MMSE estimation: at optimality, the residual error is orthogonal to suitable functions of the input\. EBD turns this into layerwise decorrelation objectives between hidden activations and the broadcast output error, yielding localthree\-factor learning rules, a long\-standing class of biologically motivated synaptic update rules, in which a presynaptic activity term and a postsynaptic sensitivity term are gated by a third, neuromodulatory factor\(Frémaux and Gerstner,[2016](https://arxiv.org/html/2605.30638#bib.bib8); Gerstner et al\.,[2018](https://arxiv.org/html/2605.30638#bib.bib9); Kuśmierz et al\.,[2017](https://arxiv.org/html/2605.30638#bib.bib10); Schultz,[1998](https://arxiv.org/html/2605.30638#bib.bib11)\)\. However, this foundation is specific to squared error\. In classification, the standard objective is cross\-entropy, and more generally one wishes to optimize differentiable losses for which the Euclidean residual is not the natural error signal\. A general theory of broadcast learning therefore requires a loss\-dependent notion of what should be broadcast\.

In this article, we identify that quantity as theoutput score, defined as the gradient of the loss with respect to the final layer output\. For cross entropy, this score is the probability residual𝜹=𝐩−𝐲\\boldsymbol\{\\delta\}=\\mathbf\{p\}\-\\mathbf\{y\}\. We show that, at the population cross entropy optimum, this score is conditionally mean zero and therefore orthogonal to any deterministic function of the input, including hidden layer activations\. More generally, the same principle applies to differentiable losses whose conditional risks are characterized by a zero score condition\. Thus, the MSE residual used by EBD and the cross entropy residual used in classification are instances of the more general score based orthogonality principle\.

Pre\-synapticPost\-synapticOutput scorehj\(k−1\)h\_\{j\}^\{\(k\-1\)\}hi\(k\)h\_\{i\}^\{\(k\)\}synapse𝜹\\boldsymbol\{\\delta\}ϕ2​\(𝐱\)⊙𝜹\\phi\_\{2\}\(\\mathbf\{x\}\)\\odot\\boldsymbol\{\\delta\}⋮\\vdotsϕM​\(𝐱\)⊙𝜹\\phi\_\{M\}\(\\mathbf\{x\}\)\\odot\\boldsymbol\{\\delta\}𝜹~∈ℝM​Dout\\tilde\{\\boldsymbol\{\\delta\}\}\\in\\mathbb\{R\}^\{MD\_\{\\mathrm\{out\}\}\}qi\(k\)=𝐑~^i\(k\)​𝜹~q\_\{i\}^\{\(k\)\}=\\hat\{\\tilde\{\\mathbf\{R\}\}\}^\{\(k\)\}\_\{i\}\\,\\tilde\{\\boldsymbol\{\\delta\}\}Δ​Wi​j\(k\)∝hj\(k−1\)⋅qi\(k\)⋅gi′⁣\(k\)​\(hi\(k\)\)​f′⁣\(k\)​\(ui\(k\)\)\\Delta W\_\{ij\}^\{\(k\)\}\\;\\propto\\;\\hbox to15\.38pt\{\\vbox to17\.14pt\{\\pgfpicture\\makeatletter\\hbox\{\\quad\\lower\-8\.57pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgffillcolor\}\{rgb\}\{0\.925,0\.925,0\.925\}\\pgfsys@color@gray@fill\{0\.925\}\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\}\{\}\{\}\{\}\{\}\{\} \{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgffillcolor\}\{rgb\}\{0\.925,0\.925,0\.925\}\\pgfsys@color@gray@fill\{0\.925\}\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{\}\}\}\{\{\}\{\}\{\{\}\}\}\{\}\{\}\{\{\}\{\}\{\{\}\}\}\{\{\}\{\}\{\{\}\}\}\{\}\{\}\{\{\}\{\}\{\{\}\}\}\{\{\}\{\}\{\{\}\}\}\{\}\{\}\{\{\}\{\}\{\{\}\}\}\{\{\}\{\}\{\{\}\}\}\{\}\{\}\{\}\\pgfsys@moveto\{6\.48756pt\}\{8\.37001pt\}\\pgfsys@lineto\{\-6\.48756pt\}\{8\.37001pt\}\\pgfsys@curveto\{\-7\.03986pt\}\{8\.37001pt\}\{\-7\.48756pt\}\{7\.9223pt\}\{\-7\.48756pt\}\{7\.37001pt\}\\pgfsys@lineto\{\-7\.48756pt\}\{\-7\.37001pt\}\\pgfsys@curveto\{\-7\.48756pt\}\{\-7\.9223pt\}\{\-7\.03986pt\}\{\-8\.37001pt\}\{\-6\.48756pt\}\{\-8\.37001pt\}\\pgfsys@lineto\{6\.48756pt\}\{\-8\.37001pt\}\\pgfsys@curveto\{7\.03986pt\}\{\-8\.37001pt\}\{7\.48756pt\}\{\-7\.9223pt\}\{7\.48756pt\}\{\-7\.37001pt\}\\pgfsys@lineto\{7\.48756pt\}\{7\.37001pt\}\\pgfsys@curveto\{7\.48756pt\}\{7\.9223pt\}\{7\.03986pt\}\{8\.37001pt\}\{6\.48756pt\}\{8\.37001pt\}\\pgfsys@closepath\\pgfsys@moveto\{\-7\.48756pt\}\{\-8\.37001pt\}\\pgfsys@fillstroke\\pgfsys@invoke\{ \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-4\.98756pt\}\{\-3\.00891pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$h\_\{j\}^\{\(k\-1\)\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\\;\\cdot\\;\\hbox to13\.55pt\{\\vbox to16\.22pt\{\\pgfpicture\\makeatletter\\hbox\{\\quad\\lower\-8\.11168pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgffillcolor\}\{rgb\}\{1,0\.91,0\.82\}\\pgfsys@color@rgb@fill\{1\}\{0\.91\}\{0\.82\}\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\}\{\}\{\}\{\}\{\}\{\} \{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgffillcolor\}\{rgb\}\{1,0\.91,0\.82\}\\pgfsys@color@rgb@fill\{1\}\{0\.91\}\{0\.82\}\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{\}\}\}\{\{\}\{\}\{\{\}\}\}\{\}\{\}\{\{\}\{\}\{\{\}\}\}\{\{\}\{\}\{\{\}\}\}\{\}\{\}\{\{\}\{\}\{\{\}\}\}\{\{\}\{\}\{\{\}\}\}\{\}\{\}\{\{\}\{\}\{\{\}\}\}\{\{\}\{\}\{\{\}\}\}\{\}\{\}\{\}\\pgfsys@moveto\{5\.5761pt\}\{7\.91168pt\}\\pgfsys@lineto\{\-5\.5761pt\}\{7\.91168pt\}\\pgfsys@curveto\{\-6\.12839pt\}\{7\.91168pt\}\{\-6\.5761pt\}\{7\.46397pt\}\{\-6\.5761pt\}\{6\.91168pt\}\\pgfsys@lineto\{\-6\.5761pt\}\{\-6\.91168pt\}\\pgfsys@curveto\{\-6\.5761pt\}\{\-7\.46397pt\}\{\-6\.12839pt\}\{\-7\.91168pt\}\{\-5\.5761pt\}\{\-7\.91168pt\}\\pgfsys@lineto\{5\.5761pt\}\{\-7\.91168pt\}\\pgfsys@curveto\{6\.12839pt\}\{\-7\.91168pt\}\{6\.5761pt\}\{\-7\.46397pt\}\{6\.5761pt\}\{\-6\.91168pt\}\\pgfsys@lineto\{6\.5761pt\}\{6\.91168pt\}\\pgfsys@curveto\{6\.5761pt\}\{7\.46397pt\}\{6\.12839pt\}\{7\.91168pt\}\{5\.5761pt\}\{7\.91168pt\}\\pgfsys@closepath\\pgfsys@moveto\{\-6\.5761pt\}\{\-7\.91168pt\}\\pgfsys@fillstroke\\pgfsys@invoke\{ \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-4\.0761pt\}\{\-3\.46724pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$q\_\{i\}^\{\(k\)\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\\;\\cdot\\;\\hbox to68\.67pt\{\\vbox to16\.78pt\{\\pgfpicture\\makeatletter\\hbox\{\\hskip 34\.33635pt\\lower\-8\.38945pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgffillcolor\}\{rgb\}\{0\.82,0\.82,1\}\\pgfsys@color@rgb@fill\{0\.82\}\{0\.82\}\{1\}\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\}\{\}\{\}\{\}\{\}\{\} \{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgffillcolor\}\{rgb\}\{0\.82,0\.82,1\}\\pgfsys@color@rgb@fill\{0\.82\}\{0\.82\}\{1\}\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{\}\}\}\{\{\}\{\}\{\{\}\}\}\{\}\{\}\{\{\}\{\}\{\{\}\}\}\{\{\}\{\}\{\{\}\}\}\{\}\{\}\{\{\}\{\}\{\{\}\}\}\{\{\}\{\}\{\{\}\}\}\{\}\{\}\{\{\}\{\}\{\{\}\}\}\{\{\}\{\}\{\{\}\}\}\{\}\{\}\{\}\\pgfsys@moveto\{33\.13635pt\}\{8\.18945pt\}\\pgfsys@lineto\{\-33\.13635pt\}\{8\.18945pt\}\\pgfsys@curveto\{\-33\.68864pt\}\{8\.18945pt\}\{\-34\.13635pt\}\{7\.74174pt\}\{\-34\.13635pt\}\{7\.18945pt\}\\pgfsys@lineto\{\-34\.13635pt\}\{\-7\.18945pt\}\\pgfsys@curveto\{\-34\.13635pt\}\{\-7\.74174pt\}\{\-33\.68864pt\}\{\-8\.18945pt\}\{\-33\.13635pt\}\{\-8\.18945pt\}\\pgfsys@lineto\{33\.13635pt\}\{\-8\.18945pt\}\\pgfsys@curveto\{33\.68864pt\}\{\-8\.18945pt\}\{34\.13635pt\}\{\-7\.74174pt\}\{34\.13635pt\}\{\-7\.18945pt\}\\pgfsys@lineto\{34\.13635pt\}\{7\.18945pt\}\\pgfsys@curveto\{34\.13635pt\}\{7\.74174pt\}\{33\.68864pt\}\{8\.18945pt\}\{33\.13635pt\}\{8\.18945pt\}\\pgfsys@closepath\\pgfsys@moveto\{\-34\.13635pt\}\{\-8\.18945pt\}\\pgfsys@fillstroke\\pgfsys@invoke\{ \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-31\.63635pt\}\{\-3\.18945pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$g\_\{i\}^\{\\prime\(k\)\}\(h\_\{i\}^\{\(k\)\}\)\\,f^\{\\prime\(k\)\}\(u\_\{i\}^\{\(k\)\}\)$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}

\(a\)Three\-factor SBD update with score\-vector expansion\. At layerkk, the synaptic update uses presynaptic activityhj\(k−1\)h\_\{j\}^\{\(k\-1\)\}, postsynaptic sensitivitygi′⁣\(k\)​\(hi\(k\)\)​f′⁣\(k\)​\(ui\(k\)\)g\_\{i\}^\{\\prime\(k\)\}\(h\_\{i\}^\{\(k\)\}\)\\,f^\{\\prime\(k\)\}\(u\_\{i\}^\{\(k\)\}\), and a broadcast modulationqi\(k\)q\_\{i\}^\{\(k\)\}obtained by projecting the expanded score vector𝜹~=\[𝜹;ϕ2​\(𝐱\)⊙𝜹;…;ϕM​\(𝐱\)⊙𝜹\]\\tilde\{\\boldsymbol\{\\delta\}\}=\[\\boldsymbol\{\\delta\};\\,\\phi\_\{2\}\(\\mathbf\{x\}\)\\odot\\boldsymbol\{\\delta\};\\,\\ldots;\\,\\phi\_\{M\}\(\\mathbf\{x\}\)\\odot\\boldsymbol\{\\delta\}\], where𝜹=∇𝐚ℒ\\boldsymbol\{\\delta\}=\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}is the output score\.
![Refer to caption](https://arxiv.org/html/2605.30638v1/figures/cnn_cifar10_correlations.png)\(b\)Empirical correlations between the cross entropy output score and hidden\-layer activations in a backpropagation\-trained CIFAR\-10 CNN\. These correlations decrease during optimization, consistent with Theorem[2](https://arxiv.org/html/2605.30638#Thmtheorem2)\.

Figure 1:Illustrations of the SBD framework and the empirical score–activation orthogonality\.Building on this principle, we formulateScore Broadcast and Decorrelation\(SBD\), a broadcast\-and\-decorrelate framework for general differentiable losses that supplies a unified theoretical grounding for the three\-factor learning rule introduced above, with the neuromodulatory factorderivedas the broadcast loss score rather than postulated; Figure[1\(a\)](https://arxiv.org/html/2605.30638#S1.F1.sf1)depicts the resulting update, including the score\-vector expansion of Section[7](https://arxiv.org/html/2605.30638#S7)\. The MSE\-based EBD update is recovered as one specific instance\.

We also introduce score vector expansion\. The output score has dimension equal to the number of output coordinates, which can limit the number of independent decorrelation directions available to wide hidden layers\. SBD can expand the broadcast vector by multiplying the score with deterministic modulators, such as functions of the predictive distribution\. These modulated scores preserve the conditional mean zero property at the population optimum while providing a richer set of decorrelation directions\.

We evaluate SBD on CIFAR\-10 and Tiny ImageNet using CNN architectures matched to the original EBD setting\. The experiments are intended as controlled proof of concept tests within broadcast learning rather than as claims of superiority over exact BP\. BP is included as a reference optimizer, while the main comparisons are to MSE based EBD and other broadcast baselines\. In this setting, SBD improves over other broadcast formulations, and score vector expansion provides additional gains\. The remainder of the paper is organized as follows\. Section[2](https://arxiv.org/html/2605.30638#S2)states the supervised\-learning problem and Section[3](https://arxiv.org/html/2605.30638#S3)reviews EBD\. Section[4](https://arxiv.org/html/2605.30638#S4)develops the cross entropy case, Section[5](https://arxiv.org/html/2605.30638#S5)presents the general SBD framework, Section[6](https://arxiv.org/html/2605.30638#S6)characterizes when a loss yields a conditionally mean\-zero score, and Section[7](https://arxiv.org/html/2605.30638#S7)introduces score\-vector expansion\. Section[8](https://arxiv.org/html/2605.30638#S8)reports experiments, and Section[9](https://arxiv.org/html/2605.30638#S9)concludes\.

### 1\.1Related work

Alternatives to BP span several major families, includingcontrastive methodssuch as Equilibrium Propagation\(Scellier and Bengio,[2017](https://arxiv.org/html/2605.30638#bib.bib15),[2019](https://arxiv.org/html/2605.30638#bib.bib16)\),target propagation\(Bengio,[2014](https://arxiv.org/html/2605.30638#bib.bib17); Lee et al\.,[2015](https://arxiv.org/html/2605.30638#bib.bib18)\),forward\-only methods\(Hinton,[2022](https://arxiv.org/html/2605.30638#bib.bib19)\),predictive methods\(Rao and Ballard,[1999](https://arxiv.org/html/2605.30638#bib.bib20); Whittington and Bogacz,[2017](https://arxiv.org/html/2605.30638#bib.bib21); Golkar et al\.,[2022](https://arxiv.org/html/2605.30638#bib.bib6)\),similarity matching\(Qin et al\.,[2021](https://arxiv.org/html/2605.30638#bib.bib22)\), andfeedback alignment\(Lillicrap et al\.,[2016](https://arxiv.org/html/2605.30638#bib.bib23); Akrout et al\.,[2019](https://arxiv.org/html/2605.30638#bib.bib4)\)\.

Most closely related to the present work areerror broadcastmethods, which send global output information directly to hidden layers\. Direct Feedback Alignment routes random feedback projections from the output to each hidden layer\(Nøkland,[2016](https://arxiv.org/html/2605.30638#bib.bib24); Bartunov et al\.,[2018](https://arxiv.org/html/2605.30638#bib.bib25); Han and Yoo,[2019](https://arxiv.org/html/2605.30638#bib.bib26); Launay et al\.,[2019](https://arxiv.org/html/2605.30638#bib.bib27),[2020](https://arxiv.org/html/2605.30638#bib.bib28); Bordelon and Pehlevan,[2023](https://arxiv.org/html/2605.30638#bib.bib29)\), while Clark et al\.\(Clark et al\.,[2021](https://arxiv.org/html/2605.30638#bib.bib30)\)broadcast a non\-negative global error vector to modulate local plasticity\. Most directly, theError Broadcast and Decorrelation\(EBD\) framework\(Erdogan et al\.,[2025](https://arxiv.org/html/2605.30638#bib.bib7)\)provided a theoretical basis for broadcast learning by deriving layerwise decorrelation objectives from the MMSE orthogonality principle\. Our work builds on this line by extending that foundation from MSE to general losses\.

### 1\.2Contributions

- •A loss\-score principle for broadcast learning\.We identify the output score𝜹=∇𝐚ℒ​\(𝐲,𝐚\)\\boldsymbol\{\\delta\}=\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{y\},\\mathbf\{a\}\)as the natural broadcast quantity for a general differentiable loss, recovering the EBD MSE residual as a special case\.
- •Score orthogonality beyond MSE\.We prove that, for cross entropy, the population\-optimal score satisfies𝔼​\[𝐩⋆​\(X\)−𝐘∣X\]=𝟎\\mathbb\{E\}\[\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbf\{Y\}\\mid X\]=\\mathbf\{0\}and is therefore orthogonal to any input\-measurable feature\. We further characterize a broad family of losses admitting the same conditional\-mean\-zero property, including Bregman divergences\(Bregman,[1967](https://arxiv.org/html/2605.30638#bib.bib33); Gneiting and Raftery,[2007](https://arxiv.org/html/2605.30638#bib.bib34)\), exponential\-family negative log\-likelihoods, and proper scoring rules through unconstrained links\.
- •The SBD rule and a grounding of the three\-factor rule under general losses\.We propose Score Broadcast and Decorrelation, which uses layerwise decorrelation objectives between hidden activations and the output score\. This provides a theoretical grounding for the three\-factor learning rule of neuroscience under general losses: the neuromodulatory factor is*derived*as the broadcast loss score rather than postulated\.
- •Score vector expansion and experiments\.We introduce score vector expansion, enriching the broadcast signal with deterministic modulators that preserve population orthogonality and increase the decorrelation directions of the layerwise objective\. Experiments show that SBD improves over broadcast baselines, with consistent gains from expansion; update\-alignment diagnostics show positive cosine similarity with BP gradients throughout training\.

## 2Problem statement

We study a supervised learning problem using a multilayer perceptron withLLlayers, including the output layer\. Let𝐡\(k\)∈ℝN\(k\)\\mathbf\{h\}^\{\(k\)\}\\in\\mathbb\{R\}^\{N^\{\(k\)\}\}denote the activation vector at layerkk, where𝐡\(0\)=𝐱\\mathbf\{h\}^\{\(0\)\}=\\mathbf\{x\}is the input and𝐡\(L\)\\mathbf\{h\}^\{\(L\)\}is the output\. For each layerk=1,…,Lk=1,\\ldots,L, the pre\-activation and activation are written as

𝐮\(k\)=𝐖\(k\)​𝐡\(k−1\)\+𝐛\(k\),𝐡\(k\)=f\(k\)​\(𝐮\(k\)\),\\mathbf\{u\}^\{\(k\)\}=\\mathbf\{W\}^\{\(k\)\}\\mathbf\{h\}^\{\(k\-1\)\}\+\\mathbf\{b\}^\{\(k\)\},\\qquad\\mathbf\{h\}^\{\(k\)\}=f^\{\(k\)\}\\\!\\left\(\\mathbf\{u\}^\{\(k\)\}\\right\),where𝐖\(k\)\\mathbf\{W\}^\{\(k\)\}and𝐛\(k\)\\mathbf\{b\}^\{\(k\)\}are the weights and biases of layerkk, andf\(k\)​\(⋅\)f^\{\(k\)\}\(\\cdot\)is its activation function\.

The goal is to learn the network parameters so that the output𝐡\(L\)\\mathbf\{h\}^\{\(L\)\}matches a target𝐲\\mathbf\{y\}under a task\-specific objective\. We therefore keep the learning formulation general and write the criterion asℒ​\(𝐡\(L\),𝐲\)\\mathcal\{L\}\\bigl\(\\mathbf\{h\}^\{\(L\)\},\\mathbf\{y\}\\bigr\)\. This objective may take different forms depending on the application\. For example, the mean\-squared\-error loss in regression, while cross entropy loss in classification\.

## 3A review of error broadcast and decorrelation method

EBD\(Erdogan et al\.,[2025](https://arxiv.org/html/2605.30638#bib.bib7)\)is a broadcast\-based alternative to backpropagation grounded in the MMSE orthogonality principle: the optimal estimator’s error should be orthogonal to suitable nonlinear functions of the input\. EBD applies this idea to hidden\-layer representations and posits that, at layerkk,

𝐑g​e\(k\)=𝔼​\[g\(k\)​\(𝐡\(k\)\)​𝐞T\]=𝟎,\\mathbf\{R\}^\{\(k\)\}\_\{ge\}=\\mathbb\{E\}\\bigl\[g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\)\\mathbf\{e\}^\{T\}\\bigr\]=\\mathbf\{0\},whereg\(k\)​\(𝐡\(k\)\)g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\)is any nonlinear function of hidden\-layer activations \(typically,g​\(\(𝐡\(k\)\)\)=\(𝐡\(k\)\)g\(\(\\mathbf\{h\}^\{\(k\)\}\)\)=\(\\mathbf\{h\}^\{\(k\)\}\)\)\. In practice, the correlation is estimated online from minibatches of sizeBBusing

𝐑^\(k\)​\[m\]=λ​𝐑^\(k\)​\[m−1\]\+\(1−λ\)​B−1​∑l=1Bg\(k\)​\(𝐡\(k\)​\[m​B\+l\]\)​𝐞​\[m​B\+l\]T,\\hat\{\\mathbf\{R\}\}^\{\(k\)\}\[m\]=\\lambda\\hat\{\\mathbf\{R\}\}^\{\(k\)\}\[m\-1\]\+\(1\-\\lambda\)B^\{\-1\}\\sum\_\{l=1\}^\{B\}g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\[mB\+l\]\)\\mathbf\{e\}\[mB\+l\]^\{T\},wheremmis the batch index\. EBD minimizes the layerwise\-defined decorrelation objective

𝒥EBD\(k\)​\[m\]=12​‖𝐑^\(k\)​\[m\]‖F2\.\\mathcal\{J\}^\{\(k\)\}\_\{\\mathrm\{EBD\}\}\[m\]=\\frac\{1\}\{2\}\\left\\\|\\hat\{\\mathbf\{R\}\}^\{\(k\)\}\[m\]\\right\\\|\_\{F\}^\{2\}\.Thus the MMSE orthogonality condition is turned into a layerwise training criterion that pushes hidden features to decorrelate from the broadcast output error\(Erdogan et al\.,[2025](https://arxiv.org/html/2605.30638#bib.bib7)\)\.

Differentiating this loss yields a local term and additional terms that would require propagating the output error through deeper layers\. EBD uses the local term to define a projected error signal

𝐪\(k\)​\[n\]=𝐑^\(k\)​\[m\]​𝐞​\[n\],n=m​B\+1,…,\(m\+1\)​B,\\mathbf\{q\}^\{\(k\)\}\[n\]=\\hat\{\\mathbf\{R\}\}^\{\(k\)\}\[m\]\\,\\mathbf\{e\}\[n\],\\hskip 14\.45377ptn=mB\+1,\\ldots,\(m\+1\)B,wherennis the sample index\. The resulting single\-sample hidden\-layer update, forB=1B=1, is

Δ​Wi​j\(k\)​\[n\]=ζ​gi′⁣\(k\)​\(hi\(k\)​\[n\]\)​f′⁣\(k\)​\(ui\(k\)​\[n\]\)​qi\(k\)​\[n\]​hj\(k−1\)​\[n\],\\Delta W\_\{ij\}^\{\(k\)\}\[n\]=\\zeta\\,g\_\{i\}^\{\\prime\(k\)\}\\\!\\left\(h\_\{i\}^\{\(k\)\}\[n\]\\right\)f^\{\\prime\(k\)\}\\\!\\left\(u\_\{i\}^\{\(k\)\}\[n\]\\right\)q\_\{i\}^\{\(k\)\}\[n\]h\_\{j\}^\{\(k\-1\)\}\[n\],This update has the three\-factor structure in neuroscience: pre\- and post\-synaptic activity terms, and a modulatory broadcast term\. EBD’s derivation, however, is tied to the MSE loss only\. The proposed SBD framework generalizes the derivation of the three\-factor rule to a wider family of losses, where MSE is a special instance\.

## 4Extension to cross entropy loss

We now extend the EBD framework from MSE to multiclass classification under cross entropy loss\. Let\(X,𝐘\)∼ℙX,𝐘\(X,\\mathbf\{Y\}\)\\sim\\mathbb\{P\}\_\{X,\\mathbf\{Y\}\}be jointly distributed input–label pairs taking values in𝒳×\{𝐞1,…,𝐞D\}\\mathcal\{X\}\\times\\\{\\mathbf\{e\}\_\{1\},\\ldots,\\mathbf\{e\}\_\{D\}\\\}, where𝐞d∈ℝD\\mathbf\{e\}\_\{d\}\\in\\mathbb\{R\}^\{D\}is thedd\-th canonical basis vector\. As before, the network produces hidden activations𝐡\(k\)\\mathbf\{h\}^\{\(k\)\}fork=1,…,L−1k=1,\\ldots,L\-1, and the output layer produces logits

𝐚=𝐖\(L\)​𝐡\(L−1\)\+𝐛\(L\),\\mathbf\{a\}=\\mathbf\{W\}^\{\(L\)\}\\mathbf\{h\}^\{\(L\-1\)\}\+\\mathbf\{b\}^\{\(L\)\},\(1\)with corresponding predicted probabilities𝐩=softmax⁡\(𝐚\)\\mathbf\{p\}=\\operatorname\{softmax\}\(\\mathbf\{a\}\), and cross entropy loss

ℒCE​\(𝐲,𝐚\)=−∑d=1Dyd​log⁡pd\.\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(\\mathbf\{y\},\\mathbf\{a\}\)=\-\\sum\_\{d=1\}^\{D\}y\_\{d\}\\log p\_\{d\}\.We refer to the gradient with respect to the logits𝐚\\mathbf\{a\}as theoutput score, which here takes the form

𝜹CE=∇𝐚ℒCE​\(𝐲,𝐚\)=𝐩−𝐲\.\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}=\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(\\mathbf\{y\},\\mathbf\{a\}\)=\\mathbf\{p\}\-\\mathbf\{y\}\.We will show that this probability residual plays the role of the Euclidean error𝐞\\mathbf\{e\}in the MSE version of EBD, satisfying the analogous orthogonality property at population optimality\.

To obtain the cross entropy analog of MMSE orthogonality, define

𝐪​\(𝐱\)=\[ℙ​\(𝐘=𝐞1∣X=𝐱\),…,ℙ​\(𝐘=𝐞D∣X=𝐱\)\]T,\\mathbf\{q\}\(\\mathbf\{x\}\)=\\left\[\\begin\{array\}\[\]\{ccc\}\\mathbb\{P\}\(\\mathbf\{Y\}=\\mathbf\{e\}\_\{1\}\\mid X=\\mathbf\{x\}\),\\ldots,\\mathbb\{P\}\(\\mathbf\{Y\}=\\mathbf\{e\}\_\{D\}\\mid X=\\mathbf\{x\}\)\\end\{array\}\\right\]^\{T\},as the true conditional class\-probability vector\. For any predictor𝐩​\(𝐱\)\\mathbf\{p\}\(\\mathbf\{x\}\), the cross entropy risk is

ℛCE​\(𝐩\)=𝔼​\[−∑d=1DYd​log⁡pd​\(X\)\]=𝔼X​\[H​\(𝐪​\(X\),𝐩​\(X\)\)\],\\mathcal\{R\}\_\{\\mathrm\{CE\}\}\(\\mathbf\{p\}\)=\\mathbb\{E\}\\Bigl\[\-\\sum\_\{d=1\}^\{D\}Y\_\{d\}\\log p\_\{d\}\(X\)\\Bigr\]=\\mathbb\{E\}\_\{X\}\\bigl\[H\(\\mathbf\{q\}\(X\),\\mathbf\{p\}\(X\)\)\\bigr\],whereH​\(𝐪,𝐩\)H\(\\mathbf\{q\},\\mathbf\{p\}\)is the cross entropy between the true and predicted conditional distributions\. Using the standard decompositionH​\(𝐪,𝐩\)=H​\(𝐪\)\+KL​\(𝐪∥𝐩\)H\(\\mathbf\{q\},\\mathbf\{p\}\)=H\(\\mathbf\{q\}\)\+\\mathrm\{KL\}\(\\mathbf\{q\}\\\|\\mathbf\{p\}\), the unique population minimizer is𝐩⋆​\(𝐱\)=𝐪​\(𝐱\)\\mathbf\{p\}^\{\\star\}\(\\mathbf\{x\}\)=\\mathbf\{q\}\(\\mathbf\{x\}\)almost surely, which yields the following identity replacing the MMSE residual property\.

###### Theorem 1\(Conditional mean\-zero property for cross entropy\)

Let𝐩⋆​\(X\)\\mathbf\{p\}^\{\\star\}\(X\)denote the population minimizer of the cross entropy risk\. Then

𝔼​\[𝐩⋆​\(X\)−𝐘∣X\]=𝟎\.\\mathbb\{E\}\\bigl\[\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbf\{Y\}\\mid X\\bigr\]=\\mathbf\{0\}\.

*The proof is deferred to Appendix[A\.1](https://arxiv.org/html/2605.30638#A1.SS1)\.*

The conditional mean\-zero identity becomes unconditional orthogonality with respect to arbitrary measurable functions ofXXvia the tower property of conditional expectation\.

###### Lemma 1\(Tower\-property orthogonality\)

Let𝐔\\mathbf\{U\}be a random vector and letg​\(X\)g\(X\)be any measurable function ofXXsuch that𝐔​g​\(X\)T\\mathbf\{U\}\\,g\(X\)^\{T\}is integrable, i\.e\.,𝔼​\[‖𝐔​g​\(X\)T‖\]<∞\\mathbb\{E\}\\bigl\[\\\|\\mathbf\{U\}\\,g\(X\)^\{T\}\\\|\\bigr\]<\\infty\. If𝔼​\[𝐔∣X\]=𝟎\\mathbb\{E\}\[\\mathbf\{U\}\\mid X\]=\\mathbf\{0\}, then𝔼​\[𝐔​g​\(X\)T\]=𝟎\\mathbb\{E\}\\bigl\[\\mathbf\{U\}\\,g\(X\)^\{T\}\\bigr\]=\\mathbf\{0\}\.

Applying the lemma with𝐔=𝐩⋆​\(X\)−𝐘\\mathbf\{U\}=\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbf\{Y\}yields the main cross entropy orthogonality theorem\.

###### Theorem 2\(Cross entropy orthogonality theorem\)

Assume that the predictor class contains the true conditional distribution, so that𝐩⋆​\(X\)=𝐪​\(X\)\\mathbf\{p\}^\{\\star\}\(X\)=\\mathbf\{q\}\(X\)\. Then for any measurable functiong​\(X\)g\(X\)such that\(𝐩⋆​\(X\)−𝐘\)​g​\(X\)T\(\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbf\{Y\}\)\\,g\(X\)^\{T\}is integrable, in particular, for anyggwith𝔼​‖g​\(X\)‖<∞\\mathbb\{E\}\\\|g\(X\)\\\|<\\infty, since𝐩⋆​\(X\)−𝐘\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbf\{Y\}is bounded,

𝔼​\[\(𝐩⋆​\(X\)−𝐘\)​g​\(X\)T\]=𝟎\.\\mathbb\{E\}\\bigl\[\(\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbf\{Y\}\)g\(X\)^\{T\}\\bigr\]=\\mathbf\{0\}\.In particular, choosingg​\(X\)=g\(k\)​\(𝐡\(k\)​\(X\)\)g\(X\)=g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\(X\)\)gives

𝔼​\[\(𝐩⋆​\(X\)−𝐘\)​g\(k\)​\(𝐡\(k\)​\(X\)\)T\]=𝟎\.\\mathbb\{E\}\\bigl\[\(\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbf\{Y\}\)g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\(X\)\)^\{T\}\\bigr\]=\\mathbf\{0\}\.

*The proof is deferred to Appendix[A\.3](https://arxiv.org/html/2605.30638#A1.SS3)\.*

Figure[1\(b\)](https://arxiv.org/html/2605.30638#S1.F1.sf2)illustrates this empirically: although the theorem is a population statement, the measured score–hidden correlations in a CNN trained with cross entropy steadily decline during backpropagation, indicating that optimization drives the network toward the predicted orthogonality regime\.

Using this cross entropy counterpart of orthogonality, we define the layerwise decorrelation condition

𝐑CE\(k\)=𝔼​\[g\(k\)​\(𝐡\(k\)\)​\(𝐩−𝐲\)T\]=𝟎,\\mathbf\{R\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}=\\mathbb\{E\}\\bigl\[g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\)\(\\mathbf\{p\}\-\\mathbf\{y\}\)^\{T\}\\bigr\]=\\mathbf\{0\},with objective𝒥CE\(k\)=12​‖𝐑CE\(k\)‖F2\\mathcal\{J\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}=\\tfrac\{1\}\{2\}\\\|\\mathbf\{R\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\\\|\_\{F\}^\{2\}\. For batch optimization, the batch probability and label matrices

𝐏​\[m\]=\[𝐩​\[m​B\+1\],…,𝐩​\[m​B\+B\]\],𝐘​\[m\]=\[𝐲​\[m​B\+1\],…,𝐲​\[m​B\+B\]\]\\mathbf\{P\}\[m\]=\\bigl\[\\mathbf\{p\}\[mB\+1\],\\ldots,\\mathbf\{p\}\[mB\+B\]\\bigr\],\\qquad\\mathbf\{Y\}\[m\]=\\bigl\[\\mathbf\{y\}\[mB\+1\],\\ldots,\\mathbf\{y\}\[mB\+B\]\\bigr\]yield the score matrix𝚫CE​\[m\]=𝐏​\[m\]−𝐘​\[m\]\\mathbf\{\\Delta\}\_\{\\mathrm\{CE\}\}\[m\]=\\mathbf\{P\}\[m\]\-\\mathbf\{Y\}\[m\]and the autoregressive correlation estimate

𝐑^CE\(k\)​\[m\]=λ​𝐑^CE\(k\)​\[m−1\]\+\(1−λ\)​B−1​𝐆\(k\)​\[m\]​𝚫CE​\[m\]T,\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]=\\lambda\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\-1\]\+\(1\-\\lambda\)B^\{\-1\}\\,\\mathbf\{G\}^\{\(k\)\}\[m\]\\mathbf\{\\Delta\}\_\{\\mathrm\{CE\}\}\[m\]^\{T\},where𝐆\(k\)​\[m\]=\[g\(k\)​\(𝐡\(k\)​\[m​B\+1\]\),…,g\(k\)​\(𝐡\(k\)​\[m​B\+B\]\)\]\\mathbf\{G\}^\{\(k\)\}\[m\]=\\left\[\\begin\{array\}\[\]\{ccc\}g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\[mB\+1\]\),\\ldots,g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\[mB\+B\]\)\\end\{array\}\\right\], with online decorrelation loss𝒥CE\(k\)​\[m\]=12​‖𝐑^CE\(k\)​\[m\]‖F2\\mathcal\{J\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]=\\tfrac\{1\}\{2\}\\\|\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\\\|\_\{F\}^\{2\}\. This is the exact EBD analog with the MSE error replaced by the cross entropy score\.

As shown in Appendix[C\.1](https://arxiv.org/html/2605.30638#A3.SS1), following the derivation as in the EBD\(Erdogan et al\.,[2025](https://arxiv.org/html/2605.30638#bib.bib7)\), differentiating𝒥CE\(k\)​\[m\]\\mathcal\{J\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]yields a local Jacobian term plus terms that would require propagating the score through deeper layers; we keep the local no\-propagation term as the practical broadcast rule\. The layer\-specific projected score is

𝐪CE\(k\)​\[n\]=𝐑^CE\(k\)​\[m\]​\(𝐩​\[n\]−𝐲​\[n\]\),n=m​B\+1,…,\(m\+1\)​B\.\\mathbf\{q\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[n\]=\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\\bigl\(\\mathbf\{p\}\[n\]\-\\mathbf\{y\}\[n\]\\bigr\),\\hskip 14\.45377ptn=mB\+1,\\ldots,\(m\+1\)B\.Hence, forB=1B=1, the update rule can be written as

Δ​Wi​j\(k\)​\[n\]=ζ​gi′⁣\(k\)​\(hi\(k\)​\[n\]\)​f′⁣\(k\)​\(ui\(k\)​\[n\]\)​qCE,i\(k\)​\[n\]​hj\(k−1\)​\[n\],\\Delta W\_\{ij\}^\{\(k\)\}\[n\]=\\zeta\\,g\_\{i\}^\{\\prime\(k\)\}\\\!\\left\(h\_\{i\}^\{\(k\)\}\[n\]\\right\)f^\{\\prime\(k\)\}\\\!\\left\(u\_\{i\}^\{\(k\)\}\[n\]\\right\)q\_\{\\mathrm\{CE\},i\}^\{\(k\)\}\[n\]h\_\{j\}^\{\(k\-1\)\}\[n\],which preserves the three\-factor structure of EBD with the modulatory broadcast term now derived from the cross entropy score\. More details, includingB\>1B\>1case, are provided in Appendix[C](https://arxiv.org/html/2605.30638#A3)\.

##### Sufficiency under a dense\-feature assumption\.

Theorem[2](https://arxiv.org/html/2605.30638#Thmtheorem2)shows that score–feature orthogonality is*necessary*at the population cross entropy optimum\. The converse, that enforcing this orthogonality drives the predictor*to*the optimum, requires the feature family used in the decorrelation to be sufficiently rich\. Let𝐩​\(X\)\\mathbf\{p\}\(X\)denote a*candidate*predictor \(not assumed to be the population minimizer\), and definem𝐩​\(X\):=𝔼​\[𝐩​\(X\)−𝐘∣X\]=𝐩​\(X\)−𝐪​\(X\)m\_\{\\mathbf\{p\}\}\(X\):=\\mathbb\{E\}\[\\mathbf\{p\}\(X\)\-\\mathbf\{Y\}\\mid X\]=\\mathbf\{p\}\(X\)\-\\mathbf\{q\}\(X\), the gap between the candidate predictor and the true conditional class probabilities\. The decorrelation condition𝔼​\[\(𝐩​\(X\)−𝐘\)​𝐳​\(X\)T\]=0\\mathbb\{E\}\[\(\\mathbf\{p\}\(X\)\-\\mathbf\{Y\}\)\\mathbf\{z\}\(X\)^\{T\}\]=0is, by the tower property, exactly the orthogonality ofm𝐩m\_\{\\mathbf\{p\}\}against the test feature𝐳\\mathbf\{z\}inL2​\(PX\)L^\{2\}\(P\_\{X\}\)\. If the hidden\-layer features \(e\.g\., first\-layer activations in the wide\-network limit\) have linear span dense inL2​\(PX\)L^\{2\}\(P\_\{X\}\), then enforcing the decorrelation condition against*all*such features forcesm𝐩m\_\{\\mathbf\{p\}\}to be orthogonal to a dense subspace\. Sincem𝐩∈L2​\(PX\)m\_\{\\mathbf\{p\}\}\\in L^\{2\}\(P\_\{X\}\), this forcesm𝐩​\(X\)=0m\_\{\\mathbf\{p\}\}\(X\)=0almost surely, which gives𝐩​\(X\)=𝐪​\(X\)=𝐩⋆​\(X\)\\mathbf\{p\}\(X\)=\\mathbf\{q\}\(X\)=\\mathbf\{p\}^\{\\star\}\(X\), i\.e\., the candidate predictor coincides with the population minimizer of Theorem[1](https://arxiv.org/html/2605.30638#Thmtheorem1)\. The same dense\-feature reasoning was used in EBD\(Erdogan et al\.,[2025](https://arxiv.org/html/2605.30638#bib.bib7), Appendix B\.2\)for MSE; under SBD it operates identically with the cross entropy score\. The finite\-width algorithm should be viewed as an approximation to this idealized condition\.

## 5Extension to more general losses: score broadcast and decorrelation

The cross entropy derivation reveals the central idea behind the generalized framework: what matters is not the MSE residual or the specific probability residual of cross entropy, but the score of the loss with respect to the network output\. For a general differentiable lossℒ​\(𝐲,𝐚\)\\mathcal\{L\}\(\\mathbf\{y\},\\mathbf\{a\}\), we define the score by

𝜹​\(𝐲,𝐚\):=∇𝐚ℒ​\(𝐲,𝐚\)\.\\boldsymbol\{\\delta\}\(\\mathbf\{y\},\\mathbf\{a\}\):=\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{y\},\\mathbf\{a\}\)\.This provides a universal notion of broadcast error\. In particular,𝜹MSE=𝐚−𝐲\\boldsymbol\{\\delta\}\_\{\\mathrm\{MSE\}\}=\\mathbf\{a\}\-\\mathbf\{y\}, and,𝜹CE=𝐩−𝐲\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}=\\mathbf\{p\}\-\\mathbf\{y\}, so both MSE and cross entropy fit naturally into the same score\-based formulation\. For arbitrary differentiable losses, the first universal orthogonality statement arises from stationarity of the last\-layer weights\. Let𝐚\\mathbf\{a\}denote network output as in Eq\. \([1](https://arxiv.org/html/2605.30638#S4.E1)\), define the population riskℛ=𝔼​\[ℒ​\(𝐘,𝐚​\(𝐗\)\)\]\\mathcal\{R\}=\\mathbb\{E\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}\(\\mathbf\{X\}\)\)\\bigr\]\. Then differentiation with respect to the last\-layer weight matrix gives

∇𝐖\(L\)ℛ=𝔼​\[𝜹​𝐡\(L−1\)​T\]\.\\nabla\_\{\\mathbf\{W\}^\{\(L\)\}\}\\mathcal\{R\}=\\mathbb\{E\}\\bigl\[\\boldsymbol\{\\delta\}\\,\\mathbf\{h\}^\{\(L\-1\)T\}\\bigr\]\.Consequently, every stationary point of the population risk satisfies the following proposition\.

###### Proposition 1\(Stationary score\-feature orthogonality\)

Assume thatℒ​\(𝐲,𝐚\)\\mathcal\{L\}\(\\mathbf\{y\},\\mathbf\{a\}\)is differentiable in𝐚\\mathbf\{a\}and that gradient and expectation may be interchanged in the expression forℛ\\mathcal\{R\}\. If𝐖\(L\)\\mathbf\{W\}^\{\(L\)\}is stationary while all earlier parameters are fixed, then𝔼​\[𝛅​𝐡\(L−1\)​T\]=𝟎\.\\mathbb\{E\}\\bigl\[\\boldsymbol\{\\delta\}\\,\\mathbf\{h\}^\{\(L\-1\)T\}\\bigr\]=\\mathbf\{0\}\.

*The proof is deferred to Appendix[A\.4](https://arxiv.org/html/2605.30638#A1.SS4)\.*

The proposition only covers the last hidden layer\. Orthogonality with arbitrary measurable input functions requires the conditional mean\-zero score property of the following theorem:

###### Theorem 3\(General score orthogonality theorem\)

Let𝐚⋆​\(X\)\\mathbf\{a\}^\{\\star\}\(X\)be an optimal predictor at the population level for a differentiable lossℒ​\(𝐲,𝐚\)\\mathcal\{L\}\(\\mathbf\{y\},\\mathbf\{a\}\), and let𝛅⋆=∇𝐚ℒ​\(𝐘,𝐚⋆​\(X\)\)\\boldsymbol\{\\delta\}^\{\\star\}=\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}^\{\\star\}\(X\)\)be the corresponding optimal score\. Assume that the optimal score satisfies the conditional mean\-zero property𝔼​\[𝛅⋆∣X\]=𝟎\\mathbb\{E\}\[\\boldsymbol\{\\delta\}^\{\\star\}\\mid X\]=\\mathbf\{0\}\. Then for every measurable functiong​\(X\)g\(X\)such that𝛅⋆​g​\(X\)T\\boldsymbol\{\\delta\}^\{\\star\}\\,g\(X\)^\{T\}is integrable \(equivalently,𝔼​\[‖𝛅⋆​g​\(X\)T‖\]<∞\\mathbb\{E\}\\bigl\[\\\|\\boldsymbol\{\\delta\}^\{\\star\}\\,g\(X\)^\{T\}\\\|\\bigr\]<\\infty\),

𝔼​\[𝜹⋆​g​\(X\)T\]=𝟎\.\\mathbb\{E\}\\bigl\[\\boldsymbol\{\\delta\}^\{\\star\}g\(X\)^\{T\}\\bigr\]=\\mathbf\{0\}\.In particular, takingg​\(X\)=g\(k\)​\(𝐡\(k\)​\(X\)\)g\(X\)=g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\(X\)\)yields layerwise score orthogonality\.

*The proof is deferred to Appendix[A\.5](https://arxiv.org/html/2605.30638#A1.SS5)\.*

*Remark\.*The conditional mean\-zero hypothesis assumed in this theorem is broadly satisfied: in Section[6](https://arxiv.org/html/2605.30638#S6)and Appendix[B](https://arxiv.org/html/2605.30638#A2)we show that it holds whenever the conditional risk has an interior minimizer in an open parameter domain\. This characterization covers the standard differentiable\-loss families used in supervised learning, including Bregman divergences, exponential\-family negative log\-likelihoods\. MSE and cross\-entropy enter the framework as the canonical instances of the first and third families, respectively, rather than as the loss families themselves: SBD is the general principle, and these familiar losses are two cases among many\.

##### Sufficiency under a dense\-feature assumption\.

The dense\-feature argument for cross\-entropy in Section[4](https://arxiv.org/html/2605.30638#S4)depends on the loss only through its score and therefore extends to general losses\. Let𝐚​\(X\)\\mathbf\{a\}\(X\)denote a*candidate*predictor \(not assumed optimal\) and let𝜹​\(X\):=∇𝐚ℒ​\(𝐘,𝐚​\(X\)\)\\boldsymbol\{\\delta\}\(X\):=\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}\(X\)\)be the corresponding candidate score\. By the tower property, decorrelating𝜹​\(X\)\\boldsymbol\{\\delta\}\(X\)against a hidden\-layer feature family whose linear span is dense inL2​\(PX\)L^\{2\}\(P\_\{X\}\)forces𝔼​\[𝜹​\(X\)∣X\]=𝟎\\mathbb\{E\}\[\\boldsymbol\{\\delta\}\(X\)\\mid X\]=\\mathbf\{0\}almost surely\. By the conditional\-risk characterization of Section[6](https://arxiv.org/html/2605.30638#S6), this conditional mean\-zero identity is sufficient for𝐚​\(X\)\\mathbf\{a\}\(X\)to be a population\-risk optimum, i\.e\.,𝐚​\(X\)=𝐚⋆​\(X\)\\mathbf\{a\}\(X\)=\\mathbf\{a\}^\{\\star\}\(X\)and hence𝜹​\(X\)=𝜹⋆​\(X\)\\boldsymbol\{\\delta\}\(X\)=\\boldsymbol\{\\delta\}^\{\\star\}\(X\), whenever the corresponding conditional risk is convex\. The cross\-entropy case of Section[4](https://arxiv.org/html/2605.30638#S4)and the MSE case of EBD\(Erdogan et al\.,[2025](https://arxiv.org/html/2605.30638#bib.bib7), Appendix B\.2\)are two instances; under SBD the same argument applies uniformly across the differentiable\-loss families covered by the conditional\-risk characterization\.

The resulting generalized score\-broadcast objective is

𝒥Score\(k\)=12​‖𝔼​\[g\(k\)​\(𝐡\(k\)\)​𝜹T\]‖F2,\\mathcal\{J\}\_\{\\mathrm\{Score\}\}^\{\(k\)\}=\\frac\{1\}\{2\}\\left\\\|\\mathbb\{E\}\\bigl\[g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\)\\boldsymbol\{\\delta\}^\{T\}\\bigr\]\\right\\\|\_\{F\}^\{2\},with minibatch estimator

𝐑^Score\(k\)​\[m\]=λ​𝐑^Score\(k\)​\[m−1\]\+\(1−λ\)​B−1​𝐆\(k\)​\[m\]​𝚫​\[m\]T,\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{Score\}\}^\{\(k\)\}\[m\]=\\lambda\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{Score\}\}^\{\(k\)\}\[m\-1\]\+\(1\-\\lambda\)B^\{\-1\}\\,\\mathbf\{G\}^\{\(k\)\}\[m\]\\mathbf\{\\Delta\}\[m\]^\{T\},where𝚫​\[m\]\\mathbf\{\\Delta\}\[m\]collects arbitrary score vectors𝜹​\[m​B\+1\],…,𝜹​\[m​B\+B\]\\boldsymbol\{\\delta\}\[mB\+1\],\\ldots,\\boldsymbol\{\\delta\}\[mB\+B\]\. The layerwise broadcast is

𝐪Score\(k\)​\[n\]=𝐑^Score\(k\)​\[m\]​𝜹​\[n\],n=m​B\+1,…,\(m\+1\)​B,\\mathbf\{q\}\_\{\\mathrm\{Score\}\}^\{\(k\)\}\[n\]=\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{Score\}\}^\{\(k\)\}\[m\]\\,\\boldsymbol\{\\delta\}\[n\],\\hskip 14\.45377ptn=mB\+1,\\ldots,\(m\+1\)B,and forB=1B=1, the weight update keeps the generic local form

Δ​Wi​j\(k\)​\[n\]=ζ​gi′⁣\(k\)​\(hi\(k\)​\[n\]\)​f′⁣\(k\)​\(ui\(k\)​\[n\]\)​qScore,i\(k\)​\[n\]​hj\(k−1\)​\[n\]\.\\Delta W\_\{ij\}^\{\(k\)\}\[n\]=\\zeta\\,g\_\{i\}^\{\\prime\(k\)\}\\\!\\left\(h\_\{i\}^\{\(k\)\}\[n\]\\right\)f^\{\\prime\(k\)\}\\\!\\left\(u\_\{i\}^\{\(k\)\}\[n\]\\right\)q\_\{\\mathrm\{Score\},i\}^\{\(k\)\}\[n\]h\_\{j\}^\{\(k\-1\)\}\[n\]\.The update rule above provides a derivation of the three\-factor learning rule, a long\-standing model of synaptic plasticity in computational neuroscience\(Frémaux and Gerstner,[2016](https://arxiv.org/html/2605.30638#bib.bib8); Kuśmierz et al\.,[2017](https://arxiv.org/html/2605.30638#bib.bib10)\), under a broad family of differentiable losses\. In the neuroscience literature the neuromodulatory factor is typically postulated; under SBD it is derived*from the loss itself*as the broadcast loss score\. Because the derivation depends on the loss only through its score, the same three\-factor update covers every loss to which the conditional\-mean\-zero characterization applies, including Bregman divergences, exponential\-family negative log\-likelihoods, and proper scoring rules through unconstrained links\. SBD is thus a unified derivation of the three\-factor rule across the standard differentiable losses of supervised learning, with MSE\-based EBD recovered as one instance\.As in EBD implementation\(Erdogan et al\.,[2025](https://arxiv.org/html/2605.30638#bib.bib7)\), the score\-broadcast objective is augmented with a layer\-entropy regularizer\(Ozsoy et al\.,[2022](https://arxiv.org/html/2605.30638#bib.bib12); Bozkurt et al\.,[2023a](https://arxiv.org/html/2605.30638#bib.bib13)\)to prevent hidden\-layer activations from collapsing into a low\-dimensional subspace, and a small output\-layerℓ1\\ell\_\{1\}term for training stabilization\. Full implementation details are in Appendix[E\.1](https://arxiv.org/html/2605.30638#A5.SS1)\.

## 6When does an unconstrained loss parametrization yield a conditionally mean\-zero score?

Theorem[3](https://arxiv.org/html/2605.30638#Thmtheorem3)shows that full layerwise orthogonality follows once the population\-optimal score satisfies𝔼​\[𝜹⋆∣X\]=𝟎\\mathbb\{E\}\[\\boldsymbol\{\\delta\}^\{\\star\}\\mid X\]=\\mathbf\{0\}\. We now ask when this hypothesis holds\. The key step is to forget the network parameterization and study the loss as a function of the prediction variable alone\.

For each input valuexx, define the conditional risk

Cx​\(𝐚\)=𝔼​\[ℒ​\(𝐘,𝐚\)∣X=x\],C\_\{x\}\(\\mathbf\{a\}\)=\\mathbb\{E\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}\)\\mid X=x\\bigr\],where𝐚\\mathbf\{a\}is the output parameter with respect to which the SBD score𝜹=∇𝐚ℒ​\(𝐘,𝐚\)\\boldsymbol\{\\delta\}=\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}\)is taken, and assume that𝐚\\mathbf\{a\}ranges over an open set𝒜⊆ℝm\\mathcal\{A\}\\subseteq\\mathbb\{R\}^\{m\}\. Losses with constrained natural prediction variables \(e\.g\., probability losses on𝐩∈ΔD\\mathbf\{p\}\\in\\Delta\_\{D\}\) are brought into this setting by composing with an unconstrained link𝐩=ψ​\(𝐚\)\\mathbf\{p\}=\\psi\(\\mathbf\{a\}\)and writingℒ​\(𝐘,𝐚\)=ℓ​\(𝐘,ψ​\(𝐚\)\)\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}\)=\\ell\(\\mathbf\{Y\},\\psi\(\\mathbf\{a\}\)\)\. The following theorem then characterizes when the conditional\-mean\-zero score property holds\.

###### Theorem 4\(Conditional\-risk characterization, informal\)

SupposeCxC\_\{x\}has an interior minimizer𝐚⋆​\(x\)∈𝒜\\mathbf\{a\}^\{\\star\}\(x\)\\in\\mathcal\{A\}for almost everyxx, with mild regularity allowing differentiation under the conditional expectation\. Then the optimal score satisfies

𝔼​\[∇𝐚ℒ​\(𝐘,𝐚⋆​\(X\)\)∣X\]=𝟎\.\\mathbb\{E\}\\bigl\[\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}^\{\\star\}\(X\)\)\\mid X\\bigr\]=\\mathbf\{0\}\.If, in addition,Cx​\(𝐚\)C\_\{x\}\(\\mathbf\{a\}\)is convex in𝐚\\mathbf\{a\}for almost everyxx, the converse also holds: any predictor whose score has zero conditional mean almost surely minimizesCxC\_\{x\}pointwise\.

The full statement of the theorem and its proof are deferred to Appendix[B](https://arxiv.org/html/2605.30638#A2)\(Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5)\)\. Three canonical loss families satisfy the hypothesis of Theorem[4](https://arxiv.org/html/2605.30638#Thmtheorem4)and are worked out in Appendix[B](https://arxiv.org/html/2605.30638#A2): Bregman divergences \(with MSE as the special caseϕ​\(𝐮\)=12​‖𝐮‖22\\phi\(\\mathbf\{u\}\)=\\tfrac\{1\}\{2\}\\\|\\mathbf\{u\}\\\|\_\{2\}^\{2\}\), proper scoring rules through unconstrained links \(with softmax cross\-entropy as the canonical instance\), and exponential\-family negative log\-likelihoods \(where the conditional\-mean\-zero property reduces to moment matching\)\. The main takeaway is that SBD subsumes the standard differentiable losses of supervised learning under a single principle, with MSE\-based EBD recovered as one instance\.

## 7Score vector expansion

The output score𝜹​\(𝐘,𝐚\)\\boldsymbol\{\\delta\}\(\\mathbf\{Y\},\\mathbf\{a\}\)has dimensionDoutD\_\{\\mathrm\{out\}\}, which is typically much less than the hidden layer widthdkd\_\{k\}\. Hence the layerwise correlation matrix has rank at mostDoutD\_\{\\mathrm\{out\}\}, and this bottleneck limits the number of independent decorrelation constraints imposed by SBD on the hidden representation\. This limitation can be relaxed by enlarging the broadcast signal with deterministic modulations that preserve the conditional mean\-zero property\. The modulator need only be a function ofXX\. Given any modulatorϕ:𝒳→ℝDout\\boldsymbol\{\\phi\}:\\mathcal\{X\}\\to\\mathbb\{R\}^\{D\_\{\\mathrm\{out\}\}\}, define the modulated score at the population optimum by

𝜼ϕ,⋆​\(X\):=ϕ​\(X\)⊙𝜹​\(𝐘,𝐚⋆​\(X\)\)\.\\boldsymbol\{\\eta\}^\{\\boldsymbol\{\\phi\},\\star\}\(X\)\\;:=\\;\\boldsymbol\{\\phi\}\(X\)\\odot\\boldsymbol\{\\delta\}\\bigl\(\\mathbf\{Y\},\\mathbf\{a\}^\{\\star\}\(X\)\\bigr\)\.It has also zero conditional mean asϕ​\(X\)\\boldsymbol\{\\phi\}\(X\)pulls out of the conditional expectation\. We stackMMsuch modulated scores into the expanded broadcast vector

𝜹~​\(X\):=\[ϕ1​\(X\)⊙𝜹​\(𝐘,𝐚​\(X\)\);…;ϕM​\(X\)⊙𝜹​\(𝐘,𝐚​\(X\)\)\]∈ℝM​Dout\.\\tilde\{\\boldsymbol\{\\delta\}\}\(X\)\\;:=\\;\\bigl\[\\boldsymbol\{\\phi\}\_\{1\}\(X\)\\odot\\boldsymbol\{\\delta\}\(\\mathbf\{Y\},\\mathbf\{a\}\(X\)\);\\;\\ldots;\\;\\boldsymbol\{\\phi\}\_\{M\}\(X\)\\odot\\boldsymbol\{\\delta\}\(\\mathbf\{Y\},\\mathbf\{a\}\(X\)\)\\bigr\]\\in\\mathbb\{R\}^\{MD\_\{\\mathrm\{out\}\}\}\.This preserves the layerwise orthogonality, providing up toMMtimes more decorrelation directions at linear cost \(see Appendix[C\.3](https://arxiv.org/html/2605.30638#A3.SS3)\), with output dependent choicesϕ​\(X\)=ψ​\(𝐚​\(X\)\)\\boldsymbol\{\\phi\}\(X\)=\\psi\(\\mathbf\{a\}\(X\)\)as a special case\.

The general construction only requires the modulatorsϕℓ\\boldsymbol\{\\phi\}\_\{\\ell\}to be deterministic functions ofXX, enabling substantial freedom in their design; a natural and computationally inexpensive choice is to build them from quantities already produced by the forward pass, such as the predictive distribution𝐩​\(X\)\\mathbf\{p\}\(X\), which is itself a nonlinear deterministic function ofXX\. Following this guideline, for the experiments in Section[8](https://arxiv.org/html/2605.30638#S8)we adopt the rank\-3​D3Dinstance𝜹~=\[𝜹;𝐩​\(X\)⊙𝜹;roll5​\(𝐩​\(X\)\)⊙𝜹\]∈ℝ3​D\\tilde\{\\boldsymbol\{\\delta\}\}=\\bigl\[\\boldsymbol\{\\delta\};\\;\\mathbf\{p\}\(X\)\\odot\\boldsymbol\{\\delta\};\\;\\mathrm\{roll\}\_\{5\}\(\\mathbf\{p\}\(X\)\)\\odot\\boldsymbol\{\\delta\}\\bigr\]\\in\\mathbb\{R\}^\{3D\}, which augments the raw score with a confidence\-weighted residual and a shifted cross\-class interaction term\. These blocks are heuristic choices among many admissible families; Appendix[E\.5](https://arxiv.org/html/2605.30638#A5.SS5)reports an ablation over alternatives, and more principled, data\-adaptive constructions are left as future work\. The numerical experiments of Section[8](https://arxiv.org/html/2605.30638#S8), most clearly the3\.13\.1\-point gain on Tiny ImageNet, nevertheless confirm that even this empirically motivated instance delivers substantial improvements over unexpanded SBD\. Full derivations and the expanded algorithm are deferred to Appendix[D](https://arxiv.org/html/2605.30638#A4)\.

## 8Numerical experiments

Because our main contribution is conceptual and theoretical, we designed our experiments as controlled tests of the score broadcast principle within broadcast learning\. BP with cross entropy is included as a reference optimizer; the main comparisons are to MSE based EBD and DFA based broadcast baselines\. For CIFAR\-10\(Krizhevsky,[2009](https://arxiv.org/html/2605.30638#bib.bib36)\), we use the CNN architecture and training setup ofErdogan et al\. \([2025](https://arxiv.org/html/2605.30638#bib.bib7)\), replacing MSE by cross entropy where appropriate\. Table[1](https://arxiv.org/html/2605.30638#S8.T1)shows that among broadcast methods, SBD with score expansion \(SBD Exp\) performs best: replacing the MSE residual by the cross entropy score improves EBD from66\.4%66\.4\\%to69\.2%69\.2\\%, and score expansion further improves performance to70\.0%70\.0\\%\. A4×4\\timeswidth run raises BP to83\.1%83\.1\\%and SBD Exp to74\.5%74\.5\\%, indicating that both benefit from additional capacity\. Appendix[E\.6](https://arxiv.org/html/2605.30638#A5.SS6)also reports that SBD updates have positive layerwise cosine similarity with exact BP gradients on the same minibatches, supporting a first order descent aligned component of the local no propagation approximation\.

Table 1:Test accuracy \(%; mean±\\pmstandard deviation over55independent seed runs\) for the CIFAR\-10 CNN experiment\. BP\(MSE\), DFA\(MSE\), MS\-GEVB, and EBD\(MSE\) are taken fromErdogan et al\. \([2025](https://arxiv.org/html/2605.30638#bib.bib7)\); BP\(CE\), DFA\(CE\), SBD\(CE, Ours\), and SBD Exp\(CE, Ours\) are obtained here under the same CNN setup with cross entropy and score broadcast, averaged over55independent seed runs\. BP\(CE\) is the best result and SBD Exp\(CE, Ours\) is the second best\.BPDFAGEVBEBDBPDFASBDSBD Exp\(MSE\)\(MSE\)\(MSE\)\(MSE\)\(CE\)\(CE\)\(CE, Ours\)\(CE, Ours\)75\.2±0\.375\.2\\pm 0\.358\.4±1\.658\.4\\pm 1\.661\.5761\.5766\.4±0\.466\.4\\pm 0\.478\.5±0\.5\\mathbf\{78\.5\\pm 0\.5\}65\.3±1\.265\.3\\pm 1\.269\.2±0\.769\.2\\pm 0\.770\.0±0\.7¯\\underline\{70\.0\\pm 0\.7\}Table 2:Test accuracy \(%; mean±\\pmstandard deviation over55independent seed runs\) for the Tiny ImageNet CNN experiment\. BP\(MSE\) and EBD\(MSE\) report MSE\-trained baselines under the same Tiny ImageNet setup; BP\(CE\), DFA\(CE\), SBD\(CE, Ours\), and SBD Exp\(CE, Ours\) are obtained here under the same CNN setup with cross entropy, with SBD using score broadcast\. BP\(CE\) is the best result and SBD Exp\(CE, Ours\) is the second best\.BPEBDBPDFASBDSBD Exp\(MSE\)\(MSE\)\(CE\)\(CE\)\(CE, Ours\)\(CE, Ours\)36\.0±0\.736\.0\\pm 0\.718\.5±0\.718\.5\\pm 0\.739\.9±0\.3\\mathbf\{39\.9\\pm 0\.3\}17\.5±0\.417\.5\\pm 0\.428\.3±0\.428\.3\\pm 0\.431\.4±0\.4¯\\underline\{31\.4\\pm 0\.4\}For Tiny ImageNet\(Le and Yang,[2015](https://arxiv.org/html/2605.30638#bib.bib37)\), we use the200200class,66layer CNN setup described in Appendix[F](https://arxiv.org/html/2605.30638#A6)\. Table[2](https://arxiv.org/html/2605.30638#S8.T2)shows the same qualitative ordering on a harder benchmark\. BP remains the strongest reference optimizer, while SBD improves substantially over DFA, and EBD \(MSE\)\. Score expansion again gives the best broadcast result\. The experiments thus support the loss score as a better broadcast signal than the MSE residual under matched settings, with score vector expansion delivering consistent further gains across both benchmarks; the larger Tiny ImageNet improvement is consistent with its role of enriching the decorrelation directions of the layerwise objective\. Beyond cross\-entropy loss, Appendix[B\.1](https://arxiv.org/html/2605.30638#A2.SS1)additionally reports a Poisson regression demonstration verifying the conditional\-mean\-zero property of Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5)on an exponential\-family\-NLL loss\. Appendices[E](https://arxiv.org/html/2605.30638#A5)–[F](https://arxiv.org/html/2605.30638#A6)provide experimental details, and the codes are included in the supplementary material for reproducibility\.

## 9Conclusions, limitations, and future work

We introducedScore Broadcast and Decorrelation\(SBD\), a broadcast credit assignment framework for general differentiable losses, in which the quantity broadcast to hidden layers is theoutput score, the gradient of the loss with respect to the network output\. At the population optimum, the score is conditionally mean zero and hence orthogonal to any input\-measurable feature, including hidden\-layer activations\. This principle applies across the standard differentiable\-loss families, e\.g\. cross\-entropy, Bregman divergences, exponential\-family negative log\-likelihoods, giving a principled answer to what should be broadcast: a signal derived from the task loss itself\. The framework supplies a unified theoretical grounding for the three\-factor learning rule of computational neuroscience, with the neuromodulatory factor*derived*as the broadcast loss score rather than postulated; the MSE\-based EBD update is recovered as one instance\. We further introducescore vector expansion, which enriches the broadcast signal with deterministic modulators that preserve the population orthogonality and provide a richer set of decorrelation directions for the layerwise objective\. Experiments on CIFAR\-10 and Tiny ImageNet show that SBD improves over MSE\-based and random\-feedback broadcast rules, with consistent further gains from score vector expansion\. The framework thus offers both a practical broadcast\-learning algorithm and a theoretical lens on biologically observed neuromodulatory plasticity\.

##### Limitations\.

The strongest results are population statements and rely on assumptions such as realizability or conditional\-mean\-zero scores, which need not hold exactly in finite models and finite data\. The practical hidden\-layer update also keeps the local no\-propagation approximation of EBD, so its dynamics do not generally coincide with exact backpropagation\. Finally, the experiments are controlled proof\-of\-concept studies on standard classification benchmarks rather than large\-scale evaluations\.

##### Future work\.

This line of work has two long\-term aims: developing broadcast\-based credit assignment as a practical alternative to backpropagation, and providing a theoretical foundation for biologically observed three\-factor learning rules driven by neuromodulatory signals\. The contribution of this paper is the orthogonality framework that supports both\. Natural next steps include scaling SBD to modern architectures, sharper characterization of when the local no\-propagation approximation aligns with exact gradients, data\-adaptive score\-vector\-expansion modulators with provable rank guarantees, and connections between the SBD orthogonality principle and biological measurements of neuromodulatory plasticity\.

## Acknowledgements

This work was supported by KUIS AI Center Research Award\. C\.P\. was supported by an NSF CAREER Award \(IIS\-2239780\) and a Sloan Research Fellowship\. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence\.

## References

- Rumelhart et al\. \(1986\)David E\. Rumelhart, Geoffrey E\. Hinton, and Ronald J\. Williams\.Learning representations by back\-propagating errors\.*Nature*, 323:533–536, 1986\.
- Crick \(1989\)Francis Crick\.The recent excitement about neural networks\.*Nature*, 337:129–132, 1989\.
- Lillicrap et al\. \(2020\)Timothy P\. Lillicrap, Adam Santoro, Luke Marris, Colin J\. Akerman, and Geoffrey Hinton\.Backpropagation and the brain\.*Nature Reviews Neuroscience*, 21\(6\):335–346, 2020\.
- Akrout et al\. \(2019\)Mohamed Akrout, Collin Wilson, Peter C\. Humphreys, Timothy Lillicrap, and Douglas Tweed\.Deep learning without weight transport\.InAdvances in Neural Information Processing Systems 32 \(NeurIPS\), pages 974–982, 2019\.
- Whittington and Bogacz \(2019\)James C\. R\. Whittington and Rafal Bogacz\.Theories of error back\-propagation in the brain\.*Trends in Cognitive Sciences*, 23\(3\):235–250, 2019\.
- Golkar et al\. \(2022\)S\. Golkar, T\. Tesileanu, Y\. Bahroun, A\. Sengupta, and D\. Chklovskii\.Constrained predictive coding as a biologically plausible model of the cortical hierarchy\.Advances in Neural Information Processing Systems 35 \(NeurIPS\), pages 14155–14169, 2022\.
- Erdogan et al\. \(2025\)Mete Erdogan, Cengiz Pehlevan, and Alper Tunga Erdogan\.Error Broadcast and Decorrelation as a Potential Artificial and Natural Learning Mechanism\.In*Advances in Neural Information Processing Systems 38 \(NeurIPS\)*, 2025\.
- Frémaux and Gerstner \(2016\)Nicolas Frémaux and Wulfram Gerstner\.Neuromodulated spike\-timing\-dependent plasticity, and theory of three\-factor learning rules\.Frontiers in Neural Circuits, 9:85, 2016\.
- Gerstner et al\. \(2018\)Wulfram Gerstner, Marco Lehmann, Vasiliki Liakoni, Dane Corneil, and Johanni Brea\.Eligibility traces and plasticity on behavioral time scales: experimental support of neo\-Hebbian three\-factor learning rules\.Frontiers in Neural Circuits, 12:53, 2018\.
- Kuśmierz et al\. \(2017\)Łukasz Kuśmierz, Takuya Isomura, and Taro Toyoizumi\.Learning with three factors: modulating Hebbian plasticity with errors\.Current Opinion in Neurobiology, 46:170–177, 2017\.
- Schultz \(1998\)Wolfram Schultz\.Predictive reward signal of dopamine neurons\.Journal of Neurophysiology, 80\(1\):1–27, 1998\.
- Ozsoy et al\. \(2022\)Serdar Ozsoy, Shadi Hamdan, Sercan Arik, Deniz Yuret, and Alper Erdogan\.Self\-Supervised Learning with an Information Maximization Criterion\.In*Advances in Neural Information Processing Systems 35 \(NeurIPS\)*, pages 35240–35253, 2022\.
- Bozkurt et al\. \(2023a\)Bariscan Bozkurt, Cengiz Pehlevan, and Alper T\. Erdogan\.Correlative Information Maximization: A Biologically Plausible Approach to Supervised Deep Neural Networks without Weight Symmetry\.In*Advances in Neural Information Processing Systems 36 \(NeurIPS\)*, pages 34928–34941, 2023\.
- Bozkurt et al\. \(2023b\)Bariscan Bozkurt, Ateş İsfendiyaroğlu, Cengiz Pehlevan, and Alper T\. Erdogan\.Correlative Information Maximization Based Biologically Plausible Neural Networks for Correlated Source Separation\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Scellier and Bengio \(2017\)Benjamin Scellier and Yoshua Bengio\.Equilibrium Propagation: Bridging the Gap between Energy\-Based Models and Backpropagation\.*Frontiers in Computational Neuroscience*, 11:24, 2017\.
- Scellier and Bengio \(2019\)Benjamin Scellier and Yoshua Bengio\.Equivalence of equilibrium propagation and recurrent backpropagation\.Neural Computation, 31\(2\):312–329, 2019\.
- Bengio \(2014\)Yoshua Bengio\.How Auto\-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation\.*CoRR*, abs/1407\.7906, 2014\.
- Lee et al\. \(2015\)Dong\-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio\.Difference Target Propagation\.In*Machine Learning and Knowledge Discovery in Databases*, pages 498–515, 2015\.
- Hinton \(2022\)Geoffrey E\. Hinton\.The Forward\-Forward Algorithm: Some Preliminary Investigations\.*CoRR*, abs/2212\.13345, 2022\.
- Rao and Ballard \(1999\)Rajesh P\. N\. Rao and Dana H\. Ballard\.Predictive coding in the visual cortex: a functional interpretation of some extra\-classical receptive\-field effects\.*Nature Neuroscience*, 2:79–87, 1999\.
- Whittington and Bogacz \(2017\)James C\. R\. Whittington and Rafal Bogacz\.An Approximation of the Error Backpropagation Algorithm in a Predictive Coding Network with Local Hebbian Synaptic Plasticity\.*Neural Computation*, 29\(5\):1229–1262, 2017\.
- Qin et al\. \(2021\)Shanshan Qin, Nayantara Mudur, and Cengiz Pehlevan\.Contrastive Similarity Matching for Supervised Learning\.*Neural Computation*, 33\(5\):1300–1328, 2021\.
- Lillicrap et al\. \(2016\)Timothy P\. Lillicrap, Daniel Cownden, Douglas B\. Tweed, and Colin J\. Akerman\.Random synaptic feedback weights support error backpropagation for deep learning\.*Nature Communications*, 7:13276, 2016\.
- Nøkland \(2016\)Arild Nøkland\.Direct Feedback Alignment Provides Learning in Deep Neural Networks\.In*Advances in Neural Information Processing Systems 29 \(NeurIPS\)*, pages 1037–1045, 2016\.
- Bartunov et al\. \(2018\)Sergey Bartunov, Adam Santoro, Blake Richards, Luke Marris, Geoffrey E\. Hinton, and Timothy Lillicrap\.Assessing the Scalability of Biologically\-Motivated Deep Learning Algorithms and Architectures\.In*Advances in Neural Information Processing Systems 31 \(NeurIPS\)*, 2018\.
- Han and Yoo \(2019\)Donghyeon Han and Hoi\-jun Yoo\.Efficient Convolutional Neural Network Training with Direct Feedback Alignment\.*CoRR*, abs/1901\.01986, 2019\.
- Launay et al\. \(2019\)Julien Launay, Iacopo Poli, and Florent Krzakala\.Principled Training of Neural Networks with Direct Feedback Alignment\.*CoRR*, abs/1906\.04554, 2019\.
- Launay et al\. \(2020\)Julien Launay, Iacopo Poli, François Boniface, and Florent Krzakala\.Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures\.In*Advances in Neural Information Processing Systems 33 \(NeurIPS\)*, 2020\.
- Bordelon and Pehlevan \(2023\)Blake Bordelon and Cengiz Pehlevan\.The Influence of Learning Rule on Representation Dynamics in Wide Neural Networks\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Clark et al\. \(2021\)David G\. Clark, L\. F\. Abbott, and SueYeon Chung\.Credit Assignment Through Broadcasting a Global Error Vector\.In*Advances in Neural Information Processing Systems 34 \(NeurIPS\)*, 2021\.
- Itakura and Saito \(1968\)Fumitada Itakura and Satoshi Saito\.Analysis synthesis telephony based on the maximum likelihood method\.In*Proceedings of the 6th International Congress on Acoustics*, pages C–17–C–20\. IEEE, 1968\.
- Févotte et al\. \(2009\)C\. Févotte, N\. Bertin, and J\.\-L\. Durrieu\.Nonnegative matrix factorization with the Itakura\-Saito divergence: With application to music analysis\.*Neural Computation*, 21\(3\):793–830, 2009\.
- Bregman \(1967\)L\. M\. Bregman\.The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming\.*USSR Computational Mathematics and Mathematical Physics*, 7\(3\):200–217, 1967\.
- Gneiting and Raftery \(2007\)Tilmann Gneiting and Adrian E\. Raftery\.Strictly proper scoring rules, prediction, and estimation\.Journal of the American Statistical Association, 102\(477\):359–378, 2007\.
- Paszke et al\. \(2019\)Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala\.PyTorch: An Imperative Style, High\-Performance Deep Learning Library\.In*Advances in Neural Information Processing Systems 32 \(NeurIPS\)*, pages 8024–8035, 2019\.
- Krizhevsky \(2009\)Alex Krizhevsky\.Learning Multiple Layers of Features from Tiny Images\.Technical Report, University of Toronto, 2009\.
- Le and Yang \(2015\)Ya Le and Xuan Yang\.Tiny ImageNet Visual Recognition Challenge\.CS 231N course project report, Stanford University, 2015\.
- Friedman \(1991\)Jerome H\. Friedman\.Multivariate adaptive regression splines\.The Annals of Statistics, 19\(1\):1–67, 1991\.
- Biewald \(2020\)Lukas Biewald\.Experiment tracking with Weights and Biases\.Software available from wandb\.com, 2020\.

Appendix

## Table of contents

## Appendix ADeferred proofs for main\-text results

### A\.1Proof of theorem[1](https://arxiv.org/html/2605.30638#Thmtheorem1)

Since cross entropy is strictly proper, its population minimizer is the true conditional label distribution, i\.e\.,𝐩⋆​\(X\)=𝐪​\(X\)=𝔼​\[𝐘∣X\]\\mathbf\{p\}^\{\\star\}\(X\)=\\mathbf\{q\}\(X\)=\\mathbb\{E\}\[\\mathbf\{Y\}\\mid X\]\. Therefore,

𝔼​\[𝐩⋆​\(X\)−𝐘∣X\]=𝐩⋆​\(X\)−𝔼​\[𝐘∣X\]=𝐪​\(X\)−𝐪​\(X\)=𝟎\.\\mathbb\{E\}\\bigl\[\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbf\{Y\}\\mid X\\bigr\]=\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbb\{E\}\[\\mathbf\{Y\}\\mid X\]=\\mathbf\{q\}\(X\)\-\\mathbf\{q\}\(X\)=\\mathbf\{0\}\.Hence the residual associated with the optimal cross entropy predictor is conditionally mean zero\.□\\square

### A\.2Proof of lemma[1](https://arxiv.org/html/2605.30638#Thmlemma1)

The integrability assumption𝔼​\[‖𝐔​g​\(X\)T‖\]<∞\\mathbb\{E\}\\bigl\[\\\|\\mathbf\{U\}\\,g\(X\)^\{T\}\\\|\\bigr\]<\\inftyensures that the \(vector\-valued\) tower property applies\. Becauseg​\(X\)g\(X\)is measurable with respect toXX, it can be pulled out of the conditional expectation\. Hence

𝔼​\[𝐔​g​\(X\)T\]=𝔼​\[𝔼​\[𝐔​g​\(X\)T∣X\]\]=𝔼​\[𝔼​\[𝐔∣X\]​g​\(X\)T\]\.\\mathbb\{E\}\\bigl\[\\mathbf\{U\}g\(X\)^\{T\}\\bigr\]=\\mathbb\{E\}\\Bigl\[\\mathbb\{E\}\\bigl\[\\mathbf\{U\}g\(X\)^\{T\}\\mid X\\bigr\]\\Bigr\]=\\mathbb\{E\}\\Bigl\[\\mathbb\{E\}\[\\mathbf\{U\}\\mid X\]g\(X\)^\{T\}\\Bigr\]\.This is exactly the mechanism by which conditional mean\-zero leads to orthogonality with arbitrary functions of the input\.□\\square

### A\.3Proof of theorem[2](https://arxiv.org/html/2605.30638#Thmtheorem2)

By Theorem[1](https://arxiv.org/html/2605.30638#Thmtheorem1),𝐔=𝐩⋆​\(X\)−𝐘\\mathbf\{U\}=\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbf\{Y\}satisfies𝔼​\[𝐔∣X\]=𝟎\\mathbb\{E\}\[\\mathbf\{U\}\\mid X\]=\\mathbf\{0\}\. Since𝐩⋆​\(X\)∈ΔD\\mathbf\{p\}^\{\\star\}\(X\)\\in\\Delta\_\{D\}and𝐘∈\{0,1\}D\\mathbf\{Y\}\\in\\\{0,1\\\}^\{D\}, the residual𝐔\\mathbf\{U\}is uniformly bounded, so any measurableggwith𝔼​‖g​\(X\)‖<∞\\mathbb\{E\}\\\|g\(X\)\\\|<\\inftyautomatically yields an integrable product𝐔​g​\(X\)T\\mathbf\{U\}\\,g\(X\)^\{T\}\. Applying Lemma[1](https://arxiv.org/html/2605.30638#Thmlemma1)to this choice of𝐔\\mathbf\{U\}immediately gives

𝔼​\[\(𝐩⋆​\(X\)−𝐘\)​g​\(X\)T\]=𝟎\\mathbb\{E\}\\bigl\[\(\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbf\{Y\}\)g\(X\)^\{T\}\\bigr\]=\\mathbf\{0\}for every suchg​\(X\)g\(X\)\. Substitutingg​\(X\)=g\(k\)​\(𝐡\(k\)​\(X\)\)g\(X\)=g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\(X\)\)yields the layerwise version used by the decorrelation framework\.□\\square

### A\.4Proof of proposition[1](https://arxiv.org/html/2605.30638#Thmproposition1)

At a stationary point, by definition, the gradient of the population risk with respect to the last\-layer weights must vanish\. Since

∇𝐖\(L\)ℛ=𝔼​\[𝜹​𝐡\(L−1\)​T\],\\nabla\_\{\\mathbf\{W\}^\{\(L\)\}\}\\mathcal\{R\}=\\mathbb\{E\}\\bigl\[\\boldsymbol\{\\delta\}\\,\\mathbf\{h\}^\{\(L\-1\)T\}\\bigr\],setting the gradient to zero gives the claimed orthogonality relation immediately\.□\\square

### A\.5Proof of theorem[3](https://arxiv.org/html/2605.30638#Thmtheorem3)

The proof is the same tower\-property argument used in the cross entropy case\. Sinceg​\(X\)g\(X\)is measurable with respect toXX,

𝔼​\[𝜹⋆​g​\(X\)T\]=𝔼​\[𝔼​\[𝜹⋆​g​\(X\)T∣X\]\]=𝔼​\[𝔼​\[𝜹⋆∣X\]​g​\(X\)T\]\.\\mathbb\{E\}\\bigl\[\\boldsymbol\{\\delta\}^\{\\star\}g\(X\)^\{T\}\\bigr\]=\\mathbb\{E\}\\Bigl\[\\mathbb\{E\}\\bigl\[\\boldsymbol\{\\delta\}^\{\\star\}g\(X\)^\{T\}\\mid X\\bigr\]\\Bigr\]=\\mathbb\{E\}\\Bigl\[\\mathbb\{E\}\[\\boldsymbol\{\\delta\}^\{\\star\}\\mid X\]g\(X\)^\{T\}\\Bigr\]\.Using the hypothesis𝔼​\[𝜹⋆∣X\]=𝟎\\mathbb\{E\}\[\\boldsymbol\{\\delta\}^\{\\star\}\\mid X\]=\\mathbf\{0\}gives

𝔼​\[𝜹⋆​g​\(X\)T\]=𝔼​\[𝟎​g​\(X\)T\]=𝟎\.\\mathbb\{E\}\\bigl\[\\boldsymbol\{\\delta\}^\{\\star\}g\(X\)^\{T\}\\bigr\]=\\mathbb\{E\}\[\\mathbf\{0\}\\,g\(X\)^\{T\}\]=\\mathbf\{0\}\.Therefore the optimal score is orthogonal to every measurable function of the input\.□\\square

## Appendix BWhen does a loss yield a conditionally mean\-zero score?

This appendix section provides the full conditional\-risk characterization summarized in Section[6](https://arxiv.org/html/2605.30638#S6)\.

Theorem[3](https://arxiv.org/html/2605.30638#Thmtheorem3)shows that the generalized SBD framework has its full layerwise orthogonality property whenever the population\-optimal score satisfies𝔼​\[𝜹⋆∣X\]=𝟎\\mathbb\{E\}\[\\boldsymbol\{\\delta\}^\{\\star\}\\mid X\]=\\mathbf\{0\}\. The key question is therefore: when does a loss produce this conditional mean\-zero score? In order to perform this characterization, we look at the loss structure\.

For a fixed input valuexx, define the conditional risk

Cx​\(𝐚\)=𝔼​\[ℒ​\(𝐘,𝐚\)∣X=x\]\.C\_\{x\}\(\\mathbf\{a\}\)=\\mathbb\{E\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}\)\\mid X=x\\bigr\]\.\(2\)Here𝐚\\mathbf\{a\}denotes the output parameter with respect to which the SBD score is taken, as we defined earlier\. Throughout the theorem below,𝐚\\mathbf\{a\}is assumed to lie in an open subsetA⊆ℝmA\\subseteq\\mathbb\{R\}^\{m\}\. For MSE loss,𝐚\\mathbf\{a\}is directly the Euclidean estimate\. For logit\-based classification or natural\-parameter likelihood models,𝐚\\mathbf\{a\}is the corresponding unconstrained output parameter\. If a loss is naturally defined on a constrained prediction variable, such as a probability vector𝐩∈ΔD\\mathbf\{p\}\\in\\Delta\_\{D\}, we treat it here by composing it with an unconstrained link𝐩=ψ​\(𝐚\)\\mathbf\{p\}=\\psi\(\\mathbf\{a\}\)and applying the theorem to

ℒ​\(𝐘,𝐚\)=ℓ​\(𝐘,ψ​\(𝐚\)\)\.\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}\)=\\ell\(\\mathbf\{Y\},\\psi\(\\mathbf\{a\}\)\)\.The direct constrained formulation is not needed for the SBD results in this paper and is left outside the main theorem\.

We note that the population risk is related to the conditional risk in Eq\. \([2](https://arxiv.org/html/2605.30638#A2.E2)\) simply through

ℛ​\(𝐚\)=𝔼X​\[𝔼​\[ℒ​\(𝐘,𝐚​\(X\)\)∣X\]⏟conditional risk​Cx​\(𝐚​\(X\)\)\]=𝔼X​\[Cx​\(𝐚\)\]\.\\mathcal\{R\}\(\\mathbf\{a\}\)=\\mathbb\{E\}\_\{X\}\\Bigl\[\\underbrace\{\\mathbb\{E\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}\(X\)\)\\mid X\\bigr\]\}\_\{\\text\{conditional risk \}C\_\{x\}\(\\mathbf\{a\}\(X\)\)\}\\Bigr\]=\\mathbb\{E\}\_\{X\}\\bigl\[C\_\{x\}\(\\mathbf\{a\}\)\\bigr\]\.
Based on this relationship, the key insight is that minimizingCx​\(𝐚\)C\_\{x\}\(\\mathbf\{a\}\)separately for everyxxminimizes the outer expectation, i\.e\. the population risk, as well\. This is due to the fact that you can not do better than minimizing each term in a nonnegative weighted sum\. Hence, for minimizing the global population riskℛ​\(𝐚\)\\mathcal\{R\}\(\\mathbf\{a\}\), we can concentrate on minimizing the conditional riskCx​\(𝐚\)C\_\{x\}\(\\mathbf\{a\}\)for everyxx, which outlines the basic principle behind the following theorem\.

###### Theorem 5\(Conditional\-risk characterization in an unconstrained parameterization\)

Let𝒜⊆ℝm\\mathcal\{A\}\\subseteq\\mathbb\{R\}^\{m\}be open, and letℒ​\(𝐲,𝐚\)\\mathcal\{L\}\(\\mathbf\{y\},\\mathbf\{a\}\)be differentiable in𝐚∈𝒜\\mathbf\{a\}\\in\\mathcal\{A\}\. For almost everyxx, assume that the conditional riskCx​\(𝐚\)C\_\{x\}\(\\mathbf\{a\}\)has an interior minimizer𝐚⋆​\(x\)∈𝒜\\mathbf\{a\}^\{\\star\}\(x\)\\in\\mathcal\{A\}, and that differentiation may be interchanged with conditional expectation\. Then the optimal score

𝜹⋆=∇𝐚ℒ​\(𝐘,𝐚⋆​\(X\)\)\\boldsymbol\{\\delta\}^\{\\star\}=\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}^\{\\star\}\(X\)\)satisfies

𝔼​\[𝜹⋆∣X\]=𝟎\.\\mathbb\{E\}\[\\boldsymbol\{\\delta\}^\{\\star\}\\mid X\]=\\mathbf\{0\}\.If, in addition,Cx​\(𝐚\)C\_\{x\}\(\\mathbf\{a\}\)is convex in𝐚\\mathbf\{a\}for almost everyxx, then the converse also holds: any measurable predictor𝐚~​\(X\)\\tilde\{\\mathbf\{a\}\}\(X\)satisfying

𝔼​\[∇𝐚ℒ​\(𝐘,𝐚~​\(X\)\)∣X\]=𝟎\\mathbb\{E\}\\bigl\[\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{Y\},\\tilde\{\\mathbf\{a\}\}\(X\)\)\\mid X\\bigr\]=\\mathbf\{0\}minimizesCx​\(𝐚\)C\_\{x\}\(\\mathbf\{a\}\)for almost everyxx; ifCx​\(𝐚\)C\_\{x\}\(\\mathbf\{a\}\)is strictly convex, this minimizer is unique almost surely\.

*Proof\.*Forward part of the proof simply relies on the first order condition for optimality, and being able to exchange differentiation and expectation under the presumed regularity assumption\. For each fixed input valuexx, differentiation of the conditional risk gives

∇𝐚Cx​\(𝐚\)=∇𝐚𝔼​\[ℒ​\(𝐘,𝐚\)∣X=x\]\.\\nabla\_\{\\mathbf\{a\}\}C\_\{x\}\(\\mathbf\{a\}\)=\\nabla\_\{\\mathbf\{a\}\}\\mathbb\{E\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}\)\\mid X=x\\bigr\]\.If𝐚⋆​\(x\)\\mathbf\{a\}^\{\\star\}\(x\)is an interior minimizer of the differentiable functionCxC\_\{x\}, then its first\-order optimality condition is

∇𝐚Cx​\(𝐚⋆​\(x\)\)=𝟎\.\\nabla\_\{\\mathbf\{a\}\}C\_\{x\}\(\\mathbf\{a\}^\{\\star\}\(x\)\)=\\mathbf\{0\}\.Exchanging differentiation with expectation yields

𝔼​\[∇𝐚ℒ​\(𝐘,𝐚⋆​\(x\)\)∣X=x\]=𝟎\\mathbb\{E\}\\bigl\[\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{Y\},\\mathbf\{a\}^\{\\star\}\(x\)\)\\mid X=x\\bigr\]=\\mathbf\{0\}for almost everyxx, which is exactly𝔼​\[𝜹⋆∣X\]=𝟎\\mathbb\{E\}\[\\boldsymbol\{\\delta\}^\{\\star\}\\mid X\]=\\mathbf\{0\}\.

For the reverse direction of the proof: ifCxC\_\{x\}is differentiable and convex and if there exists a predictor𝐚~​\(x\)\\tilde\{\\mathbf\{a\}\}\(x\)which satisifies

𝔼​\[∇𝐚ℒ​\(𝐘,𝐚~​\(X\)\)∣X\]=𝟎,\\mathbb\{E\}\\bigl\[\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{Y\},\\tilde\{\\mathbf\{a\}\}\(X\)\)\\mid X\\bigr\]=\\mathbf\{0\},Then we show that𝐚~​\(x\)\\tilde\{\\mathbf\{a\}\}\(x\)minimizesCxC\_\{x\}for almost everyxx:

First, we translate the score condition into a gradient condition onCxC\_\{x\}\. Applying the gradient–expectation interchange used in the forward direction, but now evaluated at𝐚~​\(X\)\\tilde\{\\mathbf\{a\}\}\(X\)rather than at𝐚⋆​\(X\)\\mathbf\{a\}^\{\\star\}\(X\), gives

∇𝐚Cx​\(𝐚~​\(x\)\)=𝔼​\[∇𝐚ℒ​\(𝐘,𝐚~​\(x\)\)∣X=x\]\.\\nabla\_\{\\mathbf\{a\}\}C\_\{x\}\(\\tilde\{\\mathbf\{a\}\}\(x\)\)=\\mathbb\{E\}\\bigl\[\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{Y\},\\tilde\{\\mathbf\{a\}\}\(x\)\)\\mid X=x\\bigr\]\.The right\-hand side is𝟎\\mathbf\{0\}for almost everyxxby assumption\. Therefore

∇𝐚Cx​\(𝐚~​\(x\)\)=𝟎for almost every​x\.\\nabla\_\{\\mathbf\{a\}\}C\_\{x\}\(\\tilde\{\\mathbf\{a\}\}\(x\)\)=\\mathbf\{0\}\\quad\\text\{for almost every \}x\.As the next step, we use the fact that a zero gradient of a convex function characterizes a global minimizer\. Fix anyxxfor whichCxC\_\{x\}is differentiable, convex, and satisfies∇𝐚Cx​\(𝐚~​\(x\)\)=𝟎\\nabla\_\{\\mathbf\{a\}\}C\_\{x\}\(\\tilde\{\\mathbf\{a\}\}\(x\)\)=\\mathbf\{0\}\. By the first\-order characterization of convexity, for every𝐚∈𝒜\\mathbf\{a\}\\in\\mathcal\{A\},

Cx​\(𝐚\)≥Cx​\(𝐚~​\(x\)\)\+∇𝐚Cx​\(𝐚~​\(x\)\)T​\(𝐚−𝐚~​\(x\)\)=Cx​\(𝐚~​\(x\)\)\+𝟎T​\(𝐚−𝐚~​\(x\)\)=Cx​\(𝐚~​\(x\)\)\.C\_\{x\}\(\\mathbf\{a\}\)\\geq C\_\{x\}\(\\tilde\{\\mathbf\{a\}\}\(x\)\)\+\\nabla\_\{\\mathbf\{a\}\}C\_\{x\}\(\\tilde\{\\mathbf\{a\}\}\(x\)\)^\{T\}\(\\mathbf\{a\}\-\\tilde\{\\mathbf\{a\}\}\(x\)\)=C\_\{x\}\(\\tilde\{\\mathbf\{a\}\}\(x\)\)\+\\mathbf\{0\}^\{T\}\(\\mathbf\{a\}\-\\tilde\{\\mathbf\{a\}\}\(x\)\)=C\_\{x\}\(\\tilde\{\\mathbf\{a\}\}\(x\)\)\.Hence𝐚~​\(x\)\\tilde\{\\mathbf\{a\}\}\(x\)is a global minimizer ofCxC\_\{x\}\. Since this holds for almost everyxx, the predictor𝐚~\\tilde\{\\mathbf\{a\}\}minimizesCxC\_\{x\}pointwise almost surely\.

Finally, in case of strict convexity ofCxC\_\{x\}, the inequality above is strict whenever𝐚≠𝐚~​\(x\)\\mathbf\{a\}\\neq\\tilde\{\\mathbf\{a\}\}\(x\):

Cx​\(𝐚\)\>Cx​\(𝐚~​\(x\)\)for all​𝐚∈𝒜∖\{𝐚~​\(x\)\}\.C\_\{x\}\(\\mathbf\{a\}\)\>C\_\{x\}\(\\tilde\{\\mathbf\{a\}\}\(x\)\)\\quad\\text\{for all \}\\mathbf\{a\}\\in\\mathcal\{A\}\\setminus\\\{\\tilde\{\\mathbf\{a\}\}\(x\)\\\}\.Therefore𝐚~​\(x\)\\tilde\{\\mathbf\{a\}\}\(x\)is the*unique*minimizer ofCxC\_\{x\}, and the converse predictor is unique almost surely\.□\\square

The theorem clarifies why this condition is stronger than ordinary stationarity of a parameterized model𝐚θ​\(X\)\\mathbf\{a\}\_\{\\theta\}\(X\)\. When optimization is carried out only over the parameter vectorθ\\theta, one generally obtains the weaker feature\-level orthogonality of Proposition[1](https://arxiv.org/html/2605.30638#Thmproposition1)\. The stronger identity𝔼​\[𝜹⋆∣X\]=𝟎\\mathbb\{E\}\[\\boldsymbol\{\\delta\}^\{\\star\}\\mid X\]=\\mathbf\{0\}appears when the optimal prediction value is characterized directly from the conditional risk for each input\.

We now show that the standard differentiable\-loss families used in supervised learning all satisfy the conditional\-mean\-zero score property of Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5): Bregman divergences \(covering MSE\), proper scoring rules through unconstrained links \(covering cross\-entropy\), and exponential\-family negative log\-likelihoods\.

##### Bregman losses\.

Bregman divergences\(Bregman,[1967](https://arxiv.org/html/2605.30638#bib.bib33)\)form a broad family of losses generated by a strictly convex potential functionϕ\\phi\. Important members include MSE \(fromϕ​\(𝐮\)=12​‖𝐮‖22\\phi\(\\mathbf\{u\}\)=\\tfrac\{1\}\{2\}\\\|\\mathbf\{u\}\\\|\_\{2\}^\{2\}\), the Itakura–Saito divergence\(Itakura and Saito,[1968](https://arxiv.org/html/2605.30638#bib.bib31); Févotte et al\.,[2009](https://arxiv.org/html/2605.30638#bib.bib32)\), and the generalized KL divergence\. We show that for any such loss, the population\-optimal predictor is a conditional expectation, and the score is therefore conditionally mean zero\.

LetΩ⊆ℝm\\Omega\\subseteq\\mathbb\{R\}^\{m\}be open and convex, and letϕ:Ω→ℝ\\phi:\\Omega\\to\\mathbb\{R\}be twice differentiable with positive definite Hessian∇2ϕ\\nabla^\{2\}\\phi\. Letτ:𝒴→Ω\\tau:\\mathcal\{Y\}\\to\\Omegabe a target encoding satisfyingτ​\(𝐘\)∈Ω\\tau\(\\mathbf\{Y\}\)\\in\\Omegaalmost surely\. The associated Bregman divergence and Bregman loss are

Dϕ​\(𝐮,𝐯\)=ϕ​\(𝐮\)−ϕ​\(𝐯\)−∇ϕ​\(𝐯\)T​\(𝐮−𝐯\),ℒϕ​\(𝐲,𝐚\)=Dϕ​\(τ​\(𝐲\),𝐚\)\.D\_\{\\phi\}\(\\mathbf\{u\},\\mathbf\{v\}\)=\\phi\(\\mathbf\{u\}\)\-\\phi\(\\mathbf\{v\}\)\-\\nabla\\phi\(\\mathbf\{v\}\)^\{T\}\(\\mathbf\{u\}\-\\mathbf\{v\}\),\\qquad\\mathcal\{L\}\_\{\\phi\}\(\\mathbf\{y\},\\mathbf\{a\}\)=D\_\{\\phi\}\(\\tau\(\\mathbf\{y\}\),\\mathbf\{a\}\)\.Differentiating with respect to the second argument gives

∇𝐚ℒϕ​\(𝐲,𝐚\)=∇2ϕ​\(𝐚\)​\(𝐚−τ​\(𝐲\)\)\.\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\_\{\\phi\}\(\\mathbf\{y\},\\mathbf\{a\}\)=\\nabla^\{2\}\\phi\(\\mathbf\{a\}\)\\bigl\(\\mathbf\{a\}\-\\tau\(\\mathbf\{y\}\)\\bigr\)\.Taking the conditional expectation over𝐘∣X=x\\mathbf\{Y\}\\mid X=xand using linearity yields the gradient of the conditional risk:

∇𝐚Cx​\(𝐚\)=∇2ϕ​\(𝐚\)​\(𝐚−𝔼​\[τ​\(𝐘\)∣X=x\]\)\.\\nabla\_\{\\mathbf\{a\}\}C\_\{x\}\(\\mathbf\{a\}\)=\\nabla^\{2\}\\phi\(\\mathbf\{a\}\)\\Bigl\(\\mathbf\{a\}\-\\mathbb\{E\}\[\\tau\(\\mathbf\{Y\}\)\\mid X=x\]\\Bigr\)\.Setting this equal to zero and using positive\-definiteness of∇2ϕ​\(𝐚\)\\nabla^\{2\}\\phi\(\\mathbf\{a\}\)to invert it pointwise, the unique interior minimizer is

𝐚⋆​\(X\)=𝔼​\[τ​\(𝐘\)∣X\]\.\\mathbf\{a\}^\{\\star\}\(X\)=\\mathbb\{E\}\[\\tau\(\\mathbf\{Y\}\)\\mid X\]\.This is precisely the form required by Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5), so the optimal score𝜹⋆=∇𝐚ℒϕ​\(𝐘,𝐚⋆​\(X\)\)\\boldsymbol\{\\delta\}^\{\\star\}=\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\_\{\\phi\}\(\\mathbf\{Y\},\\mathbf\{a\}^\{\\star\}\(X\)\)satisfies𝔼​\[𝜹⋆∣X\]=𝟎\\mathbb\{E\}\[\\boldsymbol\{\\delta\}^\{\\star\}\\mid X\]=\\mathbf\{0\}\. The MSE caseϕ​\(𝐮\)=12​‖𝐮‖22\\phi\(\\mathbf\{u\}\)=\\tfrac\{1\}\{2\}\\\|\\mathbf\{u\}\\\|\_\{2\}^\{2\},τ​\(𝐘\)=𝐘\\tau\(\\mathbf\{Y\}\)=\\mathbf\{Y\}gives𝐚⋆​\(X\)=𝔼​\[𝐘∣X\]\\mathbf\{a\}^\{\\star\}\(X\)=\\mathbb\{E\}\[\\mathbf\{Y\}\\mid X\]and𝜹⋆=𝐚⋆​\(X\)−𝐘\\boldsymbol\{\\delta\}^\{\\star\}=\\mathbf\{a\}^\{\\star\}\(X\)\-\\mathbf\{Y\}, recovering the MMSE residual property used by EBD as a single Bregman instance\.

##### Proper probability losses through unconstrained links\.

Probabilistic classification uses losses defined on the probability simplex: cross\-entropy, Brier score, log loss, and so on\. These losses share a strong optimality property: their conditional risks are minimized at the true conditional class\-probability vector\. The complication is that the simplexΔD\\Delta\_\{D\}is a constrained domain, so Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5)does not apply directly\. The standard fix is to compose the simplex\-domain loss with an unconstrained link function \(e\.g\., softmax over logits\), which moves the problem into the open Euclidean setting where Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5)applies\.

Letℓ​\(𝐲,𝐩\)\\ell\(\\mathbf\{y\},\\mathbf\{p\}\)be a differentiable strictly proper loss defined on the interior of the probability simplex, and let

𝐪​\(X\)=𝔼​\[𝐘∣X\]\\mathbf\{q\}\(X\)=\\mathbb\{E\}\[\\mathbf\{Y\}\\mid X\]denote the true conditional class\-probability vector, assumed to lie in the interior of the simplex almost surely\. Strict propriety means that, for each fixedxx, the simplex\-domain conditional risk

𝐩↦𝔼​\[ℓ​\(𝐘,𝐩\)∣X=x\]\\mathbf\{p\}\\mapsto\\mathbb\{E\}\\bigl\[\\ell\(\\mathbf\{Y\},\\mathbf\{p\}\)\\mid X=x\\bigr\]is uniquely minimized at𝐩=𝐪​\(x\)\\mathbf\{p\}=\\mathbf\{q\}\(x\)\.

To bring this into the unconstrained setting required by Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5), we reparameterize through a differentiable linkψ:ℝm→int​\(ΔD\)\\psi:\\mathbb\{R\}^\{m\}\\to\\mathrm\{int\}\(\\Delta\_\{D\}\)that maps an unconstrained vector𝐚\\mathbf\{a\}to a probability vector𝐩=ψ​\(𝐚\)\\mathbf\{p\}=\\psi\(\\mathbf\{a\}\), and define the unconstrained loss

ℒ​\(𝐲,𝐚\)=ℓ​\(𝐲,ψ​\(𝐚\)\)\.\\mathcal\{L\}\(\\mathbf\{y\},\\mathbf\{a\}\)=\\ell\(\\mathbf\{y\},\\psi\(\\mathbf\{a\}\)\)\.The SBD score is then taken with respect to the unconstrained variable:

𝜹=∇𝐚ℒ​\(𝐲,𝐚\)\.\\boldsymbol\{\\delta\}=\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{y\},\\mathbf\{a\}\)\.Suppose that for almost everyxx, the unconstrained conditional risk

Cx​\(𝐚\)=𝔼​\[ℓ​\(𝐘,ψ​\(𝐚\)\)∣X=x\]C\_\{x\}\(\\mathbf\{a\}\)=\\mathbb\{E\}\\bigl\[\\ell\(\\mathbf\{Y\},\\psi\(\\mathbf\{a\}\)\)\\mid X=x\\bigr\]has an interior minimizer𝐚⋆​\(x\)\\mathbf\{a\}^\{\\star\}\(x\)satisfyingψ​\(𝐚⋆​\(x\)\)=𝐪​\(x\)\\psi\(\\mathbf\{a\}^\{\\star\}\(x\)\)=\\mathbf\{q\}\(x\), and that the gradient–expectation interchange holds\. Then by Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5),

𝔼​\[∇𝐚ℓ​\(𝐘,ψ​\(𝐚⋆​\(X\)\)\)∣X\]=𝟎,\\mathbb\{E\}\\bigl\[\\nabla\_\{\\mathbf\{a\}\}\\ell\(\\mathbf\{Y\},\\psi\(\\mathbf\{a\}^\{\\star\}\(X\)\)\)\\mid X\\bigr\]=\\mathbf\{0\},which is exactly the conditional\-mean\-zero score property used by SBD\.

Softmax cross\-entropy\.The canonical instance isψ=softmax\\psi=\\operatorname\{softmax\},ℓ​\(𝐲,𝐩\)=−∑d=1Dyd​log⁡pd\\ell\(\\mathbf\{y\},\\mathbf\{p\}\)=\-\\sum\_\{d=1\}^\{D\}y\_\{d\}\\log p\_\{d\}\. A direct computation gives

𝐩=softmax⁡\(𝐚\),∇𝐚ℒCE​\(𝐲,𝐚\)=𝐩−𝐲\.\\mathbf\{p\}=\\operatorname\{softmax\}\(\\mathbf\{a\}\),\\qquad\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(\\mathbf\{y\},\\mathbf\{a\}\)=\\mathbf\{p\}\-\\mathbf\{y\}\.At the population optimum,𝐩⋆​\(X\)=ψ​\(𝐚⋆​\(X\)\)=𝐪​\(X\)\\mathbf\{p\}^\{\\star\}\(X\)=\\psi\(\\mathbf\{a\}^\{\\star\}\(X\)\)=\\mathbf\{q\}\(X\), so

𝔼​\[𝐩⋆​\(X\)−𝐘∣X\]=𝐩⋆​\(X\)−𝐪​\(X\)=𝟎\.\\mathbb\{E\}\\bigl\[\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbf\{Y\}\\mid X\\bigr\]=\\mathbf\{p\}^\{\\star\}\(X\)\-\\mathbf\{q\}\(X\)=\\mathbf\{0\}\.This is the cross\-entropy score𝐩−𝐲\\mathbf\{p\}\-\\mathbf\{y\}used throughout the main text, recovered as the unconstrained\-logit specialization of the proper\-loss construction\.

##### Negative log\-likelihood losses\.

Maximum\-likelihood estimation in a parametric family yields a third source of differentiable losses\. We show that whenever the parameter space is open and the conditional risk has an interior minimizer, the score is conditionally mean zero\. Furthermore, for exponential families, this conditional\-mean\-zero property reduces to the classical moment\-matching identity\.

Let\{pη:η∈ℋ\}\\\{p\_\{\\eta\}:\\eta\\in\\mathcal\{H\}\\\}be a differentiable parametric family of densities with open parameter spaceℋ⊆ℝm\\mathcal\{H\}\\subseteq\\mathbb\{R\}^\{m\}, and define the negative log\-likelihood loss

ℒNLL​\(𝐲,η\)=−log⁡pη​\(𝐲\)\.\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(\\mathbf\{y\},\\eta\)=\-\\log p\_\{\\eta\}\(\\mathbf\{y\}\)\.The corresponding conditional risk is the conditional cross\-entropy from the true distribution topηp\_\{\\eta\},

Cx​\(η\)=𝔼​\[−log⁡pη​\(𝐘\)∣X=x\]\.C\_\{x\}\(\\eta\)=\\mathbb\{E\}\\bigl\[\-\\log p\_\{\\eta\}\(\\mathbf\{Y\}\)\\mid X=x\\bigr\]\.SupposeCxC\_\{x\}has an interior minimizerη⋆​\(x\)∈ℋ\\eta^\{\\star\}\(x\)\\in\\mathcal\{H\}for almost everyxx, and the gradient–expectation interchange holds\. Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5)then gives

𝔼​\[−∇ηlog⁡pη⋆​\(X\)​\(𝐘\)∣X\]=𝟎\.\\mathbb\{E\}\\bigl\[\-\\nabla\_\{\\eta\}\\log p\_\{\\eta^\{\\star\}\(X\)\}\(\\mathbf\{Y\}\)\\mid X\\bigr\]=\\mathbf\{0\}\.The conditionally\-vanishing object on the left is the Fisher score evaluated at the population\-optimal parameter, the maximum\-likelihood\-estimator score conditioned onXX\.

Exponential families: moment matching\.For an exponential family in canonical form,

pη​\(𝐲\)=h​\(𝐲\)​exp⁡\(ηT​T​\(𝐲\)−A​\(η\)\),p\_\{\\eta\}\(\\mathbf\{y\}\)=h\(\\mathbf\{y\}\)\\exp\\bigl\(\\eta^\{T\}T\(\\mathbf\{y\}\)\-A\(\\eta\)\\bigr\),the score has a particularly clean form:∇ηlog⁡pη​\(𝐲\)=T​\(𝐲\)−∇A​\(η\)\\nabla\_\{\\eta\}\\log p\_\{\\eta\}\(\\mathbf\{y\}\)=T\(\\mathbf\{y\}\)\-\\nabla A\(\\eta\), so

𝜹NLL=∇A​\(η\)−T​\(𝐘\)\.\\boldsymbol\{\\delta\}\_\{\\mathrm\{NLL\}\}=\\nabla A\(\\eta\)\-T\(\\mathbf\{Y\}\)\.The conditional\-mean\-zero score condition𝔼​\[𝜹NLL⋆∣X\]=𝟎\\mathbb\{E\}\[\\boldsymbol\{\\delta\}\_\{\\mathrm\{NLL\}\}^\{\\star\}\\mid X\]=\\mathbf\{0\}then reads

∇A​\(η⋆​\(X\)\)=𝔼​\[T​\(𝐘\)∣X\],\\nabla A\(\\eta^\{\\star\}\(X\)\)=\\mathbb\{E\}\[T\(\\mathbf\{Y\}\)\\mid X\],which is exactly the classical moment\-matching condition for maximum\-likelihood estimation in exponential families: the model expected sufficient statistic equals the conditional expected sufficient statistic\. Important instances include Gaussian NLL \(withT​\(𝐘\)=𝐘T\(\\mathbf\{Y\}\)=\\mathbf\{Y\}, recovering the conditional\-mean predictor\), Bernoulli logistic loss, Poisson NLL, and multinomial softmax cross\-entropy when written in natural\-parameter coordinates\.

### B\.1Proof\-of\-concept demonstration: Poisson NLL regression

The main paper’s experiments target cross\-entropy classification, which together with the MSE setting ofErdogan et al\. \([2025](https://arxiv.org/html/2605.30638#bib.bib7)\)covers two of the loss families admitted by SBD\. To illustrate the framework on an exponential\-family negative log\-likelihood outside these two, we run a small synthetic Poisson regression experiment, which covers one of the loss families covered by Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5)and Appendix[B](https://arxiv.org/html/2605.30638#A2)but not previously demonstrated empirically\. The goal is theoretical verification rather than a performance comparison: we check that \(i\) the conditional\-mean\-zero score property predicted by Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5)holds at convergence, and \(ii\) SBD’s local update drives the network toward the same population optimum that BP reaches\.

#### B\.1\.1Data\-generating process\.

We sampleX∼𝒩​\(𝟎,I8\)X\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\_\{8\}\)and define the ground\-truth log\-rate

f⋆​\(x\)=1\.0\+0\.4​sin⁡\(x1\)​cos⁡\(x2\)\+0\.3​x3​x4−0\.15​\(x52−1\)\+0\.10​\(x6\+x7\)\+0\.10​tanh⁡\(x8\),f^\{\\star\}\(x\)=1\.0\+0\.4\\sin\(x\_\{1\}\)\\cos\(x\_\{2\}\)\+0\.3\\,x\_\{3\}x\_\{4\}\-0\.15\(x\_\{5\}^\{2\}\-1\)\+0\.10\(x\_\{6\}\+x\_\{7\}\)\+0\.10\\tanh\(x\_\{8\}\),as a synthetic regression function in the style of Friedman 1\(Friedman,[1991](https://arxiv.org/html/2605.30638#bib.bib38)\), so thatY∣X∼Poisson​\(exp⁡f⋆​\(X\)\)Y\\mid X\\sim\\mathrm\{Poisson\}\\bigl\(\\exp f^\{\\star\}\(X\)\\bigr\)\. We generate50,00050\{,\}000training and10,00010\{,\}000test samples, withf⋆f^\{\\star\}clipped to\[−1\.5,3\.5\]\[\-1\.5,3\.5\]to keep counts finite\. Becausef⋆f^\{\\star\}is known explicitly, we can compute the irreducible Bayes test NLLℛ⋆=𝔼​\[exp⁡f⋆​\(X\)−Y​f⋆​\(X\)\]\\mathcal\{R\}^\{\\star\}=\\mathbb\{E\}\[\\exp f^\{\\star\}\(X\)\-Y\\,f^\{\\star\}\(X\)\]and the finite\-sample CMZ floor \(see below\) as ground\-truth references\.

#### B\.1\.2Loss and score\.

The Poisson NLL with log\-rate parameterization isℒNLL​\(y,a\)=exp⁡\(a\)−y​a\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(y,a\)=\\exp\(a\)\-y\\,a\(up to ayy\-only constant\), giving the SBD score

𝜹​\(y,a\)=∇aℒNLL​\(y,a\)=exp⁡\(a\)−y\.\\boldsymbol\{\\delta\}\(y,a\)=\\nabla\_\{a\}\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(y,a\)=\\exp\(a\)\-y\.This is the canonical exponential\-family score∇A​\(η\)−T​\(Y\)\\nabla A\(\\eta\)\-T\(Y\)withT​\(Y\)=YT\(Y\)=YandA​\(η\)=exp⁡\(η\)A\(\\eta\)=\\exp\(\\eta\), matching the structure of Appendix[B](https://arxiv.org/html/2605.30638#A2)\. The conditional\-mean\-zero property at the optimum reduces to the moment\-matching identityexp⁡\(a⋆​\(X\)\)=𝔼​\[Y∣X\]\\exp\(a^\{\\star\}\(X\)\)=\\mathbb\{E\}\[Y\\mid X\]\.

#### B\.1\.3Derivation of the oracle Bayes NLL\.

The Poisson NLL on a sample\(X,Y\)\(X,Y\)in log\-rate parameterization is, up to aYY\-only constant,

ℒNLL​\(Y,a\)=exp⁡\(a\)−Y​a\.\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(Y,a\)=\\exp\(a\)\-Y\\,a\.The conditional risk for a fixed inputxxis

Cx​\(a\)=𝔼​\[ℒNLL​\(Y,a\)∣X=x\]=exp⁡\(a\)−𝔼​\[Y∣X=x\]​a\.C\_\{x\}\(a\)=\\mathbb\{E\}\\bigl\[\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(Y,a\)\\mid X=x\\bigr\]=\\exp\(a\)\-\\mathbb\{E\}\[Y\\mid X=x\]\\,a\.For the synthetic data\-generating processY∣X∼Poisson​\(exp⁡f⋆​\(X\)\)Y\\mid X\\sim\\mathrm\{Poisson\}\\bigl\(\\exp f^\{\\star\}\(X\)\\bigr\), the conditional mean is𝔼​\[Y∣X=x\]=exp⁡f⋆​\(x\)\\mathbb\{E\}\[Y\\mid X=x\]=\\exp f^\{\\star\}\(x\)\. DifferentiatingCxC\_\{x\}inaaand setting the gradient to zero givesexp⁡\(a⋆​\(x\)\)=exp⁡f⋆​\(x\)\\exp\(a^\{\\star\}\(x\)\)=\\exp f^\{\\star\}\(x\), so the population\-optimal predictor isa⋆​\(x\)=f⋆​\(x\)a^\{\\star\}\(x\)=f^\{\\star\}\(x\)\. Substituting back yields the conditional risk at the optimum

Cx​\(a⋆​\(x\)\)=exp⁡f⋆​\(x\)⋅\(1−f⋆​\(x\)\)\.C\_\{x\}\(a^\{\\star\}\(x\)\)=\\exp f^\{\\star\}\(x\)\\cdot\\bigl\(1\-f^\{\\star\}\(x\)\\bigr\)\.The oracle Bayes NLL is the marginal expectation of this conditional risk:

ℛ⋆=𝔼X​\[CX​\(a⋆​\(X\)\)\]=𝔼​\[exp⁡f⋆​\(X\)−Y​f⋆​\(X\)\],\\mathcal\{R\}^\{\\star\}=\\mathbb\{E\}\_\{X\}\\\!\\bigl\[C\_\{X\}\(a^\{\\star\}\(X\)\)\\bigr\]=\\mathbb\{E\}\\\!\\bigl\[\\exp f^\{\\star\}\(X\)\-Y\\,f^\{\\star\}\(X\)\\bigr\],where the second equality uses the tower property and𝔼​\[Y∣X\]=exp⁡f⋆​\(X\)\\mathbb\{E\}\[Y\\mid X\]=\\exp f^\{\\star\}\(X\)\. Becausef⋆f^\{\\star\}is known, we estimateℛ⋆\\mathcal\{R\}^\{\\star\}by Monte Carlo on the test set,ℛ^⋆=Ntest−1​∑n=1Ntest\[exp⁡f⋆​\(Xn\)−Yn​f⋆​\(Xn\)\]\\widehat\{\\mathcal\{R\}\}^\{\\star\}=N\_\{\\mathrm\{test\}\}^\{\-1\}\\sum\_\{n=1\}^\{N\_\{\\mathrm\{test\}\}\}\\bigl\[\\exp f^\{\\star\}\(X\_\{n\}\)\-Y\_\{n\}\\,f^\{\\star\}\(X\_\{n\}\)\\bigr\], which converges toℛ⋆\\mathcal\{R\}^\{\\star\}at rateNtest−1/2N\_\{\\mathrm\{test\}\}^\{\-1/2\}\. For the test set of10,00010\{,\}000samples used here, this yieldsℛ⋆≈−0\.560\\mathcal\{R\}^\{\\star\}\\approx\-0\.560, the dashed reference line in Figure[2](https://arxiv.org/html/2605.30638#A2.F2)’s implicit zero\.

#### B\.1\.4Network and methods\.

A three\-layer MLP with widths8→128→64→18\\to 128\\to 64\\to 1, ReLU activations, bias terms, and Kaiming\-He initialization\. The network output is the scalar log\-ratea​\(x\)a\(x\)\. Two methods are compared:

- •BP:backpropagation of the exact Poisson NLL gradient\.
- •SBD:the rank\-11score broadcast of Section[5](https://arxiv.org/html/2605.30638#S5)\(no score\-vector expansion\)\. The output layer uses the standard NLL gradient; the two hidden layers use the local SBD update with broadcast cross\-correlation𝐑^\(k\)\\widehat\{\\mathbf\{R\}\}^\{\(k\)\}projecting𝜹=exp⁡\(a\)−y\\boldsymbol\{\\delta\}=\\exp\(a\)\-yto each layer\.

Both methods share Adam withη0=3×10−3\\eta\_\{0\}=3\\\!\\times\\\!10^\{\-3\},0\.990\.99per\-epoch decay, weight decay5×10−45\\\!\\times\\\!10^\{\-4\}, batch size6464,200200epochs,λ=0\.99999\\lambda=0\.99999for the broadcast exponential moving average \(EMA\), and55independent seeds\.

#### B\.1\.5Verification metrics\.

We track three quantities per epoch\.

- •*Test Poisson NLL\.*Convergence sanity check; the Bayes NLLℛ⋆\\mathcal\{R\}^\{\\star\}is the irreducible lower bound\.
- •*Conditional\-mean\-zero metric \(CMZ\)\.*A binned estimate of\|𝔼\[𝜹∣X\]\|\|\\mathbb\{E\}\[\\boldsymbol\{\\delta\}\\mid X\]\|\. We bin the test set intoK=20K=20quantile bins of the predicted log\-ratea^​\(X\)\\hat\{a\}\(X\), compute\|𝔼^\[𝜹∣bink\]\|\|\\widehat\{\\mathbb\{E\}\}\[\\boldsymbol\{\\delta\}\\mid\\text\{bin\}\_\{k\}\]\|within each bin, and average the absolute values\. The*finite\-sample CMZ floor*is the value this metric takes whena^=f⋆\\hat\{a\}=f^\{\\star\}, attributable purely to test\-set sampling noise; it is the lower bound a converged predictor can attain on a finite sample\. For our test set the floor is approximately0\.0630\.063\.
- •*Score–activation correlations\.*Average absolute Pearson correlation between𝜹\\boldsymbol\{\\delta\}and each hidden\-layer activation, evaluated post\-training\. Theorem[3](https://arxiv.org/html/2605.30638#Thmtheorem3)predicts these are small at convergence\.

##### Results\.

Both methods converge to near\-oracle test Poisson NLL: BP attains a final test NLL of approximately−0\.5491±0\.0008\-0\.5491\\pm 0\.0008\(mean±\\pmstd form\) and SBD approximately−0\.5452±0\.0023\-0\.5452\\pm 0\.0023, against the Bayes lower boundℛ⋆=−0\.5595\\mathcal\{R\}^\{\\star\}=\-0\.5595\. Figure[2](https://arxiv.org/html/2605.30638#A2.F2)shows the test NLL excess over the oracle \(ℛ−ℛ⋆\\mathcal\{R\}\-\\mathcal\{R\}^\{\\star\}\) on a logarithmic scale, the natural display for measuring convergence to a known lower bound\. Both BP and SBD descend by roughly two and a half orders of magnitude during training and plateau at small residual gaps, approximately1×10−21\\\!\\times\\\!10^\{\-2\}for BP and1\.5×10−21\.5\\\!\\times\\\!10^\{\-2\}for SBD\.

![Refer to caption](https://arxiv.org/html/2605.30638v1/figures/excessnll.png)Figure 2:Poisson proof\-of\-concept: test excess negative log\-likelihood relative to the Bayes oracle,ℛ−ℛ⋆\\mathcal\{R\}\-\\mathcal\{R\}^\{\\star\}, on a logarithmic scale\. Both BP and SBD converge close to the oracle, with SBD plateauing slightly above BP\.Figure[3](https://arxiv.org/html/2605.30638#A2.F3)shows the binned conditional\-mean\-zero metric versus epoch on log scale\. Both methods drive the metric down by more than an order of magnitude, with BP’s trajectory settling at0\.0726±0\.01460\.0726\\pm 0\.0146near the finite\-sample CMZ floor of0\.0630\.063and SBD’s settling at0\.0872±0\.01560\.0872\\pm 0\.0156which is slightly above it; the shapes of the CMZ trajectories closely track the shapes of the excess\-NLL trajectories of Figure[2](https://arxiv.org/html/2605.30638#A2.F2), the empirical signature of Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5)predicting that NLL convergence and CMZ convergence proceed together\. Score–activation correlations at the trained network are small for both methods \(∼0\.01\\sim\\\!0\.01in both hidden layers\), confirming the orthogonality property of Theorem[3](https://arxiv.org/html/2605.30638#Thmtheorem3)at the empirical optimum\.

![Refer to caption](https://arxiv.org/html/2605.30638v1/figures/cmz.png)Figure 3:Poisson proof\-of\-concept: binned conditional\-mean\-zero metric versus epoch on a logarithmic scale\. Both BP and SBD approach the finite\-sample CMZ floor, with SBD settling slightly above it\.The main observation is that the conditional\-mean\-zero score property predicted by Theorem[5](https://arxiv.org/html/2605.30638#Thmtheorem5)holds empirically for an exponential family NLL loss, with both an exact\-gradient method \(BP\) and the SBD local broadcast update driving the metric toward its irreducible floor in lock\-step with the NLL convergence\. SBD reaches a population optimum on this task without computing exact gradients, providing a proof\-of\-concept demonstration of the framework on a loss outside the cross\-entropy and MSE families covered in the main paper\.

## Appendix CCross entropy SBD algorithm

This appendix collects the cross entropy SBD procedure used in Section[4](https://arxiv.org/html/2605.30638#S4)into a single self\-contained reference\. Table[3](https://arxiv.org/html/2605.30638#A3.T3)lists the per\-minibatch operations performed at every hidden layerkk, where𝐡\(k\)∈ℝN\(k\)\\mathbf\{h\}^\{\(k\)\}\\\!\\in\\\!\\mathbb\{R\}^\{N^\{\(k\)\}\}is the layer\-kkactivation,𝐚∈ℝD\\mathbf\{a\}\\\!\\in\\\!\\mathbb\{R\}^\{D\}are the output logits,𝐩=softmax⁡\(𝐚\)\\mathbf\{p\}=\\operatorname\{softmax\}\(\\mathbf\{a\}\), and𝜹CE=𝐩−𝐲\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}=\\mathbf\{p\}\-\\mathbf\{y\}is the cross entropy output score\. The layerwise correlation estimate𝐑^CE\(k\)∈ℝN\(k\)×D\\widehat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\\\!\\in\\\!\\mathbb\{R\}^\{N^\{\(k\)\}\\times D\}tracks the running covariance between the hidden feature mapg\(k\)​\(𝐡\(k\)\)g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\)and the score, and supplies the broadcast modulator𝐪CE\(k\)\\mathbf\{q\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}used in the local three\-factor weight update\. The output layer is trained with the standard cross entropy gradient\.

Table 3:Cross entropy SBD algorithm\.StepOperation1Perform a forward pass to compute the hidden activations𝐡\(k\)\\mathbf\{h\}^\{\(k\)\}, the logits𝐚\\mathbf\{a\}, the probabilities𝐩=softmax⁡\(𝐚\)\\mathbf\{p\}=\\operatorname\{softmax\}\(\\mathbf\{a\}\), and the score𝜹CE=𝐩−𝐲\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}=\\mathbf\{p\}\-\\mathbf\{y\}\.2Form the batch score matrix𝚫CE​\[m\]=𝐏​\[m\]−𝐘​\[m\]\\mathbf\{\\Delta\}\_\{\\mathrm\{CE\}\}\[m\]=\\mathbf\{P\}\[m\]\-\\mathbf\{Y\}\[m\]and update the layerwise correlation estimate𝐑^CE\(k\)​\[m\]=λ​𝐑^CE\(k\)​\[m−1\]\+1−λB​𝐆\(k\)​\[m\]​𝚫CE​\[m\]T\\widehat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]=\\lambda\\,\\widehat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\-1\]\+\\tfrac\{1\-\\lambda\}\{B\}\\,\\mathbf\{G\}^\{\(k\)\}\[m\]\\,\\mathbf\{\\Delta\}\_\{\\mathrm\{CE\}\}\[m\]^\{T\}\.3Project the broadcast score to hidden layerkkby computing𝐪CE\(k\)​\[m\]=𝐑^CE\(k\)​\[m\]​\(𝐩​\[m\]−𝐲​\[m\]\)\\mathbf\{q\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]=\\widehat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\\,\(\\mathbf\{p\}\[m\]\-\\mathbf\{y\}\[m\]\)\.4Update the hidden\-layer parameters with the local three\-factor ruleΔ​Wi​j\(k\)​\[m\]=ζ​gi′⁣\(k\)​\(hi\(k\)​\[m\]\)​f′⁣\(k\)​\(ui\(k\)​\[m\]\)​qCE,i\(k\)​\[m\]​hj\(k−1\)​\[m\]\\Delta W\_\{ij\}^\{\(k\)\}\[m\]=\\zeta\\,g\_\{i\}^\{\\prime\(k\)\}\\\!\\bigl\(h\_\{i\}^\{\(k\)\}\[m\]\\bigr\)\\,f^\{\\prime\(k\)\}\\\!\\bigl\(u\_\{i\}^\{\(k\)\}\[m\]\\bigr\)\\,q\_\{\\mathrm\{CE\},i\}^\{\(k\)\}\[m\]\\,h\_\{j\}^\{\(k\-1\)\}\[m\], and update the biases analogously\.5Update the output layer with the standard cross entropy gradient\.### C\.1Derivation of the SBD update rule

This subsection derives the practical hidden\-layer SBD update for cross entropy by following the same decomposition used in\(Erdogan et al\.,[2025](https://arxiv.org/html/2605.30638#bib.bib7)\)\.

For minibatchmm, let

𝐇\(k−1\)​\[m\]\\displaystyle\\mathbf\{H\}^\{\(k\-1\)\}\[m\]=\[𝐡\(k−1\)​\[m​B\+1\],…,𝐡\(k−1\)​\[\(m\+1\)​B\]\],\\displaystyle=\\bigl\[\\mathbf\{h\}^\{\(k\-1\)\}\[mB\+1\],\\ldots,\\mathbf\{h\}^\{\(k\-1\)\}\[\(m\+1\)B\]\\bigr\],𝐆\(k\)​\[m\]\\displaystyle\\mathbf\{G\}^\{\(k\)\}\[m\]=\[g\(k\)​\(𝐡\(k\)​\[m​B\+1\]\),…,g\(k\)​\(𝐡\(k\)​\[\(m\+1\)​B\]\)\],\\displaystyle=\\bigl\[g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\[mB\+1\]\),\\ldots,g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\[\(m\+1\)B\]\)\\bigr\],𝚫CE​\[m\]\\displaystyle\\mathbf\{\\Delta\}\_\{\\mathrm\{CE\}\}\[m\]=\[𝜹CE​\[m​B\+1\],…,𝜹CE​\[\(m\+1\)​B\]\],\\displaystyle=\\bigl\[\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}\[mB\+1\],\\ldots,\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}\[\(m\+1\)B\]\\bigr\],where𝜹CE​\[n\]=𝐩​\[n\]−𝐲​\[n\]\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}\[n\]=\\mathbf\{p\}\[n\]\-\\mathbf\{y\}\[n\]\. Treating𝐑^CE\(k\)​\[m−1\]\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\-1\]as a fixed state when differentiating the current minibatch objective, define

α:=1−λB,𝐑^CE\(k\)​\[m\]=λ​𝐑^CE\(k\)​\[m−1\]\+α​𝐆\(k\)​\[m\]​𝚫CE​\[m\]T,\\alpha:=\\frac\{1\-\\lambda\}\{B\},\\qquad\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]=\\lambda\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\-1\]\+\\alpha\\,\\mathbf\{G\}^\{\(k\)\}\[m\]\\mathbf\{\\Delta\}\_\{\\mathrm\{CE\}\}\[m\]^\{T\},and recall that

𝒥CE\(k\)​\[m\]=12​‖𝐑^CE\(k\)​\[m\]‖F2=12​Tr​\(𝐑^CE\(k\)​\[m\]T​𝐑^CE\(k\)​\[m\]\)\.\\mathcal\{J\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]=\\frac\{1\}\{2\}\\left\\\|\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\\right\\\|\_\{F\}^\{2\}=\\frac\{1\}\{2\}\\,\\mathrm\{Tr\}\\\!\\left\(\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]^\{T\}\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\\right\)\.Hence, for a hidden\-layer weightWi​j\(k\)W\_\{ij\}^\{\(k\)\},

∂𝒥CE\(k\)​\[m\]∂Wi​j\(k\)\\displaystyle\\frac\{\\partial\\mathcal\{J\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\}\{\\partial W\_\{ij\}^\{\(k\)\}\}=Tr​\(𝐑^CE\(k\)​\[m\]T​∂𝐑^CE\(k\)​\[m\]∂Wi​j\(k\)\)\\displaystyle=\\mathrm\{Tr\}\\\!\\left\(\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]^\{T\}\\frac\{\\partial\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\}\{\\partial W\_\{ij\}^\{\(k\)\}\}\\right\)=α​Tr​\(𝐑^CE\(k\)​\[m\]T​∂𝐆\(k\)​\[m\]∂Wi​j\(k\)​𝚫CE​\[m\]T\)\+α​Tr​\(𝐑^CE\(k\)​\[m\]T​𝐆\(k\)​\[m\]​∂𝚫CE​\[m\]T∂Wi​j\(k\)\)\.\\displaystyle=\\alpha\\,\\mathrm\{Tr\}\\\!\\left\(\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]^\{T\}\\frac\{\\partial\\mathbf\{G\}^\{\(k\)\}\[m\]\}\{\\partial W\_\{ij\}^\{\(k\)\}\}\\mathbf\{\\Delta\}\_\{\\mathrm\{CE\}\}\[m\]^\{T\}\\right\)\+\\alpha\\,\\mathrm\{Tr\}\\\!\\left\(\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]^\{T\}\\mathbf\{G\}^\{\(k\)\}\[m\]\\frac\{\\partial\\mathbf\{\\Delta\}\_\{\\mathrm\{CE\}\}\[m\]^\{T\}\}\{\\partial W\_\{ij\}^\{\(k\)\}\}\\right\)\.The first term depends only on the local Jacobian of layerkk, whereas the second term depends on∂𝚫CE​\[m\]/∂Wi​j\(k\)\\partial\\mathbf\{\\Delta\}\_\{\\mathrm\{CE\}\}\[m\]/\\partial W\_\{ij\}^\{\(k\)\}and therefore on how the logits change through all deeper layers\. Indeed, because the label vector is constant with respect to the hidden\-layer parameters,

∂𝜹CE​\[n\]∂Wi​j\(k\)=∂𝐩​\[n\]∂Wi​j\(k\)=\(diag⁡\(𝐩​\[n\]\)−𝐩​\[n\]​𝐩​\[n\]T\)​∂𝐚​\[n\]∂Wi​j\(k\),\\frac\{\\partial\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}\[n\]\}\{\\partial W\_\{ij\}^\{\(k\)\}\}=\\frac\{\\partial\\mathbf\{p\}\[n\]\}\{\\partial W\_\{ij\}^\{\(k\)\}\}=\\left\(\\operatorname\{diag\}\(\\mathbf\{p\}\[n\]\)\-\\mathbf\{p\}\[n\]\\mathbf\{p\}\[n\]^\{T\}\\right\)\\frac\{\\partial\\mathbf\{a\}\[n\]\}\{\\partial W\_\{ij\}^\{\(k\)\}\},and∂𝐚​\[n\]/∂Wi​j\(k\)\\partial\\mathbf\{a\}\[n\]/\\partial W\_\{ij\}^\{\(k\)\}expands into the chain of Jacobians through layersk\+1,…,Lk\+1,\\ldots,L\. This is the cross entropy analog of the backpropagation\-like term identified in EBD, so the practical SBD rule retains only the local no\-propagation contribution\.

Define the layerwise projected\-score matrix

𝐐CE\(k\)​\[m\]:=𝐑^CE\(k\)​\[m\]​𝚫CE​\[m\]=\[𝐪CE\(k\)​\[m​B\+1\],…,𝐪CE\(k\)​\[\(m\+1\)​B\]\]\.\\mathbf\{Q\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]:=\\hat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\\mathbf\{\\Delta\}\_\{\\mathrm\{CE\}\}\[m\]=\\bigl\[\\mathbf\{q\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[mB\+1\],\\ldots,\\mathbf\{q\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[\(m\+1\)B\]\\bigr\]\.Using cyclicity of the trace, the retained local term becomes

∂𝒥CE\(k\)​\[m\]∂Wi​j\(k\)\|local=α​Tr​\(𝐐CE\(k\)​\[m\]​∂𝐆\(k\)​\[m\]T∂Wi​j\(k\)\)\.\\left\.\\frac\{\\partial\\mathcal\{J\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\}\{\\partial W\_\{ij\}^\{\(k\)\}\}\\right\|\_\{\\mathrm\{local\}\}=\\alpha\\,\\mathrm\{Tr\}\\\!\\left\(\\mathbf\{Q\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\\frac\{\\partial\\mathbf\{G\}^\{\(k\)\}\[m\]^\{T\}\}\{\\partial W\_\{ij\}^\{\(k\)\}\}\\right\)\.Let𝐞i∈ℝN\(k\)\\mathbf\{e\}\_\{i\}\\in\\mathbb\{R\}^\{N^\{\(k\)\}\}denote theiith canonical basis vector\. Then

∂𝐆\(k\)​\[m\]∂Wi​j\(k\)=𝐞i​\[gi′⁣\(k\)​\(hi\(k\)​\[m​B\+1\]\)​f′⁣\(k\)​\(ui\(k\)​\[m​B\+1\]\)​hj\(k−1\)​\[m​B\+1\]⋮gi′⁣\(k\)​\(hi\(k\)​\[\(m\+1\)​B\]\)​f′⁣\(k\)​\(ui\(k\)​\[\(m\+1\)​B\]\)​hj\(k−1\)​\[\(m\+1\)​B\]\]T\.\\frac\{\\partial\\mathbf\{G\}^\{\(k\)\}\[m\]\}\{\\partial W\_\{ij\}^\{\(k\)\}\}=\\mathbf\{e\}\_\{i\}\\begin\{bmatrix\}g\_\{i\}^\{\\prime\(k\)\}\\\!\\left\(h\_\{i\}^\{\(k\)\}\[mB\+1\]\\right\)f^\{\\prime\(k\)\}\\\!\\left\(u\_\{i\}^\{\(k\)\}\[mB\+1\]\\right\)h\_\{j\}^\{\(k\-1\)\}\[mB\+1\]\\\\ \\vdots\\\\ g\_\{i\}^\{\\prime\(k\)\}\\\!\\left\(h\_\{i\}^\{\(k\)\}\[\(m\+1\)B\]\\right\)f^\{\\prime\(k\)\}\\\!\\left\(u\_\{i\}^\{\(k\)\}\[\(m\+1\)B\]\\right\)h\_\{j\}^\{\(k\-1\)\}\[\(m\+1\)B\]\\end\{bmatrix\}^\{T\}\.Substituting this derivative yields

∂𝒥CE\(k\)​\[m\]∂Wi​j\(k\)\|local=α​∑n=m​B\+1\(m\+1\)​Bgi′⁣\(k\)​\(hi\(k\)​\[n\]\)​f′⁣\(k\)​\(ui\(k\)​\[n\]\)​qCE,i\(k\)​\[n\]​hj\(k−1\)​\[n\]\.\\left\.\\frac\{\\partial\\mathcal\{J\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\}\{\\partial W\_\{ij\}^\{\(k\)\}\}\\right\|\_\{\\mathrm\{local\}\}=\\alpha\\sum\_\{n=mB\+1\}^\{\(m\+1\)B\}g\_\{i\}^\{\\prime\(k\)\}\\\!\\left\(h\_\{i\}^\{\(k\)\}\[n\]\\right\)f^\{\\prime\(k\)\}\\\!\\left\(u\_\{i\}^\{\(k\)\}\[n\]\\right\)q\_\{\\mathrm\{CE\},i\}^\{\(k\)\}\[n\]h\_\{j\}^\{\(k\-1\)\}\[n\]\.An identical calculation for the bias gives

∂𝒥CE\(k\)​\[m\]∂bi\(k\)\|local=α​∑n=m​B\+1\(m\+1\)​Bgi′⁣\(k\)​\(hi\(k\)​\[n\]\)​f′⁣\(k\)​\(ui\(k\)​\[n\]\)​qCE,i\(k\)​\[n\]\.\\left\.\\frac\{\\partial\\mathcal\{J\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\}\{\\partial b\_\{i\}^\{\(k\)\}\}\\right\|\_\{\\mathrm\{local\}\}=\\alpha\\sum\_\{n=mB\+1\}^\{\(m\+1\)B\}g\_\{i\}^\{\\prime\(k\)\}\\\!\\left\(h\_\{i\}^\{\(k\)\}\[n\]\\right\)f^\{\\prime\(k\)\}\\\!\\left\(u\_\{i\}^\{\(k\)\}\[n\]\\right\)q\_\{\\mathrm\{CE\},i\}^\{\(k\)\}\[n\]\.
To write the update compactly, define the derivative matrices

𝐆d\(k\)​\[m\]\\displaystyle\\mathbf\{G\}\_\{d\}^\{\(k\)\}\[m\]=\[𝐠′⁣\(k\)​\(𝐡\(k\)​\[m​B\+1\]\),…,𝐠′⁣\(k\)​\(𝐡\(k\)​\[\(m\+1\)​B\]\)\],\\displaystyle=\\bigl\[\\mathbf\{g\}^\{\\prime\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\[mB\+1\]\),\\ldots,\\mathbf\{g\}^\{\\prime\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\[\(m\+1\)B\]\)\\bigr\],𝐅d\(k\)​\[m\]\\displaystyle\\mathbf\{F\}\_\{d\}^\{\(k\)\}\[m\]=\[𝐟′⁣\(k\)​\(𝐮\(k\)​\[m​B\+1\]\),…,𝐟′⁣\(k\)​\(𝐮\(k\)​\[\(m\+1\)​B\]\)\],\\displaystyle=\\bigl\[\\mathbf\{f\}^\{\\prime\(k\)\}\(\\mathbf\{u\}^\{\(k\)\}\[mB\+1\]\),\\ldots,\\mathbf\{f\}^\{\\prime\(k\)\}\(\\mathbf\{u\}^\{\(k\)\}\[\(m\+1\)B\]\)\\bigr\],where the derivatives are applied elementwise, and set

𝐙CE\(k\)​\[m\]=𝐆d\(k\)​\[m\]⊙𝐅d\(k\)​\[m\]⊙𝐐CE\(k\)​\[m\]\.\\mathbf\{Z\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]=\\mathbf\{G\}\_\{d\}^\{\(k\)\}\[m\]\\odot\\mathbf\{F\}\_\{d\}^\{\(k\)\}\[m\]\\odot\\mathbf\{Q\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\.Then the retained minibatch\-local gradients are

∇𝐖\(k\)𝒥CE\(k\)​\[m\]\|local=α​𝐙CE\(k\)​\[m\]​𝐇\(k−1\)​\[m\]T,∇𝐛\(k\)𝒥CE\(k\)​\[m\]\|local=α​𝐙CE\(k\)​\[m\]​𝟏B×1\.\\left\.\\nabla\_\{\\mathbf\{W\}^\{\(k\)\}\}\\mathcal\{J\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\\right\|\_\{\\mathrm\{local\}\}=\\alpha\\,\\mathbf\{Z\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\\mathbf\{H\}^\{\(k\-1\)\}\[m\]^\{T\},\\qquad\\left\.\\nabla\_\{\\mathbf\{b\}^\{\(k\)\}\}\\mathcal\{J\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\\right\|\_\{\\mathrm\{local\}\}=\\alpha\\,\\mathbf\{Z\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\\mathbf\{1\}\_\{B\\times 1\}\.Using the same sign convention as in the main text and absorbing the scalarα\\alphainto the update coefficientζ\\zeta, the single\-sample hidden\-layer \(withB=1B=1\) rule becomes

Δ​Wi​j\(k\)​\[n\]=ζ​gi′⁣\(k\)​\(hi\(k\)​\[n\]\)​f′⁣\(k\)​\(ui\(k\)​\[n\]\)​qCE,i\(k\)​\[n\]​hj\(k−1\)​\[n\],\\Delta W\_\{ij\}^\{\(k\)\}\[n\]=\\zeta\\,g\_\{i\}^\{\\prime\(k\)\}\\\!\\left\(h\_\{i\}^\{\(k\)\}\[n\]\\right\)f^\{\\prime\(k\)\}\\\!\\left\(u\_\{i\}^\{\(k\)\}\[n\]\\right\)q\_\{\\mathrm\{CE\},i\}^\{\(k\)\}\[n\]h\_\{j\}^\{\(k\-1\)\}\[n\],with the bias update obtained by omitting the presynaptic factorhj\(k−1\)​\[n\]h\_\{j\}^\{\(k\-1\)\}\[n\]\. This is exactly the cross entropy SBD rule reported in Section[4](https://arxiv.org/html/2605.30638#S4)\.

For generalBB, the expression for the update can be written as

Δ​Wi​j\(k\)​\[m\]=ζB​∑n=m​B\+1m​B\+Bgi′⁣\(k\)​\(hi\(k\)​\[n\]\)​f′⁣\(k\)​\(ui\(k\)​\[n\]\)​qCE,i\(k\)​\[n\]​hj\(k−1\)​\[n\]\.\\Delta W\_\{ij\}^\{\(k\)\}\[m\]=\\frac\{\\zeta\}\{B\}\\sum\_\{n=mB\+1\}^\{mB\+B\}\\,g\_\{i\}^\{\\prime\(k\)\}\\\!\\left\(h\_\{i\}^\{\(k\)\}\[n\]\\right\)f^\{\\prime\(k\)\}\\\!\\left\(u\_\{i\}^\{\(k\)\}\[n\]\\right\)q\_\{\\mathrm\{CE\},i\}^\{\(k\)\}\[n\]h\_\{j\}^\{\(k\-1\)\}\[n\]\.

### C\.2Extension to convolutional layers

The MLP derivation of the previous subsection extends naturally to convolutional architectures\. FollowingErdogan et al\. \([2025](https://arxiv.org/html/2605.30638#bib.bib7), Appendix D\), which gives the corresponding extension for the EBD update under MSE, we adapt the construction to the SBD setting in which the broadcast signal is the loss score𝜹\\boldsymbol\{\\delta\}rather than the MSE residual\.

Let𝐇\(k\)∈ℝP\(k\)×M\(k\)×N\(k\)\\mathbf\{H\}^\{\(k\)\}\\in\\mathbb\{R\}^\{P^\{\(k\)\}\\times M^\{\(k\)\}\\times N^\{\(k\)\}\}denote the output of thekkth convolutional layer, whereP\(k\)P^\{\(k\)\}is the number of channels andM\(k\)×N\(k\)M^\{\(k\)\}\\times N^\{\(k\)\}is the spatial map size\. For channelpp, the filter tensor is𝐖\(k,p\)∈ℝP\(k−1\)×Ω\(k\)×Ω\(k\)\\mathbf\{W\}^\{\(k,p\)\}\\in\\mathbb\{R\}^\{P^\{\(k\-1\)\}\\times\\Omega^\{\(k\)\}\\times\\Omega^\{\(k\)\}\}with bias𝐛\(k,p\)∈ℝ\\mathbf\{b\}^\{\(k,p\)\}\\in\\mathbb\{R\}, and the convolutional layer output is

𝐇\(k,p\)=f​\(𝒰\(k,p\)\),𝒰\(k,p\)=\(𝐇\(k−1\)∗𝐖\(k,p\)\)\+𝐛\(k,p\),\\mathbf\{H\}^\{\(k,p\)\}=f\\\!\\bigl\(\\mathcal\{U\}^\{\(k,p\)\}\\bigr\),\\qquad\\mathcal\{U\}^\{\(k,p\)\}=\\bigl\(\\mathbf\{H\}^\{\(k\-1\)\}\\ast\\mathbf\{W\}^\{\(k,p\)\}\\bigr\)\+\\mathbf\{b\}^\{\(k,p\)\},where∗\\astis the \(channel\-wise and spatial\) convolution operation\.

##### Score\-broadcast objective\.

Let𝜹∈ℝDout\\boldsymbol\{\\delta\}\\in\\mathbb\{R\}^\{D\_\{\\mathrm\{out\}\}\}denote the SBD output score \(or the expanded score𝜹~\\widetilde\{\\boldsymbol\{\\delta\}\}when score\-vector expansion is used; see Appendix[D](https://arxiv.org/html/2605.30638#A4)\)\. For each channelppand each spatial location\(r,s\)\(r,s\), the score\-feature cross\-correlation is

𝐑𝐠​𝜹\(k,p\)​\[q,r,s\]=𝔼​\[𝐠\(k\)​\(𝐇\(k,p\)​\[r,s\]\)​δq\],q=1,…,Dout\.\\mathbf\{R\}^\{\(k,p\)\}\_\{\\mathbf\{g\}\\boldsymbol\{\\delta\}\}\[q,r,s\]=\\mathbb\{E\}\\bigl\[\\,\\mathbf\{g\}^\{\(k\)\}\\\!\\bigl\(\\mathbf\{H\}^\{\(k,p\)\}\[r,s\]\\bigr\)\\,\\delta\_\{q\}\\,\\bigr\],\\qquad q=1,\\ldots,D\_\{\\mathrm\{out\}\}\.The conditional\-mean\-zero score property𝔼​\[𝜹⋆∣X\]=𝟎\\mathbb\{E\}\[\\boldsymbol\{\\delta\}^\{\\star\}\\mid X\]=\\mathbf\{0\}established in Theorem[3](https://arxiv.org/html/2605.30638#Thmtheorem3)implies that this cross\-correlation vanishes at the population optimum\. The layerwise score\-broadcast objective at minibatchmmis

𝒥Score\(k\)​\[m\]=12​∑q=1Dout‖𝐑^𝐠​𝜹\(k,p\)​\[m,q,:,:\]‖F2,\\mathcal\{J\}^\{\(k\)\}\_\{\\mathrm\{Score\}\}\[m\]=\\frac\{1\}\{2\}\\sum\_\{q=1\}^\{D\_\{\\mathrm\{out\}\}\}\\bigl\\\|\\widehat\{\\mathbf\{R\}\}^\{\(k,p\)\}\_\{\\mathbf\{g\}\\boldsymbol\{\\delta\}\}\[m,q,:,:\]\\bigr\\\|\_\{F\}^\{2\},with𝐑^𝐠​𝜹\(k,p\)\\widehat\{\\mathbf\{R\}\}^\{\(k,p\)\}\_\{\\mathbf\{g\}\\boldsymbol\{\\delta\}\}maintained by the same exponentially\-weighted moving\-average estimator as in the MLP case\.

##### Weight update\.

Differentiating𝒥Score\(k\)​\[m\]\\mathcal\{J\}^\{\(k\)\}\_\{\\mathrm\{Score\}\}\[m\]with respect to the weight tensor𝐖h,i,j\(k,p\)\\mathbf\{W\}^\{\(k,p\)\}\_\{h,i,j\}\(input channelhh, kernel spatial indices\(i,j\)\(i,j\)\) and applying the chain rule through the convolution yields, after the same manipulations as inErdogan et al\. \([2025](https://arxiv.org/html/2605.30638#bib.bib7), Appendix D\.1\)with the MSE residual replaced by the loss score, the update

∂𝒥Score\(k\)​\[m\]∂𝐖h\(k,p\)=ζ​∑n=m​B\+1\(m\+1\)​B\(ϕ​\[n,p,:,:\]∗𝐇\(k−1,h\)​\[n,:,:\]\),\\frac\{\\partial\\mathcal\{J\}^\{\(k\)\}\_\{\\mathrm\{Score\}\}\[m\]\}\{\\partial\\mathbf\{W\}^\{\(k,p\)\}\_\{h\}\}=\\zeta\\sum\_\{n=mB\+1\}^\{\(m\+1\)B\}\\bigl\(\\boldsymbol\{\\phi\}\[n,p,:,:\]\\ast\\mathbf\{H\}^\{\(k\-1,h\)\}\[n,:,:\]\\bigr\),where the per\-sample postsynaptic modulator is

ϕ​\[n,p,:,:\]=∑q=1Doutδq​\[n\]⋅\(𝐑^𝐠​𝜹\(k,p\)​\[m,q,:,:\]⊙𝐠\(k\)​\(𝐇\(k,p\)​\[n,:,:\]\)⊙f′​\(𝒰\(k,p\)​\[n,:,:\]\)\)\.\\boldsymbol\{\\phi\}\[n,p,:,:\]=\\sum\_\{q=1\}^\{D\_\{\\mathrm\{out\}\}\}\\delta\_\{q\}\[n\]\\,\\cdot\\,\\Bigl\(\\widehat\{\\mathbf\{R\}\}^\{\(k,p\)\}\_\{\\mathbf\{g\}\\boldsymbol\{\\delta\}\}\[m,q,:,:\]\\odot\\mathbf\{g\}^\{\(k\)\}\\\!\\bigl\(\\mathbf\{H\}^\{\(k,p\)\}\[n,:,:\]\\bigr\)\\odot f^\{\\prime\}\\\!\\bigl\(\\mathcal\{U\}^\{\(k,p\)\}\[n,:,:\]\\bigr\)\\Bigr\)\.The bias update is

∂𝒥Score\(k\)​\[m\]∂𝐛\(k,p\)=ζ​∑n=m​B\+1\(m\+1\)​B∑r,sϕ​\[n,p,r,s\]\.\\frac\{\\partial\\mathcal\{J\}^\{\(k\)\}\_\{\\mathrm\{Score\}\}\[m\]\}\{\\partial\\mathbf\{b\}^\{\(k,p\)\}\}=\\zeta\\sum\_\{n=mB\+1\}^\{\(m\+1\)B\}\\sum\_\{r,s\}\\boldsymbol\{\\phi\}\[n,p,r,s\]\.

##### Three\-factor structure\.

The convolutional update preserves the three\-factor structure of the MLP case:𝐇\(k−1,h\)\\mathbf\{H\}^\{\(k\-1,h\)\}is the presynaptic activity \(spatially shifted\),𝐠\(k\)⊙f′\\mathbf\{g\}^\{\(k\)\}\\\!\\odot\\,f^\{\\prime\}is the postsynaptic sensitivity, and the score\-projected term∑qδq⋅𝐑^𝐠​𝜹\(k,p\)\\sum\_\{q\}\\delta\_\{q\}\\cdot\\widehat\{\\mathbf\{R\}\}^\{\(k,p\)\}\_\{\\mathbf\{g\}\\boldsymbol\{\\delta\}\}is the broadcast modulator\. The same three\-factor structure depicted in Figure[1\(a\)](https://arxiv.org/html/2605.30638#S1.F1.sf1)therefore applies to convolutional layers, with the convolution operator replacing the matrix product of the MLP case\.

##### Auxiliary regularization\.

For convolutional layers, the layer\-entropy regularizer of Section[E\.1](https://arxiv.org/html/2605.30638#A5.SS1)becomes computationally cumbersome because the activation tensor has multiple spatial dimensions\. FollowingErdogan et al\. \([2025](https://arxiv.org/html/2605.30638#bib.bib7), Appendix D\.2\), we replace it by the weight\-entropy objective

JE\(k\)​\(𝐖\(k\)\)=12​log​det\(𝐑𝐖¯\(k\)\+η​I\),J^\{\(k\)\}\_\{E\}\\bigl\(\\mathbf\{W\}^\{\(k\)\}\\bigr\)=\\tfrac\{1\}\{2\}\\log\\det\\\!\\bigl\(\\mathbf\{R\}\_\{\\overline\{\\mathbf\{W\}\}^\{\(k\)\}\}\+\\eta I\\bigr\),where𝐖¯\(k\)\\overline\{\\mathbf\{W\}\}^\{\(k\)\}is the unraveling of the filter tensor into aP\(k\)×P\(k−1\)​Ω\(k\)​2P^\{\(k\)\}\\times P^\{\(k\-1\)\}\\Omega^\{\(k\)2\}matrix\. The CIFAR\-10 and Tiny ImageNet experiments in this paper setccov\(k\)=0c^\{\(k\)\}\_\{\\mathrm\{cov\}\}=0on convolutional layers \(see Tables[7](https://arxiv.org/html/2605.30638#A5.T7)and[10](https://arxiv.org/html/2605.30638#A6.T10)\), so this regularizer is not active in the reported runs; we include it here for completeness and to match theErdogan et al\. \([2025](https://arxiv.org/html/2605.30638#bib.bib7)\)formulation\.

### C\.3Update complexity

We summarize the per\-minibatch update cost of the rule in Table[3](https://arxiv.org/html/2605.30638#A3.T3)and compare it to backpropagation \(BP\) and Direct Feedback Alignment \(DFA\) under a fully\-connected layer model\. LetLLdenote the number of layers,N\(k\)N^\{\(k\)\}the width of layerkk\(soN\(0\)N^\{\(0\)\}is the input dimension andN\(L\)=DN^\{\(L\)\}=Dthe number of classes\),BBthe minibatch size, andDDthe output\-score dimension; for the score\-expansion variant of Section[7](https://arxiv.org/html/2605.30638#S7)the score dimension is replaced byDε=M​DD\_\{\\varepsilon\}=MD\.

The forward pass is shared by all three methods and costs𝒪​\(B​∑k=1LN\(k\)​N\(k−1\)\)\\mathcal\{O\}\\\!\\bigl\(B\\sum\_\{k=1\}^\{L\}N^\{\(k\)\}N^\{\(k\-1\)\}\\bigr\)\. The local outer\-product weight update at layerkk, of cost𝒪​\(B​N\(k\)​N\(k−1\)\)\\mathcal\{O\}\\\!\\bigl\(B\\,N^\{\(k\)\}N^\{\(k\-1\)\}\\bigr\), is also identical in form across the three methods; the methods differ only in how the postsynaptic modulator at layerkkis produced\. We focus on this modulator cost\.

##### BP\.

The error signal𝜹\(k\)\\boldsymbol\{\\delta\}^\{\(k\)\}at hidden layerkkis obtained by propagating𝜹\(k\+1\)\\boldsymbol\{\\delta\}^\{\(k\+1\)\}through𝐖\(k\+1\)​T\\mathbf\{W\}^\{\(k\+1\)\\,T\}, costing𝒪​\(B​N\(k\)​N\(k\+1\)\)\\mathcal\{O\}\\\!\\bigl\(B\\,N^\{\(k\)\}N^\{\(k\+1\)\}\\bigr\)per layer\. The total backward\-pass cost is therefore𝒪​\(B​∑k=1L−1N\(k\)​N\(k\+1\)\)\\mathcal\{O\}\\\!\\bigl\(B\\sum\_\{k=1\}^\{L\-1\}N^\{\(k\)\}N^\{\(k\+1\)\}\\bigr\), comparable to the forward pass, and this step requires weight transport\.

##### DFA\.

The output error of dimensionDDis broadcast to layerkkthrough a*fixed random*feedback matrix𝐁\(k\)∈ℝN\(k\)×D\\mathbf\{B\}^\{\(k\)\}\\\!\\in\\\!\\mathbb\{R\}^\{N^\{\(k\)\}\\times D\}, costing𝒪​\(B​N\(k\)​D\)\\mathcal\{O\}\\\!\\bigl\(B\\,N^\{\(k\)\}D\\bigr\)per layer with𝒪​\(N\(k\)​D\)\\mathcal\{O\}\\\!\\bigl\(N^\{\(k\)\}D\\bigr\)storage\. No backward chain is computed\.

##### SBD\.

The output score, also of dimensionDD, is broadcast through the*adaptive*correlation matrix𝐑^CE\(k\)∈ℝN\(k\)×D\\widehat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\\\!\\in\\\!\\mathbb\{R\}^\{N^\{\(k\)\}\\times D\}\. Two operations contribute at layerkk: the EMA correlation update \(Step 2\), which is an outer product of cost𝒪​\(B​N\(k\)​D\)\\mathcal\{O\}\\\!\\bigl\(B\\,N^\{\(k\)\}D\\bigr\), and the projection𝐪CE\(k\)=𝐑^CE\(k\)​𝜹CE\\mathbf\{q\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}=\\widehat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}\(Step 3\), of cost𝒪​\(B​N\(k\)​D\)\\mathcal\{O\}\\\!\\bigl\(B\\,N^\{\(k\)\}D\\bigr\)\. Storage of𝐑^CE\(k\)\\widehat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}is𝒪​\(N\(k\)​D\)\\mathcal\{O\}\\\!\\bigl\(N^\{\(k\)\}D\\bigr\), matching DFA\.

##### Comparison\.

Per layer, the modulator cost is𝒪​\(B​N\(k\)​N\(k\+1\)\)\\mathcal\{O\}\\\!\\bigl\(B\\,N^\{\(k\)\}N^\{\(k\+1\)\}\\bigr\)for BP versus𝒪​\(B​N\(k\)​D\)\\mathcal\{O\}\\\!\\bigl\(B\\,N^\{\(k\)\}D\\bigr\)for DFA and SBD\. BecauseDDis typically much smaller than the hidden widthsN\(k\+1\)N^\{\(k\+1\)\}, both broadcast methods enjoy the same asymptotic advantage over BP, with the gap widening as the network is made wider\. SBD and DFA share identical asymptotic complexity, but SBD performs roughly twice the work of DFA per layer \(one EMA outer product plus one projection, versus DFA’s single projection through a fixed matrix\); in exchange,𝐑^CE\(k\)\\widehat\{\\mathbf\{R\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}is*learned*\. ReplacingDDbyDε=M​DD\_\{\\varepsilon\}=MDin the SBD expressions above gives the cost of the score\-expansion variant of Section[7](https://arxiv.org/html/2605.30638#S7), which scales linearly in the expansion factorMMwhile preserving the orthogonality framework\.

##### Convolutional layers\.

The same analysis extends to the convolutional layers of Appendix[C\.2](https://arxiv.org/html/2605.30638#A3.SS2)\. LetP\(k\)P^\{\(k\)\}be the number of output channels at layerkk,Ω\(k\)\\Omega^\{\(k\)\}the kernel size, andM\(k\)×N\(k\)M^\{\(k\)\}\\times N^\{\(k\)\}the spatial output size; the corresponding fully connected width isNFC\(k\)=P\(k\)​M\(k\)​N\(k\)N^\{\(k\)\}\_\{\\mathrm\{FC\}\}=P^\{\(k\)\}M^\{\(k\)\}N^\{\(k\)\}\. The forward and backward operations are convolutions, costing𝒪​\(B​P\(k\)​P\(k−1\)​\(Ω\(k\)\)2​M\(k\)​N\(k\)\)\\mathcal\{O\}\\\!\\bigl\(B\\,P^\{\(k\)\}P^\{\(k\-1\)\}\(\\Omega^\{\(k\)\}\)^\{2\}M^\{\(k\)\}N^\{\(k\)\}\\bigr\)per layer for both BP and the SBD broadcast convolution\. The SBD modulator at convolutional layerkkrequires maintaining a per\-channel cross\-correlation tensor𝐑^𝐠​𝜹\(k,p\)∈ℝD×M\(k\)×N\(k\)\\widehat\{\\mathbf\{R\}\}^\{\(k,p\)\}\_\{\\mathbf\{g\}\\boldsymbol\{\\delta\}\}\\in\\mathbb\{R\}^\{D\\times M^\{\(k\)\}\\times N^\{\(k\)\}\}, of total size𝒪​\(P\(k\)​D​M\(k\)​N\(k\)\)\\mathcal\{O\}\\\!\\bigl\(P^\{\(k\)\}DM^\{\(k\)\}N^\{\(k\)\}\\bigr\), with the EMA update and the projection each costing𝒪​\(B​P\(k\)​D​M\(k\)​N\(k\)\)\\mathcal\{O\}\\\!\\bigl\(B\\,P^\{\(k\)\}DM^\{\(k\)\}N^\{\(k\)\}\\bigr\)\. The asymptotic ordering BP\>\>SBD≈\\approxDFA in modulator cost therefore carries over to convolutional layers, withDDreplacingN\(k\+1\)N^\{\(k\+1\)\}in the broadcast case and the convolution structure shared by all three methods\. ReplacingDDbyDε=M​DD\_\{\\varepsilon\}=MDfor the score\-expansion variant scales the SBD modulator cost linearly in the expansion factor, exactly as in the fully connected case\.

##### Crossover regime for score expansion\.

The score\-expansion broadcast remains cheaper than BP at layerkkexactly whenM​D<N\(k\+1\)MD<N^\{\(k\+1\)\}\(FC layers\) orM​D<P\(k−1\)​Ω\(k\)​2MD<P^\{\(k\-1\)\}\\Omega^\{\(k\)2\}\(conv layers\)\. For the architectures used in this paper this margin is large:M​D=30MD=30on CIFAR\-10 andM​D=600MD=600on Tiny ImageNet, against hidden widths in the range10310^\{3\}–10510^\{5\}, so SBD with expansion is one\-to\-three orders of magnitude cheaper per layer than BP\. The inequality tightens asDDgrows\.

## Appendix DScore vector expansion: general theory and cross entropy example

This appendix provides the full details behind the main\-text summary in Section[7](https://arxiv.org/html/2605.30638#S7)\. We first develop the score vector expansion for a general differentiable loss whose population\-optimal score has conditional mean zero, and then specialize the construction to cross entropy\.

### D\.1General formulation

Letℒ​\(𝐲,𝐚\)\\mathcal\{L\}\(\\mathbf\{y\},\\mathbf\{a\}\)be a differentiable loss with output score𝜹​\(𝐲,𝐚\)=∇𝐚ℒ​\(𝐲,𝐚\)∈ℝDout\\boldsymbol\{\\delta\}\(\\mathbf\{y\},\\mathbf\{a\}\)=\\nabla\_\{\\mathbf\{a\}\}\\mathcal\{L\}\(\\mathbf\{y\},\\mathbf\{a\}\)\\in\\mathbb\{R\}^\{D\_\{\\mathrm\{out\}\}\}, and assume that the population\-optimal predictor𝐚⋆​\(X\)\\mathbf\{a\}^\{\\star\}\(X\)satisfies the conditional mean\-zero property

𝔼​\[𝜹​\(𝐘,𝐚⋆​\(X\)\)\|X\]=𝟎,\\mathbb\{E\}\\bigl\[\\boldsymbol\{\\delta\}\(\\mathbf\{Y\},\\mathbf\{a\}^\{\\star\}\(X\)\)\\,\\big\|\\,X\\bigr\]=\\mathbf\{0\},so that the orthogonality machinery of Theorem[3](https://arxiv.org/html/2605.30638#Thmtheorem3)applies\. The layerwise correlation matrix

𝐑\(k\)=𝔼​\[g\(k\)​\(𝐡\(k\)\)​𝜹T\]∈ℝdk×Dout\\mathbf\{R\}^\{\(k\)\}=\\mathbb\{E\}\\bigl\[g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\)\\boldsymbol\{\\delta\}^\{T\}\\bigr\]\\in\\mathbb\{R\}^\{d\_\{k\}\\times D\_\{\\mathrm\{out\}\}\}has column rank at mostDoutD\_\{\\mathrm\{out\}\}, regardless of the hidden\-layer widthdkd\_\{k\}\. Whendkd\_\{k\}greatly exceedsDoutD\_\{\\mathrm\{out\}\}, this caps the number of independent decorrelation constraints and motivates enlarging the broadcast signal\.

##### Modulated scores preserve conditional mean\-zero\.

Letϕ:𝒳→ℝDout\\boldsymbol\{\\phi\}:\\mathcal\{X\}\\to\\mathbb\{R\}^\{D\_\{\\mathrm\{out\}\}\}be any measurable vector\-valued modulator of the input, and define the*modulated score*at the population optimum by

𝜼ϕ,⋆​\(X\):=ϕ​\(X\)⊙𝜹​\(𝐘,𝐚⋆​\(X\)\),\\boldsymbol\{\\eta\}^\{\\boldsymbol\{\\phi\},\\star\}\(X\)\\;:=\\;\\boldsymbol\{\\phi\}\(X\)\\odot\\boldsymbol\{\\delta\}\\bigl\(\\mathbf\{Y\},\\mathbf\{a\}^\{\\star\}\(X\)\\bigr\),where⊙\\odotdenotes the Hadamard \(elementwise\) product\. Becauseϕ​\(X\)\\boldsymbol\{\\phi\}\(X\)is measurable with respect toXX, it can be pulled out of the conditional expectation\. Hence

𝔼​\[𝜼ϕ,⋆​\(X\)\|X\]=ϕ​\(X\)⊙𝔼​\[𝜹​\(𝐘,𝐚⋆​\(X\)\)\|X\]=𝟎\.\\mathbb\{E\}\\\!\\left\[\\boldsymbol\{\\eta\}^\{\\boldsymbol\{\\phi\},\\star\}\(X\)\\,\\big\|\\,X\\right\]=\\boldsymbol\{\\phi\}\(X\)\\odot\\mathbb\{E\}\\\!\\left\[\\boldsymbol\{\\delta\}\(\\mathbf\{Y\},\\mathbf\{a\}^\{\\star\}\(X\)\)\\,\\big\|\\,X\\right\]=\\mathbf\{0\}\.Applying the tower\-property orthogonality lemma \(Lemma[1](https://arxiv.org/html/2605.30638#Thmlemma1)\), we conclude

𝔼​\[𝜼ϕ,⋆​\(X\)​g​\(X\)T\]=𝟎for every measurable​g​\(X\)​with finite expectation\.\\mathbb\{E\}\\\!\\left\[\\boldsymbol\{\\eta\}^\{\\boldsymbol\{\\phi\},\\star\}\(X\)g\(X\)^\{T\}\\right\]=\\mathbf\{0\}\\quad\\text\{for every measurable \}g\(X\)\\text\{ with finite expectation\.\}Thus any Hadamard modulator that isXX\-measurable produces a broadcast signal for which the layerwise orthogonality condition holds at the population optimum\. Modulators of the formϕ​\(X\)=ψ​\(𝐚⋆​\(X\)\)\\boldsymbol\{\\phi\}\(X\)=\\psi\(\\mathbf\{a\}^\{\\star\}\(X\)\)are only one special case; the construction also permits modulators built from the raw input, intermediate activations, or any other deterministic function ofXX\.

##### Rank expansion by stacking modulators\.

GivenMMXX\-measurable modulatorsϕ1,…,ϕM\\boldsymbol\{\\phi\}\_\{1\},\\ldots,\\boldsymbol\{\\phi\}\_\{M\}, define the*expanded score*during training by

𝜹~​\(X\):=\[ϕ1​\(X\)⊙𝜹​\(𝐘,𝐚​\(X\)\)ϕ2​\(X\)⊙𝜹​\(𝐘,𝐚​\(X\)\)⋮ϕM​\(X\)⊙𝜹​\(𝐘,𝐚​\(X\)\)\]∈ℝM​Dout,\\tilde\{\\boldsymbol\{\\delta\}\}\(X\)\\;:=\\;\\begin\{bmatrix\}\\boldsymbol\{\\phi\}\_\{1\}\(X\)\\odot\\boldsymbol\{\\delta\}\(\\mathbf\{Y\},\\mathbf\{a\}\(X\)\)\\\\\[2\.0pt\] \\boldsymbol\{\\phi\}\_\{2\}\(X\)\\odot\\boldsymbol\{\\delta\}\(\\mathbf\{Y\},\\mathbf\{a\}\(X\)\)\\\\\[\-2\.0pt\] \\vdots\\\\\[\-1\.0pt\] \\boldsymbol\{\\phi\}\_\{M\}\(X\)\\odot\\boldsymbol\{\\delta\}\(\\mathbf\{Y\},\\mathbf\{a\}\(X\)\)\\end\{bmatrix\}\\in\\mathbb\{R\}^\{MD\_\{\\mathrm\{out\}\}\},obtained by stacking theMMmodulated scores\. Replacing𝐚​\(X\)\\mathbf\{a\}\(X\)by the population optimum𝐚⋆​\(X\)\\mathbf\{a\}^\{\\star\}\(X\), the argument above shows that each block of lengthDoutD\_\{\\mathrm\{out\}\}satisfies the conditional mean\-zero property, and hence

𝔼​\[𝜹~⋆​\(X\)\|X\]=𝟎,𝔼​\[𝜹~⋆​\(X\)​g\(k\)​\(𝐡\(k\)​\(X\)\)T\]=𝟎\.\\mathbb\{E\}\\\!\\left\[\\tilde\{\\boldsymbol\{\\delta\}\}^\{\\star\}\(X\)\\,\\big\|\\,X\\right\]=\\mathbf\{0\},\\qquad\\mathbb\{E\}\\\!\\left\[\\tilde\{\\boldsymbol\{\\delta\}\}^\{\\star\}\(X\)g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\(X\)\)^\{T\}\\right\]=\\mathbf\{0\}\.\(3\)The expanded correlation matrix

𝐑~\(k\)=𝔼​\[g\(k\)​\(𝐡\(k\)\)​𝜹~T\]∈ℝdk×M​Dout\\tilde\{\\mathbf\{R\}\}^\{\(k\)\}\\;=\\;\\mathbb\{E\}\\\!\\left\[g^\{\(k\)\}\(\\mathbf\{h\}^\{\(k\)\}\)\\,\\tilde\{\\boldsymbol\{\\delta\}\}^\{T\}\\right\]\\in\\mathbb\{R\}^\{d\_\{k\}\\times MD\_\{\\mathrm\{out\}\}\}now has column rank at mostM​DoutMD\_\{\\mathrm\{out\}\}, providing the layerwise decorrelation objective𝒥~\(k\)=12​‖𝐑~\(k\)‖F2\\tilde\{\\mathcal\{J\}\}^\{\(k\)\}=\\tfrac\{1\}\{2\}\\\|\\tilde\{\\mathbf\{R\}\}^\{\(k\)\}\\\|\_\{F\}^\{2\}with up toMMtimes more independent directions along which to suppress dependence between the hidden representation and the broadcast score, at the cost of a linear increase in parameters\.

##### Algorithmic formulation\.

The expanded\-score procedure is identical to plain SBD except that theDoutD\_\{\\mathrm\{out\}\}\-dimensional score and its batch matrix are replaced by theirM​DoutMD\_\{\\mathrm\{out\}\}\-dimensional counterparts:

𝜹~​\[n\]\\displaystyle\\tilde\{\\boldsymbol\{\\delta\}\}\[n\]=\[ϕ1​\(𝐱​\[n\]\)⊙𝜹​\(𝐲​\[n\],𝐚​\(𝐱​\[n\]\)\);…;ϕM​\(𝐱​\[n\]\)⊙𝜹​\(𝐲​\[n\],𝐚​\(𝐱​\[n\]\)\)\],\\displaystyle=\\bigl\[\\boldsymbol\{\\phi\}\_\{1\}\(\\mathbf\{x\}\[n\]\)\\odot\\boldsymbol\{\\delta\}\(\\mathbf\{y\}\[n\],\\mathbf\{a\}\(\\mathbf\{x\}\[n\]\)\);\\;\\ldots;\\;\\boldsymbol\{\\phi\}\_\{M\}\(\\mathbf\{x\}\[n\]\)\\odot\\boldsymbol\{\\delta\}\(\\mathbf\{y\}\[n\],\\mathbf\{a\}\(\\mathbf\{x\}\[n\]\)\)\\bigr\],𝚫~​\[m\]\\displaystyle\\tilde\{\\mathbf\{\\Delta\}\}\[m\]=\[𝜹~​\[m​B\+1\]⋯𝜹~​\[m​B\+B\]\]∈ℝM​Dout×B,\\displaystyle=\\begin\{bmatrix\}\\tilde\{\\boldsymbol\{\\delta\}\}\[mB\+1\]&\\cdots&\\tilde\{\\boldsymbol\{\\delta\}\}\[mB\+B\]\\end\{bmatrix\}\\in\\mathbb\{R\}^\{MD\_\{\\mathrm\{out\}\}\\times B\},𝐑^~\(k\)​\[m\]\\displaystyle\\tilde\{\\hat\{\\mathbf\{R\}\}\}^\{\(k\)\}\[m\]=λ​𝐑^~\(k\)​\[m−1\]\+1−λB​𝐆\(k\)​\[m\]​𝚫~​\[m\]T,\\displaystyle=\\lambda\\,\\tilde\{\\hat\{\\mathbf\{R\}\}\}^\{\(k\)\}\[m\{\-\}1\]\+\\frac\{1\-\\lambda\}\{B\}\\,\\mathbf\{G\}^\{\(k\)\}\[m\]\\,\\tilde\{\\mathbf\{\\Delta\}\}\[m\]^\{T\},𝐪~\(k\)​\[m​B\+n\]\\displaystyle\\tilde\{\\mathbf\{q\}\}^\{\(k\)\}\[mB\+n\]=𝐑^~\(k\)​\[m\]​𝜹~​\[m​B\+n\],for​n=1,…,B,\\displaystyle=\\tilde\{\\hat\{\\mathbf\{R\}\}\}^\{\(k\)\}\[m\]\\,\\tilde\{\\boldsymbol\{\\delta\}\}\[mB\+n\],\\text\{ for \}n=1,\\ldots,B,Δ​Wi​j\(k\)​\[m\]=ζB​∑l=m​B\+1m​B\+Bgi′⁣\(k\)​\(hi\(k\)​\[l\]\)​f′⁣\(k\)​\(ui\(k\)​\[l\]\)​q~i\(k\)​\[l\]​hj\(k−1\)​\[l\]\.\\Delta W\_\{ij\}^\{\(k\)\}\[m\]=\\frac\{\\zeta\}\{B\}\\sum\_\{l=mB\+1\}^\{mB\+B\}\\,g\_\{i\}^\{\\prime\(k\)\}\\\!\\bigl\(h\_\{i\}^\{\(k\)\}\[l\]\\bigr\)\\,f^\{\\prime\(k\)\}\\\!\\bigl\(u\_\{i\}^\{\(k\)\}\[l\]\\bigr\)\\,\\tilde\{q\}\_\{i\}^\{\(k\)\}\[l\]\\,h\_\{j\}^\{\(k\-1\)\}\[l\]\.\(4\)The output layer continues to use the rawDoutD\_\{\\mathrm\{out\}\}\-dimensional score for its standard gradient update\.

### D\.2Cross entropy instantiation

For cross entropy classification, the output dimension isDout=DD\_\{\\mathrm\{out\}\}=D\(the number of classes\) and the output score is𝜹CE=𝐩−𝐲\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}=\\mathbf\{p\}\-\\mathbf\{y\}, with𝐩⋆​\(X\)=𝐪​\(X\)\\mathbf\{p\}^\{\\star\}\(X\)=\\mathbf\{q\}\(X\)as the population minimizer\. The conditional mean\-zero property required by the general construction is established in Theorem[1](https://arxiv.org/html/2605.30638#Thmtheorem1), so the modulated and expanded score machinery above applies to any deterministicXX\-measurable modulator\. In our experiments, the modulators are built from the predictive distribution𝐩​\(X\)\\mathbf\{p\}\(X\), which is itself a deterministic function ofXXfor fixed network parameters\.

For classification networks in which the hidden widthdkd\_\{k\}greatly exceedsDD; for example, CIFAR\-10 withD=10D=10anddk∈\{103,105\}d\_\{k\}\\in\\\{10^\{3\},10^\{5\}\\\}in the fully\-connected and convolutional layers, respectively, the rank bottleneck in𝐑CE\(k\)∈ℝdk×D\\mathbf\{R\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\\in\\mathbb\{R\}^\{d\_\{k\}\\times D\}is severe, and score expansion provides a natural remedy\.

##### A concrete rank\-3​D3Dinstance\.

For the experiments in Section[8](https://arxiv.org/html/2605.30638#S8)we adopt a minimalM=3M=3instance built from two naturalXX\-measurable modulators obtained from𝐩​\(X\)\\mathbf\{p\}\(X\), together with the identity\. Let𝟏D∈ℝD\\mathbf\{1\}\_\{D\}\\in\\mathbb\{R\}^\{D\}denote the all\-ones vector, and letrolls​\(𝐯\)\\mathrm\{roll\}\_\{s\}\(\\mathbf\{v\}\)denote the cyclic shift of𝐯\\mathbf\{v\}bysspositions, that is\[rolls​\(𝐯\)\]d=v\(\(d−s−1\)modD\)\+1\[\\mathrm\{roll\}\_\{s\}\(\\mathbf\{v\}\)\]\_\{d\}=v\_\{\(\(d\-s\-1\)\\bmod D\)\+1\}\. The three modulators are

ϕ1​\(X\)=𝟏D,ϕ2​\(X\)=𝐩​\(X\),ϕ3​\(X\)=rolls​\(𝐩​\(X\)\),\\boldsymbol\{\\phi\}\_\{1\}\(X\)=\\mathbf\{1\}\_\{D\},\\qquad\\boldsymbol\{\\phi\}\_\{2\}\(X\)=\\mathbf\{p\}\(X\),\\qquad\\boldsymbol\{\\phi\}\_\{3\}\(X\)=\\mathrm\{roll\}\_\{s\}\(\\mathbf\{p\}\(X\)\),where the shift amounts∈\{1,…,D−1\}s\\in\\\{1,\\ldots,D\-1\\\}is chosen to maximize linear independence between blocks two and three; forD=10D=10we uses=5s=5\. The resulting expanded score is

𝜹~CE=\[𝜹CE𝐩⊙𝜹CErolls​\(𝐩\)⊙𝜹CE\]∈ℝ3​D\.\\tilde\{\\boldsymbol\{\\delta\}\}\_\{\\mathrm\{CE\}\}\\;=\\;\\begin\{bmatrix\}\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}\\\\\[2\.0pt\] \\mathbf\{p\}\\odot\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}\\\\\[2\.0pt\] \\mathrm\{roll\}\_\{s\}\(\\mathbf\{p\}\)\\odot\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{3D\}\.Block one reproduces the raw cross entropy score\. Block two, a*confidence\-weighted residual*, amplifies errors at classes that the network currently predicts with high probability, and therefore carries information about the output distribution that is linearly independent of block one whenever𝐩\\mathbf\{p\}is not proportional to𝟏D\\mathbf\{1\}\_\{D\}\. Block three induces a cross\-class coupling between residuals and class\-index\-shifted probabilities, producing a direction that is generically linearly independent of the first two blocks as soon as𝐩\\mathbf\{p\}is not invariant under the chosen shift\. All three modulators are deterministic functions ofXXthrough the forward pass, so \([3](https://arxiv.org/html/2605.30638#A4.E3)\) applies and the expanded decorrelation condition remains an exact orthogonality relation at the population optimum\.

##### Algorithm\.

Specializing the general expanded\-SBD procedure above to the cross entropy case yields the algorithm in Table[4](https://arxiv.org/html/2605.30638#A4.T4)\. The only modifications from Appendix[C](https://arxiv.org/html/2605.30638#A3), Table[3](https://arxiv.org/html/2605.30638#A3.T3)are Step 2, in which𝚫CE​\[m\]\\mathbf\{\\Delta\}\_\{\\mathrm\{CE\}\}\[m\]is replaced by𝚫~CE​\[m\]∈ℝ3​D×B\\tilde\{\\mathbf\{\\Delta\}\}\_\{\\mathrm\{CE\}\}\[m\]\\in\\mathbb\{R\}^\{3D\\times B\}, and Step 3, in which the projection uses𝐑^~CE\(k\)\\tilde\{\\hat\{\\mathbf\{R\}\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}and𝜹~CE\\tilde\{\\boldsymbol\{\\delta\}\}\_\{\\mathrm\{CE\}\}\.

Table 4:Cross entropy SBD with rank\-3​D3Dscore expansion\. Differences from Appendix[C](https://arxiv.org/html/2605.30638#A3), Table[3](https://arxiv.org/html/2605.30638#A3.T3)are shown in italics\.StepOperation1Forward pass to obtain𝐡\(k\)\\mathbf\{h\}^\{\(k\)\},𝐚\\mathbf\{a\},𝐩=softmax⁡\(𝐚\)\\mathbf\{p\}=\\operatorname\{softmax\}\(\\mathbf\{a\}\), and the score𝜹CE=𝐩−𝐲\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}=\\mathbf\{p\}\-\\mathbf\{y\}\.2*Form the expanded score𝛅~CE=\[𝛅CE;𝐩⊙𝛅CE;rolls​\(𝐩\)⊙𝛅CE\]∈ℝ3​D\\tilde\{\\boldsymbol\{\\delta\}\}\_\{\\mathrm\{CE\}\}=\[\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\};\\,\\mathbf\{p\}\\odot\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\};\\,\\mathrm\{roll\}\_\{s\}\(\\mathbf\{p\}\)\\odot\\boldsymbol\{\\delta\}\_\{\\mathrm\{CE\}\}\]\\in\\mathbb\{R\}^\{3D\}and stack into the batch matrix𝚫~CE​\[m\]∈ℝ3​D×B\\tilde\{\\mathbf\{\\Delta\}\}\_\{\\mathrm\{CE\}\}\[m\]\\in\\mathbb\{R\}^\{3D\\times B\}\. Update the expanded correlation estimate𝐑^~CE\(k\)​\[m\]\\tilde\{\\hat\{\\mathbf\{R\}\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\.*3*Project the broadcast score to hidden layerkkby𝐪~CE\(k\)​\[n\]=𝐑^~CE\(k\)​\[m\]​𝛅~CE​\[n\]\\tilde\{\\mathbf\{q\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[n\]=\\tilde\{\\hat\{\\mathbf\{R\}\}\}\_\{\\mathrm\{CE\}\}^\{\(k\)\}\[m\]\\,\\tilde\{\\boldsymbol\{\\delta\}\}\_\{\\mathrm\{CE\}\}\[n\]forn=m​B\+1,…,\(m\+1\)​Bn=mB\+1,\\ldots,\(m\+1\)B\.*4Update hidden\-layer parameters with the three\-factor ruleΔ​Wi​j\(k\)​\[m\]=ζ​gi′⁣\(k\)​\(hi\(k\)​\[m\]\)​f′⁣\(k\)​\(ui\(k\)​\[m\]\)​q~CE,i\(k\)​\[m\]​hj\(k−1\)​\[m\]\\Delta W\_\{ij\}^\{\(k\)\}\[m\]=\\zeta\\,g\_\{i\}^\{\\prime\(k\)\}\(h\_\{i\}^\{\(k\)\}\[m\]\)\\,f^\{\\prime\(k\)\}\(u\_\{i\}^\{\(k\)\}\[m\]\)\\,\\tilde\{q\}\_\{\\mathrm\{CE\},i\}^\{\(k\)\}\[m\]\\,h\_\{j\}^\{\(k\-1\)\}\[m\]\. forB=1B=1\(Using Eq \([4](https://arxiv.org/html/2605.30638#A4.E4)\) forB\>1B\>1\.\)5Update the output layer with the standardDD\-dimensional cross entropy gradient if desired\.

## Appendix ECIFAR\-10 experimental setup

This appendix documents the codebase used for the CIFAR\-10 broadcast\-learning experiments reported in the main text\. The expanded SBD configuration is implemented by the per\-seed launch scriptsrun\_cemseN\_s\*\.sh, which callCIFAR10\_CEMSEN\.pywith cross\-entropy loss,method=ebd, anderr\_expand=2\. This setting uses the rank\-3​D3Dscore expansion, so that the broadcast dimension is3​D=303D=30for CIFAR\-10\. The main\-text result is averaged over five independent seed runs using the 5\-layer CNN described in Section[E\.2](https://arxiv.org/html/2605.30638#A5.SS2)\. All CIFAR\-10 experiments run infloat32on a single NVIDIA V100 \(32GB\) GPU using PyTorch\. For the reported baseline\-width CIFAR\-10 experiments, the approximate times to complete200200epochs were7575,9191,9696, and101101minutes for BP, DFA, SBD, and SBD with score expansion, respectively\. For the reported4×4\\times\-width CIFAR\-10 experiments, the approximate times to complete200200epochs were170170and420420minutes for BP and SBD with score expansion, respectively\.

For CIFAR\-10, we use the standard dataset partition of50,00050\{,\}000training images and10,00010\{,\}000test images, and all reported CIFAR\-10 accuracies are top\-1 accuracies on the official test split\. The architecture and base optimization settings are matched to the original EBD CNN setup, with cross entropy replacing MSE\. For the expanded\-SBD variant, Appendix[E\.5](https://arxiv.org/html/2605.30638#A5.SS5)documents the chosen score\-expansion modulator set together with the single\-seed design exploration used to select it\.

PyTorch\(Paszke et al\.,[2019](https://arxiv.org/html/2605.30638#bib.bib35)\)is distributed under the BSD\-3\-Clause license, while the CIFAR\-10 dataset\(Krizhevsky,[2009](https://arxiv.org/html/2605.30638#bib.bib36)\)is publicly released by the University of Toronto for research use without a formal license declaration on its official distribution page; both are used here in accordance with their stated terms\.

### E\.1Score broadcast and decorrelation

SBD replaces the global backpropagation chain with local, per\-layer learning rules driven by a broadcast score signal\. At each mini\-batch, the forward pass produces pre\-softmax logits𝐚∈ℝC\\mathbf\{a\}\\in\\mathbb\{R\}^\{C\}and probabilities𝐩=softmax⁡\(𝐚/T\),\\mathbf\{p\}=\\operatorname\{softmax\}\(\\mathbf\{a\}/T\),whereC=10C=10for CIFAR\-10 andTTis the implementation variableSOFTMAX\_TEMP\. In cross\-entropy mode, the implementation uses the broadcast residual

𝜹=α​\(𝐩−𝐲\)∈ℝC,α=ERR\_SCALE,\\boldsymbol\{\\delta\}=\\alpha\(\\mathbf\{p\}\-\\mathbf\{y\}\)\\in\\mathbb\{R\}^\{C\},\\qquad\\alpha=\\mathrm\{\\texttt\{ERR\\\_SCALE\}\},whereα\\alphais the score scaling, and𝐲\\mathbf\{y\}is the one\-hot label vector\. The scalar cross\-entropy loss is computed from the temperature\-scaled logits, while the SBD broadcast signal used in the layerwise updates is the probability residual above\.

For hidden\-layer SBD updates, thisCC\-dimensional signal is expanded into a higher\-dimensional broadcast vector𝜺∈ℝDε\\boldsymbol\{\\varepsilon\}\\in\\mathbb\{R\}^\{D\_\{\\varepsilon\}\}by concatenating blocks of the formϕℓ​\(𝐱\)⊙𝜹\\boldsymbol\{\\phi\}\_\{\\ell\}\(\\mathbf\{x\}\)\\odot\\boldsymbol\{\\delta\}, where eachϕℓ\\boldsymbol\{\\phi\}\_\{\\ell\}is anXX\-measurable deterministic function of the input𝐱\\mathbf\{x\}\(in the reported setting, through𝐩\\mathbf\{p\}and𝐚\\mathbf\{a\}\)\. In the main expanded\-SBD configuration, implemented witherr\_expand=2, the hidden\-layer broadcast vector is

𝜺​\(𝐱\)=\[𝜹;𝐩⊙𝜹;roll5​\(𝐩\)⊙𝜹\]∈ℝ30\.\\boldsymbol\{\\varepsilon\}\(\\mathbf\{x\}\)=\\bigl\[\\boldsymbol\{\\delta\};\\;\\mathbf\{p\}\\odot\\boldsymbol\{\\delta\};\\;\\mathrm\{roll\}\_\{5\}\(\\mathbf\{p\}\)\\odot\\boldsymbol\{\\delta\}\\bigr\]\\in\\mathbb\{R\}^\{30\}\.\(5\)Thus the output layer receives the raw 10\-dimensional residual𝜹\\boldsymbol\{\\delta\}, while the hidden layers receive the expanded rank\-3​D3Dbroadcast signal\. Since each modulator is deterministic conditional onXX, the conditional mean\-zero property is preserved at the Bayes optimum:

𝔼\[ϕℓ\(X\)⊙𝜹⋆\|X\]=ϕℓ\(X\)⊙𝔼\[𝜹⋆∣X\]=0\.\\mathbb\{E\}\\\!\\left\[\\boldsymbol\{\\phi\}\_\{\\ell\}\(X\)\\odot\\boldsymbol\{\\delta\}^\{\\star\}\\,\\middle\|\\,X\\right\]=\\boldsymbol\{\\phi\}\_\{\\ell\}\(X\)\\odot\\mathbb\{E\}\\\!\\left\[\\boldsymbol\{\\delta\}^\{\\star\}\\mid X\\right\]=0\.Thus score expansion preserves the population orthogonality property used by SBD\.

For each trainable layerkkwith pre\-activation𝐮\(k\)=W\(k\)​𝐡\(k−1\)\\mathbf\{u\}^\{\(k\)\}=W^\{\(k\)\}\\mathbf\{h\}^\{\(k\-1\)\}, post\-activation𝐡\(k\)=ReLU​\(𝐮\(k\)\)\\mathbf\{h\}^\{\(k\)\}=\\mathrm\{ReLU\}\(\\mathbf\{u\}^\{\(k\)\}\), and incoming activation𝐡\(k−1\)\\mathbf\{h\}^\{\(k\-1\)\}, SBD maintains two mini\-batch EMAs:

R^g​ε\(k\)​\[m\]\\displaystyle\\widehat\{R\}^\{\(k\)\}\_\{g\\varepsilon\}\[m\]←λ​R^g​ε\(k\)​\[m−1\]\+\(1−λ\)⋅1B​∑l=m​B\+1m​B\+B𝜺​\[l\]​𝐡\(k\)​\[l\]T\(cross\-covariance\)\\displaystyle\\leftarrow\\lambda\\,\\widehat\{R\}^\{\(k\)\}\_\{g\\varepsilon\}\[m\-1\]\+\(1\-\\lambda\)\\cdot\\tfrac\{1\}\{B\}\\sum\_\{l=mB\+1\}^\{mB\+B\}\\boldsymbol\{\\varepsilon\}\[l\]\\,\\mathbf\{h\}^\{\(k\)\}\[l\]^\{T\}\\qquad\\text\{\(cross\-covariance\)\}R^h​h\(k\)​\[m\]\\displaystyle\\widehat\{R\}^\{\(k\)\}\_\{hh\}\[m\]←λ2​R^h​h\(k\)​\[m−1\]\+\(1−λ2\)⋅1B​∑l=m​B\+1m​B\+B𝐡\(k\)​\[l\]​𝐡\(k\)​\[l\]T\(auto\-covariance; FC only\)\\displaystyle\\leftarrow\\lambda\_\{2\}\\,\\widehat\{R\}^\{\(k\)\}\_\{hh\}\[m\-1\]\+\(1\-\\lambda\_\{2\}\)\\cdot\\tfrac\{1\}\{B\}\\sum\_\{l=mB\+1\}^\{mB\+B\}\\mathbf\{h\}^\{\(k\)\}\[l\]\\mathbf\{h\}^\{\(k\)\}\[l\]^\{T\}\\quad\\text\{\(auto\-covariance; FC only\)\}with decay ratesλ=Reh\_lambda\\lambda=\\mathrm\{\\texttt\{Reh\\\_lambda\}\}andλ2=Reh\_lambda2\\lambda\_\{2\}=\\mathrm\{\\texttt\{Reh\\\_lambda2\}\}\. Both decays are annealed toward11at the end of every epoch according toλ←λ\+ρ​\(1−λ\)\\lambda\\leftarrow\\lambda\+\\rho\(1\-\\lambda\)withρ=Reh\_lambda\_drop\\rho=\\mathrm\{\\texttt\{Reh\\\_lambda\\\_drop\}\}\.

The local weight update at layerkkis a weighted sum of three terms:

Δ​W\(k\)=−cscore\(k\)​∇W\(k\)score−ccov\(k\)​∇W\(k\)cov−cℓ1\(k\)​∇W\(k\)ℓ1\.\\Delta W^\{\(k\)\}\\;=\\;\-\\,c\_\{\\mathrm\{score\}\}^\{\(k\)\}\\,\\nabla^\{\\mathrm\{score\}\}\_\{W^\{\(k\)\}\}\\;\-\\;c\_\{\\mathrm\{cov\}\}^\{\(k\)\}\\,\\nabla^\{\\mathrm\{cov\}\}\_\{W^\{\(k\)\}\}\\;\-\\;c\_\{\\ell\_\{1\}\}^\{\(k\)\}\\,\\nabla^\{\\ell\_\{1\}\}\_\{W^\{\(k\)\}\}\.The coefficientcscore\(k\)c\_\{\\mathrm\{score\}\}^\{\(k\)\}denotes the weight of the score\-broadcast decorrelation term\. In the implementation, this coefficient corresponds to theCMSE\_\*hyperparameters, retaining the naming convention of the original MSE/EBD codebase\. Withloss\_type=mse, this term corresponds to the standard MSE/EBD residual update; withloss\_type=ce, the same local update form is driven by the cross\-entropy score residual\. The three terms are:

- •Three\-factor score\-broadcast update ∇W\(k\)score=1B​\[f′​\(𝐮\(k\)\)⊙\(R^g​ε\(k\)⊤​𝜺\)\]​𝐡\(k−1\)⊤\.\\nabla^\{\\mathrm\{score\}\}\_\{W^\{\(k\)\}\}\\;=\\;\\tfrac\{1\}\{B\}\\,\\bigl\[\\,f^\{\\prime\}\(\\mathbf\{u\}^\{\(k\)\}\)\\odot\\bigl\(\\widehat\{R\}^\{\(k\)\\top\}\_\{g\\varepsilon\}\\,\\boldsymbol\{\\varepsilon\}\\bigr\)\\,\\bigr\]\\,\\mathbf\{h\}^\{\(k\-1\)\\top\}\.This is the product of three factors: the ReLU derivativef′​\(𝐮\(k\)\)f^\{\\prime\}\(\\mathbf\{u\}^\{\(k\)\}\), the broadcast modulatorR^g​ε\(k\)⊤​𝜺\\widehat\{R\}^\{\(k\)\\top\}\_\{g\\varepsilon\}\\boldsymbol\{\\varepsilon\}, and the presynaptic activation𝐡\(k−1\)\\mathbf\{h\}^\{\(k\-1\)\}, accumulated across the batch\. At the output layer,𝜺\\boldsymbol\{\\varepsilon\}is replaced by the raw𝜹\\boldsymbol\{\\delta\},f′f^\{\\prime\}is set to the identity, andR^g​ε\(k\)\\widehat\{R\}^\{\(k\)\}\_\{g\\varepsilon\}is the identity so that the update reduces to the standard single\-layer cross entropy gradient\.
- •Activation decorrelation \(entropy\) update\(FC layers only\)\(Erdogan et al\.,[2025](https://arxiv.org/html/2605.30638#bib.bib7); Ozsoy et al\.,[2022](https://arxiv.org/html/2605.30638#bib.bib12); Bozkurt et al\.,[2023a](https://arxiv.org/html/2605.30638#bib.bib13),[b](https://arxiv.org/html/2605.30638#bib.bib14)\) ∇W\(k\)cov=−2B​Dk​\[f′​\(𝐮\(k\)\)⊙\(R^h​h\(k\)\+η​I\)−1​𝐡\(k\)\]​𝐡\(k−1\)⊤,\\nabla^\{\\mathrm\{cov\}\}\_\{W^\{\(k\)\}\}\\;=\\;\-\\tfrac\{2\}\{B\\,D\_\{k\}\}\\,\\bigl\[\\,f^\{\\prime\}\(\\mathbf\{u\}^\{\(k\)\}\)\\odot\(\\widehat\{R\}^\{\(k\)\}\_\{hh\}\+\\eta I\)^\{\-1\}\\mathbf\{h\}^\{\(k\)\}\\,\\bigr\]\\,\\mathbf\{h\}^\{\(k\-1\)\\top\},derived from Jacobi’s formula applied tolog​det\(R^h​h\(k\)\+η​I\)\\log\\det\(\\widehat\{R\}^\{\(k\)\}\_\{hh\}\+\\eta I\)\. Hereη=R\_eps\_weight\\eta=\\mathrm\{\\texttt\{R\\\_eps\\\_weight\}\}regularizes the inverse andDkD\_\{k\}is the layer width\. For conv layers we setccov\(k\)=0c\_\{\\mathrm\{cov\}\}^\{\(k\)\}=0\(CCOV\_HIDDEN\)\.
- •Activationℓ1\\ell\_\{1\}sparsity ∇W\(k\)ℓ1=1B​Dk​sign​\(𝐡\(k\)\)​𝐡\(k−1\)⊤,\\nabla^\{\\ell\_\{1\}\}\_\{W^\{\(k\)\}\}\\;=\\;\\tfrac\{1\}\{B\\,D\_\{k\}\}\\,\\mathrm\{sign\}\(\\mathbf\{h\}^\{\(k\)\}\)\\,\\mathbf\{h\}^\{\(k\-1\)\\top\},applied only to the hidden fully connected layer fc3 in the reported CIFAR\-10 expanded\-SBD implementation\. It is disabled for the convolutional layers and is not applied to the final classifier fc4\.

The scalar coefficients\(cscore\(k\),ccov\(k\),cℓ1\(k\)\)\(c\_\{\\mathrm\{score\}\}^\{\(k\)\},c\_\{\\mathrm\{cov\}\}^\{\(k\)\},c\_\{\\ell\_\{1\}\}^\{\(k\)\}\)are given in Table[7](https://arxiv.org/html/2605.30638#A5.T7)\. The local SBD gradient terms are computed inside atorch\.no\_grad\(\)context, manually written intoparam\.grad, and applied via a standard Adam optimizer\. In the SBD runs, hidden\-layer updates are supplied by these local broadcast\-decorrelation rules rather than by a global backpropagation chain\.

### E\.2Network architecture

The network is a minimalist 5\-layer CNN: three convolutional blocks followed by two fully connected layers\. It uses no bias terms, no BatchNorm, no dropout, and no residual connections\. Average pooling is used throughout\. This minimal choice isolates the SBD update rule from confounds introduced by normalization layers\.

Table 5:Baseline CNN architecture\. Input: CIFAR\-10 images of shape\(3,32,32\)\(3,32,32\)\. Total trainable parameters:1,289,6001\{,\}289\{,\}600\.LayerOperationOutput shapeParametersconv15×55\\\!\\times\\\!5conv, stride 1, pad 2\(128,32,32\)\(128,\\,32,\\,32\)9,6009\{,\}600act1ReLU\(128,32,32\)\(128,\\,32,\\,32\)–pool1AvgPool2×22\\\!\\times\\\!2\(128,16,16\)\(128,\\,16,\\,16\)–conv25×55\\\!\\times\\\!5conv, stride 1, pad 2\(64,16,16\)\(64,\\,16,\\,16\)204,800204\{,\}800act2ReLU\(64,16,16\)\(64,\\,16,\\,16\)–pool2AvgPool2×22\\\!\\times\\\!2\(64,8,8\)\(64,\\,8,\\,8\)–conv32×22\\\!\\times\\\!2conv, stride 2, pad 0\(64,4,4\)\(64,\\,4,\\,4\)16,38416\{,\}384act3ReLU\(64,4,4\)\(64,\\,4,\\,4\)–flattenreshape\(1024,\)\(1024,\)–fc3linear\(1024,\)\(1024,\)1,048,5761\{,\}048\{,\}576act4ReLU\(1024,\)\(1024,\)–fc4linear\(10,\)\(10,\)10,24010\{,\}240softmaxsoftmax atT=1T\\\!=\\\!1\(10,\)\(10,\)–Layer widths are parameterized via the CLI flags\-\-P0,\-\-P1,\-\-P2, and\-\-fc\_hidden\(default128,64,64,1024128,64,64,1024\), which in turn set the channel counts of conv1, conv2, conv3, and fc3 respectively\. The fc3 input dimensionP2⋅4⋅4P\_\{2\}\\\!\\cdot\\\!4\\\!\\cdot\\\!4is derived automatically\. Weights are initialized from a Gaussian distribution with standard deviation2​1/scale\_factor/fan​\_​in\\sqrt\{2\}\\sqrt\{1/\\mathrm\{\\texttt\{scale\\\_factor\}\}\}/\\sqrt\{\\mathrm\{fan\\\_in\}\}, giving a Kaiming\-like scaling withscale\_factor=6=6\. In the reported CIFAR\-10 runs, data augmentation consists of per\-channel normalization, random cropping after 2\-pixel reflection padding, and horizontal flipping\.

For the additional width\-scaling experiments discussed in the main text, we also evaluated a4×4\\times\-width variant of the same CNN\.

Table 6:CNN architecture at4×4\\timeswidth\. Input: CIFAR\-10 images of shape\(3,32,32\)\(3,32,32\)\. Total trainable parameters:20,395,52020\{,\}395\{,\}520\(≈16×\\approx 16\\timesthe baseline model count\)\.LayerOperationOutput shapeParametersconv15×55\\\!\\times\\\!5conv, stride 1, pad 2\(512,32,32\)\(512,\\,32,\\,32\)38,40038\{,\}400act1ReLU\(512,32,32\)\(512,\\,32,\\,32\)–pool1AvgPool2×22\\\!\\times\\\!2\(512,16,16\)\(512,\\,16,\\,16\)–conv25×55\\\!\\times\\\!5conv, stride 1, pad 2\(256,16,16\)\(256,\\,16,\\,16\)3,276,8003\{,\}276\{,\}800act2ReLU\(256,16,16\)\(256,\\,16,\\,16\)–pool2AvgPool2×22\\\!\\times\\\!2\(256,8,8\)\(256,\\,8,\\,8\)–conv32×22\\\!\\times\\\!2conv, stride 2, pad 0\(256,4,4\)\(256,\\,4,\\,4\)262,144262\{,\}144act3ReLU\(256,4,4\)\(256,\\,4,\\,4\)–flattenreshape\(4096,\)\(4096,\)–fc3linear\(4096,\)\(4096,\)16,777,21616\{,\}777\{,\}216act4ReLU\(4096,\)\(4096,\)–fc4linear\(10,\)\(10,\)40,96040\{,\}960softmaxsoftmax atT=1T\\\!=\\\!1\(10,\)\(10,\)–##### Why this is called “4×4\\timeswidth\.”

The original architecture has four hyperparameters that control the widths of its intermediate representations: the three convolutional channel countsP0P\_\{0\},P1P\_\{1\},P2P\_\{2\}and the fully connected hidden sizedfcd\_\{\\mathrm\{fc\}\}\. The default configuration uses

\(P0,P1,P2,dfc\)=\(128,64,64,1024\),\(P\_\{0\},\\,P\_\{1\},\\,P\_\{2\},\\,d\_\{\\mathrm\{fc\}\}\)\\,=\\,\(128,\\,64,\\,64,\\,1024\),while the4×4\\times\-width configuration uses

\(P0,P1,P2,dfc\)=\(512,256,256,4096\),\(P\_\{0\},\\,P\_\{1\},\\,P\_\{2\},\\,d\_\{\\mathrm\{fc\}\}\)\\,=\\,\(512,\\,256,\\,256,\\,4096\),i\.e\. every width hyperparameter is multiplied by exactly44\. The relative ratiosP0:P1:P2=2:1:1P\_\{0\}:P\_\{1\}:P\_\{2\}=2:1:1that determine the channel hierarchy of the network are preserved, as are the input dimension \(33RGB channels of size32×3232\\\!\\times\\\!32\) and the output dimension \(1010classes\)\. The spatial sizes of feature maps \(32→16→8→432\\\!\\to\\\!16\\\!\\to\\\!8\\\!\\to\\\!4\) are therefore unchanged; only the depth of each feature map and the FC hidden dimension are scaled\.

##### Parameter\-count scaling\.

Although the term “4×4\\timeswidth” describes the linear scaling of feature widths, the trainable parameter count grows*quadratically*rather than linearly, because most layers have parameter counts proportional to the product of input and output widths\. Specifically:

- •conv1grows by4×4\\times\(9,600→38,4009\{,\}600\\to 38\{,\}400\): the input channel count is fixed at33\(RGB\), so only the output width scales\.
- •conv2,conv3, andfc3each grow by16×16\\times: both the input and output widths scale by44, giving a4×4=164\\times 4=16factor for the kernel parametersCin⋅Cout⋅k2C\_\{\\mathrm\{in\}\}\\cdot C\_\{\\mathrm\{out\}\}\\cdot k^\{2\}\(ordin⋅doutd\_\{\\mathrm\{in\}\}\\cdot d\_\{\\mathrm\{out\}\}for the dense layer\)\.
- •fc4grows by4×4\\times\(10,240→40,96010\{,\}240\\to 40\{,\}960\): the output is fixed at1010classes, so only the input width scales\.

The overall parameter count therefore scales by approximately16×16\\times, from1,289,6001\{,\}289\{,\}600in the original architecture to20,395,52020\{,\}395\{,\}520at4×4\\timeswidth\. The densefc3layer alone accounts for approximately82%82\\%of the new parameter count and therefore dominates the memory budget at this width\.

### E\.3Training hyperparameters

Table[7](https://arxiv.org/html/2605.30638#A5.T7)lists the hyperparameters used by the expanded\-SBD CIFAR\-10 launch scriptsrun\_cemseN\_s\*\.sh\. Unless otherwise noted, expanded\-SBD runs inherit these values; control runs use the corresponding values specified in their own launch scripts\.

Table 7:CEMSEN training hyperparameters corresponding to the expanded\-SBD per\-seed launch scriptsrun\_cemseN\_s\*\.sh\. Column*flag*is the CLI name in the launch script; column*Symbol*names the variable used in the equations of Section[E\.1](https://arxiv.org/html/2605.30638#A5.SS1)\.GroupFlagSymbolValue*General*\-\-logger\_nameseed5 independent seeds\-\-n\_epochs–201201\-\-batch\_sizeBB6464\-\-loss\_type–cross entropy\-\-augmentation–11\(crop\+\+flip\)*Method*\-\-method–sbd\-\-err\_expand–22\(rank 30\)\-\-ERR\_SCALEα\\alpha1\.01\.0\-\-SOFTMAX\_TEMPTT1\.01\.0effectiveDεD\_\{\\varepsilon\}DεD\_\{\\varepsilon\}3030*Optimizer*\-\-lrη0\\eta\_\{0\}4×10−44\\\!\\times\\\!10^\{\-4\}\-\-lr\_drop\_rate–0\.970\.97\-\-lr\_drop\_every–11epoch\-\-weight\_decay–1×10−51\\\!\\times\\\!10^\{\-5\}Adamβ1\\beta\_\{1\},β2\\beta\_\{2\}–0\.90\.9,0\.9990\.999precision–float32*SBD coefficients*\-\-CMSE\_OUTcscore\(fc4\)c\_\{\\mathrm\{score\}\}^\{\(\\mathrm\{fc4\}\)\}1010\-\-CMSE\_OUT2cscore\(fc3\)c\_\{\\mathrm\{score\}\}^\{\(\\mathrm\{fc3\}\)\}0\.10\.1\-\-CMSE\_HIDDENcscore\(conv\)c\_\{\\mathrm\{score\}\}^\{\(\\mathrm\{conv\}\)\}0\.10\.1\-\-CCOV\_OUTccov\(fc4\)c\_\{\\mathrm\{cov\}\}^\{\(\\mathrm\{fc4\}\)\}1×10−71\\\!\\times\\\!10^\{\-7\}\-\-CCOV\_OUT2ccov\(fc3\)c\_\{\\mathrm\{cov\}\}^\{\(\\mathrm\{fc3\}\)\}1×10−71\\\!\\times\\\!10^\{\-7\}\-\-CCOV\_HIDDENccov\(conv\)c\_\{\\mathrm\{cov\}\}^\{\(\\mathrm\{conv\}\)\}0\-\-CL1\_OUTcℓ1\(fc3\)c\_\{\\ell\_\{1\}\}^\{\(\\mathrm\{fc3\}\)\}1×10−111\\\!\\times\\\!10^\{\-11\}\-\-CL1\_HIDDENcℓ1\(hidden\)c\_\{\\ell\_\{1\}\}^\{\(\\mathrm\{hidden\}\)\}0*Covariance EMA*\-\-Reh\_lambdaλ\\lambda0\.999990\.99999\-\-Reh\_lambda2λ2\\lambda\_\{2\}0\.999990\.99999\-\-Reh\_lambda\_dropρ\\rho0\.040\.04\-\-Reh\_lambda\_drop\_every–11epoch\-\-Reh\_gainstd​\(R^g​ε\(conv\)\)\\mathrm\{std\}\(\\widehat\{R\}^\{\(\\mathrm\{conv\}\)\}\_\{g\\varepsilon\}\)0\.010\.01\-\-Reh\_gain\_linstd​\(R^g​ε\(FC\)\)\\mathrm\{std\}\(\\widehat\{R\}^\{\(\\mathrm\{FC\}\)\}\_\{g\\varepsilon\}\)0\.010\.01\-\-Reh\_iniinit diagonal ofR^h​h\\widehat\{R\}\_\{hh\}1×10−81\\\!\\times\\\!10^\{\-8\}*Architecture*\-\-P0,\-\-P1,\-\-P2channel counts128,64,64128,\\,64,\\,64\-\-fc\_hiddenfc3 widthFhF\_\{h\}10241024\-\-use\_bias–0\(disabled\)\-\-scale\_factor,\-\-init\_distinit scale, dist\.66, Gaussian
### E\.4Accuracy curves

Figure[4](https://arxiv.org/html/2605.30638#A5.F4)shows the CIFAR\-10 training and test accuracy curves for BP, DFA, SBD, and the score\-expansion variant of SBD on the CNN architecture used inErdogan et al\. \([2025](https://arxiv.org/html/2605.30638#bib.bib7)\)and described in Section[E\.2](https://arxiv.org/html/2605.30638#A5.SS2)\. The expansion variant uses the confidence\-weighted androll\(5\)blocks defined in Eq\. \([5](https://arxiv.org/html/2605.30638#A5.E5)\)\.

![Refer to caption](https://arxiv.org/html/2605.30638v1/figures/CIFAR10CNNAccuracy.png)Figure 4:Training and test accuracy curves on CIFAR\-10 for BP, DFA, SBD, and SBD with score expansion on the CNN architecture of Section[E\.2](https://arxiv.org/html/2605.30638#A5.SS2)\. The expanded\-broadcast variant uses the confidence\-weighted𝐩⊙𝜹\\mathbf\{p\}\\odot\\boldsymbol\{\\delta\}block together with theroll5​\(𝐩\)⊙𝜹\\mathrm\{roll\}\_\{5\}\(\\mathbf\{p\}\)\\odot\\boldsymbol\{\\delta\}block from Eq\. \([5](https://arxiv.org/html/2605.30638#A5.E5)\)\.The curves are consistent with the final accuracies reported in Table[1](https://arxiv.org/html/2605.30638#S8.T1)\. In particular, SBD trained with cross entropy attains higher test accuracy than conventional DFA trained with the same loss, indicating that the score\-broadcast update is more effective than the standard direct\-feedback alternative in this setting\. The score\-expansion variant closely tracks plain SBD early in training and then provides a modest additional improvement later in training, yielding a small but consistent gain over using the score vector alone\.

Figure[5](https://arxiv.org/html/2605.30638#A5.F5)shows the corresponding CIFAR\-10 training and test accuracy curves for the4×4\\times\-width CNN of Table[6](https://arxiv.org/html/2605.30638#A5.T6)\. In this higher\-capacity setting, we report BP and SBD with expanded broadcast vector, where the expansion again uses the confidence\-weighted androll5blocks from Eq\. \([5](https://arxiv.org/html/2605.30638#A5.E5)\)\.

![Refer to caption](https://arxiv.org/html/2605.30638v1/figures/CIFAR10CNN4WidthAccuracy.png)Figure 5:Training and test accuracy curves on CIFAR\-10 for BP and SBD with score expansion on the4×4\\times\-width CNN architecture in Table[6](https://arxiv.org/html/2605.30638#A5.T6)\. The SBD variant uses the same expanded broadcast construction as in the baseline setting, namely the confidence\-weighted𝐩⊙𝜹\\mathbf\{p\}\\odot\\boldsymbol\{\\delta\}block together with theroll5​\(𝐩\)⊙𝜹\\mathrm\{roll\}\_\{5\}\(\\mathbf\{p\}\)\\odot\\boldsymbol\{\\delta\}block\.Compared with the baseline\-width model, both methods benefit substantially from the4×4\\times\-width scaling\. Using the final accuracies reported in Section[8](https://arxiv.org/html/2605.30638#S8), BP increases from78\.51%78\.51\\%to83\.1%83\.1\\%, a gain of4\.594\.59percentage points, while SBD with expanded broadcast increases from70\.03%70\.03\\%to74\.46%74\.46\\%, a gain of4\.434\.43points\. Thus the improvement is closer to about44–55points for both methods, rather than only33–44\. The wider network therefore preserves the same qualitative ordering while improving the absolute performance of both BP and expanded SBD\.

##### Reproducibility\.

All source code and hyperparameters used in the experiments are provided in the supplementary materials\.

### E\.5Design exploration for score\-expansion modulators

The general theory of Section[7](https://arxiv.org/html/2605.30638#S7)\(Appendix[D](https://arxiv.org/html/2605.30638#A4)\) shows that any deterministicXX\-measurable mapϕ:𝒳→ℝDout\\phi:\\mathcal\{X\}\\to\\mathbb\{R\}^\{D\_\{\\mathrm\{out\}\}\}yields a modulated scoreηϕ​\(X\)=ϕ​\(X\)⊙𝜹​\(Y,𝐚​\(X\)\)\\eta^\{\\phi\}\(X\)=\\phi\(X\)\\odot\\boldsymbol\{\\delta\}\(Y,\\mathbf\{a\}\(X\)\)whose population\-optimal version preserves the conditional mean\-zero property\. This leaves substantial freedom in the choice of modulators: they may be functions of the raw input, the logits, the probabilities, hidden activations, or any other deterministic quantity generated fromXX\. Different choices lead to broadcast vectors of different effective rank and qualitatively different couplings to the predictive distribution\. This appendix documents the design exploration that led to the rank\-3​D3Dinstance reported in Section[8](https://arxiv.org/html/2605.30638#S8)and Appendix[D](https://arxiv.org/html/2605.30638#A4)\.

##### Expansion methods\.

In the formulas below,𝐚=𝐚​\(X\)\\mathbf\{a\}=\\mathbf\{a\}\(X\)and𝐩=𝐩​\(X\)\\mathbf\{p\}=\\mathbf\{p\}\(X\)are deterministic functions of the input for fixed network parameters, so each candidate is anXX\-measurable special case of the general construction\.

- i\.confidence\-weighted ηconf​\(X\)=𝐩​\(X\)⊙𝜹∈ℝD\.\\mathbf\{\\eta\}\_\{\\text\{conf\}\}\(X\)\\;=\\;\\mathbf\{p\}\(X\)\\odot\\boldsymbol\{\\delta\}\\;\\in\\;\\mathbb\{R\}^\{D\}\.The class\-ddresidual is multiplied by the model’s predicted probability for classdd\. Coordinates corresponding to confidently predicted classes contribute proportionally more to the broadcast; coordinates wherepd≈0p\_\{d\}\\\!\\approx\\\!0are suppressed\.
- ii\.logit\-weighted ηlogit​\(X\)=𝐚​\(X\)⊙𝜹∈ℝD\.\\mathbf\{\\eta\}\_\{\\text\{logit\}\}\(X\)\\;=\\;\\mathbf\{a\}\(X\)\\odot\\boldsymbol\{\\delta\}\\;\\in\\;\\mathbb\{R\}^\{D\}\.The same idea asconfidence\-weightedbut using the raw logits𝐚​\(X\)\\mathbf\{a\}\(X\)instead of the softmax probabilities\. Logit magnitudes are unbounded, so this block can produce comparatively large modulators once the network has converged\.
- iii\.rollk ηrollk​\(X;k\)=rollk​\(𝐩​\(X\)\)⊙𝜹∈ℝD\.\\mathbf\{\\eta\}\_\{\\text\{rollk\}\}\(X;k\)\\;=\\;\\mathrm\{roll\}\_\{k\}\(\\mathbf\{p\}\(X\)\)\\odot\\boldsymbol\{\\delta\}\\;\\in\\;\\mathbb\{R\}^\{D\}\.A single cyclic shift of𝐩​\(X\)\\mathbf\{p\}\(X\)bykk,k∈\{0,1,…,D−1\}k\\in\\\{0,1,\\dots,D\-1\\\}, positions provides cross\-class coupling: coordinateddis modulated by the probability of the classkkpositions away in the class index\. Fork=0k=0the block is identical toconfidence\-weighted; fork=5k=5it is identical toroll5\. The pair\(k,D−k\)\(k,\\,D\-k\)is information\-equivalent up to a permutation of class pairs\.
- iv\.full\-cyclic ηfull​\(X\)=\[roll1​\(𝐩​\(X\)\)⊙𝜹;…;rollD−1​\(𝐩​\(X\)\)⊙𝜹\]∈ℝD​\(D−1\)\.\\mathbf\{\\eta\}\_\{\\text\{full\}\}\(X\)\\;=\\;\\Bigl\[\\,\\mathrm\{roll\}\_\{1\}\(\\mathbf\{p\}\(X\)\)\\odot\\boldsymbol\{\\delta\}\\,;\\ \\dots\\,;\\ \\mathrm\{roll\}\_\{D\-1\}\(\\mathbf\{p\}\(X\)\)\\odot\\boldsymbol\{\\delta\}\\Bigr\]\\;\\in\\;\\mathbb\{R\}^\{D\(D\-1\)\}\.Concatenates allD−1D\-1non\-trivial cyclic shifts of𝐩​\(X\)\\mathbf\{p\}\(X\)into one high\-rank block\. ForD=10D=10this contributes9090extra dimensions and captures all binary class\-pairings\(d,\(d\+k\)modD\)\(d,\(d\+k\)\\bmod D\)fork=1,…,9k=1,\\dots,9simultaneously\.
- v\.boundary\_weightA band\-pass modulator built from two softmax distributions at different temperatures: ϕbd​\(X\)=softmax⁡\(𝐚​\(X\)/T1\)−softmax⁡\(𝐚​\(X\)/T2\)with​T1=0\.5,T2=2\.0,\\mathbf\{\\phi\}\_\{\\text\{bd\}\}\(X\)\\;=\\;\\operatorname\{softmax\}\(\\mathbf\{a\}\(X\)/T\_\{1\}\)\\;\-\\;\\operatorname\{softmax\}\(\\mathbf\{a\}\(X\)/T\_\{2\}\)\\quad\\text\{with \}T\_\{1\}=0\.5,\\,T\_\{2\}=2\.0,ηbd​\(X\)=ϕbd​\(X\)⊙𝜹∈ℝD\.\\mathbf\{\\eta\}\_\{\\text\{bd\}\}\(X\)\\;=\\;\\mathbf\{\\phi\}\_\{\\text\{bd\}\}\(X\)\\odot\\boldsymbol\{\\delta\}\\;\\in\\;\\mathbb\{R\}^\{D\}\.At a sharp temperature \(T=0\.5T=0\.5\) the softmax concentrates on the top class; at a soft temperature \(T=2\.0T=2\.0\) it spreads\. Their difference is small when the network is either very confident or very uniform across classes, and large when the prediction is genuinely on a class\-boundary\. The block therefore band\-pass filters the residual by “where the prediction is most uncertain on the class\-discrimination axis\.”

##### Selection protocol\.

Because the cost of a full multi\-seed sweep over all subsets of these candidates for several values ofMMwould be prohibitive, the design exploration was conducted at a single seed\. We evaluated a representative set of subsets on the CIFAR\-10 CNN of Appendix[E](https://arxiv.org/html/2605.30638#A5), then retained the best\-performing rank\-3​D3Dinstance, and only that selected configuration was carried forward to the multi\-seed evaluation reported in Table[1](https://arxiv.org/html/2605.30638#S8.T1)\. The single\-seed numbers below are intended to document the search rather than to provide statistical comparisons among variants\.

##### Modulator\-combination ablation\.

Table[8](https://arxiv.org/html/2605.30638#A5.T8)reports CIFAR\-10 test accuracy for the candidate combinations evaluated\. The selected combination\[𝜹;𝐩⊙𝜹;roll5​\(𝐩\)⊙𝜹\]\[\\boldsymbol\{\\delta\};\\mathbf\{p\}\\odot\\boldsymbol\{\\delta\};\\mathrm\{roll\}\_\{5\}\(\\mathbf\{p\}\)\\odot\\boldsymbol\{\\delta\}\]used in the main text is highlighted\.

Table 8:Single\-seed CIFAR\-10 test accuracy for candidate score\-expansion modulator combinations on the CNN of Appendix[E\.2](https://arxiv.org/html/2605.30638#A5.SS2)\. Entries report top\-1 test accuracy \(%\) from a single seed; these numbers document the design search and are not intended as statistical comparisons\. The configuration selected for the multi\-seed evaluation in Table[1](https://arxiv.org/html/2605.30638#S8.T1)is shown in bold\.Modulator combinationMMTest acc\. \(%, 1 seed\)score \+ confidence\-weighted2269\.4669\.46score \+ logit\-weighted226262\(terminated early\)score \+ boundary\-weighted2269\.7269\.72full\-cyclic101069\.2469\.24score \+ roll5 \+ confidence\-weighted3370\.46\\mathbf\{70\.46\}score \+ confidence\-weighted \+ boundary\-weighted \+ roll54469\.4469\.44

### E\.6Cosine alignment between SBD updates and backpropagation gradients

To quantify the directional relationship between the local SBD gradient and the standard backpropagation \(BP\) gradient, we compute a signed cosine similarity for each hidden layer of the CIFAR\-10 CNN during SBD training\. LetSe,b\(k\)S^\{\(k\)\}\_\{e,b\}denote the unregularized three\-factor local SBD gradient for layerkkat epocheeand mini\-batchbb, i\.e\., the tensor that is written intoparam\.gradand consumed by the Adam optimizer \(see Appendix[E\.3](https://arxiv.org/html/2605.30638#A5.SS3)\), evaluated at the current network parameters before the optimizer step\. The actual weight increment is−η​Adam​\(Se,b\(k\)\)\-\\eta\\,\\mathrm\{Adam\}\(S^\{\(k\)\}\_\{e,b\}\);Se,b\(k\)S^\{\(k\)\}\_\{e,b\}itself is the gradient term consumed by the optimizer, not the update step\. LetGBP,e,b\(k\)=∇W\(k\)ℒCEG^\{\(k\)\}\_\{\\mathrm\{BP\},e,b\}=\\nabla\_\{W^\{\(k\)\}\}\\mathcal\{L\}\_\{\\mathrm\{CE\}\}denote the BP gradient of the cross entropy loss computed on the same mini\-batch and at the same parameter values\. The BP gradient is used only as a diagnostic; the network weights are still updated only by the SBD rule\. For each hidden layer, both tensors are flattened and compared as

ce,b\(k\)=⟨vec⁡\(Se,b\(k\)\),vec⁡\(GBP,e,b\(k\)\)⟩‖vec⁡\(Se,b\(k\)\)‖2​‖vec⁡\(GBP,e,b\(k\)\)‖2\+ϵnum,c^\{\(k\)\}\_\{e,b\}=\\frac\{\\left\\langle\\operatorname\{vec\}\\\!\\left\(S^\{\(k\)\}\_\{e,b\}\\right\),\\operatorname\{vec\}\\\!\\left\(G^\{\(k\)\}\_\{\\mathrm\{BP\},e,b\}\\right\)\\right\\rangle\}\{\\left\\\|\\operatorname\{vec\}\\\!\\left\(S^\{\(k\)\}\_\{e,b\}\\right\)\\right\\\|\_\{2\}\\left\\\|\\operatorname\{vec\}\\\!\\left\(G^\{\(k\)\}\_\{\\mathrm\{BP\},e,b\}\\right\)\\right\\\|\_\{2\}\+\\epsilon\_\{\\mathrm\{num\}\}\},\(6\)whereϵnum\\epsilon\_\{\\mathrm\{num\}\}is a small numerical stabilizer\. We do not take an absolute value in Eq\. \([6](https://arxiv.org/html/2605.30638#A5.E6)\); hence positive values denote alignment between the local SBD gradient and the BP gradient, while negative values would denote anti\-alignment\. Because both quantities are gradient\-type tensors that the optimizer subtracts, positive cosine corresponds to a shared descent direction in the resulting weight updates\. The corresponding angle, when reported, isarccos⁡\(clip​\(ce,b\(k\),−1,1\)\)\\arccos\(\\mathrm\{clip\}\(c^\{\(k\)\}\_\{e,b\},\-1,1\)\), but Figure[6](https://arxiv.org/html/2605.30638#A5.F6)plots the cosine similarities themselves\. The mini\-batch cosine values are averaged within each epoch and the experiment is repeated over five independent seeds\. Solid lines in the plot show the across\-seed mean, and the shaded envelopes show one standard deviation\.

![Refer to caption](https://arxiv.org/html/2605.30638v1/figures/cos_sim_plot.png)Figure 6:Cosine similarity between the local SBD gradient and the diagnostic BP gradient for each hidden layer of the CIFAR\-10 CNN\. Curves show means over five seeds, and shaded envelopes denote one standard deviation\. Across all hidden layers, the local SBD gradient has a consistently positive projection onto the corresponding BP gradient throughout training\.Figure[6](https://arxiv.org/html/2605.30638#A5.F6)shows that the local SBD gradient remains positively aligned with the BP gradient in every hidden layer\. The deepest hidden fully connected layer exhibits the largest cosine similarity, while the convolutional layers also maintain positive alignment throughout training\. Since bothSe,b\(k\)S^\{\(k\)\}\_\{e,b\}andGBP,e,b\(k\)G^\{\(k\)\}\_\{\\mathrm\{BP\},e,b\}are gradient\-type quantities that the optimizer subtracts, this positive cosine implies that the actual SBD weight update shares a non\-trivial descent component with the BP update at the same training instant, even though SBD does not use BP to update the weights\.

## Appendix FTiny ImageNet experimental setup

This appendix documents the codebase used for the Tiny ImageNet experiments reported in the main text, including BP, DFA, plain SBD, and score\-expanded SBD\. Tiny ImageNet\(Le and Yang,[2015](https://arxiv.org/html/2605.30638#bib.bib37)\)was released as a Stanford CS231N course project, a downsized 200\-class subset of ImageNet, without a formal open\-source license declaration on its original distribution page, and is used here for non\-commercial academic research in accordance with the underlying ImageNet terms of use\.

The Tiny ImageNet benchmark consists of200200classes with500500training images and5050validation images per class, all resized to64×6464\\\!\\times\\\!64RGB\. The reported SBD configurations are implemented by per\-seed launch scripts in the Tiny ImageNet codebase: plain SBD usesCNN\_tinyImageNet/SBD/run\_tinet\_sbd\_ce\_s\*\.sh, while score\-expanded SBD usesCNN\_tinyImageNet/SBD\_exp/run\_tinet\_sbd\_exp\_ce\_s\*\.sh\. These scripts call the corresponding Tiny ImageNet trainers,cnn\_tinet\.pyfor plain SBD andcnn\_tinete\.pyfor score\-expanded SBD, applied to the 6\-layer CNN described in Section[F\.1](https://arxiv.org/html/2605.30638#A6.SS1)\.

The forward and local update rules follow the CIFAR\-10 SBD formulation of Section[E\.1](https://arxiv.org/html/2605.30638#A5.SS1); this appendix therefore focuses on the Tiny\-ImageNet\-specific architecture, data pipeline, and hyperparameter choices\. All experiments run infloat32on a single NVIDIA V100 \(32GB\) GPU\. For comparison on a common budget, the approximate times to complete200200epochs on Tiny ImageNet were1414,2020,3030, and3232hours for BP, DFA, SBD, and SBD with score expansion, respectively\.

For plain SBD, the broadcast vector equals the raw cross\-entropy score

𝜺=𝜹=𝐩−𝐲∈ℝ200\.\\boldsymbol\{\\varepsilon\}=\\boldsymbol\{\\delta\}=\\mathbf\{p\}\-\\mathbf\{y\}\\in\\mathbb\{R\}^\{200\}\.For score\-expanded SBD, the implementation uses the rank\-3​D3Dexpansion described in Section[E\.1](https://arxiv.org/html/2605.30638#A5.SS1),

𝜺=\[𝜹;𝐩⊙𝜹;roll5​\(𝐩\)⊙𝜹\]∈ℝ600,\\boldsymbol\{\\varepsilon\}=\[\\boldsymbol\{\\delta\};\\;\\mathbf\{p\}\\odot\\boldsymbol\{\\delta\};\\;\\mathrm\{roll\}\_\{5\}\(\\mathbf\{p\}\)\\odot\\boldsymbol\{\\delta\}\]\\in\\mathbb\{R\}^\{600\},selected by\-\-err\_expand=2in theSBD\_explaunch scripts\. The softmax temperature and broadcast scale are fixed atT=1T=1andα=1\\alpha=1in the Tiny ImageNet implementation\.

### F\.1Network architecture

The Tiny ImageNet network is a wider and deeper variant of the CIFAR\-10 backbone\. It uses three5×55\\\!\\times\\\!5convolutional blocks with max\-pooling, followed by three fully connected layers with ReLU and dropout\. Bias terms are enabled in every convolution and linear layer\. A single\-\-width\_multiplierflag scales all conv channels and the FC hidden width jointly\.

Table 9:Tiny ImageNet CNN architecture at\-\-width\_multiplier=1\.0=1\.0\. Input: Tiny ImageNet images of shape\(3,64,64\)\(3,\\,64,\\,64\)\. Total trainable parameters:14,130,88814\{,\}130\{,\}888\.LayerOperationOutput shapeParametersconv15×55\\\!\\times\\\!5conv, stride 2, pad 2\(96,32,32\)\(96,\\,32,\\,32\)7,2967\{,\}296act1ReLU\(96,32,32\)\(96,\\,32,\\,32\)–pool1MaxPool2×22\\\!\\times\\\!2\(96,16,16\)\(96,\\,16,\\,16\)–conv25×55\\\!\\times\\\!5conv, stride 1, pad 2\(128,16,16\)\(128,\\,16,\\,16\)307,328307\{,\}328act2ReLU\(128,16,16\)\(128,\\,16,\\,16\)–pool2MaxPool2×22\\\!\\times\\\!2\(128,8,8\)\(128,\\,8,\\,8\)–conv35×55\\\!\\times\\\!5conv, stride 1, pad 2\(256,8,8\)\(256,\\,8,\\,8\)819,456819\{,\}456act3ReLU\(256,8,8\)\(256,\\,8,\\,8\)–pool3MaxPool2×22\\\!\\times\\\!2\(256,4,4\)\(256,\\,4,\\,4\)–flattenreshape\(4096,\)\(4096,\)–fc3linear\(2048,\)\(2048,\)8,390,6568\{,\}390\{,\}656act4ReLU\+\+Dropout\(pp\)\(2048,\)\(2048,\)–fc4linear\(2048,\)\(2048,\)4,196,3524\{,\}196\{,\}352act5ReLU\+\+Dropout\(pp\)\(2048,\)\(2048,\)–fc5linear\(200,\)\(200,\)409,800409\{,\}800softmaxsoftmax atT=1T\\\!=\\\!1\(200,\)\(200,\)–The conv channelsP0,P1,P2P\_\{0\},P\_\{1\},P\_\{2\}and FC hidden widthFhF\_\{h\}are parameterized as

P0=round​\(96⋅w\),P1=round​\(128⋅w\),P2=round​\(256⋅w\),Fh=round​\(2048⋅w\),P\_\{0\}\\\!=\\\!\\mathrm\{round\}\(96\\\!\\cdot\\\!w\),\\quad P\_\{1\}\\\!=\\\!\\mathrm\{round\}\(128\\\!\\cdot\\\!w\),\\quad P\_\{2\}\\\!=\\\!\\mathrm\{round\}\(256\\\!\\cdot\\\!w\),\\quad F\_\{h\}\\\!=\\\!\\mathrm\{round\}\(2048\\\!\\cdot\\\!w\),wherew=width\_multiplierw=\\mathrm\{\\texttt\{width\\\_multiplier\}\}\. The fc3 input dimensionP2⋅4⋅4P\_\{2\}\\\!\\cdot\\\!4\\\!\\cdot\\\!4is derived automatically from the conv stack, giving40964096at\-\-width\_multiplier=1\.0=1\.0\. Weights are initialized from a Gaussian distribution with standard deviation2/\(scale\_factor⋅fan\_in\)\\sqrt\{2/\\mathrm\{\(\\texttt\{scale\\\_factor\}\\cdot\\text\{fan\\\_in\}\)\}\}, giving a Kaiming\-like scaling withscale\_factor=6\\texttt\{scale\\\_factor\}=6; biases are initialized to zero\.

### F\.2Data pipeline and augmentation

All images are normalized per\-channel using the standard ImageNet statistics

𝝁=\(0\.485,0\.456,0\.406\),𝝈=\(0\.229,0\.224,0\.225\)\.\\boldsymbol\{\\mu\}=\(0\.485,\\,0\.456,\\,0\.406\),\\qquad\\boldsymbol\{\\sigma\}=\(0\.229,\\,0\.224,\\,0\.225\)\.Training augmentation consists of \(i\) a random crop to64×6464\\\!\\times\\\!64from a44\-pixel reflection\-padded image and \(ii\) random horizontal flipping, applied independently per sample and per epoch\. No color jittering, mixup, or RandAugment is used in the main\-text configuration\.

At evaluation time we apply a deterministic mirror\-and\-translation ensemble implemented ininfer\_mirror\_translate\. For each validation image, the code first averages the model outputs over the original image and its horizontal flip\. It then applies two deterministic reflection\-padded translation shifts and averages the mirror\-averaged outputs of these translated views\. The final prediction used to compute test accuracy is

0\.5​𝐲¯mirror\+0\.5​𝐲¯trans,0\.5\\,\\bar\{\\mathbf\{y\}\}\_\{\\mathrm\{mirror\}\}\+0\.5\\,\\bar\{\\mathbf\{y\}\}\_\{\\mathrm\{trans\}\},where𝐲¯mirror\\bar\{\\mathbf\{y\}\}\_\{\\mathrm\{mirror\}\}is the mean of two forward passes and𝐲¯trans\\bar\{\\mathbf\{y\}\}\_\{\\mathrm\{trans\}\}is the mean over two translated mirror pairs\. The evaluation data loader itself applies no stochastic augmentation beyond normalization\.

For the results reported in this paper, Tiny ImageNet uses the standard split: training is performed on the official training set and evaluation is performed on the official validation set, since the standard Tiny ImageNet release does not provide labels for the test set\. The dataset loader expects the original Tiny ImageNet directory structure: training images undertrain/<wnid\>/images/and validation images in the flatval/images/directory indexed byval/val\_annotations\.txt\. The root resolver accepts either the Tiny ImageNet root itself or a parent directory containingtiny\-imagenet\-200\. If a separate labeled test split with annotations is supplied, the code can also construct a dedicated test loader, but none of the reported numbers in this paper use that option\.

### F\.3Training hyperparameters

Table[10](https://arxiv.org/html/2605.30638#A6.T10)lists the training hyperparameters used by the Tiny ImageNet score\-expanded SBD launch scriptsCNN\_tinyImageNet/SBD\_exp/run\_tinet\_sbd\_exp\_ce\_s\*\.sh\. TheSBD coefficientsblock follows the layer indexing of Section[E\.1](https://arxiv.org/html/2605.30638#A5.SS1), withCMSE\_OUTnow applied to the deeper output layer fc5,CMSE\_OUT2shared between the two hidden FC layers fc3 and fc4, andCMSE\_HIDDENshared across all three convolutional layers\. The covariance coefficients use the analogous grouping, whileCL1\_OUTis applied to the two hidden FC layers fc3 and fc4 in the Tiny ImageNet implementation\.

The Tiny ImageNet expanded\-SBD hyperparameters in Table[10](https://arxiv.org/html/2605.30638#A6.T10)were selected by a grid search over the SBD learning\-rate and regularization coefficients\. The search was managed using Weights & Biases sweeps\(Biewald,[2020](https://arxiv.org/html/2605.30638#bib.bib39)\), including automatic early termination of clearly underperforming configurations\.

Table 10:Tiny ImageNet expanded\-SBD training hyperparameters\. Values shown correspond to the score\-expanded SBD CE launch scriptsCNN\_tinyImageNet/SBD\_exp/run\_tinet\_sbd\_exp\_ce\_s\*\.sh\. Column*Flag*is the CLI name in the launch script; column*Symbol*names the variable used in the equations of Section[E\.1](https://arxiv.org/html/2605.30638#A5.SS1)\.GroupFlagSymbolValue*General*\-\-logger\_nameseed55independent seeds\-\-n\_epochs–251251\-\-batch\_sizeBB2828\-\-loss\_out–cross entropy \(ce\)augmentation–RandomCrop pad44\+\+HFlip*Method*\-\-method–sbd\-\-err\_expand–22broadcast vector𝜺\\boldsymbol\{\\varepsilon\}rank\-3​D3Dexpansion of𝜹\\boldsymbol\{\\delta\}softmax temperatureTT1\.01\.0fixedeffectiveDεD\_\{\\varepsilon\}DεD\_\{\\varepsilon\}600600*Optimizer*\-\-lrη0\\eta\_\{0\}8\.5×10−58\.5\\\!\\times\\\!10^\{\-5\}\-\-lr\_drop\_rate–0\.980\.98\-\-lr\_drop\_every–11epoch\-\-weight\_decay–1\.25×10−51\.25\\\!\\times\\\!10^\{\-5\}Adamβ1\\beta\_\{1\},β2\\beta\_\{2\}–0\.90\.9,0\.9990\.999precision–float32*SBD coefficients*\-\-CMSE\_OUTcscore\(fc5\)c\_\{\\mathrm\{score\}\}^\{\(\\mathrm\{fc5\}\)\}1010\-\-CMSE\_OUT2cscore\(fc3,fc4\)c\_\{\\mathrm\{score\}\}^\{\(\\mathrm\{fc3,fc4\}\)\}0\.10\.1\-\-CMSE\_HIDDENcscore\(conv\)c\_\{\\mathrm\{score\}\}^\{\(\\mathrm\{conv\}\)\}0\.10\.1\-\-CCOV\_OUTccov\(fc5\)c\_\{\\mathrm\{cov\}\}^\{\(\\mathrm\{fc5\}\)\}1×10−71\\\!\\times\\\!10^\{\-7\}\-\-CCOV\_OUT2ccov\(fc3,fc4\)c\_\{\\mathrm\{cov\}\}^\{\(\\mathrm\{fc3,fc4\}\)\}6×10−76\\\!\\times\\\!10^\{\-7\}\-\-CCOV\_HIDDENccov\(conv\)c\_\{\\mathrm\{cov\}\}^\{\(\\mathrm\{conv\}\)\}0\-\-CL1\_OUTcℓ1\(fc3,fc4\)c\_\{\\ell\_\{1\}\}^\{\(\\mathrm\{fc3,fc4\}\)\}1×10−71\\\!\\times\\\!10^\{\-7\}\-\-CL1\_HIDDENcℓ1\(hidden\)c\_\{\\ell\_\{1\}\}^\{\(\\mathrm\{hidden\}\)\}0*Covariance EMA*\-\-Reh\_lambdaλ\\lambda0\.9999920\.999992\-\-Reh\_lambda2λ2\\lambda\_\{2\}0\.9999920\.999992\-\-Reh\_lambda\_dropρ\\rho0\.0150\.015\-\-Reh\_lambda\_drop\_every–11epoch\-\-Reh\_gainstd​\(R^g​ε\(conv\)\)\\mathrm\{std\}\(\\widehat\{R\}^\{\(\\mathrm\{conv\}\)\}\_\{g\\varepsilon\}\)0\.010\.01\-\-Reh\_gain\_linstd​\(R^g​ε\(FC\)\)\\mathrm\{std\}\(\\widehat\{R\}^\{\(\\mathrm\{FC\}\)\}\_\{g\\varepsilon\}\)0\.010\.01\-\-Reh\_iniinit diagonal ofR^h​h\\widehat\{R\}\_\{hh\}1×10−81\\\!\\times\\\!10^\{\-8\}*Architecture*\-\-P0,\-\-P1,\-\-P2channel counts\(96,128,256\)\(96,\\,128,\\,256\)\-\-dropoutpp0\.080\.08\-\-scale\_factor,\-\-init\_distinit scale, dist\.66, Gaussian
### F\.4Score expansion

In Tiny ImageNet experiments, we employed the same expansion modulators used in the CIFAR\-10 experiments: confidence\-weighted𝐩⊙𝜹\\mathbf\{p\}\\odot\\boldsymbol\{\\delta\}androll5​\(𝐩\)⊙𝜹\\mathrm\{roll\}\_\{5\}\(\\mathbf\{p\}\)\\odot\\boldsymbol\{\\delta\}\. Note that the numerical indices in Tiny Imagenet labeling are arbitrary and have no semantic meaning\. Therefore, the use55as the roll index is also arbitrary \(i\.e\., we do not need to use100100\.\)\. Since there are200200outputs, the score size is200200, and therefore, the hidden\-layer broadcast dimension for SBD Exp is600600\.

![Refer to caption](https://arxiv.org/html/2605.30638v1/figures/TinyImageNetCNN_Train_Accuracy.png)Figure 7:Training accuracy on Tiny ImageNet for BackProp, SBD, and SBD with score expansion \(SBD\_Exp\) across 250 epochs\. Each curve represents the mean over five runs, and the shaded region denotes one standard deviation\.
### F\.5Accuracy curves

Figure[8](https://arxiv.org/html/2605.30638#A6.F8)and Figure[7](https://arxiv.org/html/2605.30638#A6.F7)show the Tiny ImageNet test and training accuracy curves for BP, SBD, and SBD with score expansion\. The curves show a clear qualitative separation between the methods\. BP obtains the highest test accuracy, reaching39\.89%39\.89\\%, while SBD with score expansion reaches31\.43%31\.43\\%and plain SBD reaches28\.29%28\.29\\%\. Thus, as in the CIFAR\-10 experiments, expanding the broadcast vector improves the SBD update, yielding a gain of3\.143\.14percentage points over using the score vector alone\.

![Refer to caption](https://arxiv.org/html/2605.30638v1/figures/TinyImageNetCNN_Test_Accuracy.png)Figure 8:Test accuracy on Tiny ImageNet for BackProp, SBD, and SBD with score expansion \(SBD\_Exp\) across 250 epochs\. Each curve represents the mean over five runs, and the shaded region denotes one standard deviation\. The expanded SBD variant consistently improves over baseline SBD, while BackProp achieves the highest test accuracy\.Figure[9](https://arxiv.org/html/2605.30638#A6.F9)shows the corresponding DFA training and test accuracy curves\. DFA improves slowly over the longer 500\-epoch run, reaching17\.49%17\.49\\%test accuracy and21\.02%21\.02\\%training accuracy at the end of training\. Even with this longer training horizon, DFA remains well below the SBD variants, indicating that the score\-broadcast update is more effective than the standard direct\-feedback alternative on the Tiny ImageNet CNN experiment as well\.

![Refer to caption](https://arxiv.org/html/2605.30638v1/figures/TinyImageNetCNN_DFA_Train_Test_Accuracy.png)Figure 9:Training and test accuracy on Tiny ImageNet for DFA across 500 epochs\. Each curve represents the mean over five runs, and the shaded region denotes one standard deviation\. DFA improves slowly throughout training and remains substantially below the accuracies obtained by BackProp and the SBD variants\.A notable difference from the SBD and DFA curves is the strong overfitting observed for BP\. In the training plot, BP rapidly approaches nearly perfect training accuracy, reaching99\.98%99\.98\\%, whereas its test accuracy saturates at39\.89%39\.89\\%\. This corresponds to a train\-test gap of60\.0960\.09percentage points\. In contrast, the SBD variants achieve lower training accuracy and also lower test accuracy, but exhibit a much less extreme separation between training and test performance\. The expanded SBD model, for example, reaches53\.77%53\.77\\%training accuracy and31\.43%31\.43\\%test accuracy\. This suggests that BP fits the Tiny ImageNet training set very aggressively in this architecture, while the broadcast\-based learning rules act as a more constrained optimization procedure\.

Similar Articles

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

Hugging Face Daily Papers

This paper introduces DeScore, a video reward model that decouples reasoning and scoring processes to improve training efficiency and generalization. It addresses the limitations of existing discriminative and generative reward models by using a 'think-then-score' paradigm with multimodal large language models.

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

arXiv cs.LG

Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

arXiv cs.LG

This paper identifies a structural failure mode in token-level credit assignment for LLM reinforcement learning when using LoRA, where intrinsic signals degenerate. It proposes Adapter-Residual Credit Assignment (ARCA), which derives token salience from adapter hidden-state residuals and remains competitive with baselines.