Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

arXiv cs.AI Papers

Summary

This paper shows that layer-local training methods like Forward-Forward (FF) do not scale to realistic image sizes and datasets, and that synthetic benchmarks overstate their performance. The authors introduce a strong FF variant (DTG-FF) and demonstrate that on real data (e.g., ImageNet-100 at 224x224) FF achieves only 49.4% versus typical BP above 75%, while on synthetic tasks the gap narrows or reverses.

arXiv:2606.06539v1 Announce Type: cross Abstract: Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updates. Recent FF-CNN work has narrowed the gap to BP on 32x32 benchmarks, raising the question of whether layer-local training is becoming a viable alternative at realistic scale. To probe this rigorously, we develop DTG-FF -- dynamic temperature goodness, decoupled normalization, and multi-layer fusion -- as an instrument that sets FF-family state of the art across nine real-data benchmarks (91.8% CIFAR-10 and the first FF baseline at ImageNet-100 224x224), and use it to audit how far layer-local training actually scales. (1) Real-data scaling. Under identical recipe and backbone, an architecture-matched BP-DeepSup baseline beats DTG-FF by 2.40/5.93 pp on CIFAR-10/CIFAR-100, and the gap widens with class count. At 224x224 the same instrument reaches only 49.4% -- the first FF baseline at this scale, versus typical BP above 75% [Tian et al., 2020] -- exposing a real-data ceiling invisible at 32x32. (2) Synthetic vs. real K-conflict. DTG-FF increasingly outperforms BP as class count K grows on synthetic teacher-student tasks, yet on real images the FF-BP gap reverses sign and widens with K. A within-dataset CIFAR-100 coarse vs. fine probe isolates label-hierarchy from image distribution: synthetic K-sweeps confound output dimensionality with fine-grained discrimination difficulty and thereby overstate FF transferability. (3) Systems audit. FF can be implemented without storing depth-wide activations, but on commodity 8 GB hardware standard BP+gradient-accumulation reaches 4.18 GB / 157 imgs/s versus DTG-FF's 7.90 GB / 138 imgs/s, so a memory-based justification for FF at this scale is not supported under fair baselines.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:16 AM

# Real-Data Limits of Layer-Local Training
Source: [https://arxiv.org/html/2606.06539](https://arxiv.org/html/2606.06539)
## Synthetic Benchmarks Overstate Forward\-Forward Scaling: Real\-Data Limits of Layer\-Local Training

###### Abstract

Forward\-Forward \(FF\) learning\[Hinton,[2022](https://arxiv.org/html/2606.06539#bib.bib1)\]replaces backpropagation with strictly layer\-local goodness updates\. Recent FF\-CNN work has narrowed the gap to BP on32×3232\\\!\\times\\\!32benchmarks, raising the question of whether layer\-local training is becoming a viable alternative at realistic scale\. To probe this rigorously, we developDTG\-FF—dynamic temperature goodness, decoupled normalization, and multi\-layer fusion—as an instrument that sets FF\-family state of the art across nine real\-data benchmarks \(91\.8%91\.8\\%CIFAR\-10 and the first FF baseline at ImageNet\-100224×224224\\\!\\times\\\!224\), and use it to audit how far layer\-local training actually scales\.\(1\) Real\-data scaling\.Under identical recipe and backbone, an architecture\-matched BP\-DeepSup baseline beats DTG\-FF by2\.402\.40/5\.935\.93pp on CIFAR\-10/CIFAR\-100, and the gap widens with class count\. At224×224224\\\!\\times\\\!224the same instrument reaches only49\.4%49\.4\\%—the first FF baseline at this scale, versus typical BP above75%75\\%\[Tian et al\.,[2020](https://arxiv.org/html/2606.06539#bib.bib2)\]—exposing a real\-data ceiling invisible at32×3232\\\!\\times\\\!32\.\(2\) Synthetic vs\. realKK\-conflict\.DTG\-FF increasingly outperforms BP as class countKKgrows on synthetic teacher\-student tasks, yet on real images the FF–BP gap reverses sign and widens withKK\. A within\-dataset CIFAR\-100 coarse vs\. fine probe isolates label\-hierarchy from image distribution: syntheticKK\-sweeps confound output dimensionality with fine\-grained discrimination difficulty and thereby overstate FF transferability\.\(3\) Systems audit\.FF can be implemented without storing depth\-wide activations, but on commodity 8 GB hardware standard BP\+gradient\-accumulation reaches4\.184\.18GB /157157imgs/s versus DTG\-FF’s7\.907\.90GB /138138imgs/s, so a memory\-based justification for FF at this scale is not supported under fair baselines\.

## 1Introduction

Backpropagation’s global backward pass couples all layers and stores activations across depth, motivating strictly layer\-local training rules as potential accuracy–memory tradeoffs rather than direct BP replacements\. The Forward\-Forward \(FF\) algorithm\[Hinton,[2022](https://arxiv.org/html/2606.06539#bib.bib1)\]is one such rule: it replaces the backward pass with layer\-local goodness\-based learning, training each layer to produce high “goodness” \(squared activation norm\) for positive data and low goodness for negative data, with no cross\-layer gradient flow\. FF is sometimes also motivated by biological considerations\[Crick,[1989](https://arxiv.org/html/2606.06539#bib.bib3), Lillicrap et al\.,[2020](https://arxiv.org/html/2606.06539#bib.bib4)\], but our interest here is empirical: is layer\-local training scalable enough to be a useful alternative on real\-data workloads, and what systems property does it actually deliver?

FF has not yet demonstrated this competitiveness\. On CIFAR\-10 the original FF algorithm reaches roughly60%60\\%with MLPs, a3030\-point deficit relative to BP\. Subsequent work has narrowed the gap through architectural innovations—convolutional extensions\[Tosato et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib5), Lee et al\.,[2024](https://arxiv.org/html/2606.06539#bib.bib6)\], deeper architectures\[Sezener et al\.,[2025](https://arxiv.org/html/2606.06539#bib.bib7)\], and adaptive goodness evaluation\[Zhao et al\.,[2024](https://arxiv.org/html/2606.06539#bib.bib8)\]—reaching90\.62%90\.62\\%\(ASGE VGG11\) but still trailing BP baselines, and almost exclusively on32×3232\\\!\\times\\\!32inputs\.

We test this with a strong FF\-family instrument and make four contributions:

1\. An instrument for stress\-testing layer\-local training\.We develop DTG\-FF as a compact FF\-family architecture combining three mechanism\-level improvements \(dynamic temperature goodness, decoupled three\-path normalization, multi\-layer fusion\)\. It sets FF\-family state of the art on nine real\-world benchmarks \(Sec\.[5\.2](https://arxiv.org/html/2606.06539#S5.SS2)\)—CIFAR\-10 \(91\.79%91\.79\\%logit\-sum /91\.33%91\.33\\%concat\), CIFAR\-100 \(67\.28%67\.28\\%\), Tiny ImageNet \(48\.17%48\.17\\%\), and the first FF\-family ImageNet\-100 baseline at224×224224\\\!\\times\\\!224\(49\.4%49\.4\\%on VGG11\)\. We treat this not as the paper’s headline claim but as the credibility floor for the audit that follows: only a strong FF instrument can tell us whether the residual FF–BP gap reflects fundamental limits of layer\-local training or merely the weakness of prior FF baselines\.

2\. Real\-data scaling diagnosis with architecture\-matched BP controls\.Despite FF\-family SOTA, DTG\-FF \(concat\) trails an architecture\-matched BP\-DeepSup baseline under identical recipe and backbone by2\.402\.40pp on CIFAR\-10 \(K=10K\\\!=\\\!10\) and5\.935\.93pp on CIFAR\-100 \(K=100K\\\!=\\\!100\); the FF–BP gap widens with class count\. A deeper\-and\-narrower VGG11 backbone underperforms VGG8 on32×3232\\\!\\times\\\!32inputs by−6\.97\-6\.97/−12\.49\-12\.49pp on CIFAR\-10/100 \(App\.[D\.5](https://arxiv.org/html/2606.06539#A4.SS5)\), but VGG11 differs from VGG8 in both depth \(8 vs\. 7 conv layers\) and early\-channel width \(64 vs\. 128 starting channels\), so we cannot strictly attribute the drop to depth alone\. At224×224224\\\!\\times\\\!224the same instrument reaches only49\.4%49\.4\\%versus typical BP above75%75\\%\[Tian et al\.,[2020](https://arxiv.org/html/2606.06539#bib.bib2)\], exposing a real\-data ceiling invisible at32×3232\\\!\\times\\\!32\.

3\. Synthetic–realKK\-axis conflict\.In paired\-seed teacher–student synthetic tasks DTG\-FF’s advantage over BP*grows*withKK; on real images the FF–BP gap reverses sign and*widens*withKK\. A within\-dataset CIFAR\-100 coarse\-2020vs\. fine\-100100probe \(Sec\.[5\.3](https://arxiv.org/html/2606.06539#S5.SS3)\) shows thatKKin synthetic sweeps tracks output dimensionality whereasKKon real data also tracks fine\-grained discrimination difficulty; current synthetic FF validation overstates real\-data transferability\.

4\. A fair\-baseline systems audit\.A natural defense of FF is its𝒪​\(1\)\\mathcal\{O\}\(1\)\-in\-depth activation\-memory property\. Pipelined per\-layer training realizes this bound, but on commodity 8 GB hardware standard BP\+gradient\-accumulation reaches4\.184\.18GB /157157imgs/s versus DTG\-FF’s7\.907\.90GB /138138imgs/s, so a memory\-based justification for FF at this scale is not supported \(Sec\.[6\.2](https://arxiv.org/html/2606.06539#S6.SS2), App\.[D\.7](https://arxiv.org/html/2606.06539#A4.SS7)\)\. We additionally offer an interpretive synthesis \(Sec\.[6\.1](https://arxiv.org/html/2606.06539#S6.SS1)\) reading several FF\-family improvements—label overlay, BP\-trained classifier heads, spatial goodness, multi\-layer fusion—as partial substitutes for the supervised cross\-layer signal that BP provides natively\.

![Refer to caption](https://arxiv.org/html/2606.06539v1/x1.png)Figure 1:DTG\-FF method overview\.The method combines three mechanisms: layer\-local FF losses with detached propagation, a learnable per\-layer temperatureTlT\_\{l\}that scales spatial goodness before the fixed random readout, and a detached multi\-layer classifier that fuses GAP features through BN\+\+Linear without updating the convolutional backbone\.Section[2](https://arxiv.org/html/2606.06539#S2)surveys related work\. Section[3](https://arxiv.org/html/2606.06539#S3)presents the per\-layer signal diagnostic and the synthetic validation with architecture\-matched BP controls\. Section[4](https://arxiv.org/html/2606.06539#S4)describes DTG\-FF\. Section[5](https://arxiv.org/html/2606.06539#S5)reports real\-data scaling and the within\-datasetKK\-disambiguation\. Section[6](https://arxiv.org/html/2606.06539#S6)reports the fair\-baseline systems audit and offers an interpretive reading of FF\-family improvements as partial substitutes for BP’s cross\-layer supervised signal\.

## 2Related Work

Forward\-Forward and variants\.FF was introduced byHinton \[[2022](https://arxiv.org/html/2606.06539#bib.bib1)\]with each layer trained to produce high goodness on positive and low on negative data\. Subsequent work has reduced the FF–BP accuracy gap: LSFF\[Tosato et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib5)\]extended FF to CNNs \(81\.12% CIFAR\-10\); SCFF\[Lee et al\.,[2024](https://arxiv.org/html/2606.06539#bib.bib6)\]introduced self\-recurrence \(80\.75%\); DeeperForward\[Sezener et al\.,[2025](https://arxiv.org/html/2606.06539#bib.bib7)\]independently observed that batch normalization disrupts goodness\-based learning and proposed removing it \(88\.72%\); ASGE\[Zhao et al\.,[2024](https://arxiv.org/html/2606.06539#bib.bib8)\]used per\-layer classifiers with logit summation \(90\.62% on VGG11, the prior FF\-family best\)\.Ororbia and Mali \[[2023](https://arxiv.org/html/2606.06539#bib.bib9)\]combined FF with predictive coding\. Outside FF, SoftHebb\[Journé et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib10)\]achieves 80\.3% via soft Winner\-Take\-All Hebbian learning\.

Other biologically motivated alternatives to BP\.Feedback Alignment\[Lillicrap et al\.,[2016](https://arxiv.org/html/2606.06539#bib.bib11), Nøkland,[2016](https://arxiv.org/html/2606.06539#bib.bib12), Launay et al\.,[2020](https://arxiv.org/html/2606.06539#bib.bib13)\], Target Propagation\[Bengio,[2014](https://arxiv.org/html/2606.06539#bib.bib14), Lee et al\.,[2015a](https://arxiv.org/html/2606.06539#bib.bib15), Meulemans et al\.,[2024](https://arxiv.org/html/2606.06539#bib.bib16)\], Equilibrium Propagation\[Scellier and Bengio,[2017](https://arxiv.org/html/2606.06539#bib.bib17), Laborieux and Zenke,[2024](https://arxiv.org/html/2606.06539#bib.bib18), Scellier,[2023](https://arxiv.org/html/2606.06539#bib.bib19)\], and Predictive Coding\[Rao and Ballard,[1999](https://arxiv.org/html/2606.06539#bib.bib20), Whittington and Bogacz,[2017](https://arxiv.org/html/2606.06539#bib.bib21), Millidge et al\.,[2022](https://arxiv.org/html/2606.06539#bib.bib22), Salvatori et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib23)\]all retain some form of structured backward signal—contrasting with FF’s complete elimination of backward gradient flow\. Perturbation\-based methods\[Dellaferrera and Kreiman,[2022](https://arxiv.org/html/2606.06539#bib.bib24), Ren et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib25)\]use input modulation instead of gradients\. Auxiliary\-classifier heads with local losses have a long history\[Szegedy et al\.,[2015](https://arxiv.org/html/2606.06539#bib.bib26), Lee et al\.,[2015b](https://arxiv.org/html/2606.06539#bib.bib27), Belilovsky et al\.,[2019](https://arxiv.org/html/2606.06539#bib.bib28),[2020](https://arxiv.org/html/2606.06539#bib.bib29), Nøkland and Eidnes,[2019](https://arxiv.org/html/2606.06539#bib.bib30)\]; our multi\-layer classifier builds on this lineage\.

Information theory, normalization, and temperature scaling\.Information\-bottleneck perspectives on deep learning\[Tishby et al\.,[2000](https://arxiv.org/html/2606.06539#bib.bib31), Tishby and Zaslavsky,[2015](https://arxiv.org/html/2606.06539#bib.bib32), Shwartz\-Ziv and Tishby,[2017](https://arxiv.org/html/2606.06539#bib.bib33)\]have received critiques on estimator dependence\[Saxe et al\.,[2018](https://arxiv.org/html/2606.06539#bib.bib34), Belghazi et al\.,[2018](https://arxiv.org/html/2606.06539#bib.bib35), Poole et al\.,[2019](https://arxiv.org/html/2606.06539#bib.bib36), McAllester and Stratos,[2020](https://arxiv.org/html/2606.06539#bib.bib37)\]; we sidestep these by using the KSG estimator\[Kraskov et al\.,[2004](https://arxiv.org/html/2606.06539#bib.bib38)\]for scalar MI and linear\-probe lower bounds for vectors\. Normalization’s effect on optimization has been extensively studied\[Ioffe and Szegedy,[2015](https://arxiv.org/html/2606.06539#bib.bib39), Ba et al\.,[2016](https://arxiv.org/html/2606.06539#bib.bib40), Santurkar et al\.,[2018](https://arxiv.org/html/2606.06539#bib.bib41), Yang et al\.,[2019](https://arxiv.org/html/2606.06539#bib.bib42)\]; our contribution is three\-path decoupling rather than a new normalization\. Temperature scaling appears in knowledge distillation\[Hinton et al\.,[2015](https://arxiv.org/html/2606.06539#bib.bib43)\], calibration\[Guo et al\.,[2017](https://arxiv.org/html/2606.06539#bib.bib44)\], and curriculum\-based distillation\[Li et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib45), Zhou et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib46)\], all operating on softmax outputs; DTG instead modulates the layer\-local learning signal\. Extended discussion in Appendix[B](https://arxiv.org/html/2606.06539#A2)\.

## 3Diagnostic and Synthetic Validation

### 3\.1Per\-Layer Signal Diagnostic

A short empirical diagnostic on a trained DTG\-FF VGG8 \(CIFAR\-10,91\.33%91\.33\\%\) motivates the three DTG\-FF components and grounds the BP\-shadow lens of Sec\.[6\.1](https://arxiv.org/html/2606.06539#S6.SS1)\. We measure scalar goodnessI​\(glscalar;Y\)≈0\.24I\(g\_\{l\}^\{\\mathrm\{scalar\}\};Y\)\{\\approx\}0\.24bits per layer \(KSG\[Kraskov et al\.,[2004](https://arxiv.org/html/2606.06539#bib.bib38)\], mean across layers, range0\.160\.16–0\.310\.31;0\.220\.22–0\.380\.38bits across5050–500500\-bin histograms\), spatial goodness vectors at1\.11\.1–2\.52\.5bits per layer via linear probe with Fano’s inequality, and GAP features at2\.522\.52bits at layer66\. Estimator details, bin\-sensitivity, and the per\-layer figure are in App\.[C](https://arxiv.org/html/2606.06539#A3)\. FF’s inter\-layer gradient path is also*zero by construction*viadetach—a gradient\-flow property, not an information\-theoretic bound\. These observations motivate three design pathways:*signal quality*\(scalar→\\tospatial goodness, providing higher probe\-accessible task signal per layer\),*signal utilization*\(dynamic temperature; ablation cost−0\.72\-0\.72to−1\.34\-1\.34pp atT=1T\\\!=\\\!1\), and*cross\-layer coordination*\(multi\-layer fusion; pairwise per\-layer prediction disagreement25\.1%25\.1\\%on CIFAR\-10 test, indicating non\-redundant per\-layer hypotheses to aggregate\)\. We test these design choices in a controlled synthetic setup before committing to real\-data scaling\.

### 3\.2Synthetic Validation with Architecture\-Matched BP Controls

Setup\.A 3\-layer ReLU teacher \(din=50d\_\{\\mathrm\{in\}\}=50,dhidden=128d\_\{\\mathrm\{hidden\}\}=128\) labels 20,000 train / 5,000 test samples\. All students have 4 hidden layers \(dhidden=128d\_\{\\mathrm\{hidden\}\}=128\), 8,000 Adam steps, batch256256,lr=10−3\\mathrm\{lr\}=10^\{\-3\},55seeds,K∈\{5,10,15,20,30,50\}K\\in\\\{5,10,15,20,30,50\\\}\. We compare DTG\-FF against single BP, BP\-DeepSup \(backbone\- and depth\-matched, differs from DTG\-FF only in detach\), and BP\-Ensemble \(4×4\\\!\\timesparams, softmax\-averaged\)\. Teacher variance dominates across seeds \(std44–17%17\\%\), so we report paired differences\. Full table, baseline definitions, param counts, and FLOPs are in App\.[E](https://arxiv.org/html/2606.06539#A5)\.

Results\.Paired DTG\-FF−\-BP\-DeepSup grows from−0\.23\-0\.23pp atK=5K\\\!=\\\!5to\+2\.00\+2\.00pp atK=50K\\\!=\\\!50; a pre\-specified low\-KK\(\{5,10,15,20\}\\\{5,10,15,20\\\}\) vs\. high\-KK\(\{30,50\}\\\{30,50\\\}\) contrast yields\+1\.37\+1\.37pp in favor of highKK\(bootstrapn=10,000n\\\!=\\\!10\{,\}000,95%95\\%CI\[\+0\.59,\+2\.15\]\[\+0\.59,\+2\.15\]\)\. DTG\-FF reliably exceeds single BP atK≥15K\\\!\\geq\\\!15\(5/5 seeds,\+1\.2\+1\.2to\+3\.4\+3\.4pp\) but loses to the4×4\\\!\\times\-parameter BP\-Ensemble at everyKK, as expected for a pure capacity comparison\. The narrow finding:*the FF–BP gap in this synthetic regime is not solely explained by the absence of end\-to\-end gradients*\. The synthetic advantage does not transfer to real images: on CIFAR\-10/100, architecture\-matched BP\-DeepSup outperforms DTG\-FF by2\.402\.40/5\.935\.93pp \(Sec\.[5\.2](https://arxiv.org/html/2606.06539#S5.SS2)\)\.

## 4Method: DTG\-FF

### 4\.1Temperature\-Scaled Local Goodness \(DTG\)

Let layerllcompute activation𝐳l​\(x\)=ReLU​\(Wl​𝐱^l−1\)\\mathbf\{z\}\_\{l\}\(x\)=\\mathrm\{ReLU\}\(W\_\{l\}\\hat\{\\mathbf\{x\}\}\_\{l\-1\}\)where𝐱^l−1\\hat\{\\mathbf\{x\}\}\_\{l\-1\}is the \(detached\) input from layerl−1l\\\!\-\\\!1\. We define a nonnegative goodness representation

𝐮l​\(x\)=ϕl​\(𝐳l​\(x\)\)∈ℝ≥0ml,\\mathbf\{u\}\_\{l\}\(x\)=\\phi\_\{l\}\(\\mathbf\{z\}\_\{l\}\(x\)\)\\in\\mathbb\{R\}^\{m\_\{l\}\}\_\{\\geq 0\},\(1\)whereϕl\\phi\_\{l\}depends on architecture:ϕlMLP​\(𝐳\)=1dl​‖𝐳‖22∈ℝ≥0\\phi\_\{l\}^\{\\mathrm\{MLP\}\}\(\\mathbf\{z\}\)=\\tfrac\{1\}\{d\_\{l\}\}\\\|\\mathbf\{z\}\\\|\_\{2\}^\{2\}\\in\\mathbb\{R\}\_\{\\geq 0\}\(scalar;ml=1m\_\{l\}\\\!=\\\!1\);ϕlCNN​\(𝐳\)=flatten​\(AdaptiveAvgPool2d​\(𝐳2,\(hl,wl\)\)\)∈ℝ≥0ml\\phi\_\{l\}^\{\\mathrm\{CNN\}\}\(\\mathbf\{z\}\)=\\mathrm\{flatten\}\(\\mathrm\{AdaptiveAvgPool2d\}\(\\mathbf\{z\}^\{2\},\(h\_\{l\},w\_\{l\}\)\)\)\\in\\mathbb\{R\}^\{m\_\{l\}\}\_\{\\geq 0\}\(vector; here and in subsequent goodness expressions𝐳2\\mathbf\{z\}^\{2\}denotes the element\-wise square\)\. DTG\-FF introduces a layer\-local learnable temperatureTl\>0T\_\{l\}\>0and uses the temperature\-scaled goodness

𝐮~l​\(x\)=𝐮l​\(x\)/Tl,Tl=Tmin\+\(Tmax−Tmin\)​σ​\(αl\),\\tilde\{\\mathbf\{u\}\}\_\{l\}\(x\)=\\mathbf\{u\}\_\{l\}\(x\)/T\_\{l\},\\qquad T\_\{l\}=T\_\{\\min\}\+\(T\_\{\\max\}\-T\_\{\\min\}\)\\,\\sigma\(\\alpha\_\{l\}\),\(2\)whereαl∈ℝ\\alpha\_\{l\}\\\!\\in\\\!\\mathbb\{R\}is a per\-layer learnable scalar \(one parameter per layer\) andσ\\sigmais the logistic sigmoid\. Each layer therefore learns its own temperature jointly with its weights, via gradient descent on the layer\-local loss\.

Locality\.TlT\_\{l\}is a learnable parameter of layerlland does not depend on other layers or batch statistics\. DTG\-FF’s FF \(goodness\) loss therefore satisfies strict inter\-layer gradient locality \(enforced bydetach\): updates toWlW\_\{l\}require no gradient propagated through layersj≠lj\\\!\\neq\\\!l\. The MLP classifier head is a separate hybrid that backpropagates through shared layer weights \(App\.[A\.3](https://arxiv.org/html/2606.06539#A1.SS3)\); the CNN classifier is fully detached from the conv backbone\.

Learned temperatures\.Figure[2](https://arxiv.org/html/2606.06539#S4.F2)\(left panel\) shows that the learnedTlT\_\{l\}in a trained DTG\-FF VGG8 spans the full allowed range: layer 0 learnsTl=Tmin=0\.5T\_\{l\}\\\!=\\\!T\_\{\\min\}\\\!=\\\!0\.5, middle layers occupy intermediate values \(1\.41\.4–1\.91\.9\), and deep layers saturate atTmax=2\.0T\_\{\\max\}\\\!=\\\!2\.0\. The temperature range is used meaningfully rather than collapsing to a single value\. At test time, the goodness loss is not computed; temperature still scales logits for the per\-layer random\-projection readout at inference \(App\.[A\.4](https://arxiv.org/html/2606.06539#A1.SS4)\)\.

![Refer to caption](https://arxiv.org/html/2606.06539v1/x2.png)Figure 2:DTG mechanism diagnostics on the trained DTG\-FF VGG8*concat\-classifier*checkpoint \(CIFAR\-10, 91\.33%\)\.Left:learned temperatureTlT\_\{l\}per layer \(bars, left axis\) and per\-layer random\-projection classifier accuracy \(line, right axis\)\. Layer 0 learnsT=Tmin=0\.5T\\\!=\\\!T\_\{\\min\}\\\!=\\\!0\.5; middle layers occupy intermediate values \(1\.41\.4–1\.91\.9\); deep layers saturate atTmax=2\.0T\_\{\\max\}\\\!=\\\!2\.0\.Right:pairwise per\-layer prediction disagreement on CIFAR\-10 test\. Layer 0 disagrees with deeper layers on 54–56% of samples, whereas late layers are nearly redundant \(layers 4–6 disagree by only 3–6%\); the off\-diagonal mean disagreement is 25\.1%\.Local loss\.DTG\-FF instantiates a shared temperature\-scaled formℒl=ℓl​\(ψl​\(𝐮~l\),y\)\\mathcal\{L\}\_\{l\}=\\ell\_\{l\}\(\\psi\_\{l\}\(\\tilde\{\\mathbf\{u\}\}\_\{l\}\),y\)with two architecture\-dependent readouts:

*MLP: scalar margin loss\.*For positive/negative supervisions∈\{\+1,−1\}s\\\!\\in\\\!\\\{\+1,\-1\\\}\[Hinton,[2022](https://arxiv.org/html/2606.06539#bib.bib1)\],

ψlMLP​\(u~l\)=u~l−θl,ℓl​\(ψl,s\)=log⁡\(1\+exp⁡\(−s​ψl\)\),\\psi\_\{l\}^\{\\mathrm\{MLP\}\}\(\\tilde\{u\}\_\{l\}\)=\\tilde\{u\}\_\{l\}\-\\theta\_\{l\},\\quad\\ell\_\{l\}\(\\psi\_\{l\},s\)=\\log\(1\+\\exp\(\-s\\,\\psi\_\{l\}\)\),\(3\)whereθl=softplus​\(θ0,l\)\\theta\_\{l\}=\\mathrm\{softplus\}\(\\theta\_\{0,l\}\)is a learnable scalar margin per layer, initialized so thatθl\\theta\_\{l\}matches the average goodness measured on the first batch \(App\.[A\.2](https://arxiv.org/html/2606.06539#A1.SS2)\)\.

*CNN:KK\-way local cross\-entropy\.*

ψlCNN​\(𝐮~l\)=𝐮~l⊤​Rl\+𝐛l,ℓl​\(𝝍,y\)=CE​\(𝝍,y\),\\psi\_\{l\}^\{\\mathrm\{CNN\}\}\(\\tilde\{\\mathbf\{u\}\}\_\{l\}\)=\\tilde\{\\mathbf\{u\}\}\_\{l\}^\{\\top\}R\_\{l\}\+\\mathbf\{b\}\_\{l\},\\quad\\ell\_\{l\}\(\\boldsymbol\{\\psi\},y\)=\\mathrm\{CE\}\(\\boldsymbol\{\\psi\},y\),\(4\)with fixed \(non\-learned\) random projectionRl∈ℝml×KR\_\{l\}\\\!\\in\\\!\\mathbb\{R\}^\{m\_\{l\}\\times K\},Rl,i​j∼𝒩​\(0,1/K\)R\_\{l,ij\}\\sim\\mathcal\{N\}\(0,1/K\), and𝐛l∼𝒩​\(𝟎,\(1/K\)​I\)\\mathbf\{b\}\_\{l\}\\\!\\sim\\\!\\mathcal\{N\}\(\\mathbf\{0\},\(1/K\)I\)also fixed\. No margin term is needed\.

The choice of loss reflects convention:Hinton \[[2022](https://arxiv.org/html/2606.06539#bib.bib1)\]established scalar goodness \+ margin for MLPs; the FF\-CNN literature\[Tosato et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib5), Zhao et al\.,[2024](https://arxiv.org/html/2606.06539#bib.bib8)\]uses vector goodness \+ cross\-entropy\. In the scalar\-goodness case, aKK\-way linear readoutak=rl​k​gl/Tl\+bl​ka\_\{k\}=r\_\{lk\}g\_\{l\}/T\_\{l\}\+b\_\{lk\}is rank\-1 \(all classes depend on the same scalar\), and the margin objective circumvents this degeneracy; we do not claim this forces the design choice, but it explains why the two instantiations use different loss families\.

Gradient scale\.For the CNN loss,∇𝐮ℒ=T−1​Rl​\(𝐩−𝐞y\)\\nabla\_\{\\mathbf\{u\}\}\\mathcal\{L\}=T^\{\-1\}R\_\{l\}\(\\mathbf\{p\}\-\\mathbf\{e\}\_\{y\}\), so‖∇𝐮ℒ‖2≤\(2/T\)​‖Rl‖op\\\|\\nabla\_\{\\mathbf\{u\}\}\\mathcal\{L\}\\\|\_\{2\}\\\!\\leq\\\!\(\\sqrt\{2\}/T\)\\\|R\_\{l\}\\\|\_\{\\mathrm\{op\}\}\. WithT∈\[0\.5,2\.0\]T\\\!\\in\\\!\[0\.5,2\.0\]for CNN, DTG modulates gradient magnitude by at mostTmax/Tmin=4T\_\{\\max\}/T\_\{\\min\}\\\!=\\\!4; for MLP withT∈\[0\.1,2\.0\]T\\\!\\in\\\!\[0\.1,2\.0\]the range is2020\. This modulation does not change task\-relevant mutual informationI​\(𝐮;Y\)I\(\\mathbf\{u\};Y\)—temperature adapts the optimization signal, not the representation content\. Appendix[A\.5](https://arxiv.org/html/2606.06539#A1.SS5)discusses the random\-projection scale confound and reports an ablation with column\-normalizedRlR\_\{l\}\.

### 4\.2DTG\-CNN: Decoupled Normalization and Multi\-Layer Fusion

Architecture\.We use a VGG8 backbone \(7 convolutional layers:3→128→256→256→512→512→512→5123\\\!\\to\\\!128\\\!\\to\\\!256\\\!\\to\\\!256\\\!\\to\\\!512\\\!\\to\\\!512\\\!\\to\\\!512\\\!\\to\\\!512, with2×22\\\!\\times\\\!2average pooling after layers 1, 3, 4, 5\)\. Each convolutional layer has three functional paths:

*FF \(goodness\) path\.*No normalization\. The raw post\-ReLU activation𝐳l\\mathbf\{z\}\_\{l\}feeds directly into spatial goodness𝐮l\\mathbf\{u\}\_\{l\}and the local lossℒl\\mathcal\{L\}\_\{l\}\. Batch normalization here collapses class\-conditional variance in squared activation norms, reducing the signal the goodness objective relies on, consistent with the independent observation bySezener et al\. \[[2025](https://arxiv.org/html/2606.06539#bib.bib7)\]that BN disrupts goodness\-based learning\.

*Inter\-layer propagation path\.*Channel\-wise LayerNorm applied at each spatial location \(i\.e\., for𝐱∈ℝB×C×H×W\\mathbf\{x\}\\\!\\in\\\!\\mathbb\{R\}^\{B\\times C\\times H\\times W\}, normalize alongCCper\(b,h,w\)\(b,h,w\)\), followed by dropout anddetach\. The LayerNorm module includes affine parameters\(γ,β\)\(\\gamma,\\beta\)but its application sits inside atorch\.no\_gradregion in our pipelined per\-layer training step \(the FF goodness loss is computed on the pre\-norm activation𝐳l\\mathbf\{z\}\_\{l\}, and the propagated input𝐳l\+1in\\mathbf\{z\}\_\{l\+1\}^\{\\text\{in\}\}is derived only after the autograd graph has been freed\); these affine parameters therefore receive no gradient in either path and remain at their initialization\(γ=1,β=0\)\(\\gamma\\\!=\\\!1,\\beta\\\!=\\\!0\), so the path is operationally equivalent to non\-affine LayerNorm\. We retain the module form for code\-level interchangeability with affine variants\. This path provides the next convolutional filter with inputs of controlled scale, without affecting the FF training signal\.

*Classifier path\.*Global Average Pooling per layer produces𝐟l∈ℝCl\\mathbf\{f\}\_\{l\}\\in\\mathbb\{R\}^\{C\_\{l\}\}\. Features from layers11throughL−1L\\\!\-\\\!1\(i\.e\., excluding the first conv layer; using0\-based indexing this is layers1,2,…,L−11,2,\\dots,L\\\!\-\\\!1\) are concatenated, normalized by BatchNorm1d, passed through dropout, and fed to a learnable linear classifier trained with cross\-entropy \(label smoothing 0\.1\)\. For VGG8 \(L=7L\\\!=\\\!7\) the concat dimension is256\+256\+512\+512\+512\+512=2560256\+256\+512\+512\+512\+512=2560\. An alternative logit\-sum aggregation \(per\-layer random\-projection logits summed across*all*LLlayers, no trainable classifier head\) is also evaluated and yields the best CIFAR\-10 accuracy; Sec\.[5\.4](https://arxiv.org/html/2606.06539#S5.SS4)compares both\.

Locality scope\.Convolutional feature learning is fully inter\-layer local: per\-layer AdamW optimizers, per\-layer cosine schedulers,detachat every layer boundary\. The concatenated classifier head uses global backpropagation through BN and Linear, but not into the conv layers \(features are extracted underno\_grad\)\. The logit\-sum variant requires no learnable classifier parameters, keeping the full pipeline layer\-local\. We follow the auxiliary\-classifier tradition\[Belilovsky et al\.,[2019](https://arxiv.org/html/2606.06539#bib.bib28)\]in treating a global classifier head as a practical, well\-understood concession\.

Training details\.Per\-layer AdamW \(lr2×10−42\\\!\\times\\\!10^\{\-4\}, weight decay10−310^\{\-3\}\) with cosine annealing to10−510^\{\-5\}\. 400 epochs for 32×\\times32 datasets; 200 for Tiny ImageNet and ImageNet\-100\. Augmentation: random crop \(padding 4\), horizontal flip, color jitter, and Cutout \(16×1616\\\!\\times\\\!16for 32×\\times32\)\. Classifier dropout 0\.2, inter\-layer dropout 0\.1\. Full hyperparameters are in App\.[D\.1](https://arxiv.org/html/2606.06539#A4.SS1)\.

## 5Experiments

### 5\.1Setup

We evaluate DTG\-FF on nine real\-world classification datasets: CIFAR\-10/100, Fashion\-MNIST, STL\-10, Tiny ImageNet, ImageNet\-100, and three MedMNIST datasets \(PathMNIST, DermaMNIST, BloodMNIST\)\. VGG8 is used for32×3232\\\!\\times\\\!32inputs and for Tiny ImageNet at native64×6464\\\!\\times\\\!64resolution; VGG11 is used only for ImageNet\-100 \(224×224224\\\!\\times\\\!224\)\. Full training details in App\.[D\.1](https://arxiv.org/html/2606.06539#A4.SS1); secondary datasets \(STL\-10, Fashion\-MNIST, MedMNIST\) are summarized in App\.[D\.2](https://arxiv.org/html/2606.06539#A4.SS2)\.

### 5\.2FF\-Family State of the Art

Table 1:FF\-family comparison on CIFAR\-10, CIFAR\-100, Tiny ImageNet \(200 classes\), and ImageNet\-100 \(224×224224\\\!\\times\\\!224\)\. Entries as reported; ‘—’ = not reported\.†Tiny ImageNet at native64×6464\\\!\\times\\\!64resolution\. Notably, on tractable texture\-based medical classification DTG\-FF reaches96\.61%on BloodMNIST, within the typical BP range; full secondary\-benchmark results \(Fashion\-MNIST, STL\-10, PathMNIST, DermaMNIST\) in App\.[D\.2](https://arxiv.org/html/2606.06539#A4.SS2)\.

Table[1](https://arxiv.org/html/2606.06539#S5.T1)summarizes FF\-family results\. DTG\-FF \(concat\) exceeds the prior best FF\-family CIFAR\-10 result by\+0\.71%\+0\.71\\%with a smaller backbone \(VGG8 vs\. ASGE’s VGG11\), on CIFAR\-100 by\+1\.86%\+1\.86\\%, and achieves 48\.17% on Tiny ImageNet—well above the next\-highest FF\-family result we are aware of \(≈35%\\approx\\\!35\\%\)\. An alternative logit\-sum aggregation variant \(no trainable classifier head; per\-layer random\-projection logits summed at inference\) reaches 91\.79% on CIFAR\-10; we report this as a CIFAR\-10\-specific ablation rather than a cross\-dataset headline, since we evaluate only the concat variant on the remaining datasets\.

Architecture\-matched BP comparison\.We report two BP reference points on the same VGG8 backbone and identical training recipe \(AdamW, lr2×10−42\\\!\\times\\\!10^\{\-4\}, cosine to10−510^\{\-5\}, 400 epochs, same augmentation\): \(i\)BP\-VGG8\-DeepSup, end\-to\-end BP with per\-layer auxiliary heads and logit\-sum inference—matched to DTG\-FF in backbone, depth, and per\-layer head count, but differing in detach boundary, local objective, and head parameterization \(see Sec\.[6\.3](https://arxiv.org/html/2606.06539#S6.SS3)\) \(93\.73% / 73\.21% on CIFAR\-10/100\); \(ii\)BP\-VGG8, a single\-head classifier \(94\.80% on CIFAR\-10\)\. The FF–BP gap on the concat operating point is2\.40\\mathbf\{2\.40\}pp on CIFAR\-10 and5\.93\\mathbf\{5\.93\}pp on CIFAR\-100; under strict aggregation matching \(DTG\-FF logit\-sum91\.79%91\.79\\%vs\. BP\-DeepSup logit\-sum\), the CIFAR\-10 gap is1\.941\.94pp \(CIFAR\-100 logit\-sum not reported\)\. The gap widens with task complexity in either framing\.

Multi\-seed stability\.Headline numbers use seed 42; reproducing on seed 43 gives DTG\-FF91\.1891\.18/66\.95%66\.95\\%and BP\-DeepSup93\.6393\.63/72\.80%72\.80\\%, with paired per\-seed gaps\{2\.40,2\.45\}\\\{2\.40,2\.45\\\}pp on CIFAR\-10 and\{5\.93,5\.85\}\\\{5\.93,5\.85\\\}pp on CIFAR\-100\. Both methods reproduce within0\.40\.4pp std and the FF–BP gap is stable to0\.050\.05/0\.080\.08pp\.

![Refer to caption](https://arxiv.org/html/2606.06539v1/x3.png)Figure 3:Synthetic vs\. realKK\-axis conflict\.Paired DTG\-FF−\-BP\-DeepSup accuracy difference on a sharedKK\-axis\.Synthetic\(green, 5 teacher–student seeds\): DTG\-FF advantage*grows*withKK, reaching\+2\+2pp atK=50K\\\!=\\\!50\.Real\(orange, CIFAR\-10/100, 2 seeds, half\-range error bars\): the gap is reversed and*widens*from−2\.42\-2\.42pp atK=10K\\\!=\\\!10to−5\.89\-5\.89pp atK=100K\\\!=\\\!100\. At the matchedK=10K\\\!=\\\!10point the synthetic regime predicts a\+0\.84\+0\.84pp DTG\-FF advantage; the real\-data outcome is−2\.42\-2\.42pp—a3\.33\.3pp sign\-reversed disagreement on identical class count\. This visualization motivates the methodology critique of Sec\.[5\.3](https://arxiv.org/html/2606.06539#S5.SS3): syntheticKK\-sweeps confound output dimensionality with fine\-grained discrimination difficulty and overstate FF transferability\.ImageNet\-100 ceiling\.At224×224224\\\!\\times\\\!224on ImageNet\-100, DTG\-FF on VGG11 reaches49\.4%49\.4\\%in200200epochs \(12\.612\.6hours on a single RTX 4090\)—to our knowledge the first FF\-family report at this scale—trailing typical BP above75%75\\%\[Tian et al\.,[2020](https://arxiv.org/html/2606.06539#bib.bib2)\]by a wide margin\. The cited75%75\\%uses ResNet\-50, so the precise gap conflates algorithmic and architectural capacity; the conservative reading is that DTG\-FF at224×224224\\\!\\times\\\!224remains substantially below standard BP’s operating range, with the per\-architecture penalty left to a future BP\-VGG11 control\.

MLP and depth\-scaling supplementaries\.DTG also improves FF\-MLP on CIFAR\-10 to63\.72%63\.72\\%, exceeding Hinton\-FF MLP baselines by∼4\\sim\\\!4pp \(App\.[D\.3](https://arxiv.org/html/2606.06539#A4.SS3)\)\. A deeper VGG11 backbone on CIFAR\-10/100 underperforms VGG8 by6\.976\.97/12\.4912\.49pp respectively \(App\.[D\.5](https://arxiv.org/html/2606.06539#A4.SS5)\); VGG11 also differs in early\-channel width \(64 vs\. 128\) and downsampling, so this bundles depth and width effects\. We retain VGG11 only for ImageNet\-100, where higher per\-sample signal partially compensates\.

### 5\.3Within\-DatasetKKProbe: CIFAR\-100 Coarse vs Fine

The cross\-dataset CIFAR\-10/CIFAR\-100 comparison confounds class count with image distribution\. CIFAR\-100 supplies both fine \(K=100K\\\!=\\\!100\) and coarse \(K=20K\\\!=\\\!20, semantic superclass\) labels over the*same*images\. At matched recipe \(VGG8,5050epochs, seed114514114514; App\.[D\.6](https://arxiv.org/html/2606.06539#A4.SS6)\) the FF–BP gap shrinks from11\.3511\.35pp \(fine\) to7\.997\.99pp \(coarse\)—a3\.363\.36pp differential\. Both runs are non\-converged at5050epochs and the fine gap drifts5\.425\.42pp by400400epochs \(exceeding the3\.363\.36pp differential\), so the probe is direction\-suggestive rather than asymptotic; matched\-convergence is left to future work\.

### 5\.4Ablation on CIFAR\-10

Table 2:CIFAR\-10 ablation, headline rows \(VGG8, 400 epochs, single seed\)\. Cutout, RMSNorm, and regularization variants in App\.[D\.4](https://arxiv.org/html/2606.06539#A4.SS4)\.ConfigurationAccuracyΔ\\DeltaDTG\-FF \(logit\-sum\)91\.79%—−\-DTG \(fixedT=1\.0T\\\!=\\\!1\.0\)90\.45%−1\.34\-1\.34DTG\-FF \(concat\)91\.33%—−\-DTG \(fixedT=1\.0T\\\!=\\\!1\.0\)90\.61%−0\.72\-0\.72−\-inter\-layer LayerNorm90\.53%−0\.80\-0\.80Removing DTG \(fixedT=1T\\\!=\\\!1\) is the dominant single\-component effect, consistent across aggregation schemes \(−0\.72\-0\.72pp concat,−1\.34\-1\.34pp logit\-sum\)\. Removing inter\-layer LayerNorm contributes−0\.80\-0\.80pp; replacing it with RMSNorm contributes a further−0\.18\-0\.18pp\. Cutout and explicit regularization \(label smoothing\+\+classifier dropout\) have marginal effect on final accuracy under cosine annealing, though they substantially affect training\-loss trajectories \(App\.[D\.4](https://arxiv.org/html/2606.06539#A4.SS4)\)\.

## 6Discussion and Limitations

### 6\.1A BP\-Shadow Lens: FF Improvements as Partial Substitutes

We propose a unifying lens for our four scaling observations: several established FF\-family improvements can be interpreted as partial substitutes for the supervised cross\-layer signal that BP provides natively, and the residual architecture\-matched gap reflects coordination and supervision that no current substitute fully recovers\. The architecture\-matched audit \(Sec\.[5\.2](https://arxiv.org/html/2606.06539#S5.SS2), App\.[D\.5](https://arxiv.org/html/2606.06539#A4.SS5)\) provides empirical grounding for this reading\.

Mappings are interpretive: each row pairs an FF\-family mechanism with a BP information channel it appears to substitute for\. The mappings are not causally validated by these measurements alone \(Sec\.[6\.3](https://arxiv.org/html/2606.06539#S6.SS3)\)\.

Under this lens, BP\-DeepSup’s matched advantage of2\.402\.40/5\.935\.93pp, the larger real\-data gap at higherKK, the depth degradation, and the synthetic–realKKreversal are coherent consequences of incomplete substitutes for supervised cross\-layer coordination\. Two further FF mechanisms—BatchNorm in goodness\-adjacent paths and the dynamic temperature in this work—are not obvious substitutes for any single BP channel; we treat them as adjacent rather than load\-bearing for this lens\. Causally isolating which substitute carries the dominant load requires controls that swap one channel at a time \(Sec\.[6\.3](https://arxiv.org/html/2606.06539#S6.SS3)\)\.

### 6\.2Auditing the Systems Advantage of Layer\-Local Training

A natural defense of FF research is its structural systems property: pipelined layer\-local training admits an𝒪​\(1\)\\mathcal\{O\}\(1\)\-in\-depth activation\-memory bound, whereas BP saves activations across allLLlayers\. We measure this directly to test whether it translates into practical dominance\. Implementation matters: a naive forward\-once\-with\-grad / forward\-once\-no\-grad split duplicates conv computation and underperforms BP\-VGG8 by5%5\\%memory and30%30\\%throughput; a pipelined schedule \(single conv pass, autograd graph released by local backward before deriving the next\-layer input\) recovers the structural property\. Within VRAM \(VGG8/32×3232\\\!\\times\\\!32batch256256\), pipelined DTG\-FF reaches1\.5351\.535GB /17261726imgs/s versus BP\-VGG8 at1\.5891\.589GB /15861586imgs/s \(−3\.4%\-3\.4\\%memory,\+8%\+8\\%throughput\)\. The harder test is the memory\-cliff regime, VGG11/224×224224\\\!\\times\\\!224on a commodity 8 GB GPU, where we compare four methods at effective batch128128\(full per\-batch sweep and protocol in App\.[D\.7](https://arxiv.org/html/2606.06539#A4.SS7)\):

∗Vanilla BP at b=128 spills to host memory; remaining rows fit in VRAM\. RTX 4060 Laptop,88GB\.†Identical throughput at b=64 and b=32 reflects compute\-bound saturation on this GPU: per\-step latency scales linearly with micro\-batch size while effective imgs/s stays constant\.

Finding\.On commodity 8 GB hardware, the structural𝒪​\(L\)→𝒪​\(1\)\\mathcal\{O\}\(L\)\\\!\\to\\\!\\mathcal\{O\}\(1\)activation\-memory property of pipelined FF is realized but does not translate into measured systems dominance over memory\-optimized BP\. BP\+gradient\-accumulation at micro\-batch6464matches DTG\-FF’s effective batch with−47%\-47\\%peak memory and\+14%\+14\\%throughput; BP\+activation\-checkpointing recovers in\-VRAM operation at1\.5×1\.5\\\!\\timesslowdown; DTG\-FF dominates only vanilla BP at its spill point—a regime practitioners avoid via standard tooling\. The systems\-feasibility argument for FF research at this scale is, on this hardware, not supported under fair baselines\. Whether the structural property becomes load\-bearing in deeper architectures or multi\-device pipelines is an open question \(App\.[D\.7](https://arxiv.org/html/2606.06539#A4.SS7)\)\.

### 6\.3Limitations and Other Notes

Causal underidentification\.Our scaling diagnosis compares DTG\-FF against BP\-DeepSup, which differs along three per\-layer dimensions \(detach boundary, goodness vs\. CE objective, fixed random projection vs\. learned head\)\. The architecture\-matched gap is real, but cleanly isolating which dimension dominates requires a sequence of single\-dimension ablations \(e\.g\., a Local\-CE\-Detach control to isolate detach, plus learned\-head DTG and detached BP\-DeepSup variants for the remaining two dimensions\) left to future work\. The BP\-shadow lens \(Sec\.[6\.1](https://arxiv.org/html/2606.06539#S6.SS1)\) is therefore interpretive synthesis, not causal validation\.

Ensemble scope\.A BP ensemble with4×4\\\!\\timesparameters exceeds DTG\-FF on synthetic tasks at everyKK\. DTG\-FF is a local\-learning mechanism, not a parameter\-efficiency argument; claims about FF\-family advantages are regime\-qualified\.

Relationship to DeeperForward\.Sezener et al\. \[[2025](https://arxiv.org/html/2606.06539#bib.bib7)\]independently observe that BatchNorm disrupts goodness\-based learning and remove normalization entirely\. We decouple normalization across functional paths \(Sec\.[4\.2](https://arxiv.org/html/2606.06539#S4.SS2)\)\. Both lines identify BN’s conflict with goodness; the design difference is whether to remove or decouple by path\.

Biological plausibility\.DTG\-FF uses BatchNorm in the classifier, dropout, and AdamW; learnable temperatures are engineering parameters\. We describe it as*biologically motivated*\(layer\-local gradient flow\) rather than strictly biologically plausible\.

## 7Conclusion

We developed DTG\-FF as an FF\-family instrument that sets state of the art across nine benchmarks \(CIFAR\-10/100, Tiny ImageNet, the first FF\-family ImageNet\-100 baseline at224×224224\\\!\\times\\\!224, and others\), and used it as a stress test for strictly layer\-local training\. Even at FF\-family SOTA, DTG\-FF trails an architecture\-matched BP\-DeepSup baseline under identical recipe by2\.402\.40/5\.935\.93pp on CIFAR\-10/100, with the gap widening in class count; on ImageNet\-100 at224×224224\\\!\\times\\\!224it reaches49\.4%49\.4\\%against typical BP above75%75\\%\[Tian et al\.,[2020](https://arxiv.org/html/2606.06539#bib.bib2)\], a real\-data ceiling that persists at higher resolution\. A within\-dataset CIFAR\-100 coarse\-vs\-fine probe combined with the synthetic\-vs\-realKK\-axis reversal supports an interpretation in which syntheticKK\-sweeps confound output dimensionality with fine\-grained discrimination, leading current FF benchmarks to overstate real\-data transferability\. A unifying lens reads several FF\-family improvements as partial substitutes for the supervised cross\-layer signal that BP provides natively; under fair systems baselines, even the structural𝒪​\(1\)\\mathcal\{O\}\(1\)\-in\-depth activation property of pipelined FF does not translate into measured advantage over BP\+gradient\-accumulation on commodity hardware\.

## References

- Hinton \[2022\]Geoffrey Hinton\.The forward\-forward algorithm: Some preliminary investigations\.In*NeurIPS 2022 Workshop*, 2022\.arXiv:2212\.13345\.
- Tian et al\. \[2020\]Yonglong Tian, Dilip Krishnan, and Phillip Isola\.Contrastive multiview coding\.In*European Conference on Computer Vision \(ECCV\)*, 2020\.Establishes the ImageNet\-100 benchmark; reports supervised ResNet\-50 baseline above 75%\.
- Crick \[1989\]Francis Crick\.The recent excitement about neural networks\.*Nature*, 337:129–132, 1989\.
- Lillicrap et al\. \[2020\]Timothy P\. Lillicrap, Adam Santoro, Luke Marris, Colin J\. Akerman, and Geoffrey Hinton\.Backpropagation and the brain\.*Nature Reviews Neuroscience*, 21:335–346, 2020\.
- Tosato et al\. \[2023\]Davide Tosato, Mark Shann, Halis Erdogan, and Bertrand Lebichot\.Local signal adaptation in the forward\-forward algorithm\.*arXiv preprint arXiv:2305\.12466*, 2023\.
- Lee et al\. \[2024\]Junyeol Lee, Kyungbae Park, Seungkyu Lee, and Yousun Shin\.Symbiosis of forward\-forward and back\-propagation for self\-recurrent forward\-forward networks\.*Nature Communications*, 2024\.
- Sezener et al\. \[2025\]Eren Sezener, Charlotte Magister, and Timothy Lillicrap\.Deeperforward: Training deeper forward\-forward networks\.In*International Conference on Learning Representations*, 2025\.
- Zhao et al\. \[2024\]Hao Zhao et al\.Adaptive symmetric goodness evaluation for forward\-forward learning\.*arXiv preprint*, 2024\.
- Ororbia and Mali \[2023\]Alexander G\. Ororbia and Ankur Mali\.The predictive forward\-forward algorithm\.*arXiv preprint*, 2023\.
- Journé et al\. \[2023\]Adrien Journé, Hector Garcia Rodriguez, Qinghai Guo, and Timoleon Moraitis\.Hebbian deep learning without feedback\.In*Advances in Neural Information Processing Systems*, 2023\.
- Lillicrap et al\. \[2016\]Timothy P\. Lillicrap, Daniel Cownden, Douglas B\. Tweed, and Colin J\. Akerman\.Random synaptic feedback weights support error backpropagation for deep learning\.*Nature Communications*, 7:13276, 2016\.
- Nøkland \[2016\]Arild Nøkland\.Direct feedback alignment provides learning in deep neural networks\.In*Advances in Neural Information Processing Systems*, 2016\.
- Launay et al\. \[2020\]Julien Launay, Iacopo Poli, Kilian Muller, Gordon Wetzstein, et al\.Direct feedback alignment scales to modern deep learning tasks and architectures\.In*Advances in Neural Information Processing Systems*, 2020\.
- Bengio \[2014\]Yoshua Bengio\.How auto\-encoders could provide credit assignment in deep networks via target propagation\.*arXiv preprint arXiv:1407\.7906*, 2014\.
- Lee et al\. \[2015a\]Dong\-Hyun Lee, Saizheng Zhang, Antoine Biard, and Yoshua Bengio\.Difference target propagation\.In*Joint European Conference on Machine Learning and Knowledge Discovery in Databases*, 2015a\.
- Meulemans et al\. \[2024\]Alexander Meulemans, Matilde Tristany Farinha, Javier Garcia Ordonez, Pau Vilimelis Aceituno, João Sacramento, and Benjamin F\. Grewe\.Theoretical analysis of learned target propagation\.*Journal of Machine Learning Research*, 2024\.
- Scellier and Bengio \[2017\]Benjamin Scellier and Yoshua Bengio\.Equilibrium propagation: Bridging the gap between energy\-based models and backpropagation\.*Frontiers in Computational Neuroscience*, 11:24, 2017\.
- Laborieux and Zenke \[2024\]Axel Laborieux and Friedemann Zenke\.Coupled learning: An alternative to equilibrium propagation\.In*International Conference on Learning Representations*, 2024\.
- Scellier \[2023\]Benjamin Scellier\.Backpropagation at the infinitesimal inference limit of energy\-based models: Unifying predictive coding, equilibrium propagation, and contrastive Hebbian learning\.In*Advances in Neural Information Processing Systems*, 2023\.
- Rao and Ballard \[1999\]Rajesh P\. N\. Rao and Dana H\. Ballard\.Predictive coding in the visual cortex: A functional interpretation of some extra\-classical receptive\-field effects\.*Nature Neuroscience*, 2\(1\):79–87, 1999\.
- Whittington and Bogacz \[2017\]James C\. R\. Whittington and Rafal Bogacz\.An approximation of the error backpropagation algorithm in a predictive coding network with local Hebbian synaptic plasticity\.*Neural Computation*, 29\(5\):1229–1262, 2017\.
- Millidge et al\. \[2022\]Beren Millidge, Alexander Tschantz, and Christopher L\. Buckley\.Predictive coding approximates backprop along arbitrary computation graphs\.*Neural Computation*, 34\(6\):1329–1368, 2022\.
- Salvatori et al\. \[2023\]Tommaso Salvatori, Luca Pinchetti, Beren Millidge, et al\.Incremental predictive coding: A parallel and fully automatic learning algorithm\.In*Advances in Neural Information Processing Systems*, 2023\.
- Dellaferrera and Kreiman \[2022\]Giorgia Dellaferrera and Gabriel Kreiman\.Error\-driven input modulation: Solving the credit assignment problem without a backward pass\.In*International Conference on Machine Learning*, 2022\.
- Ren et al\. \[2023\]Mengye Ren, Simon Kornblith, Renjie Liao, and Geoffrey Hinton\.Scaling forward gradient with local losses\.In*International Conference on Learning Representations*, 2023\.
- Szegedy et al\. \[2015\]Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich\.Going deeper with convolutions\.In*IEEE Conference on Computer Vision and Pattern Recognition*, 2015\.
- Lee et al\. \[2015b\]Chen\-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu\.Deeply\-supervised nets\.In*International Conference on Artificial Intelligence and Statistics*, 2015b\.
- Belilovsky et al\. \[2019\]Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon\.Greedy layerwise learning can scale to ImageNet\.In*International Conference on Machine Learning*, 2019\.
- Belilovsky et al\. \[2020\]Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon\.Decoupled greedy learning of CNNs\.In*International Conference on Machine Learning*, 2020\.
- Nøkland and Eidnes \[2019\]Arild Nøkland and Lars Hiller Eidnes\.Training neural networks with local error signals\.In*International Conference on Machine Learning*, 2019\.
- Tishby et al\. \[2000\]Naftali Tishby, Fernando C\. Pereira, and William Bialek\.The information bottleneck method\.In*Allerton Conference on Communication, Control, and Computing*, 2000\.
- Tishby and Zaslavsky \[2015\]Naftali Tishby and Noga Zaslavsky\.Deep learning and the information bottleneck principle\.In*IEEE Information Theory Workshop*, 2015\.
- Shwartz\-Ziv and Tishby \[2017\]Ravid Shwartz\-Ziv and Naftali Tishby\.Opening the black box of deep neural networks via information\.*arXiv preprint arXiv:1703\.00810*, 2017\.
- Saxe et al\. \[2018\]Andrew M\. Saxe, Yamini Bansal, Joel Dykstra, J\. Zico Kolter, Samy Bengio, and David Sussillo\.On the information bottleneck theory of deep learning\.In*International Conference on Learning Representations*, 2018\.
- Belghazi et al\. \[2018\]Mohamed Ishmael Belghazi, Aristide Barber, Sherjil Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R\. Devon Hjelm\.Mutual information neural estimation\.In*International Conference on Machine Learning*, 2018\.
- Poole et al\. \[2019\]Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker\.On variational bounds of mutual information\.In*International Conference on Machine Learning*, 2019\.
- McAllester and Stratos \[2020\]David McAllester and Karl Stratos\.Formal limitations on the measurement of mutual information\.In*International Conference on Artificial Intelligence and Statistics*, 2020\.
- Kraskov et al\. \[2004\]Alexander Kraskov, Harald Stögbauer, and Peter Grassberger\.Estimating mutual information\.*Physical Review E*, 69\(6\):066138, 2004\.
- Ioffe and Szegedy \[2015\]Sergey Ioffe and Christian Szegedy\.Batch normalization: Accelerating deep network training by reducing internal covariate shift\.In*International Conference on Machine Learning*, 2015\.
- Ba et al\. \[2016\]Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E\. Hinton\.Layer normalization\.*arXiv preprint arXiv:1607\.06450*, 2016\.
- Santurkar et al\. \[2018\]Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry\.How does batch normalization help optimization?In*Advances in Neural Information Processing Systems*, 2018\.
- Yang et al\. \[2019\]Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl\-Dickstein, and Samuel S\. Schoenholz\.A mean field theory of batch normalization\.In*International Conference on Learning Representations*, 2019\.
- Hinton et al\. \[2015\]Geoffrey Hinton, Oriol Vinyals, and Jeff Dean\.Distilling the knowledge in a neural network\.In*NeurIPS Deep Learning Workshop*, 2015\.
- Guo et al\. \[2017\]Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q\. Weinberger\.On calibration of modern neural networks\.In*International Conference on Machine Learning*, 2017\.
- Li et al\. \[2023\]Zheng Li, Xiang Li, Lingfeng Yang, Borui Zhao, Renjie Song, Lei Luo, Jun Li, and Jian Yang\.Curriculum temperature for knowledge distillation\.In*AAAI Conference on Artificial Intelligence*, 2023\.
- Zhou et al\. \[2023\]Yefan Zhou, Tianyu Pang, Keqin Liu, Charles Martin, Michael W\. Mahoney, and Yaoqing Yang\.Temperature balancing, layer\-wise weight analysis, and neural network training\.In*Advances in Neural Information Processing Systems*, 2023\.
- Srinivasan et al\. \[2023\]Aditya Srinivasan, Tomaso Poggio, and Rahul Bhatt\.Forward\-forward training of an optical neural network\.*arXiv preprint*, 2023\.
- Ross \[2014\]Brian C\. Ross\.Mutual information between discrete and continuous data sets\.*PLoS ONE*, 9\(2\):e87357, 2014\.doi:10\.1371/journal\.pone\.0087357\.
- Rajbhandari et al\. \[2020\]Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He\.Zero: Memory optimizations toward training trillion parameter models\.In*SC20: International Conference for High Performance Computing, Networking, Storage and Analysis*, pages 1–16, 2020\.
- Zhao et al\. \[2023\]Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien\-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al\.PyTorch FSDP: experiences on scaling fully sharded data parallel\.*arXiv preprint arXiv:2304\.11277*, 2023\.

## Appendix AFormal DTG\-FF Description

This appendix formalizes DTG\-FF components referenced in Sec\.[4](https://arxiv.org/html/2606.06539#S4)\. Subsections cover \(A\.1\) the unified temperature\-scaled goodness form and the two loss instances, \(A\.2\) MLP\-specific additions \(warm\-start and label overlay\), \(A\.3\) the locality taxonomy including MLP’s hybrid classifier path, \(A\.4\) the temperature parametrization, and \(A\.5\) random\-projection scaling and the gradient\-magnitude bound\.

### A\.1Unified Temperature\-Scaled Local Goodness and Loss Instances

Let layerllproduce post\-ReLU activation𝐳l​\(x\)=ReLU​\(Wl​𝐱^l−1\)\\mathbf\{z\}\_\{l\}\(x\)=\\mathrm\{ReLU\}\(W\_\{l\}\\hat\{\\mathbf\{x\}\}\_\{l\-1\}\), where𝐱^l−1\\hat\{\\mathbf\{x\}\}\_\{l\-1\}is the detached input from layerl−1l\\\!\-\\\!1\. Define a nonnegative goodness representation𝐮l​\(x\)=ϕl​\(𝐳l​\(x\)\)∈ℝ≥0ml\\mathbf\{u\}\_\{l\}\(x\)=\\phi\_\{l\}\(\\mathbf\{z\}\_\{l\}\(x\)\)\\in\\mathbb\{R\}^\{m\_\{l\}\}\_\{\\geq 0\}and a layer\-local learnable temperatureTl\>0T\_\{l\}\>0\(Sec\.[4\.1](https://arxiv.org/html/2606.06539#S4.SS1)\)\. The temperature\-scaled goodness is𝐮~l=𝐮l/Tl\\tilde\{\\mathbf\{u\}\}\_\{l\}=\\mathbf\{u\}\_\{l\}/T\_\{l\}\. DTG\-FF instances share the form

ℒlFF​\(x,⋅\)=ℓl​\(ψl​\(𝐮~l\),⋅\),∇𝐮lℒlFF=Tl−1​Jψl​\(𝐮~l\)⊤​∇ψℓl,\\mathcal\{L\}\_\{l\}^\{\\mathrm\{FF\}\}\(x,\\cdot\)=\\ell\_\{l\}\\\!\\big\(\\psi\_\{l\}\(\\tilde\{\\mathbf\{u\}\}\_\{l\}\),\\,\\cdot\\big\),\\qquad\\nabla\_\{\\mathbf\{u\}\_\{l\}\}\\mathcal\{L\}\_\{l\}^\{\\mathrm\{FF\}\}=T\_\{l\}^\{\-1\}\\,J\_\{\\psi\_\{l\}\}\(\\tilde\{\\mathbf\{u\}\}\_\{l\}\)^\{\\\!\\top\}\\,\\nabla\_\{\\psi\}\\ell\_\{l\},\(5\)so temperature modulates the magnitude of the local feature\-learning signal in every instance, without alteringI​\(𝐮l;Y\)I\(\\mathbf\{u\}\_\{l\};Y\)under a fixed\-batch conditioning\.

#### MLP instance \(scalar goodness, margin loss\)\.

Withml=1m\_\{l\}\\\!=\\\!1andul=1dl​‖𝐳l‖22u\_\{l\}=\\tfrac\{1\}\{d\_\{l\}\}\\\|\\mathbf\{z\}\_\{l\}\\\|\_\{2\}^\{2\}, lets∈\{\+1,−1\}s\\\!\\in\\\!\\\{\+1,\-1\\\}encode positive/negative supervision \(Hinton FF convention\):

ψlMLP​\(u~l\)=u~l−θl,ℓlMLP​\(ψ,s\)=log⁡\(1\+exp⁡\(−s​ψ\)\)\.\\psi\_\{l\}^\{\\mathrm\{MLP\}\}\(\\tilde\{u\}\_\{l\}\)=\\tilde\{u\}\_\{l\}\-\\theta\_\{l\},\\qquad\\ell\_\{l\}^\{\\mathrm\{MLP\}\}\(\\psi,s\)=\\log\\\!\\big\(1\+\\exp\(\-s\\psi\)\\big\)\.\(6\)

#### CNN instance \(KK\-way local cross\-entropy\)\.

Withml≫1m\_\{l\}\\gg 1and𝐮l=flatten​\(AdaptiveAvgPool2d​\(𝐳l2,\(hl,wl\)\)\)\\mathbf\{u\}\_\{l\}=\\mathrm\{flatten\}\(\\mathrm\{AdaptiveAvgPool2d\}\(\\mathbf\{z\}\_\{l\}^\{2\},\(h\_\{l\},w\_\{l\}\)\)\):

ψlCNN​\(𝐮~l\)=𝐮~l⊤​Rl\+𝐛l,ℓlCNN​\(𝝍,y\)=CE​\(𝝍,y\),\\psi\_\{l\}^\{\\mathrm\{CNN\}\}\(\\tilde\{\\mathbf\{u\}\}\_\{l\}\)=\\tilde\{\\mathbf\{u\}\}\_\{l\}^\{\\\!\\top\}R\_\{l\}\+\\mathbf\{b\}\_\{l\},\\qquad\\ell\_\{l\}^\{\\mathrm\{CNN\}\}\(\\boldsymbol\{\\psi\},y\)=\\mathrm\{CE\}\(\\boldsymbol\{\\psi\},y\),\(7\)with fixed \(non\-learned\)Rl∈ℝml×KR\_\{l\}\\in\\mathbb\{R\}^\{m\_\{l\}\\times K\},Rl,i​j∼𝒩​\(0,1/K\)R\_\{l,ij\}\\\!\\sim\\\!\\mathcal\{N\}\(0,1/K\), and𝐛l∈ℝK\\mathbf\{b\}\_\{l\}\\\!\\in\\\!\\mathbb\{R\}^\{K\},bl,k∼𝒩​\(0,1/K\)b\_\{l,k\}\\\!\\sim\\\!\\mathcal\{N\}\(0,1/K\), registered as buffers\.

#### Why two loss families\.

The objective switch follows convention \(Hinton \[[2022](https://arxiv.org/html/2606.06539#bib.bib1)\]for MLP;Tosato et al\. \[[2023](https://arxiv.org/html/2606.06539#bib.bib5)\], Zhao et al\. \[[2024](https://arxiv.org/html/2606.06539#bib.bib8)\]for FF\-CNN\), supported by a dimensionality argument: whenml=1m\_\{l\}\\\!=\\\!1, the linear readoutak=rl​k​gl/Tl\+bl​ka\_\{k\}=r\_\{lk\}\\,g\_\{l\}/T\_\{l\}\+b\_\{lk\}from a scalar toKKlogits has rank 1 by construction \(image is 1\-dimensional\), so softmax realizes only a 1\-parameter family of class distributions and margin supervision is the natural choice\. Whenml≫1m\_\{l\}\\\!\\gg\\\!1,KK\-way CE is well\-defined and we adopt it\. We do not claim the dimensionality argument uniquely determines the loss family, only that it explains why the two instantiations differ\.

### A\.2MLP\-Specific Additions: Warm\-Start and Label Overlay

The MLP loss inmain\.pyadds a warm\-start initialization and the standard Hinton label\-overlay procedure on top of the unified form in Eq\. \([5](https://arxiv.org/html/2606.06539#A1.E5)\)\.

#### Full MLP per\-layer loss\.

ℒlMLP,full=−log⁡σ​\(gl\+/Tl−θl\)−log⁡σ​\(θl−gl−/Tl\),\\mathcal\{L\}\_\{l\}^\{\\mathrm\{MLP,full\}\}=\-\\log\\sigma\(g\_\{l\}^\{\+\}/T\_\{l\}\-\\theta\_\{l\}\)\-\\log\\sigma\(\\theta\_\{l\}\-g\_\{l\}^\{\-\}/T\_\{l\}\),\(8\)whereθl=softplus​\(θ0,l\)\\theta\_\{l\}=\\mathrm\{softplus\}\(\\theta\_\{0,l\}\)is a learnable scalar margin per layer\.

#### Warm\-start initialization\.

On the first forward pass,θ0,l\\theta\_\{0,l\}is initialized from the measured average goodnessg¯l\(0\)\\bar\{g\}\_\{l\}^\{\(0\)\}: ifg¯l\(0\)\>20\\bar\{g\}\_\{l\}^\{\(0\)\}\\\!\>\\\!20,θ0,l←g¯l\(0\)\\theta\_\{0,l\}\\\!\\leftarrow\\\!\\bar\{g\}\_\{l\}^\{\(0\)\}\(outside the softplus linear regime\); otherwiseθ0,l←log⁡\(exp⁡\(g¯l\(0\)\)−1\)\\theta\_\{0,l\}\\\!\\leftarrow\\\!\\log\(\\exp\(\\bar\{g\}\_\{l\}^\{\(0\)\}\)\-1\)\(inverse softplus\)\. This places the initial decision boundary at layer\-specific activation scales rather than atθ=0\\theta\\\!=\\\!0\.

#### Label overlay\.

In the MLP, the input to each FF forward pass is not the raw imagexxbutxxwith its firstKKentries overwritten byλ​𝐞y\\lambda\\mathbf\{e\}\_\{y\}\(one\-hot target scaled byλ=1\.0\\lambda\\\!=\\\!1\.0default\)\. The positive pass usesy\+=yy^\{\+\}\\\!=\\\!y\(true label\) and the negative pass usesy−=random​\(\{0,…,K−1\}∖\{y\}\)y^\{\-\}\\\!=\\\!\\mathrm\{random\}\(\\\{0,\\dots,K\\\!\-\\\!1\\\}\\setminus\\\{y\\\}\)\. The overlay is standard Hinton FF procedure and is omitted from the CNN instance, which uses CE on allKKlabels and therefore does not require a separate negative construction\.

### A\.3Locality Taxonomy and MLP Hybrid Classifier

We distinguish three locality properties:

Inter\-layer gradient locality\.Updates toWlW\_\{l\}require no gradient propagated through layersj≠lj\\\!\\neq\\\!lin the relevant loss\.

Batch locality\.Updates may use statistics aggregated across samples within the same minibatch \(as in BatchNorm\)\.

Sample locality\.Updates for samplexbx\_\{b\}use only statistics derivable fromxbx\_\{b\}alone\.

The FF \(goodness\) loss in both MLP and CNN DTG\-FF satisfies*inter\-layer gradient locality*\(enforced bydetachof𝐳l\\mathbf\{z\}\_\{l\}before passing to layerl\+1l\\\!\+\\\!1\) and*sample locality*in the FF path: sinceTlT\_\{l\}andθl\\theta\_\{l\}are learnable parameters of layerllrather than batch statistics, the per\-layer loss for samplexbx\_\{b\}depends only onxbx\_\{b\}and layer\-llparameters\.

#### Hybrid classifier head \(MLP\)\.

The MLP pipeline combines the FF per\-layer loss with a global CE over a classifier built from the*same*layer\-wise linear weights: for each layeri<L−1i<L\\\!\-\\\!1, the classifier input isBNi​\(Wi​zi−1\)\\mathrm\{BN\}\_\{i\}\(W\_\{i\}z\_\{i\-1\}\)\(nodetach\), concatenated across layers and fed into a linear readout\. The joint objective is

ℒMLP,total=λff​∑l=0L−1ℒlMLP,full\+ℒclsCE,\\mathcal\{L\}^\{\\mathrm\{MLP,total\}\}=\\lambda\_\{\\mathrm\{ff\}\}\\sum\_\{l=0\}^\{L\-1\}\\mathcal\{L\}\_\{l\}^\{\\mathrm\{MLP,full\}\}\\;\+\\;\\mathcal\{L\}^\{\\mathrm\{CE\}\}\_\{\\mathrm\{cls\}\},\(9\)with defaultλff=0\.1\\lambda\_\{\\mathrm\{ff\}\}\\\!=\\\!0\.1\. Consequently, the classifier loss backpropagates end\-to\-end through the layer weights, and the MLP variant is a*hybrid*FF\+BP system \(consistent with the auxiliary\-classifier tradition\[Belilovsky et al\.,[2019](https://arxiv.org/html/2606.06539#bib.bib28)\]\) rather than a strictly inter\-layer\-local training procedure\. We state this explicitly to avoid readers inferring pure biological plausibility\.

#### CNN pipeline\.

The CNN classifier path is separated from conv training bytorch\.no\_grad\(\)wrappers during feature extraction \(see Sec\.[4\.2](https://arxiv.org/html/2606.06539#S4.SS2)\); the conv backbone receives gradients only from its own per\-layer FF loss\. The CNN is therefore strictly inter\-layer\-gradient\-local in the conv backbone\.

#### MLP vs\. CNN classifier index sets differ\.

The MLP hybrid classifier above uses layersi<L−1i\\\!<\\\!L\\\!\-\\\!1\(i\.e\., excludes the*last*hidden layer\)\. The CNN concat classifier \(Sec\.[4\.2](https://arxiv.org/html/2606.06539#S4.SS2)\) uses layers1,…,L−11,\\ldots,L\\\!\-\\\!1in 0\-based indexing \(i\.e\., excludes the*first*conv layer\)\. The two variants therefore drop opposite ends, following the auxiliary\-classifier\-head conventions of the respective architecture families\[Belilovsky et al\.,[2019](https://arxiv.org/html/2606.06539#bib.bib28)\]; both choices are matched by the implementations and are not load\-bearing claims\.

### A\.4Temperature Parametrization

Each layer’s temperature is parametrized asTl=Tmin\+\(Tmax−Tmin\)​σ​\(αl\)T\_\{l\}=T\_\{\\min\}\+\(T\_\{\\max\}\-T\_\{\\min\}\)\\,\\sigma\(\\alpha\_\{l\}\), whereαl∈ℝ\\alpha\_\{l\}\\\!\\in\\\!\\mathbb\{R\}is a single learnable scalar per layer \(initialized to0, givingTl=\(Tmin\+Tmax\)/2T\_\{l\}=\(T\_\{\\min\}\+T\_\{\\max\}\)/2at the start of training\) andσ\\sigmais the logistic sigmoid\. The boundTl∈\[Tmin,Tmax\]T\_\{l\}\\\!\\in\\\!\[T\_\{\\min\},T\_\{\\max\}\]withTmin=0\.5T\_\{\\min\}\\\!=\\\!0\.5,Tmax=2\.0T\_\{\\max\}\\\!=\\\!2\.0for CNN andTmin=0\.1T\_\{\\min\}\\\!=\\\!0\.1,Tmax=2\.0T\_\{\\max\}\\\!=\\\!2\.0for MLP preventsTlT\_\{l\}from collapsing to0or diverging\.TlT\_\{l\}is updated alongsideWlW\_\{l\}via gradient descent on the layer\-local loss \(Eq\. \([5](https://arxiv.org/html/2606.06539#A1.E5)\)\)\.

At inference on the CNN, the per\-layer logitψlCNN​\(𝐮l/Tl\)=\(𝐮l/Tl\)⊤​Rl\+𝐛l\\psi\_\{l\}^\{\\mathrm\{CNN\}\}\(\\mathbf\{u\}\_\{l\}/T\_\{l\}\)=\(\\mathbf\{u\}\_\{l\}/T\_\{l\}\)^\{\\\!\\top\}R\_\{l\}\+\\mathbf\{b\}\_\{l\}retains the temperature scaling \(sinceTlT\_\{l\}is a fixed learned parameter\)\. The MLP goodness loss is only evaluated during training; at test time the MLP uses its classifier head without invoking the margin objective\.

### A\.5Random Projection Scale and Gradient\-Magnitude Bound

For the CNN instance,𝐲^l=\(𝐮l/Tl\)⊤​Rl\+𝐛l\\hat\{\\mathbf\{y\}\}\_\{l\}=\(\\mathbf\{u\}\_\{l\}/T\_\{l\}\)^\{\\\!\\top\}R\_\{l\}\+\\mathbf\{b\}\_\{l\}with fixedRl∈ℝml×KR\_\{l\}\\\!\\in\\\!\\mathbb\{R\}^\{m\_\{l\}\\times K\},Rl,i​j∼𝒩​\(0,1/K\)R\_\{l,ij\}\\\!\\sim\\\!\\mathcal\{N\}\(0,1/K\)\. The CE gradient w\.r\.t\. goodness is

∇𝐮lℒlCNN=Tl−1​Rl​\(𝐩−𝐞y\),𝐩=softmax​\(𝐲^l\),‖𝐩−𝐞y‖2≤2,\\nabla\_\{\\mathbf\{u\}\_\{l\}\}\\mathcal\{L\}\_\{l\}^\{\\mathrm\{CNN\}\}=T\_\{l\}^\{\-1\}\\,R\_\{l\}\\,\(\\mathbf\{p\}\-\\mathbf\{e\}\_\{y\}\),\\quad\\mathbf\{p\}=\\mathrm\{softmax\}\(\\hat\{\\mathbf\{y\}\}\_\{l\}\),\\quad\\\|\\mathbf\{p\}\-\\mathbf\{e\}\_\{y\}\\\|\_\{2\}\\leq\\sqrt\{2\},\(10\)so

‖∇𝐮lℒlCNN‖2≤2Tl​‖Rl‖op,\\\|\\nabla\_\{\\mathbf\{u\}\_\{l\}\}\\mathcal\{L\}\_\{l\}^\{\\mathrm\{CNN\}\}\\\|\_\{2\}\\;\\leq\\;\\frac\{\\sqrt\{2\}\}\{T\_\{l\}\}\\,\\\|R\_\{l\}\\\|\_\{\\mathrm\{op\}\},\(11\)where‖Rl‖op\\\|R\_\{l\}\\\|\_\{\\mathrm\{op\}\}is the spectral norm\. For class columnsrk∈ℝmlr\_\{k\}\\\!\\in\\\!\\mathbb\{R\}^\{m\_\{l\}\},𝔼​‖rk‖22=ml/K\\mathbb\{E\}\\\|r\_\{k\}\\\|\_\{2\}^\{2\}=m\_\{l\}/K; with typical FF\-CNN sizesml∈\{512,1024,2048\}m\_\{l\}\\\!\\in\\\!\\\{512,1024,2048\\\}andK=10K\\\!=\\\!10, typical column norms areO​\(ml/K\)∼7−15O\(\\sqrt\{m\_\{l\}/K\}\)\\\!\\sim\\\!7\\\!\-\\\!15\.

#### Temperature ranges differ by architecture\.

For CNN,Tl∈\[0\.5,2\.0\]T\_\{l\}\\\!\\in\\\!\[0\.5,2\.0\], so the DTG prefactor1/Tl1/T\_\{l\}varies by at mostTmax/Tmin=4T\_\{\\max\}/T\_\{\\min\}\\\!=\\\!4\. For MLP,Tl∈\[0\.1,2\.0\]T\_\{l\}\\\!\\in\\\!\[0\.1,2\.0\], with a ratio of2020; the stronger MLP modulation is consistent with the coarser scalar\-goodness signal needing larger relative gradient\-scale adjustment\.

#### Scale\-confound ablation\.

To check whether DTG’s gain is merely compensating for per\-column norm variance in the Gaussian\-sampledRlR\_\{l\}, we run a 4\-cell ablation on CIFAR\-10 \(VGG8, 400 epochs, seed 42\) crossing\{T=1,DTG\}×\{unnormalized​Rl,unit\-column\-norm​R~l\}\\\{T\\\!=\\\!1,\\mathrm\{DTG\}\\\}\\times\\\{\\text\{unnormalized\}~R\_\{l\},\\text\{unit\-column\-norm\}~\\tilde\{R\}\_\{l\}\\\}:

TemperatureunnormalizedRlR\_\{l\}normalizedR~l\\tilde\{R\}\_\{l\}Δ\\Delta\(unnorm−\-norm\)T=1T\\\!=\\\!1\(no DTG\)90\.61%89\.71%\+0\.90\+0\.90DTG \(learnableTlT\_\{l\}\)91\.33%90\.22%\+1\.11\+1\.11Δ\\Delta\(DTG−\-T=1T\\\!=\\\!1\)\+0\.72\+0\.72\+0\.51\+0\.51—
Two findings\. First, DTG’s gain overT=1T\\\!=\\\!1persists under column\-normalizedRlR\_\{l\}\(\+0\.51%\+0\.51\\%\), so DTG is not purely compensating for random\-column\-norm variation—roughly0\.51/0\.72≈71%0\.51/0\.72\\\!\\approx\\\!71\\%of DTG’s advantage is independent ofRlR\_\{l\}scale\. Second, normalizingRlR\_\{l\}columns hurts bothT=1T\\\!=\\\!1and DTG configurations by≈1\\approx\\\!1percentage point, suggesting the default Gaussian column\-norm variation acts as a mild implicit regularizer\. We therefore do not adopt column normalization in the default implementation, and we report the findings here to pre\-empt the scale\-compensation concern\.

## Appendix BExtended Related Work

This appendix expands Sec\.[2](https://arxiv.org/html/2606.06539#S2)with coverage too long for the 9\-page main text\.

#### Forward\-Forward variants\.

Srinivasan et al\. \[[2023](https://arxiv.org/html/2606.06539#bib.bib47)\]proposed symmetric forward\-forward training for stability\.Tosato et al\. \[[2023](https://arxiv.org/html/2606.06539#bib.bib5)\]\(LSFF\) introduced local\-signal adaptation, the first FF\-CNN\.Lee et al\. \[[2024](https://arxiv.org/html/2606.06539#bib.bib6)\]\(SCFF, Nature Comm\) used self\-recurrence, where each layer’s output feeds back before propagating forward\.Sezener et al\. \[[2025](https://arxiv.org/html/2606.06539#bib.bib7)\]\(DeeperForward, ICLR 2025\) independently observed that BN disrupts goodness and removed normalization entirely; combined with progressive layerwise training, they reach 88\.72% on CIFAR\-10\.Zhao et al\. \[[2024](https://arxiv.org/html/2606.06539#bib.bib8)\]\(ASGE\) introduced per\-layer spatial features and logit summation across layers, reaching 90\.62%—the prior FF\-family best\.Ororbia and Mali \[[2023](https://arxiv.org/html/2606.06539#bib.bib9)\]\(PFF\) combined FF with predictive coding, re\-introducing partial inter\-layer coordination\.

#### Spatial goodness attribution\.

We want to clearly credit prior work: spatial goodness vectors \(as opposed to scalar goodness\) originated in the FF\-CNN literature, notably ASGE\[Zhao et al\.,[2024](https://arxiv.org/html/2606.06539#bib.bib8)\]and LSFF\[Tosato et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib5)\]\. Our Pathway 1 \(signal quality\) in Sec\.[3\.1](https://arxiv.org/html/2606.06539#S3.SS1)identifies vector goodness as one of the three DTG\-FF components, but the*contribution*of this work is DTG \(temperature mechanism\), decoupled normalization, and multi\-layer integration, not spatial goodness per se\.

#### Biologically plausible BP alternatives\.

Feedback alignment replaces the transpose of forward weights with random feedback\[Lillicrap et al\.,[2016](https://arxiv.org/html/2606.06539#bib.bib11)\]; direct feedback alignment projects errors directly to each hidden layer\[Nøkland,[2016](https://arxiv.org/html/2606.06539#bib.bib12), Launay et al\.,[2020](https://arxiv.org/html/2606.06539#bib.bib13)\]\. Target propagation computes layer\-wise targets via learned inverse mappings\[Bengio,[2014](https://arxiv.org/html/2606.06539#bib.bib14), Lee et al\.,[2015a](https://arxiv.org/html/2606.06539#bib.bib15), Meulemans et al\.,[2024](https://arxiv.org/html/2606.06539#bib.bib16)\]\. Equilibrium propagation uses two\-phase Hebbian updates at energy\-based equilibrium\[Scellier and Bengio,[2017](https://arxiv.org/html/2606.06539#bib.bib17), Laborieux and Zenke,[2024](https://arxiv.org/html/2606.06539#bib.bib18)\];Scellier \[[2023](https://arxiv.org/html/2606.06539#bib.bib19)\]showed predictive coding, EqProp, and contrastive Hebbian learning approximate BP in a unified limit\. Predictive coding networks provide structured backward signals via prediction errors\[Rao and Ballard,[1999](https://arxiv.org/html/2606.06539#bib.bib20), Whittington and Bogacz,[2017](https://arxiv.org/html/2606.06539#bib.bib21), Millidge et al\.,[2022](https://arxiv.org/html/2606.06539#bib.bib22), Salvatori et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib23)\]\. Perturbation\-based methods use input modulation\[Dellaferrera and Kreiman,[2022](https://arxiv.org/html/2606.06539#bib.bib24), Ren et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib25)\]\. An*informal*ordering by retained backward\-signal richness \(BP\>\>FA/DFA\>\>TP\>\>PC\>\>FF\) helps position FF as the most constrained end of the design space; this ordering is intuitive rather than formally established, since “retained information” is not uniformly defined across methods\. DTG\-FF preserves the zero\-backward\-gradient property in its FF goodness path\.

#### Auxiliary\-classifier heads with local losses\.

Deeply\-supervised networks\[Lee et al\.,[2015b](https://arxiv.org/html/2606.06539#bib.bib27)\], GoogLeNet\[Szegedy et al\.,[2015](https://arxiv.org/html/2606.06539#bib.bib26)\], and greedy layer\-wise training\[Belilovsky et al\.,[2019](https://arxiv.org/html/2606.06539#bib.bib28),[2020](https://arxiv.org/html/2606.06539#bib.bib29)\]attach classifiers to intermediate layers\.Nøkland and Eidnes \[[2019](https://arxiv.org/html/2606.06539#bib.bib30)\]used similarity matching and prediction auxiliary losses\. These differ from FF in that classifiers’ gradients typically backpropagate into convolutional layers \(providing richer supervision than scalar goodness\)\. DTG\-FF’s CNN classifier isdetach\-separated from the conv backbone; the MLP classifier is a hybrid \(App\.[A\.3](https://arxiv.org/html/2606.06539#A1.SS3)\)\.

#### Information\-theoretic perspectives on deep learning\.

Tishby et al\. \[[2000](https://arxiv.org/html/2606.06539#bib.bib31)\]introduced the information bottleneck;Tishby and Zaslavsky \[[2015](https://arxiv.org/html/2606.06539#bib.bib32)\]proposed deep networks implicitly optimize it;Shwartz\-Ziv and Tishby \[[2017](https://arxiv.org/html/2606.06539#bib.bib33)\]provided empirical support\.Saxe et al\. \[[2018](https://arxiv.org/html/2606.06539#bib.bib34)\]showed compression depends on activation function—ReLU networks do not compress in the same sense as tanh networks\. MI estimation in high dimensions is notoriously hard\[Belghazi et al\.,[2018](https://arxiv.org/html/2606.06539#bib.bib35), Poole et al\.,[2019](https://arxiv.org/html/2606.06539#bib.bib36), McAllester and Stratos,[2020](https://arxiv.org/html/2606.06539#bib.bib37)\]\. Our measurements \(App\.[C](https://arxiv.org/html/2606.06539#A3)\) avoid deep estimator reliance by using scalar KSG\[Kraskov et al\.,[2004](https://arxiv.org/html/2606.06539#bib.bib38)\]and linear\-probe Fano bounds\.

#### Normalization and temperature scaling\.

BN’s role in optimization smoothing rather than covariate shift was established bySanturkar et al\. \[[2018](https://arxiv.org/html/2606.06539#bib.bib41)\], Yang et al\. \[[2019](https://arxiv.org/html/2606.06539#bib.bib42)\]\.Ba et al\. \[[2016](https://arxiv.org/html/2606.06539#bib.bib40)\]introduced LayerNorm as a batch\-independent alternative\. Temperature scaling originated in distillation\[Hinton et al\.,[2015](https://arxiv.org/html/2606.06539#bib.bib43)\]and calibration\[Guo et al\.,[2017](https://arxiv.org/html/2606.06539#bib.bib44)\]; curriculum\-scheduled temperatures\[Li et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib45)\]and spectral\-driven temperatures\[Zhou et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib46)\]operate within BP\-trained networks\. DTG differs by scaling the local*learning signal*in a layer\-local loss, not the softmax output\.

## Appendix CInformation Diagnostic Methodology

### C\.1Setup

![Refer to caption](https://arxiv.org/html/2606.06539v1/x4.png)Figure 4:Per\-layer task\-information diagnostic on a trained DTG\-FF VGG8 \(CIFAR\-10,91\.33%91\.33\\%\)\. Scalar goodness uses the KSG estimator; spatial goodness and GAP report linear\-probe Fano lower bounds\.Measurements use the DTG\-FF VGG8 trained on CIFAR\-10 \(concat\-classifier variant, test accuracy91\.33%91\.33\\%\)\. We extract per\-layer activations from the test set \(10,00010\{,\}000images\) after the ReLU and before inter\-layer normalization\. For each layerl∈\{0,…,6\}l\\\!\\in\\\!\\\{0,\\dots,6\\\}we compute:

- •scalar goodnessgl=1dl​‖𝐳l‖22∈ℝg\_\{l\}=\\tfrac\{1\}\{d\_\{l\}\}\\\|\\mathbf\{z\}\_\{l\}\\\|\_\{2\}^\{2\}\\in\\mathbb\{R\}per sample,
- •spatial goodness vector𝐠lvec=flatten​\(AAP​\(𝐳l2,\(hl,wl\)\)\)∈ℝml\\mathbf\{g\}\_\{l\}^\{\\mathrm\{vec\}\}=\\mathrm\{flatten\}\(\\mathrm\{AAP\}\(\\mathbf\{z\}\_\{l\}^\{2\},\(h\_\{l\},w\_\{l\}\)\)\)\\in\\mathbb\{R\}^\{m\_\{l\}\},
- •GAP features𝐟lGAP=1Hl​Wl​∑h,wzl,c,h,w∈ℝCl\\mathbf\{f\}\_\{l\}^\{\\mathrm\{GAP\}\}=\\tfrac\{1\}\{H\_\{l\}W\_\{l\}\}\\sum\_\{h,w\}z\_\{l,c,h,w\}\\in\\mathbb\{R\}^\{C\_\{l\}\}\.

### C\.2Estimators

#### Scalar MI\.

We use the Kraskov\-Stögbauer\-Grassberger \(KSG\) estimator\[Kraskov et al\.,[2004](https://arxiv.org/html/2606.06539#bib.bib38)\]withk=4k\\\!=\\\!4neighbors viasklearn\.feature\_selection\.mutual\_info\_classif\(random\_state=42\) forI​\(gl;Y\)I\(g\_\{l\};Y\); this routine implements theRoss \[[2014](https://arxiv.org/html/2606.06539#bib.bib48)\]extension of KSG for continuous\-feature/discrete\-target MI\. As a sensitivity check we also compute histogram\-based MI with bin counts\{50,100,200,500\}\\\{50,100,200,500\\\}\.

#### Vector MI lower bound\.

For𝐠lvec\\mathbf\{g\}^\{\\mathrm\{vec\}\}\_\{l\}and𝐟lGAP\\mathbf\{f\}^\{\\mathrm\{GAP\}\}\_\{l\}, we train a linear probe \(softmax regression\) on an 80/20 split of the test set—50 epochs Adam,lr=0\.01\\mathrm\{lr\}\\\!=\\\!0\.01, inputs standardized by training\-split statistics\. LetY^\\hat\{Y\}be the probe’s prediction on held\-out samples with errorPe=Pr⁡\[Y^≠Y\]P\_\{e\}=\\Pr\[\\hat\{Y\}\\\!\\neq\\\!Y\]\. SinceY^\\hat\{Y\}is a deterministic function of the layer features \(call themFF\), the data\-processing inequality givesI​\(F;Y\)≥I​\(Y^;Y\)I\(F;Y\)\\geq I\(\\hat\{Y\};Y\)\. By Fano’s inequality with uniform prior,

I​\(Y^;Y\)≥log2⁡K−H2​\(Pe\)−Pe​log2⁡\(K−1\),I\(\\hat\{Y\};Y\)\\geq\\log\_\{2\}K\-H\_\{2\}\(P\_\{e\}\)\-P\_\{e\}\\log\_\{2\}\(K\-1\),\(12\)whereH2H\_\{2\}is binary entropy\. This lower bound depends on linear\-probe separability and is not representation\-intrinsic\.

### C\.3Per\-layer Measurements

#### Histogram sensitivity\.

MeanI​\(g;Y\)I\(g;Y\)across layers: 0\.22 bits \(50 bins\), 0\.24 \(100\), 0\.28 \(200\), 0\.38 \(500\)\. Histogram estimates inflate with bin count \(finite\-sample bias\); KSG atk=4k\\\!=\\\!4gives 0\.20 bits, consistent with the lower end of the histogram range\.

![Refer to caption](https://arxiv.org/html/2606.06539v1/x5.png)Figure 5:Sensitivity of scalar\-goodness mutual\-information estimates to estimator choice\. Histogram estimates increase with bin count, consistent with finite\-sample bias; the KSG estimates remain near the lower end of the histogram range\.
#### Per\-layer observables for Pathway 3\.

Using logit\-sum inference on the concat\-classifier checkpoint \(CIFAR\-10,91\.33%91\.33\\%\), per\-layer accuracies are\{43\.5,77\.6,85\.1,88\.7,91\.2,91\.3,91\.0\}%\\\{43\.5,77\.6,85\.1,88\.7,91\.2,91\.3,91\.0\\\}\\%\. Pairwise prediction disagreement averages25\.1%25\.1\\%\(off\-diagonal mean\); the oracle\-ensemble accuracy \(“any layer correct”\) is96\.9%96\.9\\%against the logit\-sum\-evaluated ensemble’s91\.5%91\.5\\%\(distinct from the91\.79%91\.79\\%headline obtained by training a dedicated logit\-sum model from scratch, since here the random\-projection heads were never the optimization target\), a gap of5\.45\.4percentage points\. Sec\.[3\.1](https://arxiv.org/html/2606.06539#S3.SS1)reports these values as observable support for the three\-pathway lens\.

#### Limitations of the diagnostic\.

These numbers are architecture\- and checkpoint\-specific\. The scalar KSG estimate is consistent across binning and estimators but is only measured on DTG\-FF VGG8 \(CIFAR\-10\)\. The vector numbers are linear\-probe lower bounds: a different probe class could yield a different bound\. We do not claim the MI measurements upper\-bound achievable FF performance; they are diagnostic of the current trained representations\.

## Appendix DExtended Experimental Details

### D\.1Full Hyperparameters

### D\.2Secondary Benchmarks

DTG\-FF’s SOTA pattern generalizes beyond CIFAR and ImageNet\-scale datasets\. Table below reports results on 5 additional benchmarks \(VGG8, 400 epochs, same hyperparameters unless noted\)\.

The BloodMNIST result is the standout: DTG\-FF approaches BP\-level performance \(typical 95–97% BP range\) on this tractable texture\-based task\. DermaMNIST’s lower accuracy reflects dataset difficulty \(small training set, high intra\-class visual similarity\); BP baselines report 73–77%\. These five datasets are on the 32×\\times32 grid \(MedMNIST is natively 28×\\times28, upsampled\)\.

### D\.3MLP Ablation \(CIFAR\-10\)

We report the MLP progressive ablation on CIFAR\-10 \(4 layers, hidden dim 2048, 100 epochs\)\. This is distinct from the CNN variant used in Sec\.[5\.4](https://arxiv.org/html/2606.06539#S5.SS4)\.

Per\-layer BN in the classifier path is the largest single\-step MLP improvement \(\+8\.29%\+8\.29\\%\), consistent with the decoupled\-normalization principle: BN helps the classifier path on detached features while being destructive in the FF goodness path\. The final 63\.72% exceeds Hinton FF’s original∼60%\\sim\\\!60\\%MLP result on CIFAR\-10\.

### D\.4Extended CNN Ablation Analysis

Table 3:Full CIFAR\-10 ablation rows including Cutout, RMSNorm, and regularization variants\. Headline rows in Sec\.[5\.4](https://arxiv.org/html/2606.06539#S5.SS4)\.†RMSNorm delta measured at 200 epochs \(vs\. LayerNorm at 200 epochs\); other rows at 400 epochs\.

We note two complementary observations:

#### Cosine annealing as implicit regularizer\.

With a cosine learning\-rate schedule from2×10−42\\\!\\times\\\!10^\{\-4\}to10−510^\{\-5\}over 400 epochs, explicit regularization \(label smoothing 0\.1 \+ classifier dropout 0\.2\) has a small effect on test accuracy \(\+0\.17%\+0\.17\\%when removed\) but dramatically changes training classifier loss \(0\.250\.25without regularization vs\.0\.870\.87with\)\. This suggests the cosine schedule provides substantial implicit regularization on CIFAR\-10\. Whether this holds on larger datasets \(Tiny ImageNet, ImageNet\-100\) where overfitting pressure is higher requires separate study\.

#### Label smoothing interaction\.

Ablating label smoothing \(classifier CE loss\) yields\+0\.17%\+0\.17\\%on CIFAR\-10 at 400 epochs, i\.e\.,*removing*it slightly improves accuracy\. We interpret this as suggestive that label smoothing and DTG’s temperature scaling overlap functionally: both soften the effective decision boundary during training\. Under strong cosine annealing, the implicit regularization already suffices, and label smoothing becomes slightly redundant\. The effect size is within the expected single\-seed noise band; whether label smoothing consistently harms DTG\-FF in this regime requires multi\-seed follow\-up\.

### D\.5Depth Scaling: VGG8 vs VGG11 on 32×\\times32 Inputs

We probe DTG\-FF’s depth\-scaling behavior by training the deeper VGG11 backbone \(8 conv layers,9\.39\.3M params, vs\. VGG8’s5\.55\.5M and 7 conv layers\) on CIFAR\-10/100 under identical hyperparameters \(AdamW lr2×10−42\\\!\\times\\\!10^\{\-4\}, cosine to10−510^\{\-5\}, 400 epochs, batch 128, same augmentation\)\. Results:

The VGG11 accuracy drop is−6\.97\-6\.97pp on CIFAR\-10 \(K=10K\\\!=\\\!10\) and−12\.49\-12\.49pp on CIFAR\-100 \(K=100K\\\!=\\\!100\); the cost nearly doubles as class count increases\.Width\-vs\-depth confound\.VGG11 in our setup starts with 64 channels \(vs\. VGG8’s 128\) and applies2×22\\\!\\times\\\!2pooling at the first layer, while VGG8 retains full resolution through layer 1; VGG11 thus differs from VGG8 in depth \(8 vs\. 7 conv layers\), early\-channel width \(64 vs\. 128\), and downsampling schedule\. The accuracy drop reported here therefore bundles depth and width effects, and we cannot strictly attribute the cost to depth alone\. A depth\-only ablation \(matched widths, varying number of layers\) is left to future work\. Combined with the syntheticKK\-scaling observed in Sec\.[3\.2](https://arxiv.org/html/2606.06539#S3.SS2), we observe that local\-learning cost grows with task complexity and with deeper/narrower architectures together; this is consistent with prior FF\-family reports that depth alone does not monotonically improve FF accuracy\[Sezener et al\.,[2025](https://arxiv.org/html/2606.06539#bib.bib7)\]\. Depth\-specific calibration \(layer\-wise learning rate scaling, warmup, or progressive layer\-wise training as in DeeperForward\) may mitigate this cost but is left to future work\.

#### Why does VGG11 recover on ImageNet\-100?

The same VGG11 backbone reaches49\.4%49\.4\\%on ImageNet\-100 at224×224224\\\!\\times\\\!224\(Sec\.[5\.3](https://arxiv.org/html/2606.06539#S5.SS3)\)\. Although the absolute accuracy is lower than VGG11’s54\.79%54\.79\\%on CIFAR\-100, ImageNet\-100 has100100classes at substantially higher visual complexity, and each224×224224\\\!\\times\\\!224image carries far more per\-sample information than a32×3232\\\!\\times\\\!32CIFAR image \(representing dozens to hundreds of “effective receptive fields” worth of content\)\. We hypothesize that higher input signal per sample partially compensates the depth cost that manifests acutely on32×3232\\\!\\times\\\!32inputs\. Under this view, VGG11 is a defensible choice for high\-resolution tasks even though it underperforms VGG8 on32×3232\\\!\\times\\\!32\.

#### Hyperparameter comparison with ASGE\.

We apply identical hyperparameters to VGG8 and VGG11 without the dataset\-specific tuning that ASGE’s public recipe\[Zhao et al\.,[2024](https://arxiv.org/html/2606.06539#bib.bib8)\]implies\. Specifically, ASGE uses weight decay10−410^\{\-4\}\(vs\. our10−310^\{\-3\}, a10×10\\\!\\times\\\!difference\), validation\-based best\-checkpoint selection, and averages over three independent seeds; we use single\-seed last\-epoch evaluation with fixed hyperparameters tuned for VGG8\. ASGE does*not*use warmup, layer\-wise learning rates, or progressive layer\-wise training—their depth robustness appears protocol\-driven \(weight decay \+ validation\-based selection\) rather than algorithmic\. Replicating their protocol on VGG11 may close a substantial portion of the6\.266\.26pp CIFAR\-10 gap to ASGE\-VGG11 \(90\.62%90\.62\\%vs\. our84\.36%84\.36\\%\); we did not pursue this because VGG8 remains our operating point for headline results\.

### D\.6Within\-DatasetKKProbe: Full Table

GranularityKKDTG\-FF \(concat\)BP\-DeepSupFF–BP gapCoarse202066\.14%66\.14\\%74\.13%74\.13\\%7\.997\.99ppFine10010051\.28%51\.28\\%62\.63%62\.63\\%11\.3511\.35pp*Within\-dataset gap differential:*3\.363\.36pp
VGG8, batch128128, AdamW lr2×10−42\\\!\\times\\\!10^\{\-4\}, cosine schedule to10−510^\{\-5\},5050epochs, single seed \(114514\)\. Identical augmentation across granularities; only the label mapping changes\. Both granularities are non\-converged at5050epochs \(the400400\-epoch CIFAR\-100 fine result in Sec\.[5\.2](https://arxiv.org/html/2606.06539#S5.SS2)is a5\.935\.93pp gap, smaller than the11\.3511\.35pp at5050epochs because BP and DTG\-FF benefit from extended training at different rates\)\. The relative ordering—coarse gap<<fine gap—is the load\-bearing observation\.

### D\.7Memory\-Feasibility Frontier: Detailed Measurements and Fair BP Baselines

#### Hardware and protocol\.

NVIDIA RTX 4060 Laptop \(8\.08\.0GB VRAM\), PyTorch2\.9\.12\.9\.1\+ CUDA13\.013\.0, CPU pinned dataloader off \(synthetic random tensors\)\. For each \(method, batch, arch, input size\) cell we run55warmup steps then time2020measured steps undertorch\.cuda\.synchronize; peak memory usestorch\.cuda\.max\_memory\_allocatedafterreset\_peak\_memory\_stats\. We report peak allocated \(not reserved\) memory and per\-step wallclock\.

#### Naive vs\. pipelined DTG\-FF\.

The naive implementation in our initial release runsforward\_ff\(with grad\) andforward\_features\(torch\.no\_grad\) sequentially, recomputing the convolution twice\. The pipelined schedule integrates the two into a singletrain\_step: forward→\\tobackward→\\tooptimizer step→\\toderive next\-layer input fromz\_act\.detach\(\)*after*the autograd graph has been freed\. The order matters: deriving the next\-layer input before the local backward causes the propagation tensors to coexist with the autograd state and increases peak memory\. The pipelined version is mathematically identical for each layer’s FF loss; only the propagation semantics differ by one optimizer step \(the propagated input reflects pre\-step weights of the just\-trained layer rather than post\-step weights\), a perturbation of orderlr\\mathrm\{lr\}on activations\. We verified empirically that this does not affect convergence trajectory on CIFAR\-10 \(5\-epoch sanity train:59\.84%59\.84\\%test, normal trajectory\)\.

#### Depth scaling at32×3232\\\!\\times\\\!32batch128128\.

At small input size, BP runs as a single fused forward and backward and amortizes per\-iteration overhead across all layers; DTG\-FF performsLLseparate backward calls and thus pays per\-layer kernel\-launch overhead\. The per\-architecture comparisons here are confounded by width and downsampling differences across our VGG variants—VGG11 starts with narrower channels \(64\) and pools earlier than VGG8 \(128, no first\-layer pool\), which dominates the per\-architecture peak\-memory difference at32×3232\\\!\\times\\\!32and explains why VGG11’s footprint sits below VGG8’s despite having more conv layers\. The structural𝒪​\(L\)\\mathcal\{O\}\(L\)\-vs\-𝒪​\(1\)\\mathcal\{O\}\(1\)activation\-memory contrast between BP and pipelined DTG\-FF only manifests cleanly when input size and per\-layer activation volume are large; at224×224224\\\!\\times\\\!224\(Sec\.[6\.2](https://arxiv.org/html/2606.06539#S6.SS2)\) DTG\-FF’s bound becomes the binding factor, while at32×3232\\\!\\times\\\!32both methods sit in a regime where width allocation matters more than depth\.

#### Throughput dynamics at the memory cliff\.

The dramatic throughput collapse for BP at batch≥96\\geq\\\!96on VGG11224×224224\\\!\\times\\\!224\(Sec\.[6\.2](https://arxiv.org/html/2606.06539#S6.SS2)\) is consistent with PyTorch’s CUDA caching allocator falling back to host\-memory paged allocations once device fragmentation exceeds free space\. Latency per step shifts from device\-bound \(∼400\\sim\\\!400ms at batch6464\) to host\-IO\-bound \(\>8\>\\\!8s at batch128128\)\. Smaller models or larger device memory \(e\.g\., RTX 40902424GB\) push the cliff further out and the same\-batch advantage shrinks to the modest in\-VRAM numbers reported in the main text\.

#### Fair BP baselines\.

We compare DTG\-FF at batch128128on VGG11224×224224\\\!\\times\\\!224against three BP variants designed to recover memory headroom \(all measured directly\):

BP\-VGG11 \(vanilla, b=128\)\.1414imgs/s,8\.188\.18GB peak \(host\-memory spill\)\.

BP\+grad\-accum \(micro\-batch64×264\\\!\\times\\\!2\)\.Effective batch128128,4\.184\.18GB peak,157157imgs/s\. Achieves identical effective batch with no device\-memory penalty but at the cost of doubling the number of forward/backward passes\.

BP\+activation\-checkpointing \(b=128, every conv block\)\.6\.356\.35GB peak,9292imgs/s \(1\.5×1\.5\\\!\\timesslowdown from recomputation\)\.

Honest reading\.Under same effective batch and at the no\-spill operating point, BP\+grad\-accum at4\.184\.18GB /157157imgs/s dominates DTG\-FF \(7\.907\.90GB /138138imgs/s\) on both memory and throughput; BP\+ckpt at6\.356\.35GB /9292imgs/s recovers in\-VRAM operation at a recompute penalty\. DTG\-FF beats only vanilla BP at its spill point—a regime practitioners avoid via standard tooling\. The structural𝒪​\(L\)→𝒪​\(1\)\\mathcal\{O\}\(L\)\\\!\\to\\\!\\mathcal\{O\}\(1\)activation\-memory property of pipelined FF is realized but does not translate into measured systems dominance over memory\-optimized BP on this hardware\.

#### Caveats\.

We report wall\-clock on a single laptop GPU at modest scale; we do not claim parity with engineered memory\-saving stacks like ZeRO/FSDP\[Rajbhandari et al\.,[2020](https://arxiv.org/html/2606.06539#bib.bib49), Zhao et al\.,[2023](https://arxiv.org/html/2606.06539#bib.bib50)\]\. The experiment establishes that the in\-principle𝒪​\(L\)→𝒪​\(1\)\\mathcal\{O\}\(L\)\\to\\mathcal\{O\}\(1\)activation memory property of strict layer\-local training is realizable in practice with a trivial code path \(single\-conv\-pass pipelining\) and produces a feasibility frontier shift on commodity hardware\.

### D\.8Optimizer Protocol Asymmetry Control

DTG\-FF uses seven per\-layer AdamW optimizers with per\-layer cosine schedules and per\-layer gradient clipping; the BP\-DeepSup baseline in Sec\.[5\.2](https://arxiv.org/html/2606.06539#S5.SS2)uses a single global AdamW optimizer and global cosine schedule\. To rule out the optimizer protocol as a confound, we train BP\-DeepSup with the same per\-layer protocol \(seven per\-layer AdamWs matching DTG\-FF exactly\)\. The result:93\.74%on CIFAR\-10, within0\.010\.01pp of the global\-optimizer variant \(93\.73%\)\. The optimizer protocol is not a meaningful source of the2\.402\.40pp FF–BP gap on CIFAR\-10\. The gap reflects genuine algorithmic differences \(detach vs\. end\-to\-end gradients, goodness vs\. learned linear head, random projection vs\. learned classifier\), not a training\-setup artifact\.

## Appendix ESynthetic Experiment Details

#### Setup\.

A 3\-layer ReLU teacher network withdin=50d\_\{\\mathrm\{in\}\}=50,dhidden=128d\_\{\\mathrm\{hidden\}\}=128, and random output layer of sizeKK\(different random draw per seed\) generates\(x,y\)\(x,y\)pairs: 20,000 train \+ 5,000 test\. Inputs are Gaussian𝒩​\(0,Id\)\\mathcal\{N\}\(0,I\_\{d\}\); labels arearg⁡max\\arg\\maxof the teacher’s logits\. Students have 4 hidden layers of size 128 each\. All methods share identical data, optimizer family \(Adam,lr=10−3\\mathrm\{lr\}\\\!=\\\!10^\{\-3\}, batch 256\), and training budget \(8,000 steps\)\. Seeds:\{42,123,456,789,1024\}\\\{42,123,456,789,1024\\\}\. Class counts:K∈\{5,10,15,20,30,50\}K\\\!\\in\\\!\\\{5,10,15,20,30,50\\\}\.

#### Baselines\.

Single BP\.One MLP with 4 hidden layers \+KK\-way output, end\-to\-end CE\.

BP\-Ensemble\.Four independently initialized single\-BP students, softmax\-averaged at inference\.∼4×\\sim\\\!4\\\!\\timesthe parameter count of DTG\-FF\.

BP\-DeepSup\.Single MLP with 4 auxiliary linear heads \(one per hidden layer\)\. Training loss is summed CE over all heads\. Inference is logit\-sum over heads\. Matched to DTG\-FF in backbone, depth, per\-layer head count, and inference aggregation; the methods differ in detach boundary, local objective \(CE vs\. goodness\), and head parameterization \(learned linear vs\. fixed random projection\)\. See Sec\.[6\.3](https://arxiv.org/html/2606.06539#S6.SS3)for the three\-dimensional difference\.

DTG\-FF\.Per\-layer local CE on𝐮~l⊤​Rl\+𝐛l\\tilde\{\\mathbf\{u\}\}\_\{l\}^\{\\top\}R\_\{l\}\+\\mathbf\{b\}\_\{l\}with spatial goodness𝐮l\\mathbf\{u\}\_\{l\}\(though for the 1\-D synthetic input we use𝐮l=𝐳l2\\mathbf\{u\}\_\{l\}=\\mathbf\{z\}\_\{l\}^\{2\}directly, a fixed\-dimension analog of the CNN spatial goodness\), fixedRl,𝐛l∼𝒩​\(0,1/K\)R\_\{l\},\\mathbf\{b\}\_\{l\}\\\!\\sim\\\!\\mathcal\{N\}\(0,1/K\), logit\-sum inference,detachbetween layers\.

#### Parameter counts \(per method,K=10K=10\)\.

All four students share the same 4\-layer backbone \(din=50d\_\{\\mathrm\{in\}\}\\\!=\\\!50,dhidden=128d\_\{\\mathrm\{hidden\}\}\\\!=\\\!128\); eachnn\.Linearcarries the default learnable bias\. DTG\-FF’s per\-layer random readoutsRl∈ℝ128×10R\_\{l\}\\\!\\in\\\!\\mathbb\{R\}^\{128\\times 10\}and𝐛l∈ℝ10\\mathbf\{b\}\_\{l\}\\\!\\in\\\!\\mathbb\{R\}^\{10\}are registered as buffers \(not learned\)\. Counts measured directly from the implementations:

- •DTG\-FF: backbone50⋅128\+128\+3​\(128⋅128\+128\)=56,06450\\\!\\cdot\\\!128\+128\+3\(128\\\!\\cdot\\\!128\+128\)=56\{,\}064plus44learnable temperaturesαl\\alpha\_\{l\}=𝟓𝟔,𝟎𝟔𝟖\\mathbf\{56\{,\}068\}learnable; fixed buffers4​\(128⋅10\+10\)=5,1604\(128\\\!\\cdot\\\!10\+10\)=5\{,\}160; total61,22861\{,\}228\.
- •BP\-DeepSup: backbone56,06456\{,\}064plus44learned heads4​\(128⋅10\+10\)=5,1604\(128\\\!\\cdot\\\!10\+10\)=5\{,\}160=𝟔𝟏,𝟐𝟐𝟒\\mathbf\{61\{,\}224\}learnable\.
- •Single BP:56,064\+\(128⋅10\+10\)=𝟓𝟕,𝟑𝟓𝟒56\{,\}064\+\(128\\\!\\cdot\\\!10\+10\)=\\mathbf\{57\{,\}354\}learnable\.
- •BP\-Ensemble:4⋅57,354=𝟐𝟐𝟗,𝟒𝟏𝟔4\\\!\\cdot\\\!57\{,\}354=\\mathbf\{229\{,\}416\}learnable \(≈4×\\approx 4\\\!\\timesDTG\-FF\)\.

The DTG\-FF vs\. BP\-DeepSup learnable\-parameter mismatch is5,1565\{,\}156params \(the four per\-layer heads, less the fourαl\\alpha\_\{l\}scalars\), about9%9\\%of backbone size, in BP\-DeepSup’s favor\. We retain this mismatch rather than tyingRlR\_\{l\}in DTG\-FF because doing so would change the algorithm; it should be noted when interpreting capacity\-matched comparisons\. The ratio is stable acrossKK\.

#### Paired\-difference methodology\.

The random teacher varies by seed, so absolute accuracies have high variance \(std44–17%17\\%across seeds\)\. We control this by reporting per\-seed paired differences \(DTG\-FF minus baseline\) in Table[4](https://arxiv.org/html/2606.06539#A5.T4); this cancels the teacher\-induced shared variance\.

Table 4:Paired differences \(DTG\-FF minus baseline accuracy, per seed\) on synthetic teacher–student tasks, 5 seeds\.\(n/5\)\(n/5\)shows the number of seeds where DTG\-FF wins\. Summary in Sec\.[3\.2](https://arxiv.org/html/2606.06539#S3.SS2)\.![Refer to caption](https://arxiv.org/html/2606.06539v1/x6.png)Figure 6:Architecture\-matched synthetic controls\. Points show seed\-level paired differences; markers and error bars show mean±\\pm95% normal\-approximation CI across five seeds\. DTG\-FF becomes favorable relative to BP\-DeepSup at largerKKin this controlled teacher–student setting, but remains below the4×4\{\\times\}\-parameter BP ensemble at everyKK\.
#### Pre\-specified trend test\.

Following the hypothesis “DTG\-FF’s advantage over BP\-DeepSup emerges more clearly at largerKK,” we partitionK∈\{5,10,15,20\}K\\\!\\in\\\!\\\{5,10,15,20\\\}\(low\-KK,n=20n\\\!=\\\!20seed\-KKobservations\) vs\.K∈\{30,50\}K\\\!\\in\\\!\\\{30,50\\\}\(high\-KK,n=10n\\\!=\\\!10\) and compute mean paired DTG\-FF−\-BP\-DeepSup\. Low\-KKmean:\+0\.41%\+0\.41\\%\. High\-KKmean:\+1\.78%\+1\.78\\%\. Difference of means:\+1\.37\+1\.37pp\. Bootstrap 95% CI \(paired observations resampled with replacement,nboot=10,000n\_\{\\mathrm\{boot\}\}=10\{,\}000, seed4242\):\[\+0\.59,\+2\.15\]\[\+0\.59,\+2\.15\]\. The CI does not cross zero, supporting the regime\-qualified claim\.

#### Train–test gap analysis\.

To assess whether DTG\-FF’s advantage reflects implicit regularization rather than a purely representational effect, we record training accuracy alongside test accuracy\. Table[5](https://arxiv.org/html/2606.06539#A5.T5)reports per\-method train–test gaps averaged over 5 seeds\. BP and BP\-DeepSup overfit to near\-100% training accuracy at allK≥10K\\\!\\geq\\\!10, producing train–test gaps of2222–3434percentage points\. DTG\-FF reaches only8686–94%94\\%on training, with train–test gaps of1515–1818pp\. The gap*difference*between BP and DTG\-FF grows withKK\(1\.41\.4pp atK=5K\\\!=\\\!5;16\.716\.7pp atK=30K\\\!=\\\!30\)\. This is consistent with local per\-layer training acting as implicit regularization—each layer cannot freely coordinate with others to memorize the training set—and partially explains why DTG\-FF can reach comparable or higher test accuracy despite fitting the training set less tightly than end\-to\-end BP variants\.

Table 5:Test accuracy and train–test gap \(Δ\\Delta= train acc−\-test acc, both in %\) on synthetic teacher–student tasks, 5 seeds\. Last column: how much smaller the DTG\-FF gap is than the BP gap\. DTG\-FF overfits less at everyKK; the effect grows withKK\.

Similar Articles

Linearizing Vision Transformer with Test-Time Training

Hugging Face Daily Papers

This paper proposes a method to convert pretrained Softmax attention models into linear-complexity Test-Time Training (TTT) architectures, achieving comparable text-to-image quality to fine-tuned Softmax models while significantly accelerating inference. The approach is validated by linearizing Stable Diffusion 3.5, resulting in SD3.5-T^5 with 1.32x speedup at 1K resolution.