Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

arXiv cs.AI 2026/06/08 04:00 论文

forward-forward layer-local-training scaling-limits synthetic-benchmarks real-data cnn deep-learning

摘要

本文表明，像Forward-Forward (FF)这样的层局部训练方法无法扩展到真实的图像尺寸和数据集，且合成基准测试夸大了其性能。作者引入了一个强大的FF变体（DTG-FF），并证明在真实数据（例如224x224的ImageNet-100）上，FF仅达到49.4%，而典型BP超过75%，而在合成任务上，差距缩小甚至反转。

arXiv:2606.06539v1 Announce Type: cross Abstract: Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updates. Recent FF-CNN work has narrowed the gap to BP on 32x32 benchmarks, raising the question of whether layer-local training is becoming a viable alternative at realistic scale. To probe this rigorously, we develop DTG-FF -- dynamic temperature goodness, decoupled normalization, and multi-layer fusion -- as an instrument that sets FF-family state of the art across nine real-data benchmarks (91.8% CIFAR-10 and the first FF baseline at ImageNet-100 224x224), and use it to audit how far layer-local training actually scales. (1) Real-data scaling. Under identical recipe and backbone, an architecture-matched BP-DeepSup baseline beats DTG-FF by 2.40/5.93 pp on CIFAR-10/CIFAR-100, and the gap widens with class count. At 224x224 the same instrument reaches only 49.4% -- the first FF baseline at this scale, versus typical BP above 75% [Tian et al., 2020] -- exposing a real-data ceiling invisible at 32x32. (2) Synthetic vs. real K-conflict. DTG-FF increasingly outperforms BP as class count K grows on synthetic teacher-student tasks, yet on real images the FF-BP gap reverses sign and widens with K. A within-dataset CIFAR-100 coarse vs. fine probe isolates label-hierarchy from image distribution: synthetic K-sweeps confound output dimensionality with fine-grained discrimination difficulty and thereby overstate FF transferability. (3) Systems audit. FF can be implemented without storing depth-wide activations, but on commodity 8 GB hardware standard BP+gradient-accumulation reaches 4.18 GB / 157 imgs/s versus DTG-FF's 7.90 GB / 138 imgs/s, so a memory-based justification for FF at this scale is not supported under fair baselines.

查看原文

查看缓存全文

缓存时间: 2026/06/08 09:16

# 层局部训练的真实数据极限 来源：https://arxiv.org/html/2606.06539

## 合成基准夸大了前向-前向缩放：层局部训练的真实数据极限

###### 摘要

前向-前向（FF）学习[Hinton, 2022 (https://arxiv.org/html/2606.06539#bib.bib1)]用严格层局部“好度”更新取代了反向传播。最近的 FF-CNN 工作已在 32×32 基准上将差距缩小到 BP，这引发了疑问：层局部训练是否正成为现实规模下的可行替代方案？为严格探究此问题，我们开发了 DTG-FF——动态温度好度、解耦归一化和多层融合——作为一项工具，它在九个真实数据基准上设置了 FF 家族的最优水平（CIFAR-10 上 91.8%，以及 ImageNet-100 224×224 上的首个 FF 基线），并用以审计层局部训练实际能扩展到何种程度。

(1) **真实数据缩放**。在相同配方和骨干网络下，架构匹配的 BP-DeepSup 基线在 CIFAR-10/CIFAR-100 上分别领先 DTG-FF 2.40/5.93 个百分点，且差距随类别数增加而扩大。在 224×224 尺度下，同一工具仅达到 49.4%——这是该尺度下的首个 FF 基线，而典型 BP 通常高于 75%[Tian et al., 2020 (https://arxiv.org/html/2606.06539#bib.bib2)]——揭示了在 32×32 尺度下不可见的真实数据天花板。

(2) **合成与真实数据的 K 冲突**。在合成教师-学生任务中，DTG-FF 随类别数 K 增加而越来越优于 BP；但在真实图像上，FF–BP 差距符号反转并随 K 扩大。在数据集内 CIFAR-100 粗粒度（20 类）与细粒度（100 类）的探针将标签层次与图像分布分离：合成 K 扫描将输出维度与细粒度判别难度混淆，从而夸大了 FF 的可迁移性。

(3) **系统审计**。FF 可在不存储深度范围激活的情况下实现，但在通用 8 GB 硬件上，标准 BP + 梯度累积达到 4.18 GB / 157 imgs/s，而 DTG-FF 为 7.90 GB / 138 imgs/s，因此在公平基线支持下，基于内存的理由在此尺度下不成立。

## 1 引言

反向传播的全局后向传播耦合了所有层并跨深度存储激活，这促使严格层局部训练规则被视为潜在的精度-内存权衡，而非直接的 BP 替代。前向-前向（FF）算法[Hinton, 2022 (https://arxiv.org/html/2606.06539#bib.bib1)]正是这样一种规则：它用基于层局部好度的学习取代后向传播，训练每层对正数据产生高“好度”（平方激活范数），对负数据产生低好度，且无跨层梯度流动。FF 有时也因生物学考虑而受到动机支持[Crick, 1989 (https://arxiv.org/html/2606.06539#bib.bib3), Lillicrap et al., 2020 (https://arxiv.org/html/2606.06539#bib.bib4)]，但我们此处的兴趣在于实证：层局部训练是否足够可扩展，以成为真实数据工作负载上的有用替代方案，并且它实际提供了哪些系统特性？

FF 尚未证明这种竞争力。在 CIFAR-10 上，原始 FF 算法用 MLP 达到约 60%，落后 BP 30 个百分点。后续工作通过架构创新——卷积扩展[Tosato et al., 2023 (https://arxiv.org/html/2606.06539#bib.bib5), Lee et al., 2024 (https://arxiv.org/html/2606.06539#bib.bib6)]、更深架构[Sezener et al., 2025 (https://arxiv.org/html/2606.06539#bib.bib7)]和自适应好度评估[Zhao et al., 2024 (https://arxiv.org/html/2606.06539#bib.bib8)]——将差距缩小到 90.62%（ASGE VGG11），但仍落后 BP 基线，且几乎仅限 32×32 输入。我们用一个强大的 FF 家族工具测试这一点，并做出四项贡献：

1. **用于压力测试层局部训练的工具**。我们开发了 DTG-FF，一个紧凑的 FF 家族架构，结合了三个机制层面的改进（动态温度好度、解耦三路径归一化、多层融合）。它在九个真实世界基准上设置了 FF 家族最优水平（第 5.2 节 (https://arxiv.org/html/2606.06539#S5.SS2)）——CIFAR-10（logit-sum 91.79% / concat 91.33%）、CIFAR-100（67.28%）、Tiny ImageNet（48.17%），以及 ImageNet-100 224×224 上的首个 FF 家族基线（VGG11 上 49.4%）。我们不将此视为论文的 headline 主张，而视为后续审计的可信度基准：只有一个强大的 FF 工具才能告诉我们，残留的 FF–BP 差距是否反映了层局部训练的根本极限，还是仅反映了先前 FF 基线的薄弱。

2. **使用架构匹配的 BP 控制进行真实数据缩放诊断**。尽管是 FF 家族最优，DTG-FF（concat）在相同配方和骨干网络下落后架构匹配的 BP-DeepSup 基线：CIFAR-10（K=10）上 2.40 个百分点，CIFAR-100（K=100）上 5.93 个百分点；FF–BP 差距随类别数增加而扩大。在 32×32 输入上，更深更窄的 VGG11 骨干网络在 CIFAR-10/100 上比 VGG8 分别差 -6.97/-12.49 个百分点（附录 D.5 (https://arxiv.org/html/2606.06539#A4.SS5)），但 VGG11 与 VGG8 在深度（8 个卷积层 vs. 7 个）和初始通道宽度（起始通道 64 vs. 128）上均不同，因此我们不能将下降严格归因于深度。在 224×224 尺度下，同一工具仅达到 49.4%，而典型 BP 通常高于 75%[Tian et al., 2020 (https://arxiv.org/html/2606.06539#bib.bib2)]，揭示了在 32×32 尺度下不可见的真实数据天花板。

3. **合成-真实数据 K 轴冲突**。在配对种子的教师-学生合成任务中，DTG-FF 相对于 BP 的优势随 K **增加**；在真实图像上，FF–BP 差距符号反转并随 K **扩大**。数据集内 CIFAR-100 粗粒度 20 类 vs. 细粒度 100 类的探针（第 5.3 节 (https://arxiv.org/html/2606.06539#S5.SS3)）显示，合成扫描中的 K 跟踪输出维度，而真实数据上的 K 也跟踪细粒度判别难度；当前的合成 FF 验证夸大了真实数据上的可迁移性。

4. **公平基线的系统审计**。FF 的一个自然辩护是其与深度相关的 O(1) 激活内存特性。流水线每层训练可实现这一界，但在通用 8 GB 硬件上，标准 BP + 梯度累积达到 4.18 GB / 157 imgs/s，而 DTG-FF 为 7.90 GB / 138 imgs/s，因此在此尺度下基于内存的理由不成立（第 6.2 节 (https://arxiv.org/html/2606.06539#S6.SS2)、附录 D.7 (https://arxiv.org/html/2606.06539#A4.SS7)）。我们还提供了一种解释性综合（第 6.1 节 (https://arxiv.org/html/2606.06539#S6.SS1)），将若干 FF 家族改进——标签覆盖、BP 训练的分类器头、空间好度、多层融合——解读为 BP 原生提供的监督跨层信号的部分替代。

参考图标题
图 1：DTG-FF 方法概览。该方法结合了三种机制：带分离传播的层局部 FF 损失、可学习的逐层温度 T_l 在固定随机读取前缩放空间好度，以及一个分离的多层分类器，通过 BN+Linear 融合 GAP 特征而不更新卷积骨干。第 2 节 (https://arxiv.org/html/2606.06539#S2) 调研相关工作。第 3 节 (https://arxiv.org/html/2606.06539#S3) 介绍逐层信号诊断和带架构匹配 BP 控制的合成验证。第 4 节 (https://arxiv.org/html/2606.06539#S4) 描述 DTG-FF。第 5 节 (https://arxiv.org/html/2606.06539#S5) 报告真实数据缩放和数据集内 K 消歧。第 6 节 (https://arxiv.org/html/2606.06539#S6) 报告公平基线系统审计，并提供将 FF 家族改进解读为 BP 跨层监督信号部分替代的解释性解读。

## 2 相关工作

**前向-前向及其变体**。FF 由 Hinton [2022 (https://arxiv.org/html/2606.06539#bib.bib1)] 引入，每层被训练为对正数据产生高好度、对负数据产生低好度。后续工作缩小了 FF–BP 准确率差距：LSFF [Tosato et al., 2023 (https://arxiv.org/html/2606.06539#bib.bib5)] 将 FF 扩展到 CNN（CIFAR-10 上 81.12%）；SCFF [Lee et al., 2024 (https://arxiv.org/html/2606.06539#bib.bib6)] 引入了自递归（80.75%）；DeeperForward [Sezener et al., 2025 (https://arxiv.org/html/2606.06539#bib.bib7)] 独立观察到批归一化破坏了基于好度的学习，并提出移除它（88.72%）；ASGE [Zhao et al., 2024 (https://arxiv.org/html/2606.06539#bib.bib8)] 使用带 logit 求和的逐层分类器（VGG11 上 90.62%，先前 FF 家族最佳）。Ororbia 和 Mali [2023 (https://arxiv.org/html/2606.06539#bib.bib9)] 将 FF 与预测编码结合。在 FF 之外，SoftHebb [Journé et al., 2023 (https://arxiv.org/html/2606.06539#bib.bib10)] 通过软胜者全拿 Hebbian 学习达到 80.3%。

**其他生物动机的 BP 替代方案**。反馈对齐 [Lillicrap et al., 2016 (https://arxiv.org/html/2606.06539#bib.bib11), Nøkland, 2016 (https://arxiv.org/html/2606.06539#bib.bib12), Launay et al., 2020 (https://arxiv.org/html/2606.06539#bib.bib13)]、目标传播 [Bengio, 2014 (https://arxiv.org/html/2606.06539#bib.bib14), Lee et al., 2015a (https://arxiv.org/html/2606.06539#bib.bib15), Meulemans et al., 2024 (https://arxiv.org/html/2606.06539#bib.bib16)]、均衡传播 [Scellier and Bengio, 2017 (https://arxiv.org/html/2606.06539#bib.bib17), Laborieux and Zenke, 2024 (https://arxiv.org/html/2606.06539#bib.bib18), Scellier, 2023 (https://arxiv.org/html/2606.06539#bib.bib19)] 以及预测编码 [Rao and Ballard, 1999 (https://arxiv.org/html/2606.06539#bib.bib20), Whittington and Bogacz, 2017 (https://arxiv.org/html/2606.06539#bib.bib21), Millidge et al., 2022 (https://arxiv.org/html/2606.06539#bib.bib22), Salvatori et al., 2023 (https://arxiv.org/html/2606.06539#bib.bib23)] 都保留了某种形式的结构化后向信号——与 FF 完全消除后向梯度流动形成对比。基于扰动的 [Dellaferrera and Kreiman, 2022 (https://arxiv.org/html/2606.06539#bib.bib24), Ren et al., 2023 (https://arxiv.org/html/2606.06539#bib.bib25)] 使用输入调制而非梯度。带局部损失的辅助分类器头有着悠久历史 [Szegedy et al., 2015 (https://arxiv.org/html/2606.06539#bib.bib26), Lee et al., 2015b (https://arxiv.org/html/2606.06539#bib.bib27), Belilovsky et al., 2019 (https://arxiv.org/html/2606.06539#bib.bib28), 2020 (https://arxiv.org/html/2606.06539#bib.bib29), Nøkland and Eidnes, 2019 (https://arxiv.org/html/2606.06539#bib.bib30)]；我们的多层分类器建立在这一谱系之上。

**信息论、归一化和温度缩放**。关于深度学习的理论观点 [Tishby et al., 2000 (https://arxiv.org/html/2606.06539#bib.bib31), Tishby and Zaslavsky, 2015 (https://arxiv.org/html/2606.06539#bib.bib32), Shwartz-Ziv and Tishby, 2017 (https://arxiv.org/html/2606.06539#bib.bib33)] 已受到关于估计器依赖性的批评 [Saxe et al., 2018 (https://arxiv.org/html/2606.06539#bib.bib34), Belghazi et al., 2018 (https://arxiv.org/html/2606.06539#bib.bib35), Poole et al., 2019 (https://arxiv.org/html/2606.06539#bib.bib36), McAllester and Stratos, 2020 (https://arxiv.org/html/2606.06539#bib.bib37)]；我们通过使用 KSG 估计器 [Kraskov et al., 2004 (https://arxiv.org/html/2606.06539#bib.bib38)] 计算标量 MI 和向量线性探针下界来避开这一点。归一化对优化的影响已被广泛研究 [Ioffe and Szegedy, 2015 (https://arxiv.org/html/2606.06539#bib.bib39), Ba et al., 2016 (https://arxiv.org/html/2606.06539#bib.bib40), Santurkar et al., 2018 (https://arxiv.org/html/2606.06539#bib.bib41), Yang et al., 2019 (https://arxiv.org/html/2606.06539#bib.bib42)]；我们的贡献是三条路径解耦，而非新的归一化。温度缩放出现在知识蒸馏 [Hinton et al., 2015 (https://arxiv.org/html/2606.06539#bib.bib43)]、校准 [Guo et al., 2017 (https://arxiv.org/html/2606.06539#bib.bib44)] 和课程式蒸馏 [Li et al., 2023 (https://arxiv.org/html/2606.06539#bib.bib45), Zhou et al., 2023 (https://arxiv.org/html/2606.06539#bib.bib46)] 中，均作用于 softmax 输出；DTG 则调制层局部学习信号。扩展讨论见附录 B (https://arxiv.org/html/2606.06539#A2)。

## 3 诊断与合成验证

### 3.1 逐层信号诊断

一个短篇实证诊断——在训练过的 DTG-FF VGG8（CIFAR-10，91.33%）上进行——为 DTG-FF 的三个组件提供了动机，并为第 6.1 节 (https://arxiv.org/html/2606.06539#S6.SS1) 的 BP 影视角奠定了基础。我们测量了标量好度 I(g_l^scalar; Y) ≈ 0.24 bits/层（KSG [Kraskov et al., 2004 (https://arxiv.org/html/2606.06539#bib.bib38)]，各层均值，范围 0.16–0.31 bits；在 50–500 bin 直方图中为 0.22–0.38 bits），空间好度向量为 1.1–2.5 bits/层（通过线性探针结合 Fano 不等式），第 6 层的 GAP 特征为 2.52 bits。估计器细节、bin 敏感性和逐层图见附录 C (https://arxiv.org/html/2606.06539#A3)。FF 的层间梯度路径通过 detach **本质上为零**——这是一个梯度流性质，而非信息论界限。这些观察激发了三条设计路径：**信号质量**（标量→空间好度，每层提供更高的探针可访问任务信号），**信号利用**（动态温度；消融实验在 T=1 时成本为 -0.72 到 -1.34 个百分点），以及**跨层协调**（多层融合；CIFAR-10 测试集上逐层预测不一致性为 25.1%，表明存在非冗余的逐层假设需要聚合）。我们在投入真实数据缩放之前，先在一个受控合成设置中测试这些设计选择。

### 3.2 带架构匹配 BP 控制的合成验证

**设置**。一个 3 层 ReLU 教师（d_in=50, d_hidden=128）为 20,000 训练 / 5,000 测试样本打标签。所有学生有 4 个隐藏层（d_hidden=128），8,000 步 Adam，batch 256, lr=10^{-3}, 5 个种子, K ∈ {5,10,15,20,30,50}。我们比较 DTG-FF 与单 BP、BP-DeepSup（骨干和深度匹配，仅 detach 与 DTG-FF 不同）以及 BP-Ensemble（4×参数，softmax 平均）。教师方差在各种子间占主导（std 4–17%），因此我们报告配对差异。完整表格、基线定义、参数计数和 FLOPs 见附录 E (https://arxiv.org/html/2606.06539#A5)。

**结果**。配对 DTG-FF − BP-DeepSup 从 K=5 时的 -0.23 个百分点增长到 K=50 时的 +2.00 个百分点；预先指定的低 K（{5,10,15,20}）vs. 高 K（{30,50}）对比显示高 K 优势为 +1.37 个百分点（bootstrap n=10,000, 95% CI [+0.59, +2.15]）。DTG-FF 在 K≥15 时可靠地超过单 BP（5/5 种子，+1.2 到 +3.4 个百分点），但在每个 K 值上都输给 4× 参数的 BP-Ensemble，这在纯容量比较中符合预期。狭窄的发现：**在此合成机制中，FF–BP 差距并不能仅由缺乏端到端梯度来解释**。然而，合成优势并未转移到真实数据上。

Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

相似文章

您的LLM推理基准测试在误导您

EdgeBench：揭示从真实世界环境中学习的缩放定律

基准测试：仅限CPU硬件上Parakeet TDT 0.6B的ONNX Runtime、HF Transformers与GGUF对比 [D]

新本地模型在PII移除上达到接近前沿性能，仅需9毫秒CPU推理

还有人觉得AI基准测试在预测实际性能方面越来越没用了吗？

提交意见反馈