SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
Summary
This paper introduces SURGE, a novel learnable gradient compensation framework for training Binary Neural Networks that addresses gradient mismatch and information loss issues found in traditional methods like the Straight-Through Estimator.
View Cached Full Text
Cached at: 05/13/26, 06:25 AM
# Surrogate Gradient Adaptation in Binary Neural Networks
Source: [https://arxiv.org/html/2605.10989](https://arxiv.org/html/2605.10989)
Boyu LiuLinlin YangYanjing LiYuguang YangXuhui LiuCanyu ChenZhongqian FuBaochang Zhang
###### Abstract
The training of Binary Neural Networks \(BNNs\) is fundamentally based on gradient approximation for non\-differentiable binarization operations \(*e\.g\.*,signfunction\)\. However, prevailing methods including the Straight\-Through Estimator \(STE\) and its improved variants, rely on hand\-crafted designs that suffer from gradient mismatch problem and information loss induced by fixed\-range gradient clipping\. To address this, we propose SURrogate GradiEnt Adaptation \(SURGE\), a novel learnable gradient compensation framework with theoretical grounding\. SURGE mitigates gradient mismatch through auxiliary backpropagation\. Specifically, we design a Dual\-Path Gradient Compensator \(DPGC\) that constructs a parallel full\-precision auxiliary branch for each binarized layer, decoupling gradient flow via output decomposition during backpropagation\. DPGC enables bias\-reduced gradient estimation by leveraging the full\-precision branch to estimate components beyond STE’s first\-order approximation\. To further enhance training stability, we introduce an Adaptive Gradient Scaler \(AGS\) based on an optimal scale factor to dynamically balance inter\-branch gradient contributions via norm\-based scaling\. Experiments on image classification, object detection, and language understanding tasks demonstrate that SURGE performs best over state\-of\-the\-art methods\.
Quantization\-Aware Training, Binary Neural Networks, Mode Quantization, Model Compression
## 1Introduction
Deep neural networks \(DNNs\) have achieved remarkable success across various domains\(Heet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib112); Vaswani,[2017](https://arxiv.org/html/2605.10989#bib.bib80)\), with model parameters scaling from millions to billions in state\-of\-the\-art architectures\(Brownet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib33); Yanget al\.,[2024](https://arxiv.org/html/2605.10989#bib.bib34)\)\. However, their escalating computational complexity and memory requirements pose significant challenges for deployment in resource\-limited scenarios\. To address this challenge, numerous model compression techniques have been developed to enhance deployment efficiency\(He and Xiao,[2023](https://arxiv.org/html/2605.10989#bib.bib35); Hintonet al\.,[2014](https://arxiv.org/html/2605.10989#bib.bib36); Liuet al\.,[2025](https://arxiv.org/html/2605.10989#bib.bib37); Yuet al\.,[2017](https://arxiv.org/html/2605.10989#bib.bib38)\), each offering distinct trade\-offs among compression ratio, inference speedup, and accuracy retention\. Different from structural compression methods \(*e\.g\.*, pruning\), quantization\(Esseret al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib48); Hubaraet al\.,[2021](https://arxiv.org/html/2605.10989#bib.bib94); Wanget al\.,[2022](https://arxiv.org/html/2605.10989#bib.bib54); Xuet al\.,[2023](https://arxiv.org/html/2605.10989#bib.bib95)\)achieves compression through bit\-width reduction without modifying the network architecture\. The reduced bit\-width representation significantly decreases storage requirements while enabling computational acceleration via low\-precision operations\.
As an extreme form of quantization, binarization\(Courbariauxet al\.,[2015](https://arxiv.org/html/2605.10989#bib.bib5),[2016](https://arxiv.org/html/2605.10989#bib.bib22); Gonget al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib4); Xuet al\.,[2021b](https://arxiv.org/html/2605.10989#bib.bib10),[2022a](https://arxiv.org/html/2605.10989#bib.bib16)\)represents weights and activations with 1\-bit values, theoretically enabling32×32\\timesmemory reduction and58×58\\timescomputational acceleration compared to full\-precision networks\(Rastegariet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib6)\)\. These efficiency advantages of binarization make it especially practical for edge computing devices with severely limited computational resources, and its effectiveness has been proven in diverse tasks, such as classification\(Xuet al\.,[2021c](https://arxiv.org/html/2605.10989#bib.bib8)\), object detection\(Xuet al\.,[2022b](https://arxiv.org/html/2605.10989#bib.bib26)\), and natural language understanding\(Qinet al\.,[2022](https://arxiv.org/html/2605.10989#bib.bib115)\)\.
Figure 1:\(a\-b\) Activation gradient patterns without/with SURGE \(left/right\); \(c\) Gradient distribution comparison; \(d\) Cumulative probability of gradients\. STE provides a first\-order approximation for thesignfunction’s gradient and clips out\-of\-range activation gradients, while SURGE compensates them with a Dual\-Path Gradient Compensator \(a\-b\)\. SURGE also right\-shifts gradient distributions of activations \(c\-d\), validating its effectiveness in rectifying STE\-induced mismatch\.Despite considerable advances, there remains a non\-negligible performance gap between binary neural networks \(BNNs\) and their full\-precision counterparts\(Rastegariet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib6)\)\. This discrepancy primarily stems from the substantial representation divergence between binary and continuous\-valued weights and activations\. Specifically, the training of BNNs incorporates quantization of real\-valued tensors with deterministic or stochastic binarization operations\(Courbariauxet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib22)\)\. However, the non\-differentiable nature and vanishing gradients of binarization operations introduce significant challenges in backpropagation\.
To solve the training problem, the Straight\-Through Estimator \(STE\)\(Bengioet al\.,[2013](https://arxiv.org/html/2605.10989#bib.bib2)\)provides an effective gradient approximation method for binarization operations\. Specifically, STE directly substitutes the gradients of binarization operations \(*e\.g\.*,signfunction\) with the derivative of theIdentityfunction during backpropagation, thereby enabling stable parameter optimization\. Despite its prevalent application in training BNNs and low\-bit networks, STE suffers from several inherent limitations that remain to be addressed\. On the one hand, since thesignfunction’s gradient vanishes everywhere except at zero, employing a fixed\-value gradient approximation inevitably introduces estimation bias and optimization instability\(Qinet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib7)\)\. To reduce the gradient error of STE, subsequent approaches predominantly rely on heuristic quantizer designs\(Liuet al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib31); Gonget al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib4)\), such as piecewise polynomial functions\(Liuet al\.,[2018b](https://arxiv.org/html/2605.10989#bib.bib9)\)and SignSwish activation functions\(Darabiet al\.,[2018](https://arxiv.org/html/2605.10989#bib.bib32)\), which cannot guarantee finding the optimal gradient approximation\.
On the other hand, during the backpropagation of STE, the gradient clipping is adopted to only preserve the gradient for inputs within the vicinity of zero \(typically\[−1,1\]\[\-1,1\]\), which empirically improves model accuracy\(Courbariauxet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib22)\)\. However, applying fixed\-range gradient clipping is suboptimal for binarized representations, particularly for activation quantization, since the gradient information is discarded for values outside the clipping range\(Qinet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib7)\)\. Existing binarization methods largely overlook the impact of gradient clipping range, as only a few studies propose handcrafted asymptotic functions to gradually approximate the hard binarization function\(Gonget al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib4); Qinet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib7)\)\. Consequently, merely employing STE and improved estimators\(Rastegariet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib6); Gonget al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib4); Xuet al\.,[2022a](https://arxiv.org/html/2605.10989#bib.bib16); Jinet al\.,[2025](https://arxiv.org/html/2605.10989#bib.bib131)\)fails to obtain accurate gradient approximation for binarization operations, as non\-negligible gradient mismatch\(Qinet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib7)\)accumulates in the backward pass, necessitating explicit gradient rectification\.
This paper proposes SURrogate GradiEnt Adaptation \(SURGE\), a novel learnable gradient compensation strategy that addresses gradient mismatch through auxiliary backpropagation\. While STE or improved estimators provides surrogate gradients for binarization operations, SURGE offers enhanced gradient adaptation for binary neural networks\. Specifically, we design a Dual\-Path Gradient Compensator \(DPGC\), which constructs a parallel full\-precision parameterized branch \(noted as auxiliary branch\) for each binarized layer \(noted as main branch\)\. In particular, DPGC decomposes each layer’s output into contributions from the main branch and auxiliary branch, thus decoupling the gradient flow into two parts during backpropagation\. Therefore, DPGC ensures that the auxiliary branch only affects the backward gradient while preserving the original layer outputs during the forward pass\. Compared with the binary branch, the full\-precision branch can provide less biased gradients\(Stocket al\.,[2021](https://arxiv.org/html/2605.10989#bib.bib116)\)that compensate for STE’s first\-order approximation\(Liuet al\.,[2023](https://arxiv.org/html/2605.10989#bib.bib124)\)error by learning higher\-order terms\. As shown in Figure[1](https://arxiv.org/html/2605.10989#S1.F1),\(a\)STE’s fixed clipping*zeros vast area*of activation gradients;\(b\)with SURGE, the auxiliary branch injects compensation gradients while keeping the forward output unchanged, visibly recovering the clipped regions\. Aggregated statistics in\(c\)–\(d\)show a right\-shifted gradient distribution and heavier tails in the cumulative curves, indicating that SURGE restores informative gradients beyond STE’s first\-order surrogate\.
Moreover, large\-magnitude gradients from the auxiliary path may adversely affect the convergence of the main branch\. To address this problem, we propose an Adaptive Gradient Scaler \(AGS\) that dynamically balances inter\-branch gradient contributions via norm\-based scaling, thereby ensuring stable and effective compensation\. To validate the effectiveness of SURGE, we conduct comprehensive comparative experiments on two image classification benchmarks, one object detection benchmark, one suite of language understanding benchmark, and our proposed method achieves best performance over state\-of\-the\-art\. In summary, the main contributions of this work are as follows:
- •We propose SURrogate GradiEnt Adaptation \(SURGE\), a novel gradient compensation framework employing a Dual\-Path Gradient Compensator to address gradient mismatch\. Our method does not modify the forward\-pass output and introduces no additional overhead at inference\.
- •We introduce an Adaptive Gradient Scaler \(AGS\) that dynamically equilibrates gradient contributions from binary and auxiliary branches based on theoretically derived optimal scaling factor\.
- •Extensive experiments demonstrate that SURGE achieves state\-of\-the\-art performance across four standard benchmarks for BNN training\. Specifically, a SURGE\-trained binarized ResNet\-18 attains 62\.0% top\-1 accuracy on ImageNet with one\-stage training, surpassing previous SOTA methods by significant margins \(*e\.g\.*, \+1\.0%, and \+3\.9% top\-1 accuracy improvements over ReCU and IR\-Net, respectively, on ImageNet\)\.
## 2Related Work
### 2\.1Gradient Approximation
Gradient approximation serves as a cornerstone for training neural networks with non\-differentiable operators, addressing challenges in discrete sampling\(Suttonet al\.,[1999](https://arxiv.org/html/2605.10989#bib.bib117); Schulmanet al\.,[2015](https://arxiv.org/html/2605.10989#bib.bib118); Athalyeet al\.,[2018](https://arxiv.org/html/2605.10989#bib.bib120); Rezendeet al\.,[2014](https://arxiv.org/html/2605.10989#bib.bib119)\), architecture search\(Xieet al\.,[2018](https://arxiv.org/html/2605.10989#bib.bib122); Liuet al\.,[2018a](https://arxiv.org/html/2605.10989#bib.bib121); Caiet al\.,[2018](https://arxiv.org/html/2605.10989#bib.bib123)\), and especially quantization\(Esseret al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib28); Gonget al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib4); Liuet al\.,[2018b](https://arxiv.org/html/2605.10989#bib.bib9),[2020](https://arxiv.org/html/2605.10989#bib.bib17); Xuet al\.,[2022a](https://arxiv.org/html/2605.10989#bib.bib16)\)\. A popular family of gradient estimators is the Straight\-Through Estimator \(STE\), which directly propagates gradients through non\-differentiable functions\. The idea of Straight\-Through originates from the perceptron algorithm\(Rosenblatt,[1957](https://arxiv.org/html/2605.10989#bib.bib125)\), which leverages a modified chain rule and utilizes theIdentityfunction as the proxy of the original derivative of a binary output function\.\(Bengioet al\.,[2013](https://arxiv.org/html/2605.10989#bib.bib2)\)improves this method by using non\-linear functions like sigmoid, and\(Janget al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib27)\)further incorporates the Gumbel reparameterization, reparameterizes discrete variables via temperature\-annealed continuous relaxation, enabling low\-variance gradient estimation for categorical sampling\. In the field of quantization, DSQ\(Gonget al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib4)\)employs parameterized sigmoid functions to progressively approximate the gradients of the non\-differentiable quantization function, while LSQ\(Esseret al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib28)\)introduced scaling factors for end\-to\-end gradient propagation, advancing low\-bit quantization\. BONN\(Zhaoet al\.,[2022](https://arxiv.org/html/2605.10989#bib.bib24)\)integrates Bayesian optimization to guide differentiable binarization policies, and FDA\-BNN\(Xuet al\.,[2021b](https://arxiv.org/html/2605.10989#bib.bib10)\)converts thesignfunction into the frequency domain to mitigate the gradient mismatch\.
### 2\.2Binary Neural Network
Pioneering works in binary neural networks focused either on binarization architecture design\(Liuet al\.,[2018b](https://arxiv.org/html/2605.10989#bib.bib9); Xuet al\.,[2021b](https://arxiv.org/html/2605.10989#bib.bib10); Liuet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib17); Bulatet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib11); Yanget al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib21)\)or training strategies\(Courbariauxet al\.,[2015](https://arxiv.org/html/2605.10989#bib.bib5); Rastegariet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib6); Qinet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib7); Xuet al\.,[2021c](https://arxiv.org/html/2605.10989#bib.bib8),[2022a](https://arxiv.org/html/2605.10989#bib.bib16)\)\. In terms of architecture design, Bi\-Real Net\(Liuet al\.,[2018b](https://arxiv.org/html/2605.10989#bib.bib9)\)enhances skip connections, and FDA\-BNN\(Xuet al\.,[2021b](https://arxiv.org/html/2605.10989#bib.bib10)\)introduces differentiable binarization units in the frequency domain\. Moreover, ReActNet\(Liuet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib17)\)substitutes thesignfunction and PReLU\(Heet al\.,[2015](https://arxiv.org/html/2605.10989#bib.bib29)\)with RSign and RPReLU based on learnable thresholds\. Approaches like BATS\(Bulatet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib11)\)and SLB\(Yanget al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib21)\)combine BNNs with neural architecture search\. In terms of training strategies, BinaryConnect\(Courbariauxet al\.,[2015](https://arxiv.org/html/2605.10989#bib.bib5)\)and XNOR\-Net\(Rastegariet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib6)\)use thesignfunction with gradient approximation, but they cause severe information loss in forward propagation\. Later, training strategies were innovated\. IR\-Net\(Qinet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib7)\)and ReCU\(Xuet al\.,[2021c](https://arxiv.org/html/2605.10989#bib.bib8)\)use progressive quantization and feature distribution alignment, but they still face gradient mismatch in deep networks\. RBONN\(Xuet al\.,[2022a](https://arxiv.org/html/2605.10989#bib.bib16)\)introduces a recurrent bilinear optimization for BNNs\.
Unlike prior work, our work is the first attempt to employ a Dual\-Path Gradient Compensator to correct gradient mismatch in STE\-based binarized networks, coupled with an Adaptive Gradient Scaler to equilibrate the gradient contribution between binary and auxiliary branches dynamically\.
Figure 2:Overall architecture of SURGE\.\(a\)Integration into common backbones \(left: convolution block; right: transformer block\)\.\(b\)Component details\. DPGC constructs a parallel full\-precision parameterized branch \(auxiliary branch, shown with red arrows for forward pass and blue arrows for backpropagation\) for each binarized layer \(main branch, represented by black arrows in forward pass and green arrows for backpropagation\)\. This ensures identical output to standard BNNs while providing less biased gradients for compensation\. AGS takes gradients from both branches as input \(visualized through corresponding colored arrows\) and dynamically balances inter\-branch gradient contributions via norm\-based scaling\. SURGE is architecture\-agnostic and applies to arbitrary binarized linear operators\.
## 3Preliminaries
Consider a neural layer with weight vectorW∈ℝdW\\in\\mathbb\{R\}^\{d\}and input vectorx∈ℝdx\\in\\mathbb\{R\}^\{d\}\. The main operation in deep neural networks is expressed as:
f\(x;W\)=W⊤x\.f\(x;W\)=W^\{\\top\}x\.\(1\)
In binary neural networks \(BNNs\), we quantizeWWandxxto\{−1,\+1\}d\\\{\-1,\+1\\\}^\{d\}, thus using the efficient XNOR and Bit\-count operations to replace real\-valued operations\. Let𝐁W∈\{−1,\+1\}d\\mathbf\{B\}\_\{W\}\\in\\\{\-1,\+1\\\}^\{d\}and𝐁x∈\{−1,\+1\}d\\mathbf\{B\}\_\{x\}\\in\\\{\-1,\+1\\\}^\{d\}denote the binarized counterparts\. Network binarization aims to represent the floating\-point weights and/or activations with 1 bit\. In general, the quantization can be formulated as:Qx\(x\)=αx𝐁x,QW\(W\)=αW𝐁WQ\_\{x\}\(x\)=\\alpha\_\{x\}\\mathbf\{B\}\_\{x\},Q\_\{W\}\(W\)=\\alpha\_\{W\}\\mathbf\{B\}\_\{W\}, whereα⋅\\alpha\_\{\\cdot\}denotes scalars for binary values includingαw\\alpha\_\{w\}for weights andαx\\alpha\_\{x\}for inputs\. And we usually usesignfunction to binarizeWWandxx:𝐁x=sign\(x\),𝐁W=sign\(W\)\\mathbf\{B\}\_\{x\}=\\texttt\{sign\}\(x\),\\mathbf\{B\}\_\{W\}=\\texttt\{sign\}\(W\)\. Following\(Rastegariet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib6)\), the binary operation is formulated as:
fb\(x;𝐁W\)=QW\(W\)⊤Qx\(x\)=αWαx⋅\(𝐁W⊙𝐁x\),f\_\{b\}\(x;\\mathbf\{B\}\_\{W\}\)=Q\_\{W\}\(W\)^\{\\top\}Q\_\{x\}\(x\)=\\alpha\_\{W\}\\alpha\_\{x\}\\cdot\(\\mathbf\{B\}\_\{W\}\\odot\\mathbf\{B\}\_\{x\}\),\(2\)
where⊙\\odotdenotes the inner product for vectors with bitwise operations XNOR and Bitcount\.
In backpropagation, the derivative of thesignfunction is zero almost everywhere, which makes it incompatible with backpropagation, since exact gradients for the original values before binarization would be zeroed\. Thus, Straight\-Through Estimator \(STE\)\(Bengioet al\.,[2013](https://arxiv.org/html/2605.10989#bib.bib2)\)is generally used to train BNNs, which propagates the gradient throughIdentityfunction\. Regarding the gradient of the lossLL*w\.r\.t\.*WW, it is approximated as
∂L∂W=∂L∂𝐁W⋅∂𝐁W∂W≈∂L∂𝐁W\.\\frac\{\\partial L\}\{\\partial W\}=\\frac\{\\partial L\}\{\\partial\\mathbf\{B\}\_\{W\}\}\\cdot\\frac\{\\partial\\mathbf\{B\}\_\{W\}\}\{\\partial W\}\\approx\\frac\{\\partial L\}\{\\partial\\mathbf\{B\}\_\{W\}\}\.\(3\)As for the gradient*w\.r\.t\.*the activations, it can be formulated as
∂L∂x=∂L∂𝐁x⋅∂𝐁x∂x≈∂L∂𝐁x⋅𝟏\{\|x\|≤1\},\\frac\{\\partial L\}\{\\partial x\}=\\frac\{\\partial L\}\{\\partial\\mathbf\{B\}\_\{x\}\}\\cdot\\frac\{\\partial\\mathbf\{B\}\_\{x\}\}\{\\partial x\}\\approx\\frac\{\\partial L\}\{\\partial\\mathbf\{B\}\_\{x\}\}\\cdot\\mathbf\{1\}\_\{\\\{\|x\|\\leq 1\\\}\},\(4\)where𝟏\{\|x\|≤1\}\\mathbf\{1\}\_\{\\\{\|x\|\\leq 1\\\}\}is the indicator function that equals 1 when\|x\|≤1\|x\|\\leq 1and 0 otherwise\. This expression corresponds to STE’s first\-order approximation for the sign function’s gradient\.
## 4Methodology
In this section, we describe SURGE in detail\. We first introduce the Dual\-Path Gradient Compensator \(DPGC\) module to address the gradient mismatch in STE\-based training \(Sec\.[4\.1](https://arxiv.org/html/2605.10989#S4.SS1)\), then present the Adaptive Gradient Scaler \(AGS\) for stable optimization \(Sec\.[4\.2](https://arxiv.org/html/2605.10989#S4.SS2)\)\. The complete training paradigm integrates these components while preserving standard BNN inference\.
### 4\.1Dual\-Path Gradient Compensator \(DPGC\)
To handle the intrinsic gradient mismatch in STE\(Qinet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib7)\), we propose a layer\-wise dual\-path architecture that preserves original forward computations while introducing auxiliary gradient pathways\. As shown in Figure[2](https://arxiv.org/html/2605.10989#S2.F2), DPGC constructs a parallel full\-precision parameterized branch \(noted as auxiliary branch\) including a full\-precision operator \(*e\.g\.*, convolution, linear, attention projection\) with identical dimensions \(*e\.g\.*, kernel size, dimension\) to the main branch, augmented with an Adaptive Gradient Scaler \(AGS\) module \(Section[4\.2](https://arxiv.org/html/2605.10989#S4.SS2)\) for each binarized layer \(noted as main branch\)\. DPGC decomposes each layer’s output into contributions from both the main branch \(black arrow\) and auxiliary branch \(red arrow\), thus decoupling the gradient flow into two parts during backpropagation \(Eq\.[6](https://arxiv.org/html/2605.10989#S4.E6)\) \(green arrow for main branch, blue arrow for auxiliary branch\)\. Givenfb\(x\)=QW\(Wb\)⊤Qx\(x\)f\_\{b\}\(x\)=Q\_\{W\}\(W\_\{b\}\)^\{\\top\}Q\_\{x\}\(x\)as the binarized computation,fa\(x\)=Wa⊤xf\_\{a\}\(x\)=W\_\{a\}^\{\\top\}xas the full\-precision computation,Wa,WbW\_\{a\},W\_\{b\}as the weight parameters for auxiliary branch \(full\-precision branch\), main branch \(binary branch\), respectively, the combined output is:
output=fb\(x;Wb\)⏟Binary output−fao\(x;Wa\)↓⏟Detached compensator\+fao\(x;Wa\)⏟Active compensator,\\textit\{output\}=\\underbrace\{f\_\{b\}\(x;W\_\{b\}\)\}\_\{\\text\{Binary output\}\}\-\\underbrace\{f\_\{ao\}\(x;W\_\{a\}\)\\downarrow\}\_\{\\text\{Detached compensator\}\}\+\\underbrace\{f\_\{ao\}\(x;W\_\{a\}\)\}\_\{\\text\{Active compensator\}\},\(5\)
wherefao\(x\)=λfa\(x\)f\_\{ao\}\(x\)=\\lambda f\_\{a\}\(x\)is the scaled full\-precision computation,λ\\lambdais the scale factor,↓\\downarrowis the gradient stop operator\. The design ensures identical outputs to standard BNNs, while gradients flow through both pathways, thus providing less biased gradient estimates\(Stocket al\.,[2021](https://arxiv.org/html/2605.10989#bib.bib116)\)while preserving STE gradients\. Upon completion of training, the auxiliary branch can be discarded, introducing no additional computational overhead during inference\. Backpropagation aggregates gradients from both paths:
∂ℒ∂x=∂ℒ∂fb∂fb∂x\|STE⏟Binary gradientsgb\+λ∂ℒ∂fa∂fa∂x⏟Compensator gradientsga\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial x\}=\\underbrace\{\\frac\{\\partial\\mathcal\{L\}\}\{\\partial f\_\{b\}\}\\frac\{\\partial f\_\{b\}\}\{\\partial x\}\\Big\|\_\{\\text\{STE\}\}\}\_\{\\text\{Binary gradients $g\_\{b\}$\}\}\+\\lambda\\underbrace\{\\frac\{\\partial\\mathcal\{L\}\}\{\\partial f\_\{a\}\}\\frac\{\\partial f\_\{a\}\}\{\\partial x\}\}\_\{\\text\{Compensator gradients $g\_\{a\}$\}\}\.\(6\)
Here,∂fb∂x\|STE\\left\.\\frac\{\\partial f\_\{b\}\}\{\\partial x\}\\right\|\_\{\\text\{STE\}\}serves as the STE surrogate for the theoretically intractable binary\-branch derivative\(Liuet al\.,[2023](https://arxiv.org/html/2605.10989#bib.bib124)\)\. The termgbg\_\{b\}is thereby a first\-order approximation, while the full\-precision auxiliary branch yieldsgag\_\{a\}to capture higher\-order terms and recover gradients removed by clipping\.
### 4\.2Adaptive Gradient Scaler \(AGS\)
The raw combination ofgbg\_\{b\}andgag\_\{a\}risks unstable training due to varying magnitude ratios between paths, and the large\-magnitude gradients from the auxiliary path may adversely affect the convergence of the main branch\. We address this through a novel mechanism that dynamically balances inter\-branch gradient contributions with norm\-based adaptive scaling factorλAGS\\lambda\_\{\\mathrm\{AGS\}\}, thereby ensuring stable and effective compensation:
∂ℒ∂x=gb\+λAGS⋅ga,λAGS:=η‖gb‖2‖ga‖2\+ϵ,\\frac\{\\partial\\mathcal\{L\}\}\{\\partial x\}=g\_\{b\}\+\\lambda\_\{\\mathrm\{AGS\}\}\\cdot g\_\{a\},\\quad\\lambda\_\{\\mathrm\{AGS\}\}:=\\eta\\frac\{\\\|g\_\{b\}\\\|\_\{2\}\}\{\\\|g\_\{a\}\\\|\_\{2\}\+\\epsilon\},\(7\)
whereη\\etais the base scaling coefficient,ϵ=10−8\\epsilon=10^\{\-8\}is the numerical stabilizer, andλAGS\\lambda\_\{\\mathrm\{AGS\}\}is a practical plug\-in approximation of the theoretical optimum \(Theorem[5\.3](https://arxiv.org/html/2605.10989#S5.Thmtheorem3), Corollary[5\.4](https://arxiv.org/html/2605.10989#S5.Thmtheorem4)\)\. This dynamic scaling preserves the directional consistency of the primary binary gradientgbg\_\{b\}while allowing auxiliary gradientsgag\_\{a\}to provide magnitude\-aware compensation\. Such design guarantees that the STE\-based gradients dominate the parameter update process, while the auxiliary path serves as an adaptive compensator that injects higher\-order gradient information without destabilizing the primary learning dynamics\. In practice, the scale factor derived from gradient computation in the current iteration is used in the subsequent AGS step for adaptive parameter adjustment to optimize computational efficiency\. The complete training procedure is summarized in Appendix[A](https://arxiv.org/html/2605.10989#A1)\.
## 5Theoretical Analysis
This section formally establishes the theoretical foundation of gradient compensation in dual\-path architectures\. We begin by formulating the gradient propagation mechanism under our proposed compensation framework, followed by introducing moment\-based notation for gradient statistics \(Definition[5\.1](https://arxiv.org/html/2605.10989#S5.Thmtheorem1)\) and a moment model that captures the bias/noise structure of the two gradient components \(Assumption[5\.2](https://arxiv.org/html/2605.10989#S5.Thmtheorem2)\)\. We then derive the theoretically optimal scaling factor for gradient compensation \(Theorem[5\.3](https://arxiv.org/html/2605.10989#S5.Thmtheorem3)\) and obtain a practical norm\-ratio approximation that directly motivates the AGS update rule \(Corollary[5\.4](https://arxiv.org/html/2605.10989#S5.Thmtheorem4)\)\.
Let𝒳\\mathcal\{X\}denote the input space andW=\(Wb,Wa\)∈ℝ2dW=\(W\_\{b\},W\_\{a\}\)\\in\\mathbb\{R\}^\{2d\}represent the binarized and full\-precision weights\. The forward propagation becomes:
f\(x;W\)=\\displaystyle f\(x;W\)=\(8\)QW\(Wb\)⊤Qx\(x\)⏟Binary path\+λ\(Wa⊤x⏟Compensator path−Wa⊤x↓⏟Detached path\),\\displaystyle\\underbrace\{Q\_\{W\}\(W\_\{b\}\)^\{\\top\}Q\_\{x\}\(x\)\}\_\{\\text\{Binary path\}\}\+\\lambda\\left\(\\underbrace\{W\_\{a\}^\{\\top\}x\}\_\{\\text\{Compensator path\}\}\-\\underbrace\{W\_\{a\}^\{\\top\}x\\downarrow\}\_\{\\text\{Detached path\}\}\\right\),
whereλ\\lambdafollows the adaptive scaling in Section[4\.2](https://arxiv.org/html/2605.10989#S4.SS2)\. LetApproxApproxdenote a kind of STE\-based gradient approximation \(*e\.g\.*, STE\), the composite gradient combines:
∂ℒ∂x=∂ℒ∂fb∂fb∂x\|Approx⏟gb\+∂ℒ∂fao∂fao∂x⏟λga\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial x\}=\\underbrace\{\\frac\{\\partial\\mathcal\{L\}\}\{\\partial f\_\{b\}\}\\frac\{\\partial f\_\{b\}\}\{\\partial x\}\\Big\|\_\{Approx\}\}\_\{g\_\{b\}\}\+\\underbrace\{\\frac\{\\partial\\mathcal\{L\}\}\{\\partial f\_\{ao\}\}\\frac\{\\partial f\_\{ao\}\}\{\\partial x\}\}\_\{\\lambda g\_\{a\}\}\.\(9\)
###### Definition 5\.1\(Notation for gradient statistics\)\.
Letμb:=𝔼\[gb\]\\mu\_\{b\}:=\\mathbb\{E\}\[g\_\{b\}\],μa:=𝔼\[ga\]\\mu\_\{a\}:=\\mathbb\{E\}\[g\_\{a\}\]\. Define the approximation\-induced bias vector of the baseline surrogate asδb:=g∗−μb\\delta\_\{b\}:=g^\{\*\}\-\\mu\_\{b\}\.
###### Assumption 5\.2\(Moment model for gradient components\)\.
Letμb:=𝔼\[gb\]\\mu\_\{b\}:=\\mathbb\{E\}\[g\_\{b\}\]andμa:=𝔼\[ga\]\\mu\_\{a\}:=\\mathbb\{E\}\[g\_\{a\}\]\. There exists an ideal \(unobserved\) reference gradientg∗g^\{\*\}such that
μb=g∗−δb,‖δb‖2≤Cd,\\mu\_\{b\}=g^\{\*\}\-\\delta\_\{b\},\\qquad\\\|\\delta\_\{b\}\\\|\_\{2\}\\leq C\\sqrt\{d\},andgb,gag\_\{b\},g\_\{a\}have finite second moments\.
The definition ofg∗g^\{\*\}and the noise structure assumptions are detailed in Appendix[B\.1](https://arxiv.org/html/2605.10989#A2.SS1)\.
###### Theorem 5\.3\(Optimal scaling factor\)\.
Letg~\(λ\):=gb\+λga\\tilde\{g\}\(\\lambda\):=g\_\{b\}\+\\lambda g\_\{a\}\. Under Assumption[5\.2](https://arxiv.org/html/2605.10989#S5.Thmtheorem2), assumeμa:=𝔼\[ga\]\\mu\_\{a\}:=\\mathbb\{E\}\[g\_\{a\}\]exists andVar\(ga\)\\mathrm\{Var\}\(g\_\{a\}\)has finite trace\. Assume additionally that the mini\-batch noises ingbg\_\{b\}andgag\_\{a\}are uncorrelated in the dot\-product sense:
𝔼\[\(gb−𝔼\[gb\]\)⊤\(ga−𝔼\[ga\]\)\]=0\.\\mathbb\{E\}\\big\[\(g\_\{b\}\-\\mathbb\{E\}\[g\_\{b\}\]\)^\{\\top\}\(g\_\{a\}\-\\mathbb\{E\}\[g\_\{a\}\]\)\\big\]=0\.Then any minimizer of𝔼‖g~\(λ\)−g∗‖22\\mathbb\{E\}\\\|\\tilde\{g\}\(\\lambda\)\-g^\{\*\}\\\|\_\{2\}^\{2\}is
λ∗=⟨δb,μa⟩‖μa‖22\+tr\(Var\(ga\)\)\.\\lambda^\{\*\}\\;=\\;\\frac\{\\langle\\delta\_\{b\},\\ \\mu\_\{a\}\\rangle\}\{\\\|\\mu\_\{a\}\\\|\_\{2\}^\{2\}\+\\mathrm\{tr\}\(\\mathrm\{Var\}\(g\_\{a\}\)\)\}\.If‖μa‖22\+tr\(Var\(ga\)\)\>0\\\|\\mu\_\{a\}\\\|\_\{2\}^\{2\}\+\\mathrm\{tr\}\(\\mathrm\{Var\}\(g\_\{a\}\)\)\>0, this minimizer is unique\. In particular, under the isotropic noise modelVar\(ga\)=σa2Id\\mathrm\{Var\}\(g\_\{a\}\)=\\sigma\_\{a\}^\{2\}I\_\{d\},
λ∗=⟨δb,μa⟩‖μa‖22\+dσa2\.\\lambda^\{\*\}=\\frac\{\\langle\\delta\_\{b\},\\ \\mu\_\{a\}\\rangle\}\{\\\|\\mu\_\{a\}\\\|\_\{2\}^\{2\}\+d\\sigma\_\{a\}^\{2\}\}\.
###### Corollary 5\.4\(Practical norm\-ratio approximation\)\.
In addition, suppose during the main training phase \(after a short transient\): \(i\) the alignmentcosθ:=⟨δb,μa⟩‖δb‖2‖μa‖2≈cθ\\cos\\theta:=\\frac\{\\langle\\delta\_\{b\},\\mu\_\{a\}\\rangle\}\{\\\|\\delta\_\{b\}\\\|\_\{2\}\\\|\\mu\_\{a\}\\\|\_\{2\}\}\\approx c\_\{\\theta\}is approximately stable, \(ii\) the*relative bias ratio*β:=‖δb‖2‖μb‖2\\beta:=\\frac\{\\\|\\delta\_\{b\}\\\|\_\{2\}\}\{\\\|\\mu\_\{b\}\\\|\_\{2\}\}withμb:=𝔼\[gb\]\\mu\_\{b\}:=\\mathbb\{E\}\[g\_\{b\}\]is bounded and slowly varying, soβ≈κ\\beta\\approx\\kappa, and \(iii\) the noise ratioρ:=dσa2‖μa‖22\\rho:=\\frac\{d\\sigma\_\{a\}^\{2\}\}\{\\\|\\mu\_\{a\}\\\|\_\{2\}^\{2\}\}is approximately stable\. Then
λ∗≈η‖μb‖2‖μa‖2,η:=κcθ1\+ρ\.\\lambda^\{\*\}\\approx\\eta\\frac\{\\\|\\mu\_\{b\}\\\|\_\{2\}\}\{\\\|\\mu\_\{a\}\\\|\_\{2\}\},\\qquad\\eta:=\\frac\{\\kappa c\_\{\\theta\}\}\{1\+\\rho\}\.Replacing population quantities by mini\-batch estimates and adding a numerical stabilizerϵ\>0\\epsilon\>0yields the AGS rule
λAGS:=η‖gb‖2‖ga‖2\+ϵ\.\\lambda\_\{\\mathrm\{AGS\}\}\\;:=\\;\\eta\\frac\{\\\|g\_\{b\}\\\|\_\{2\}\}\{\\\|g\_\{a\}\\\|\_\{2\}\+\\epsilon\}\.
The proof is detailed in Appendix[B\.2](https://arxiv.org/html/2605.10989#A2.SS2)\. We now derive the practical expressionλAGS\\lambda\_\{\\mathrm\{AGS\}\}for the optimal scaling factorλ∗\\lambda^\{\*\}, which is norm\-based and adaptive, adopted in our AGS module \(Sec\.[4\.2](https://arxiv.org/html/2605.10989#S4.SS2)\)\. This analysis establishes a principled gradient scaling factor that improves the resulting gradient update and alleviates the gradient mismatch\.
Table 1:Performance comparison with the state\-of\-the\-arts on CIFAR\-10\. W/A denotes the bit length of the weights and activations\.NetworkMethodW/ATop\-1ResNet\-18Real\-Valued32/3294\.8%RAD1/190\.5%IR\-Net1/191\.5%RBNN1/192\.2%ReCU1/192\.8%SURGE \(Ours\)1/193\.1%ResNet\-20Real\-Valued32/3292\.1%DoReFa1/179\.3%DSQ1/184\.1%SLB1/185\.5%IR\-Net1/186\.5%ReCU1/187\.4%SURGE \(Ours\)1/188\.0%VGG\-SmallReal\-Valued32/3294\.1%XNOR\-Net1/189\.8%DoReFa1/190\.2%IR\-Net1/190\.4%RBNN1/191\.3%DSQ1/191\.7%SLB1/192\.0%ReCU1/192\.2%SURGE \(Ours\)1/192\.5%
## 6Experiments
### 6\.1Datasets and Implementation Details
Datasets\.We evaluate on two standard image classification benchmarks, one object detection benchmark, and one suite of language understanding tasks to demonstrate the effectiveness: CIFAR\-10\(Krizhevskyet al\.,[2009](https://arxiv.org/html/2605.10989#bib.bib14)\), ImageNet\-1K\(Russakovskyet al\.,[2015](https://arxiv.org/html/2605.10989#bib.bib12)\), PASCAL VOC\(Everinghamet al\.,[2010](https://arxiv.org/html/2605.10989#bib.bib13)\), and GLUE\(Wanget al\.,[2018](https://arxiv.org/html/2605.10989#bib.bib126)\)\. More details of datasets, data augmentation, and evaluating metrics are provided in Appendix[D](https://arxiv.org/html/2605.10989#A4)\.
Implementation Details\.On CIFAR\-10, we evaluate our method with ResNet\-18/20\(Heet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib112)\)and VGG\-Small\(Simonyan and Zisserman,[2014](https://arxiv.org/html/2605.10989#bib.bib15)\)\. On PASCAL VOC, we binarize Faster\-RCNN\(Renet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib76)\)with a ResNet\-18 backbone \(with minor structural modifications shared by FP/BNN\)\. On GLUE, we evaluate our method with BERT\-base\(Devlinet al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib127)\)\. More training details are provided in Appendix[E](https://arxiv.org/html/2605.10989#A5)\.
### 6\.2Image Classification
CIFAR\-10\.We first show the experimental results on CIFAR\-10 with ResNet\-18, ResNet\-20, VGG\-Small backbone in Table[1](https://arxiv.org/html/2605.10989#S5.T1)\. Specifically, we compare SURGE with state\-of\-the\-art methods include RAD\(Dinget al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib18)\), IR\-Net\(Qinet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib7)\), RBNN\(Linet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib19)\), ReCU\(Xuet al\.,[2021c](https://arxiv.org/html/2605.10989#bib.bib8)\), DoReFa\(Zhouet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib20)\), DSQ\(Gonget al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib4)\), SLB\(Yanget al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib21)\), IR\-Net\(Qinet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib7)\), and XNOR\-Net\(Rastegariet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib6)\)\. We can see that SURGE outperforms all other methods in all backbones\. Compared to recent ReCU, SURGE obtains a 0\.3% performance increase with ResNet\-18, a 0\.6% performance increase with ResNet\-20, and a 0\.3% performance increase with VGG\-Small\.
Table 2:A performance comparison with SOTAs on ImageNet with one\-stage training\. W/A denotes the bit length of weights and activations\. We report the Top\-1 \(%\) and Top\-5 \(%\) accuracy performances\.NetworkMethodW/AOPs \(×108\\times 10^\{8\}\)Top\-1Top\-5ResNet\-18Real\-valued32/3218\.1969\.689\.2DoReFa1/42\.4459\.281\.5TBN1/21\.8155\.679\.0BNN1/11\.6342\.267\.1XNOR\-Net51\.273\.2Bi\-Real Net56\.479\.5IR\-Net58\.180\.0BONN59\.381\.6RBNN59\.681\.6ReCU61\.082\.6RBONN61\.483\.5SURGE \(Ours\)62\.083\.7Table 3:A performance comparison with SOTAs on ImageNet with two\-stage training\. W/A denotes the bit length of weights and activations\. We report the Top\-1 \(%\) and Top\-5 \(%\) accuracy performances\. \* denotes the result is from the official checkpoint\.NetworkMethodW/AOPs \(×108\\times 10^\{8\}\)Top\-1Top\-5ResNet\-18Real\-valued32/3218\.1969\.689\.2ReActNet1/11\.6365\.9\-ReCU66\.486\.5RBONN\*66\.586\.7SURGE \(Ours\)66\.786\.7One\-Stage Training on ImageNet\.Table[2](https://arxiv.org/html/2605.10989#S6.T2)displays the performance comparison in binarizing ResNet\-18 with one\-stage training on ImageNet\. We compare SURGE with DoReFa\(Zhouet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib20)\), TBN\(Wanet al\.,[2018](https://arxiv.org/html/2605.10989#bib.bib23)\), BNN\(Courbariauxet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib22)\), XNOR\-Net\(Rastegariet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib6)\), Bi\-Real Net\(Liuet al\.,[2018b](https://arxiv.org/html/2605.10989#bib.bib9)\), IR\-Net\(Qinet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib7)\), BONN\(Zhaoet al\.,[2022](https://arxiv.org/html/2605.10989#bib.bib24)\), RBNN\(Linet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib19)\), RBONN\(Xuet al\.,[2022a](https://arxiv.org/html/2605.10989#bib.bib16)\)\. We can see that SURGE is leading in both the top\-1 and top\-5 accuracies\. Specifically, SURGE outperforms RBONN by 0\.6% in top\-1 accuracy, achieving the best performance\.
Two\-Stage Training on ImageNet\.Table[3](https://arxiv.org/html/2605.10989#S6.T3)displays the performance comparison in binarizing ResNet\-18 with two\-stage training on ImageNet\. We compare SURGE with ReActNet\(Liuet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib17)\), ReCU\(Xuet al\.,[2021c](https://arxiv.org/html/2605.10989#bib.bib8)\), and RBONN\(Xuet al\.,[2022a](https://arxiv.org/html/2605.10989#bib.bib16)\)\. Results show that SURGE outperforms all other methods in top\-1 accuracy\. Specifically, SURGE obtains a 0\.2% performance increase in top\-1 over RBONN\. Our SURGE demonstrates superior overall performance compared to all existing approaches\.
Table 4:Performance comparison of different methods in Faster\-RCNN framework with input resolution set to 1000×\\times600\.†denotes that the result is from our re\-implementation\.FrameworkBackboneMethodW/AMemory Usage\(MB\)OPs\(×109\\times 10^\{9\}\)mAPFaster\-RCNNResNet\-18Real\-valued32/32112\.8896\.4078\.8DoReFa\-Net4/421\.5927\.1573\.3ReActNet1/116\.6118\.4969\.6LWS\-Det73\.2IDa\-Det†76\.5SURGE \(Ours\)77\.0
### 6\.3Object Detection
On the PASCAL VOC dataset, we compare the proposed SURGE against existing state\-of\-the\-art binarized detection methods, such as ReActNet\(Liuet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib17)\), LWS\-Det\(Xuet al\.,[2021a](https://arxiv.org/html/2605.10989#bib.bib30)\), and IDa\-Det\(Xuet al\.,[2022b](https://arxiv.org/html/2605.10989#bib.bib26)\), on the Faster\-RCNN framework for object detection\. The detection result of multi\-bit quantized networks DoReFa\-Net\(Zhouet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib20)\)is also reported\. As shown in Table[4](https://arxiv.org/html/2605.10989#S6.T4), compared with the prior state\-of\-the\-art IDa\-Det, our method gains 0\.5% performance increase, with the same FLOPs and memory usage\. Compared with the raw real\-valued detectors, SURGE surpasses raw real\-valued Faster\-RCNN with ResNet\-18 backbone \(77\.0%v\.s\.v\.s\.76\.4%\) by apparent computation acceleration and storage savings by5\.21×5\.21\\timesand6\.80×6\.80\\times\.
### 6\.4Language Understanding
On the GLUE dataset, we compare SURGE against existing state\-of\-the\-art methods, such as BinaryBERT\(Baiet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib129)\), BiBERT\(Qinet al\.,[2022](https://arxiv.org/html/2605.10989#bib.bib115)\), and BiT\(Liuet al\.,[2022](https://arxiv.org/html/2605.10989#bib.bib128)\), on BERT\. We can see that SURGE outperforms all other methods\. Specifically, SURGE obtains a 1\.4% performance increase compared to BiT, and outperforms BiBERT by 8\.9%, achieving the best performance\.
Table 5:Performance comparison of BERT quantization on the GLUE dev set\. FP is short for full precision\.†denotes our re\-implementation without multi\-distillation techniques for fair comparison\.QuantSize \(MB\)FLOPs \(G\)MNLIm/mmQQPQNLISST\-2CoLASTS\-BMRPCRTEAvg\.BERT \(FP\)41822\.584\.9/85\.591\.492\.193\.259\.790\.186\.372\.283\.9BinaryBERT16\.50\.435\.6/35\.366\.251\.553\.206\.168\.352\.741\.0BiBERT13\.40\.466\.1/67\.584\.872\.688\.725\.433\.672\.557\.463\.2BiT†13\.40\.477\.0/77\.585\.485\.587\.823\.668\.079\.458\.170\.6SURGE\(Ours\)13\.40\.477\.3/77\.587\.186\.288\.624\.171\.780\.660\.672\.0
### 6\.5Ablation Study
Ablation on Components\.We ablate each component on CIFAR\-10 using ResNet20\. As shown in Table[6\(a\)](https://arxiv.org/html/2605.10989#S6.T6.st1), the baseline achieves 87\.4% accuracy\. Introducing theDual\-Path Gradient Compensator \(DPGC\)alone improves performance by \+0\.4%, validating its capability to balance gradient conflicts\. Subsequent integration of theAdaptive Gradient Scaler \(AGS\)adds another \+0\.2%, demonstrating that AGS effectively modulates gradient magnitudes without disrupting DPGC’s compensation\. The hierarchical gains confirm that both mechanisms address distinct aspects of gradient optimization\.
Table 6:Ablation Study on CIFAR\-10 with ResNet20\.\(a\)Ablation on componentsMethodAccuracy \(%\)Baseline87\.4\+ DPGC87\.8\+ DPGC \+ AGS88\.0
\(b\)Ablation on gradient compensation scopeMethodScopeAccuracy \(%\)Baseline/87\.4SURGEclipped gradients87\.7unclipped gradients87\.6all gradients88\.0
Figure 3:Ablation study on parameter scaling strategies\. \(a\) is fixed scaling with constant factors across training iterations\. \(b\) is adaptive scaling via parameterη\\etathat dynamically adjusts the compensation strength \(Eq\.[7](https://arxiv.org/html/2605.10989#S4.E7)\)\.Ablation on Parameterη\\eta\.As shown in Figure[3](https://arxiv.org/html/2605.10989#S6.F3), the performance degradation of fixed scaling \(scale\>0\.05scale\>0\.05, max \-17\.7%\) highlights the necessity of dynamic adaptation, while our adaptive scaling achieves peak accuracy \(87\.95%\) atη=0\.01\\eta=0\.01\. The result confirms that our theory\-driven design \(Theorem[5\.3](https://arxiv.org/html/2605.10989#S5.Thmtheorem3)\) successfully balances gradient compensation and training stability\.
Ablation on Gradient Compensation Scope of DPGC\.We ablate the gradient compensation scope on CIFAR\-10 using ResNet20\. As detailed in Table[6\(b\)](https://arxiv.org/html/2605.10989#S6.T6.st2), compensatingonlygradients outside STE’s clipping range \(\|x\|\>1\|x\|\>1\) yields 87\.7% accuracy \(0\.3% improvement over baseline\), while compensatingsolelywithin\-range gradients \(\|x\|≤1\|x\|\\leq 1\) achieves 87\.6% \(0\.2% improvement\)\. This verifies that both clipped and preserved gradient components contribute to parameter optimization\. When jointly compensatingallactivation gradients through SURGE’s adaptive integration, accuracy rises to 88\.0% \(0\.6% improvement\)\. This confirms that SURGE’s design overcomes fixed\-range clipping limitations in STE, enabling comprehensive gradient utilization\.
## 7Conclusion
This paper proposes a novel gradient compensation strategy that mitigates the STE\-induced gradient mismatch through an auxiliary backpropagation\. The proposed Dual\-Path Gradient Compensator \(DPGC\) utilizes a dual\-path architecture that ensures identical output to standard BNNs while providing less biased gradients for compensation\. And the Adaptive Gradient Scaler \(AGS\) dynamically balances inter\-branch gradient contributions via norm\-based scaling\. SURGE obtains the best performance over existing methods through main benchmarks in image classification, object detection, and language understanding tasks\.
## Acknowledgements
The work was supported by the National Key Research and Development Program of China \(No\.2023YFC3306401\) and Beijing Natural Science Foundation L244043\. This research was also supported by National Natural Science Foundation of China 623B2016 and 62576018, and Zhejiang Provincial Natural Science Foundation of China under Grant No\. LD24F020007\.
## Impact Statement
This paper proposes a training\-time gradient compensation framework for Binary Neural Networks \(BNNs\) to improve optimization stability and accuracy under extreme quantization\. The primary expected impact is enabling more efficient deployment of neural models on resource\-constrained devices through reduced memory footprint and computation, which may lower energy consumption and operational cost for practical applications\.
We do not anticipate direct safety or security risks introduced by the proposed method beyond those already associated with deploying machine learning models in real\-world settings\. The method does not introduce new data collection, does not require sensitive information, and does not change the functional scope of the underlying models; it mainly affects the training dynamics and can be removed at inference time\. As with other model compression techniques, improved efficiency could facilitate wider deployment, and responsible use should follow standard best practices for dataset governance, evaluation, and monitoring in downstream applications\.
## References
- A\. Athalye, N\. Carlini, and D\. Wagner \(2018\)Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples\.InInt\. Conf\. Mach\. Learn\.,pp\. 274–283\.Cited by:[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1)\.
- H\. Bai, W\. Zhang, L\. Hou, L\. Shang, J\. Jin, X\. Jiang, Q\. Liu, M\. Lyu, and I\. King \(2020\)Binarybert: pushing the limit of bert quantization\.arXiv preprint arXiv:2012\.15701\.Cited by:[§6\.4](https://arxiv.org/html/2605.10989#S6.SS4.p1.1)\.
- Y\. Bengio, N\. Léonard, and A\. Courville \(2013\)Estimating or propagating gradients through stochastic neurons for conditional computation\.arXiv preprint arXiv:1308\.3432\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1),[§3](https://arxiv.org/html/2605.10989#S3.p5.2)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Adv\. Neural Inform\. Process\. Syst\.33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p1.1)\.
- A\. Bulat, B\. Martinez, and G\. Tzimiropoulos \(2020\)Bats: binary architecture search\.InEur\. Conf\. Comput\. Vis\.,pp\. 309–325\.Cited by:[§2\.2](https://arxiv.org/html/2605.10989#S2.SS2.p1.1)\.
- H\. Cai, L\. Zhu, and S\. Han \(2018\)Proxylessnas: direct neural architecture search on target task and hardware\.arXiv preprint arXiv:1812\.00332\.Cited by:[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1)\.
- M\. Courbariaux, Y\. Bengio, and J\. David \(2015\)Binaryconnect: training deep neural networks with binary weights during propagations\.Adv\. Neural Inform\. Process\. Syst\.28\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p2.2),[§2\.2](https://arxiv.org/html/2605.10989#S2.SS2.p1.1)\.
- M\. Courbariaux, I\. Hubara, D\. Soudry, R\. El\-Yaniv, and Y\. Bengio \(2016\)Binarized neural networks: training deep neural networks with weights and activations constrained to\+ 1 or\-1\.arXiv preprint arXiv:1602\.02830\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p2.2),[§1](https://arxiv.org/html/2605.10989#S1.p3.1),[§1](https://arxiv.org/html/2605.10989#S1.p5.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p2.1)\.
- S\. Darabi, M\. Belbahri, M\. Courbariaux, and V\. P\. Nia \(2018\)Bnn\+: improved binary network training\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p4.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§E\.1](https://arxiv.org/html/2605.10989#A5.SS1.p4.1),[§E\.2](https://arxiv.org/html/2605.10989#A5.SS2.p4.1),[§6\.1](https://arxiv.org/html/2605.10989#S6.SS1.p2.1)\.
- R\. Ding, T\. Chin, Z\. Liu, and D\. Marculescu \(2019\)Regularizing activation distribution for training binarized deep networks\.InIEEE/CVF Conf\. Comput\. Vis\. Pattern Recog\.,pp\. 11408–11417\.Cited by:[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p1.1)\.
- S\. K\. Esser, J\. L\. McKinstry, D\. Bablani, R\. Appuswamy, and D\. S\. Modha \(2019\)Learned step size quantization\.InInt\. Conf\. Learn\. Represent\.,Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p1.1)\.
- S\. K\. Esser, J\. L\. McKinstry, D\. Bablani, R\. Appuswamy, and D\. S\. Modha \(2020\)Learned step size quantization\.Int\. Conf\. Learn\. Represent\.\.Cited by:[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1)\.
- M\. Everingham, L\. Van Gool, C\. K\. Williams, J\. Winn, and A\. Zisserman \(2010\)The pascal visual object classes \(voc\) challenge\.Int\. J\. Comput\. Vis\.88,pp\. 303–338\.Cited by:[Appendix D](https://arxiv.org/html/2605.10989#A4.p1.2),[§6\.1](https://arxiv.org/html/2605.10989#S6.SS1.p1.1)\.
- J\. Feng \(2021\)Bolt\.Note:[https://github\.com/huawei\-noah/bolt](https://github.com/huawei-noah/bolt)Cited by:[§G\.2](https://arxiv.org/html/2605.10989#A7.SS2.p1.1)\.
- R\. Gong, X\. Liu, S\. Jiang, T\. Li, P\. Hu, J\. Lin, F\. Yu, and J\. Yan \(2019\)Differentiable soft quantization: bridging full\-precision and low\-bit neural networks\.InIEEE Int\. Conf\. Comput\. Vis\.,pp\. 4852–4861\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p2.2),[§1](https://arxiv.org/html/2605.10989#S1.p4.1),[§1](https://arxiv.org/html/2605.10989#S1.p5.1),[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2015\)Delving deep into rectifiers: surpassing human\-level performance on imagenet classification\.InIEEE Int\. Conf\. Comput\. Vis\.,pp\. 1026–1034\.Cited by:[§E\.1](https://arxiv.org/html/2605.10989#A5.SS1.p3.1),[§2\.2](https://arxiv.org/html/2605.10989#S2.SS2.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InIEEE/CVF Conf\. Comput\. Vis\. Pattern Recog\.,Cited by:[§E\.1](https://arxiv.org/html/2605.10989#A5.SS1.p1.1),[§1](https://arxiv.org/html/2605.10989#S1.p1.1),[§6\.1](https://arxiv.org/html/2605.10989#S6.SS1.p2.1)\.
- Y\. He and L\. Xiao \(2023\)Structured pruning for deep convolutional neural networks: a survey\.IEEE Trans\. Pattern Anal\. Mach\. Intell\.46\(5\),pp\. 2900–2919\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p1.1)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2014\)Distilling the knowledge in a neural network\.InAdv\. Neural Inform\. Process\. Syst\. Worksh\.,Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p1.1)\.
- I\. Hubara, Y\. Nahshan, Y\. Hanani, R\. Banner, and D\. Soudry \(2021\)Accurate post training quantization with small calibration sets\.InInt\. Conf\. Mach\. Learn\.,Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p1.1)\.
- E\. Jang, S\. Gu, and B\. Poole \(2016\)Categorical reparameterization with gumbel\-softmax\.arXiv preprint arXiv:1611\.01144\.Cited by:[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1)\.
- L\. Jin, J\. Ma, Z\. Liu, A\. Gromov, A\. Defazio, and L\. Xiao \(2025\)PARQ: piecewise\-affine regularized quantization\.arXiv preprint arXiv:2503\.15748\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p5.1)\.
- A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning multiple layers of features from tiny images\.Cited by:[Appendix D](https://arxiv.org/html/2605.10989#A4.p1.2),[§6\.1](https://arxiv.org/html/2605.10989#S6.SS1.p1.1)\.
- M\. Lin, R\. Ji, Z\. Xu, B\. Zhang, Y\. Wang, Y\. Wu, F\. Huang, and C\. Lin \(2020\)Rotated binary neural network\.Adv\. Neural Inform\. Process\. Syst\.33,pp\. 7474–7485\.Cited by:[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p1.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p2.1)\.
- B\. Liu, H\. Huang, L\. Yang, Y\. Li, G\. Guo, X\. Cao, and B\. Zhang \(2025\)Efficient low\-bit quantization with adaptive scales for multi\-task co\-training\.InInt\. Conf\. Learn\. Represent\.,Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p1.1)\.
- C\. Liu, W\. Ding, X\. Xia, B\. Zhang, J\. Gu, J\. Liu, R\. Ji, and D\. Doermann \(2019\)Circulant binary convolutional networks: enhancing the performance of 1\-bit dcnns with circulant back propagation\.InIEEE/CVF Conf\. Comput\. Vis\. Pattern Recog\.,pp\. 2691–2699\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p4.1)\.
- H\. Liu, K\. Simonyan, and Y\. Yang \(2018a\)Darts: differentiable architecture search\.arXiv preprint arXiv:1806\.09055\.Cited by:[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1)\.
- L\. Liu, C\. Dong, X\. Liu, B\. Yu, and J\. Gao \(2023\)Bridging discrete and backpropagation: straight\-through and beyond\.Adv\. Neural Inform\. Process\. Syst\.36,pp\. 12291–12311\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p6.1),[§4\.1](https://arxiv.org/html/2605.10989#S4.SS1.p5.3)\.
- Z\. Liu, B\. Oguz, A\. Pappu, L\. Xiao, S\. Yih, M\. Li, R\. Krishnamoorthi, and Y\. Mehdad \(2022\)Bit: robustly binarized multi\-distilled transformer\.Adv\. Neural Inform\. Process\. Syst\.35,pp\. 14303–14316\.Cited by:[§E\.1](https://arxiv.org/html/2605.10989#A5.SS1.p4.1),[§E\.2](https://arxiv.org/html/2605.10989#A5.SS2.p4.1),[§6\.4](https://arxiv.org/html/2605.10989#S6.SS4.p1.1)\.
- Z\. Liu, Z\. Shen, M\. Savvides, and K\. Cheng \(2020\)Reactnet: towards precise binary neural network with generalized activation functions\.InEur\. Conf\. Comput\. Vis\.,pp\. 143–159\.Cited by:[§E\.1](https://arxiv.org/html/2605.10989#A5.SS1.p2.1),[§E\.2](https://arxiv.org/html/2605.10989#A5.SS2.p2.5),[§E\.2](https://arxiv.org/html/2605.10989#A5.SS2.p3.2),[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.10989#S2.SS2.p1.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p3.1),[§6\.3](https://arxiv.org/html/2605.10989#S6.SS3.p1.3)\.
- Z\. Liu, B\. Wu, W\. Luo, X\. Yang, W\. Liu, and K\. Cheng \(2018b\)Bi\-real net: enhancing the performance of 1\-bit cnns with improved representational capability and advanced training algorithm\.InEur\. Conf\. Comput\. Vis\.,pp\. 722–737\.Cited by:[§E\.1](https://arxiv.org/html/2605.10989#A5.SS1.p2.1),[§1](https://arxiv.org/html/2605.10989#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.10989#S2.SS2.p1.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p2.1)\.
- H\. Qin, Y\. Ding, M\. Zhang, Y\. Qinghua, A\. Liu, Q\. Dang, Z\. Liu, and X\. Liu \(2022\)BiBERT: accurate fully binarized bert\.InInt\. Conf\. Learn\. Represent\.,Cited by:[§E\.1](https://arxiv.org/html/2605.10989#A5.SS1.p4.1),[§1](https://arxiv.org/html/2605.10989#S1.p2.2),[§6\.4](https://arxiv.org/html/2605.10989#S6.SS4.p1.1)\.
- H\. Qin, R\. Gong, X\. Liu, M\. Shen, Z\. Wei, F\. Yu, and J\. Song \(2020\)Forward and backward information retention for accurate binary neural networks\.InIEEE/CVF Conf\. Comput\. Vis\. Pattern Recog\.,pp\. 2250–2259\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p4.1),[§1](https://arxiv.org/html/2605.10989#S1.p5.1),[§2\.2](https://arxiv.org/html/2605.10989#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.10989#S4.SS1.p1.3),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p1.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p2.1)\.
- M\. Rastegari, V\. Ordonez, J\. Redmon, and A\. Farhadi \(2016\)Xnor\-net: imagenet classification using binary convolutional neural networks\.InEur\. Conf\. Comput\. Vis\.,pp\. 525–542\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p2.2),[§1](https://arxiv.org/html/2605.10989#S1.p3.1),[§1](https://arxiv.org/html/2605.10989#S1.p5.1),[§2\.2](https://arxiv.org/html/2605.10989#S2.SS2.p1.1),[§3](https://arxiv.org/html/2605.10989#S3.p2.12),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p1.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p2.1)\.
- S\. Ren, K\. He, R\. Girshick, and J\. Sun \(2016\)Faster r\-cnn: towards real\-time object detection with region proposal networks\.IEEE Trans\. Pattern Anal\. Mach\. Intell\.\.Cited by:[§6\.1](https://arxiv.org/html/2605.10989#S6.SS1.p2.1)\.
- D\. J\. Rezende, S\. Mohamed, and D\. Wierstra \(2014\)Stochastic backpropagation and approximate inference in deep generative models\.InInt\. Conf\. Mach\. Learn\.,pp\. 1278–1286\.Cited by:[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1)\.
- F\. Rosenblatt \(1957\)The perceptron, a perceiving and recognizing automaton project para\.Cornell Aeronautical Laboratory\.Cited by:[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1)\.
- O\. Russakovsky, J\. Deng, H\. Su, J\. Krause, S\. Satheesh, S\. Ma, Z\. Huang, A\. Karpathy, A\. Khosla, M\. Bernstein,et al\.\(2015\)Imagenet large scale visual recognition challenge\.Int\. J\. Comput\. Vis\.115,pp\. 211–252\.Cited by:[Appendix D](https://arxiv.org/html/2605.10989#A4.p1.2),[§6\.1](https://arxiv.org/html/2605.10989#S6.SS1.p1.1)\.
- J\. Schulman, N\. Heess, T\. Weber, and P\. Abbeel \(2015\)Gradient estimation using stochastic computation graphs\.Adv\. Neural Inform\. Process\. Syst\.28\.Cited by:[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1)\.
- K\. Simonyan and A\. Zisserman \(2014\)Very deep convolutional networks for large\-scale image recognition\.arXiv preprint arXiv:1409\.1556\.Cited by:[§E\.1](https://arxiv.org/html/2605.10989#A5.SS1.p1.1),[§6\.1](https://arxiv.org/html/2605.10989#S6.SS1.p2.1)\.
- P\. Stock, A\. Fan, B\. Graham, E\. Grave, R\. Gribonval, H\. Jegou, and A\. Joulin \(2021\)Training with quantization noise for extreme model compression\.InInt\. Conf\. Learn\. Represent\.,Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p6.1),[§4\.1](https://arxiv.org/html/2605.10989#S4.SS1.p3.3)\.
- R\. S\. Sutton, D\. McAllester, S\. Singh, and Y\. Mansour \(1999\)Policy gradient methods for reinforcement learning with function approximation\.Adv\. Neural Inform\. Process\. Syst\.12\.Cited by:[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1)\.
- A\. Vaswani \(2017\)Attention is all you need\.Adv\. Neural Inform\. Process\. Syst\.\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p1.1)\.
- D\. Wan, F\. Shen, L\. Liu, F\. Zhu, J\. Qin, L\. Shao, and H\. T\. Shen \(2018\)Tbn: convolutional neural network with ternary inputs and binary weights\.InEur\. Conf\. Comput\. Vis\.,pp\. 315–332\.Cited by:[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p2.1)\.
- A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman \(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.arXiv preprint arXiv:1804\.07461\.Cited by:[Appendix D](https://arxiv.org/html/2605.10989#A4.p1.2),[§6\.1](https://arxiv.org/html/2605.10989#S6.SS1.p1.1)\.
- L\. Wang, X\. Dong, Y\. Wang, L\. Liu, W\. An, and Y\. Guo \(2022\)Learnable lookup table for neural network quantization\.InIEEE/CVF Conf\. Comput\. Vis\. Pattern Recog\.,Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p1.1)\.
- Z\. Wang, Z\. Wu, J\. Lu, and J\. Zhou \(2020\)Bidet: an efficient binarized object detector\.InIEEE/CVF Conf\. Comput\. Vis\. Pattern Recog\.,pp\. 2049–2058\.Cited by:[§E\.1](https://arxiv.org/html/2605.10989#A5.SS1.p3.1)\.
- S\. Xie, H\. Zheng, C\. Liu, and L\. Lin \(2018\)SNAS: stochastic neural architecture search\.arXiv preprint arXiv:1812\.09926\.Cited by:[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1)\.
- S\. Xu, Y\. Li, M\. Lin, P\. Gao, G\. Guo, J\. Lü, and B\. Zhang \(2023\)Q\-detr: an efficient low\-bit quantized detection transformer\.InIEEE/CVF Conf\. Comput\. Vis\. Pattern Recog\.,Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p1.1)\.
- S\. Xu, Y\. Li, T\. Wang, T\. Ma, B\. Zhang, P\. Gao, Y\. Qiao, J\. Lü, and G\. Guo \(2022a\)Recurrent bilinear optimization for binary neural networks\.InEur\. Conf\. Comput\. Vis\.,pp\. 19–35\.Cited by:[Appendix C](https://arxiv.org/html/2605.10989#A3.p4.1),[§E\.2](https://arxiv.org/html/2605.10989#A5.SS2.p2.5),[§1](https://arxiv.org/html/2605.10989#S1.p2.2),[§1](https://arxiv.org/html/2605.10989#S1.p5.1),[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.10989#S2.SS2.p1.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p2.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p3.1)\.
- S\. Xu, Y\. Li, B\. Zeng, T\. Ma, B\. Zhang, X\. Cao, P\. Gao, and J\. Lü \(2022b\)Ida\-det: an information discrepancy\-aware distillation for 1\-bit detectors\.InEur\. Conf\. Comput\. Vis\.,pp\. 346–361\.Cited by:[§E\.2](https://arxiv.org/html/2605.10989#A5.SS2.p3.2),[§1](https://arxiv.org/html/2605.10989#S1.p2.2),[§6\.3](https://arxiv.org/html/2605.10989#S6.SS3.p1.3)\.
- S\. Xu, J\. Zhao, J\. Lu, B\. Zhang, S\. Han, and D\. Doermann \(2021a\)Layer\-wise searching for 1\-bit detectors\.InIEEE/CVF Conf\. Comput\. Vis\. Pattern Recog\.,pp\. 5682–5691\.Cited by:[§6\.3](https://arxiv.org/html/2605.10989#S6.SS3.p1.3)\.
- Y\. Xu, K\. Han, C\. Xu, Y\. Tang, C\. Xu, and Y\. Wang \(2021b\)Learning frequency domain approximation for binary neural networks\.Adv\. Neural Inform\. Process\. Syst\.34,pp\. 25553–25565\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p2.2),[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.10989#S2.SS2.p1.1)\.
- Z\. Xu, M\. Lin, J\. Liu, J\. Chen, L\. Shao, Y\. Gao, Y\. Tian, and R\. Ji \(2021c\)Recu: reviving the dead weights in binary neural networks\.InIEEE Int\. Conf\. Comput\. Vis\.,pp\. 5198–5208\.Cited by:[§E\.2](https://arxiv.org/html/2605.10989#A5.SS2.p1.2),[§1](https://arxiv.org/html/2605.10989#S1.p2.2),[§2\.2](https://arxiv.org/html/2605.10989#S2.SS2.p1.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p1.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p3.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\. 5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p1.1)\.
- Z\. Yang, Y\. Wang, K\. Han, C\. Xu, C\. Xu, D\. Tao, and C\. Xu \(2020\)Searching for low\-bit weights in quantized neural networks\.Adv\. Neural Inform\. Process\. Syst\.33,pp\. 4091–4102\.Cited by:[§2\.2](https://arxiv.org/html/2605.10989#S2.SS2.p1.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p1.1)\.
- X\. Yu, T\. Liu, X\. Wang, and D\. Tao \(2017\)On compressing deep models by low rank and sparse decomposition\.InIEEE/CVF Conf\. Comput\. Vis\. Pattern Recog\.,pp\. 7370–7379\.Cited by:[§1](https://arxiv.org/html/2605.10989#S1.p1.1)\.
- J\. Zhao, S\. Xu, B\. Zhang, J\. Gu, D\. Doermann, and G\. Guo \(2022\)Towards compact 1\-bit cnns via bayesian learning\.Int\. J\. Comput\. Vis\.,pp\. 1–25\.Cited by:[§2\.1](https://arxiv.org/html/2605.10989#S2.SS1.p1.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p2.1)\.
- S\. Zhou, Y\. Wu, Z\. Ni, X\. Zhou, H\. Wen, and Y\. Zou \(2016\)Dorefa\-net: training low bitwidth convolutional neural networks with low bitwidth gradients\.arXiv preprint arXiv:1606\.06160\.Cited by:[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p1.1),[§6\.2](https://arxiv.org/html/2605.10989#S6.SS2.p2.1),[§6\.3](https://arxiv.org/html/2605.10989#S6.SS3.p1.3)\.
APPENDIX
## Appendix ATraining Procedure
The complete training procedure is summarized in Algorithm[A](https://arxiv.org/html/2605.10989#alg1)\.
Algorithm ALayer\-wise Training with DPGC & AGS0:Layer input
x\(l\)x^\{\(l\)\}, target
yy, learning rate
α\\alpha, base scaling coefficient
η\\eta, loss function
ℱ\\mathcal\{F\}
0:Trained binarized weights
\{Wb\(l\)\}l=1L\\\{W\_\{b\}^\{\(l\)\}\\\}\_\{l=1\}^\{L\}
1:Initialize layer parameters
Wb\(l\),Wa\(l\)W\_\{b\}^\{\(l\)\},W\_\{a\}^\{\(l\)\}\{Binary & auxiliary full\-precision paths\}
2:Initialize
λ\(l,0\)←1\|Wa\(l\)\|\\lambda^\{\(l,0\)\}\\leftarrow\\frac\{1\}\{\\sqrt\{\|W\_\{a\}^\{\(l\)\}\|\}\}\{Reciprocal sqrt of auxiliary weight cardinality
\|Wa\(l\)\|\|W\_\{a\}^\{\(l\)\}\|\}
3:foriteration
t=1t=1to
TTdo
4:Forward Propagation \(Layerll\):
5:Compute binary path:
fb\(l\)←Wb\(l\)⊙Sign\(x\(l\)\)f\_\{b\}^\{\(l\)\}\\leftarrow W\_\{b\}^\{\(l\)\}\\odot\\mathrm\{Sign\}\(x^\{\(l\)\}\)
6:Compute auxiliary path:
fa\(l\)←Wa\(l\)⊙x\(l\)f\_\{a\}^\{\(l\)\}\\leftarrow W\_\{a\}^\{\(l\)\}\\odot x^\{\(l\)\}
7:Generate compensator:
fao\(l\)←λ\(l,t−1\)⊙fa\(l\)f\_\{ao\}^\{\(l\)\}\\leftarrow\\lambda^\{\(l,t\-1\)\}\\odot f\_\{a\}^\{\(l\)\}\{Previous scaling factor\}
8:Synthesis output:
out\(l\)←fb\(l\)−fao\(l\)↓\+fao\(l\)\\mathrm\{out\}^\{\(l\)\}\\leftarrow f\_\{b\}^\{\(l\)\}\-f\_\{ao\}^\{\(l\)\}\\downarrow\+f\_\{ao\}^\{\(l\)\}\{Forward synthesis via gradient\-decoupled decomposition, detach gradient at
↓\\downarrow\}
9:Loss Computation:
10:Calculate Loss
ℒ\\mathcal\{L\}
11:Backward Propagation \(Layerll\):
12:Compute main branch gradients:
13:
gb←∂ℒ∂fb∂fb∂x\|STE\\quad g\_\{b\}\\leftarrow\\frac\{\\partial\\mathcal\{L\}\}\{\\partial f\_\{b\}\}\\frac\{\\partial f\_\{b\}\}\{\\partial x\}\\Big\|\_\{\\text\{STE\}\},
gwb\(l\)←∂ℒ∂Wb\(l\)\|STE\\quad g\_\{wb\}^\{\(l\)\}\\leftarrow\\frac\{\\partial\\mathcal\{L\}\}\{\\partial W\_\{b\}^\{\(l\)\}\}\\Big\|\_\{\\mathrm\{STE\}\}
14:Compute auxiliary branch gradients:
15:
ga←∂ℒ∂fa∂fa∂x\\quad g\_\{a\}\\leftarrow\\frac\{\\partial\\mathcal\{L\}\}\{\\partial f\_\{a\}\}\\frac\{\\partial f\_\{a\}\}\{\\partial x\},
gwa\(l\)←λ\(l,t−1\)⊙∂ℒ∂Wa\(l\)\\quad g\_\{wa\}^\{\(l\)\}\\leftarrow\\lambda^\{\(l,t\-1\)\}\\odot\\frac\{\\partial\\mathcal\{L\}\}\{\\partial W\_\{a\}^\{\(l\)\}\}
16:Then we have:
∂ℒ∂x\(l\)=gb\+λga\\frac\{\\partial\\mathcal\{L\}\}\{\\partial x^\{\(l\)\}\}=g\_\{b\}\+\\lambda g\_\{a\}
17:Adaptive Gradient Scaler:
18:Calculate norm ratio:
r←‖gb‖2/\(‖ga‖2\+ϵ\)r\\leftarrow\{\\\|g\_\{b\}\\\|\_\{2\}\}/\{\(\\\|g\_\{a\}\\\|\_\{2\}\+\\epsilon\)\}
19:Update scaling factor:
λ\(l,t\)←η⋅r\\lambda^\{\(l,t\)\}\\leftarrow\\eta\\cdot r
20:Parameter Update \(Layerll\):
21:
Wb\(l\)←Wb\(l\)−α⋅gwb\(l\)W\_\{b\}^\{\(l\)\}\\leftarrow W\_\{b\}^\{\(l\)\}\-\\alpha\\cdot g\_\{wb\}^\{\(l\)\}
22:
Wa\(l\)←Wa\(l\)−α⋅gwa\(l\)W\_\{a\}^\{\(l\)\}\\leftarrow W\_\{a\}^\{\(l\)\}\-\\alpha\\cdot g\_\{wa\}^\{\(l\)\}
23:endfor
## Appendix BTheoretical Foundations and Proofs
### B\.1Assumptions Underlying the Moment Model and Theorem[5\.3](https://arxiv.org/html/2605.10989#S5.Thmtheorem3)
In this subsection we make explicit the assumptions underlying Definition[5\.1](https://arxiv.org/html/2605.10989#S5.Thmtheorem1)and Theorem[5\.3](https://arxiv.org/html/2605.10989#S5.Thmtheorem3)\. Intuitively, since the derivative ofsignis zero almost everywhere and corresponds to a Dirac delta distribution at the origin in the sense of distributions, it is natural to viewg∗g^\{\*\}as the gradient induced by an “ideal” surrogate that captures this behavior, while practical rules such as STE provide tractable but biased approximations\.
###### Assumption B\.1\(Ideal reference gradient from a surrogate family\)\.
Fix a binarization node whose pre\-binarization activation is denoted byx∈ℝdx\\in\\mathbb\{R\}^\{d\}\. Consider a family𝒮\\mathcal\{S\}of smooth surrogate functionss:ℝ→ℝs:\\mathbb\{R\}\\to\\mathbb\{R\}that approximate the non\-differentiablesignfunction used in binarization\. For eachs∈𝒮s\\in\\mathcal\{S\}, letℓs\(⋅;W\)\\ell\_\{s\}\(\\cdot;W\)denote the population loss of the corresponding surrogate network, and define the population risk as a function of this node input:
ℒs\(x;W\):=𝔼ξ\[ℓs\(ξ;W\)\]with the backpropagated gradient taken w\.r\.t\.x\.\\mathcal\{L\}\_\{s\}\(x;W\):=\\mathbb\{E\}\_\{\\xi\}\\big\[\\,\\ell\_\{s\}\(\\xi;W\)\\,\\big\]\\quad\\text\{with the backpropagated gradient taken w\.r\.t\. \}x\.We assume that there exists a surrogates∗∈𝒮s^\{\*\}\\in\\mathcal\{S\}that attains the smallest population loss within this family, and define the associated reference \(“better”\) gradient at the current parameterWWas
g∗:=∇xℒs∗\(x;W\)\.g^\{\*\}:=\\nabla\_\{x\}\\mathcal\{L\}\_\{s^\{\*\}\}\(x;W\)\.Thisg∗g^\{\*\}is not observable in practice and we never require a closed\-form expression for it; it serves as an ideal target that practical surrogate gradients aim to approximate\. We assume thatg∗g^\{\*\}has finite second moments\.
###### Assumption B\.2\(Empirical gradients as random vectors\)\.
At a fixed parameterWW, the empirical gradientsgb,ga∈ℝdg\_\{b\},g\_\{a\}\\in\\mathbb\{R\}^\{d\}obtained from a single mini\-batch are modelled as random vectors whose randomness comes from mini\-batch sampling, data noise, and the stochastic optimization procedure\. All expectations𝔼\[⋅\]\\mathbb\{E\}\[\\cdot\]and variancesVar\(⋅\)\\mathrm\{Var\}\(\\cdot\)in our analysis are taken with respect to this randomness, and the empirical gradients have finite second moments\.
###### Assumption B\.3\(Directional consistency and bounded relative bias\)\.
We acknowledge that STE\-based gradients may suffer magnitude distortion due to clipping\. We assume that the baseline surrogate gradientgbg\_\{b\}remains statistically correlated with the descent direction of the ideal reference gradientg∗g^\{\*\}, i\.e\.,𝔼\[cos\(gb,g∗\)\]≥c\>0\\mathbb\{E\}\[\\cos\(g\_\{b\},g^\{\*\}\)\]\\geq c\>0during the main training phase\. Moreover, we assume the*relative bias ratio*β:=‖δb‖2/‖μb‖2\\beta:=\\\|\\delta\_\{b\}\\\|\_\{2\}/\\\|\\mu\_\{b\}\\\|\_\{2\}is bounded and varies slowly after a short transient \(layer\-wise\), so it can be treated as approximately constant when deriving practical scaling rules\.
###### Assumption B\.4\(Isotropic, homoscedastic gradient noise\)\.
We decompose the empirical gradients as
gb=𝔼\[gb\]\+εb,ga=𝔼\[ga\]\+εa,g\_\{b\}=\\mathbb\{E\}\[g\_\{b\}\]\+\\varepsilon\_\{b\},\\qquad g\_\{a\}=\\mathbb\{E\}\[g\_\{a\}\]\+\\varepsilon\_\{a\},where the noise terms satisfy𝔼\[εb\]=𝔼\[εa\]=0\\mathbb\{E\}\[\\varepsilon\_\{b\}\]=\\mathbb\{E\}\[\\varepsilon\_\{a\}\]=0\. We assume an isotropic, homoscedastic noise model: there exist scalarsσb2,σa2≥0\\sigma\_\{b\}^\{2\},\\sigma\_\{a\}^\{2\}\\geq 0such that
Var\(gb\)=𝔼\[εbεb⊤\]=σb2Id,Var\(ga\)=𝔼\[εaεa⊤\]=σa2Id,\\mathrm\{Var\}\(g\_\{b\}\)=\\mathbb\{E\}\[\\varepsilon\_\{b\}\\varepsilon\_\{b\}^\{\\top\}\]=\\sigma\_\{b\}^\{2\}I\_\{d\},\\qquad\\mathrm\{Var\}\(g\_\{a\}\)=\\mathbb\{E\}\[\\varepsilon\_\{a\}\\varepsilon\_\{a\}^\{\\top\}\]=\\sigma\_\{a\}^\{2\}I\_\{d\},whereIdI\_\{d\}is thedd\-dimensional identity matrix\.
### B\.2Proof of Theorem[5\.3](https://arxiv.org/html/2605.10989#S5.Thmtheorem3)
###### Proof\.
Expand the error expectation:
𝔼\[‖g~−g∗‖22\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\\|\\tilde\{g\}\-g^\{\*\}\\\|\_\{2\}^\{2\}\\right\]=𝔼\[‖gb\+λga−g∗‖22\]\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\\|g\_\{b\}\+\\lambda g\_\{a\}\-g^\{\*\}\\\|\_\{2\}^\{2\}\\right\]=‖𝔼\[gb\]\+λ𝔼\[ga\]−g∗‖22\+tr\(Var\(gb\)\)\+λ2tr\(Var\(ga\)\)\\displaystyle=\\\|\\mathbb\{E\}\[g\_\{b\}\]\+\\lambda\\mathbb\{E\}\[g\_\{a\}\]\-g^\{\*\}\\\|\_\{2\}^\{2\}\+\\mathrm\{tr\}\(\\mathrm\{Var\}\(g\_\{b\}\)\)\+\\lambda^\{2\}\\mathrm\{tr\}\(\\mathrm\{Var\}\(g\_\{a\}\)\)\+2λ𝔼\[\(gb−𝔼\[gb\]\)⊤\(ga−𝔼\[ga\]\)\]\.\\displaystyle\\qquad\+2\\lambda\\,\\mathbb\{E\}\\\!\\left\[\(g\_\{b\}\-\\mathbb\{E\}\[g\_\{b\}\]\)^\{\\top\}\(g\_\{a\}\-\\mathbb\{E\}\[g\_\{a\}\]\)\\right\]\.\(B\.1\)By the dot\-product uncorrelated assumption in Theorem[5\.3](https://arxiv.org/html/2605.10989#S5.Thmtheorem3), the last term in \([B\.1](https://arxiv.org/html/2605.10989#A2.E1)\) vanishes\. Using𝔼\[gb\]=g∗−δb\\mathbb\{E\}\[g\_\{b\}\]=g^\{\*\}\-\\delta\_\{b\}, we obtain
𝔼\[‖g~−g∗‖22\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\\|\\tilde\{g\}\-g^\{\*\}\\\|\_\{2\}^\{2\}\\right\]=‖λ𝔼\[ga\]−δb‖22\+tr\(Var\(gb\)\)\+λ2tr\(Var\(ga\)\)\.\\displaystyle=\\\|\\lambda\\mathbb\{E\}\[g\_\{a\}\]\-\\delta\_\{b\}\\\|\_\{2\}^\{2\}\+\\mathrm\{tr\}\(\\mathrm\{Var\}\(g\_\{b\}\)\)\+\\lambda^\{2\}\\mathrm\{tr\}\(\\mathrm\{Var\}\(g\_\{a\}\)\)\.\(B\.2\)Differentiating w\.r\.t\.λ\\lambdagives
∇λ𝔼\[‖g~−g∗‖22\]=−2⟨δb,μa⟩\+2λ\(‖𝔼\[ga\]‖22\+tr\(Var\(ga\)\)\),\\displaystyle\\nabla\_\{\\lambda\}\\,\\mathbb\{E\}\\\!\\left\[\\\|\\tilde\{g\}\-g^\{\*\}\\\|\_\{2\}^\{2\}\\right\]=\-2\\langle\\delta\_\{b\},\\mu\_\{a\}\\rangle\+2\\lambda\\left\(\\\|\\mathbb\{E\}\[g\_\{a\}\]\\\|\_\{2\}^\{2\}\+\\mathrm\{tr\}\(\\mathrm\{Var\}\(g\_\{a\}\)\)\\right\),\(B\.3\)which yields the minimizer
λ∗=⟨δb,𝔼\[ga\]⟩‖𝔼\[ga\]‖22\+tr\(Var\(ga\)\)\.\\displaystyle\\lambda^\{\*\}=\\frac\{\\langle\\delta\_\{b\},\\mathbb\{E\}\[g\_\{a\}\]\\rangle\}\{\\\|\\mathbb\{E\}\[g\_\{a\}\]\\\|\_\{2\}^\{2\}\+\\mathrm\{tr\}\(\\mathrm\{Var\}\(g\_\{a\}\)\)\}\.\(B\.4\)In particular, underVar\(ga\)=σa2Id\\mathrm\{Var\}\(g\_\{a\}\)=\\sigma\_\{a\}^\{2\}I\_\{d\}, we havetr\(Var\(ga\)\)=dσa2\\mathrm\{tr\}\(\\mathrm\{Var\}\(g\_\{a\}\)\)=d\\sigma\_\{a\}^\{2\}\.
Practical approximation via dynamic analysis\.Eq\. \([B\.4](https://arxiv.org/html/2605.10989#A2.E4)\) can be rewritten as
λ∗=‖δb‖2‖μa‖2⋅cosθ1\+ρ,cosθ:=⟨δb,μa⟩‖δb‖2‖μa‖2,ρ:=dσa2‖μa‖22\.\\displaystyle\\lambda^\{\*\}=\\frac\{\\\|\\delta\_\{b\}\\\|\_\{2\}\}\{\\\|\\mu\_\{a\}\\\|\_\{2\}\}\\cdot\\frac\{\\cos\\theta\}\{1\+\\rho\},\\hskip 18\.49988pt\\cos\\theta:=\\frac\{\\langle\\delta\_\{b\},\\mu\_\{a\}\\rangle\}\{\\\|\\delta\_\{b\}\\\|\_\{2\}\\\|\\mu\_\{a\}\\\|\_\{2\}\},\\qquad\\rho:=\\frac\{d\\sigma\_\{a\}^\{2\}\}\{\\\|\\mu\_\{a\}\\\|\_\{2\}^\{2\}\}\.\(B\.5\)To obtain a computable rule, we parameterize the unobserved bias magnitude via the*relative bias ratio*β:=‖δb‖2/‖μb‖2\\beta:=\\\|\\delta\_\{b\}\\\|\_\{2\}/\\\|\\mu\_\{b\}\\\|\_\{2\}\(withμb=𝔼\[gb\]\\mu\_\{b\}=\\mathbb\{E\}\[g\_\{b\}\]\) and treatβ≈κ\\beta\\approx\\kappaduring the main training phase\. Plugging‖δb‖2≈κ‖μb‖2\\\|\\delta\_\{b\}\\\|\_\{2\}\\approx\\kappa\\\|\\mu\_\{b\}\\\|\_\{2\}into \([B\.5](https://arxiv.org/html/2605.10989#A2.E5)\) yields
λ∗≈η‖μb‖2‖μa‖2,η:=κcθ1\+ρ\.\\displaystyle\\lambda^\{\*\}\\approx\\eta\\,\\frac\{\\\|\\mu\_\{b\}\\\|\_\{2\}\}\{\\\|\\mu\_\{a\}\\\|\_\{2\}\},\\hskip 18\.49988pt\\eta:=\\frac\{\\kappa\\,c\_\{\\theta\}\}\{1\+\\rho\}\.\(B\.6\)Finally, replacing population quantities by mini\-batch estimates gives
λAGS≈η‖gb‖2‖ga‖2\+ϵ,\\displaystyle\\lambda\_\{\\mathrm\{AGS\}\}\\approx\\eta\\,\\frac\{\\\|g\_\{b\}\\\|\_\{2\}\}\{\\\|g\_\{a\}\\\|\_\{2\}\+\\epsilon\},\(B\.7\)whereϵ\>0\\epsilon\>0is a numerical stabilizer\. Empirically, the inter\-branch cosine similarity stays high over most of training and drops only near convergence \(Figure[F](https://arxiv.org/html/2605.10989#A3.F6)\), whileλ\\lambdaquickly reaches a plateau in the toy study \(Figure[A](https://arxiv.org/html/2605.10989#A2.F1)\), supporting the use of a slowly\-varyingη\\etain practice\.
∎
Figure A:Comparison of five methods \(FP, STE, STE\+SURGE, Bi\-Real, Bi\-Real\+SURGE\) on Beale function: \(a\) trajectories, \(b\) loss, \(c\) distance to optimum, \(d\) SURGE’sλ\\lambdaadaptation of hidden layer\.Figure B:Comparison of three methods \(FP, STE, STE\+SURGE\) on Beale function: \(a\) trajectories, \(b\) loss, \(c\) distance to optimum, \(d\) SURGE’sλ\\lambdaadaptation of hidden layer\.
## Appendix CSURGE for Toy Problem and Visualizations
We implement a simple yet illustrative toy model to optimize the non\-convex Beale function\. The original model architecture consists of an input layer, a hidden layer with ReLU activation, and an output layer that produces 2D coordinates\. In our experiments, we binarize the first and second linear layer\.
Convergence and loss curve of toy model\.As illustrated in Figure[A](https://arxiv.org/html/2605.10989#A2.F1), we compare five training methods under identical initialization: FP \(Full\-precision network\), STE, STE\+SURGE, Bi\-Real, Bi\-Real\+SURGE\. We also provide a focused three\-group subset \(FP, STE, STE\+SURGE\) shown in Figure[B](https://arxiv.org/html/2605.10989#A2.F2)for clearer visualization\. Specifically, for each binarized layer \(based on STE / Bi\-Real\), SURGE adds a parallel full\-precision layer and merges their outputs\. We provide trajectory plot, loss curve, distance to optimum, adaptive scaling factor \(λ\\lambda\)\. It can be seen that SURGE achieves better convergence performance, yielding lower loss compared to the control group without SURGE integration\. And the scale factorλ\\lambdaalso reaches convergence\.
Parameter evolution of toy model\.As illustrated in Figure[C](https://arxiv.org/html/2605.10989#A3.F3), we provide a parameter evolution plot, where we track the Frobenius norms of the binary and auxiliary full\-precision weights, as well as the learnable scaling factorsαw\\alpha\_\{w\},αa\\alpha\_\{a\}\. We also provide a focused three\-group subset \(FP, STE, STE\+SURGE\) shown in Figure[D](https://arxiv.org/html/2605.10989#A3.F4)for clearer visualization\. As shown in the figure of Frobenius norm of weights, compared to the FP model, all binarized variants lie in a narrow band, but variants with SURGE maintain a slightly larger and more stable weight norm after the initial transient\. As shown in the figure of learnable scaling factors, SURGE consistently pushes these scales slightly higher\.
Weight distribution\.As illustrated in Figure[E](https://arxiv.org/html/2605.10989#A3.F5), we provide the weight distribution of one layer in ResNet\-18 trained with CIFAR\-10\. It can be seen that weights around zero is less with SURGE than the counterpart without SURGE\. The binarization results without SURGE is less robust to any robust disturbance\(Xuet al\.,[2022a](https://arxiv.org/html/2605.10989#bib.bib16)\), as sign\(w\) would more frequently flips\.
Cosine similarity\.As shown in Figure[F](https://arxiv.org/html/2605.10989#A3.F6), we draw a figure of cosine similarity \(on ResNet\-20 trained with CIFAR\-10\) between the gradient of weights of main branch and auxiliary branch, averaged over layers and mini\-batches\. It can be seen that the cosine similarity is relatively high\. During the main training phase, the similarity slowly decreases but remains in the range 0\.8\-0\.9\. Towards the very end of training, when the model has already entered a small local basin and gradients become very small, the cosine similarity drops more sharply\. This is expected: the full\-precision branch can still perform fine\-grained adjustments inside the basin, whereas the binary branch is constrained by quantization, so the auxiliary branch likely compensates in different directions\.
Noise contrast experiment\.As shown in Figure[H](https://arxiv.org/html/2605.10989#A3.F8), we conducted experiments with added noise on the toy model to more intuitively demonstrate that our compensation is not merely noise\. The results show that the convergence process with added noise becomes significantly more volatile, and the final convergence performance is worse\.
Figure C:Comparison of five methods \(FP, STE, STE\+SURGE, Bi\-Real, Bi\-Real\+SURGE\) on Beale function: \(a\) weight norm of main branch, \(b\) weight norm of auxiliary branch, \(c\) scaling factor of weights, \(d\) scaling factor of activations\.Figure D:Comparison of three methods \(FP, STE, STE\+SURGE\) on Beale function: \(a\) weight norm of main branch, \(b\) weight norm of auxiliary branch, \(c\) scaling factor of weights, \(d\) scaling factor of activations\.Figure E:Weight distribution comparison of a ResNet\-18 layer trained on CIFAR\-10\. \(left\) Baseline method; \(right\) SURGE \(ours\)\.Figure F:Cosine similarity \(on ResNet\-20 trained with CIFAR\-10\) between the gradient of weights of main branch and auxiliary branch, averaged over layers and mini\-batchesFigure G:Cosine similarity \(on ResNet\-20 trained with CIFAR\-10\) between the gradient of inputs of main branch and auxiliary branch, averaged over layers and mini\-batchesFigure H:Comparison of five methods \(FP, STE, STE\+Noise, STE\+SURGE\) on Beale function: \(a\) trajectories, \(b\) loss, \(c\) distance to optimum, \(d\) SURGE’sλ\\lambdaadaptation of hidden layer\.
## Appendix DDataset, Data Augmentation, and Evaluating Metrics
We evaluate on two standard image classification benchmarks, one object detection benchmark, and one suite of language understanding tasks to demonstrate the effectiveness: CIFAR\-10\(Krizhevskyet al\.,[2009](https://arxiv.org/html/2605.10989#bib.bib14)\)\(10k 32×\\times32 images with random cropping & flipping\), ImageNet\-1K\(Russakovskyet al\.,[2015](https://arxiv.org/html/2605.10989#bib.bib12)\)\(1\.28M training and 50k validation images at 224×\\times224 resolution via center crop\), PASCAL VOC\(Everinghamet al\.,[2010](https://arxiv.org/html/2605.10989#bib.bib13)\)\(around 16k training and 5k validation images across 20 classes with multi\-scale resizing to 1500×900, 1000×600, and 666×400, random flipping at 0\.5 ratio\), and GLUE\(Wanget al\.,[2018](https://arxiv.org/html/2605.10989#bib.bib126)\)\(covers CoLA, SST\-2, MRPC, STS\-B, QQP, MNLI \(m/mm\), QNLI, and RTE, without data augmentation\)\.
We evaluate models using Top\-1 and Top\-5 accuracy for image classification, mean Average Precision at IoU=0\.5 \(mAP@0\.5\) for object detection, and task\-specific GLUE metrics following the official protocol: Matthews correlation \(MCC\) for CoLA; accuracy for SST\-2, MNLI \(matched/mismatched\), QNLI, and RTE; F1/Accuracy for MRPC and QQP; and Spearman correlations for STS\-B\.
## Appendix EImplementation Details
### E\.1Model Details
CIFAR\-10\.On CIFAR\-10, we evaluate our method with ResNet\-18/20\(Heet al\.,[2016](https://arxiv.org/html/2605.10989#bib.bib112)\)and VGG\-Small\(Simonyan and Zisserman,[2014](https://arxiv.org/html/2605.10989#bib.bib15)\)\. We binarize all convolutional and fully\-connected layers except the first and last ones\.
ImageNet\-1K\.On ImageNet\-1K, we binarize ResNet\-18 and retain the first layer, shortcut, and last layer in the networks as real\-valued following\(Liuet al\.,[2018b](https://arxiv.org/html/2605.10989#bib.bib9)\)\. We adopt the same model modification scheme as described in\(Liuet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib17)\)\.
PASCAL VOC\.On PASCAL VOC, we binarize Faster\-RCNN with a ResNet\-18 backbone\. We keep the shortcut, first layer, and the last layer \(the 1×\\times1 convolution layer of RPN and an FC layer of the bbox head\) in the detectors as real\-valued after implementing 1\-bit CNNs\. Following\(Wanget al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib25)\), we modify the network of ResNet\-18 with an extra shortcut and PReLU\(Heet al\.,[2015](https://arxiv.org/html/2605.10989#bib.bib29)\)\.
GLUE\.On GLUE, we evaluate our method with BERT\-base\(Devlinet al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib127)\)\. We follow the previous work to binarize the word embedding layer, MHA and FFN in transformer layers, but leave full\-precision classifier, position embedding layer, and token type embedding layer\(Qinet al\.,[2022](https://arxiv.org/html/2605.10989#bib.bib115); Liuet al\.,[2022](https://arxiv.org/html/2605.10989#bib.bib128)\)\.
### E\.2Training Details
CIFAR\-10\.On CIFAR\-10, we train our models from scratch and following the setting in\(Xuet al\.,[2021c](https://arxiv.org/html/2605.10989#bib.bib8)\), and the base scaling coefficientη\\etais set to0\.010\.01\.
ImageNet\-1K\.On ImageNet\-1K, we follow two implementation setups for fair comparison\. First, we employone\-stage trainingon ResNet\-18 following the setting in\(Xuet al\.,[2022a](https://arxiv.org/html/2605.10989#bib.bib16)\), using Adam as the optimizer and a weight decay of1e−51e\-5\. The initial learning rate is set to5e−45e\-4\. The model is trained from scratch for 200 epochs with learning rates optimized by the annealing cosine learning rate schedule\. Second, we employtwo\-stage trainingfollowing the setting in\(Liuet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib17)\), using Adam as the optimizer\. The network is supervised by a real\-valued ResNet\-34 teacher\. In the first stage, the model is trained from scratch with binarized activation and real\-valued convolution weights\. We load the state dict from the first stage, and both the activation and weights are binarized in the second stage\. The initial learning rate is set to5e−45e\-4, the same as one\-stage training, and annealed to 0 by a linear descent scheduler\. The base scaling coefficientη\\etais set to0\.0010\.001\.
PASCAL VOC\.On PASCAL VOC, we use ImageNet to pre\-train the backbone of a 1\-bit student, following\(Liuet al\.,[2020](https://arxiv.org/html/2605.10989#bib.bib17)\)\. The SGD optimizer is utilized, and the batch size is set as 4 for Faster\-RCNN\. We train the model in two stages\. Only the backbone is binarized at the first stage\. Then we binarize all layers in the second stage\. Each stage counts 12 epochs\. The learning rate is set as 0\.004 and decays by multiplying 0\.1 in the 9th and 11th epochs following\(Xuet al\.,[2022b](https://arxiv.org/html/2605.10989#bib.bib26)\)\. The base scaling coefficientη\\etais set to0\.0010\.001\.
GLUE\.On GLUE, We follow\(Liuet al\.,[2022](https://arxiv.org/html/2605.10989#bib.bib128)\)in adopting the experimental setting of\(Devlinet al\.,[2019](https://arxiv.org/html/2605.10989#bib.bib127)\)\. We use the Adam as our optimizer, and we take more training epochs for every quantization method on each tasks to have a sufficient training, which is 50 for CoLA, 20 for MRPC, STS\-B and RTE, 10 for SST\-2 and QNLI, 5 for MNLI and QQP\. We distill binary models using full\-precision teacher without using multi\-distill technique\.
Furthermore,*we have provided our code*in the supplementary materials, which contains the full implementation of our method and training scripts to facilitate easy replication and future research\.
## Appendix FAdditional GLUE Analyses
### F\.1Training under Fixed Wall\-Clock Budgets
To further examine whether the improvement of SURGE comes from more effective optimization rather than simply longer training, we additionally evaluate BiT and BiT\+SURGE under the same fixed wall\-clock budget\. Specifically, for each GLUE task, both methods are trained on identical hardware with the same task\-specific training duration, and we compare the performance reached within that budget\.
As shown in Table[A](https://arxiv.org/html/2605.10989#A6.T1), BiT\+SURGE still outperforms the BiT baseline under the same time budget, improving the average GLUE score from 65\.2 to 66\.7\. This indicates that, despite its higher per\-step cost, SURGE provides more effective optimization updates within a controlled training\-time budget\.
Table A:Results under fixed wall\-clock budgets on GLUE\. For each task, both methods are trained on the same hardware with the same task\-specific training duration\.MethodMNLI\-m\(10800s\)MNLI\-mm\(10800s\)QQP\(7200s\)QNLI\(10800s\)SST\-2\(7200s\)CoLA\(3600s\)STS\-B\(1800s\)MRPC\(1500s\)RTE\(900s\)Avg\.BiT72\.072\.782\.781\.585\.623\.045\.074\.857\.065\.2BiT\+SURGE72\.172\.383\.382\.486\.521\.454\.378\.754\.566\.7
### F\.2Effect of Explicit Auxiliary\-Branch Alignment
SURGE uses an auxiliary full\-precision branch to compensate for the truncated first\-order gradient caused by binarization, while the detach trick prevents the auxiliary branch from changing the forward output value\. Therefore, the auxiliary branch is designed to provide complementary gradient information, rather than to mimic the binary branch in the forward computation\.
We further test a variant that explicitly encourages the auxiliary branch output to align with the binary branch output\. As shown in Table[B](https://arxiv.org/html/2605.10989#A6.T2), this explicit alignment does not further improve the performance over BiT\+SURGE\. This suggests that forcing the auxiliary branch to behave too similarly to the binary branch may reduce the diversity of the compensation signal, thereby weakening its ability to correct the STE\-induced gradient bias\.
Table B:GLUE dev\-set results for BiT, BiT\+SURGE, and BiT\+SURGE\+Align\. All values are reported in percentage form\.MethodMNLI\-mMNLI\-mmQQPQNLISST\-2CoLASTS\-BMRPCRTEAvg\.BiT77\.077\.585\.485\.587\.823\.668\.079\.458\.170\.6BiT\+SURGE77\.377\.587\.186\.288\.624\.171\.780\.660\.672\.0BiT\+SURGE\+Align77\.177\.385\.985\.987\.823\.671\.379\.258\.871\.2
## Appendix GOverhead and Deployment Efficiency
SURGE introduces modest additional overhead during training while eliminating any extra inference cost, since the auxiliary branch is discarded after training\.
### G\.1Training overhead\.
CNNs\.We conduct a comparison of training time \(10 epochs\) and memory \(batchsize 256/GPU\) among SURGE, Bi\-Real Net, ReActNet, and RBONN under one\-stage training on ImageNet\. As shown in Table[C](https://arxiv.org/html/2605.10989#A7.T3), our results demonstrate that while the full\-precision branch introduces modest overhead \(\+25% training time, \+7\.6% memory vs RBONN\), SURGE can deliver significant accuracy improvements against other SOTA methods \(\+0\.63% accuracy vs RBONN\)\.
Table C:Overhead comparison of training time \(10 epochs\) and memory \(batchsize 256/GPU\)\. Accuracy is reported after 10 epochs of training\. \* denotes a simple cost\-reducing variant\.MethodTraining Time \(min\)GPU Memory \(2 GPUs\)Accuracy \(%\)Bi\-Real Net14319153 MiB×\\times238\.73ReActNet15620923 MiB×\\times245\.86RBONN16021005 MiB×\\times246\.65SURGE20022597 MiB×\\times247\.28SURGE\*17722295 MiB×\\times247\.21Transformers\.We conduct a comparison of training time \(1 epoch\) and memory consumption between SURGE and baseline \(BiT\) for BERT quantization on each task of GLUE\. Following BiT, we employ task\-specific batch sizes during training to optimize performance across different tasks\. As shown in Table[D](https://arxiv.org/html/2605.10989#A7.T4), SURGE introduces acceptable additional training overhead \(\+17% avg time, \+22% avg memory\) while delivering significant accuracy improvements \(\+1\.38% avg accuracy\)\. Notably, when the baseline employs larger batch sizes \(32\) with substantial memory consumption \(12429 MB\) on QQP dataset, SURGE adds only minimal additional overhead \(\+10% memory\)\.
Table D:Comparison of Time, Memory, and Final Accuracy between Baseline and SURGE of training binarized BERT across GLUE tasksMethodCoLAMNLI\(m/mm\)MRPCQNLIQQPRTESST\-2STS\-BBatch size———16168832888Time \(1 epoch\)/sBaseline10750259225903263611583144SURGE122622010930504110721755158Memory \(MB\)Baseline570181756051605112429605146176049SURGE722796497475748313731747559577473Final accuracyBaseline23\.5677\.05/77\.4679\.4185\.4885\.4058\.1287\.8467\.97SURGE24\.1177\.27/77\.5380\.6486\.2387\.1260\.6588\.6571\.70Current BNN deployments predominantly target edge devices where inference efficiency is significant\. Consequently, state\-of\-the\-art methods in this domain prioritize two key metrics: \(1\) achievable accuracy under extreme quantization constraints, and \(2\) real\-world inference latency on resource\-limited hardware\. Training efficiency remains secondary in established BNN research paradigms\. Our method introduces modest additional overhead during training and does not impact deployment\.
### G\.2Deployment efficiency\.
SURGE discards all auxiliary branches after training, maintaining identical resource requirements to standard binary networks while delivering stable accuracy gains\. Based on our prior experience, we implement the 1\-bit models on ODROID C4, which has a 2\.016 GHz 64\-bit quad\-core ARM Cortex\-A55\. By evaluating its real speed in real\-world mobile device, the deployment efficiency of SURGE is proven\. We leverage the SIMD instruction SSHL on ARM NEON to make the inference library BOLT\(Feng,[2021](https://arxiv.org/html/2605.10989#bib.bib130)\)compatible with SURGE\. We compare SURGE to the real\-valued backbone in Table[E](https://arxiv.org/html/2605.10989#A7.T5)\. We can see that SURGE’s inference speed is substantially faster with the highly efficient BOLT library\. For example, the acceleration rate achieves about 4\.1× on ResNet18\.
Table E:Deployment efficiency\.BackboneMethod\#bit \(W/A\)Size \(MB\)Memory SavingLatency \(ms\)Acceleration RateResNet\-18Real\-valued32/3242\.7–276\.8–SURGE1/11\.725\.1×\\times67\.84\.1×\\times
## Appendix HLimitations and Future Works
While SURGE improves gradient estimation through its dual\-path design, this approach inherently requires the retention of auxiliary full\-precision parametersWa\(l\)W\_\{a\}^\{\(l\)\}throughout the training phase to enable gradient compensation\. This architectural choice introduces two practical considerations: \(1\) a temporary increase of parameter memory footprint during backward propagation compared to conventional BNN implementations, and \(2\) additional computational overhead from parallel path gradient calculations during optimization\. In Appendix[G](https://arxiv.org/html/2605.10989#A7), we have conduct experiments to quantify the training overhead of binarizing CNNs and Transformers\. Notably, these costs are strictly confined to the training phase\. During inference, the auxiliary branches are discarded, restoring the original binary architecture’s computational efficiency and memory footprint without any residual overhead\.
For architecture, our framework enables exploration of auxiliary structure designs \(*e\.g\.*, via efficient structure design, low\-rank decomposition, or saliency\-aware layer compensation\) to reduce computational overhead\. Here we propose a potential method to decrease training overhead\. By simply replacing auxiliary convolutions from 3×3 to 1×1 kernels \(SURGE\* variant in Table[C](https://arxiv.org/html/2605.10989#A7.T3)\), we achieve 14\.4% training time reduction \(177min vs 200min over 160min\), 1\.5% memory savings \(22295MB vs 22597MB over 21005MB\), while still bringing accuracy improvements against other SOTA methods \(\+0\.56% accuracy vs RBONN\)\. This validates that our framework inherently supports efficient re\-engineering, and such architectural explorations constitute promising future directions\.Similar Articles
Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge
This paper presents Block-Wise Differentiable Sinkhorn Attention, a method for efficient long-context balanced entropic optimal transport attention on TPU hardware. It introduces a tail-refinement surrogate for exact differentiation, proving an efficient backward pass schedule and demonstrating significant improvements in Pfam sequence alignment reconstruction.
Attribution-Based Neuron Utility for Plasticity Restoration in Deep Networks
This paper introduces Gradient Times Difference from Reference (GXD), a theoretically motivated utility measure for attributing neuron utility to restore plasticity in deep networks during continual learning. It argues that GXD provides more reliable intervention cost estimation compared to existing proxy signals like activation magnitude.
Extensions and limitations of the neural GPU
This paper explores extensions and limitations of the Neural GPU model, demonstrating improvements through curriculum design and scaling, enabling it to learn arithmetic operations on decimal numbers and long expressions while identifying failure modes on symmetric inputs analogous to adversarial examples.
Attacking machine learning with adversarial examples
This article examines adversarial attacks on machine learning models and demonstrates why gradient masking—a defensive technique that attempts to deny attackers access to useful gradients—is fundamentally ineffective. The paper shows that attackers can circumvent gradient masking by training substitute models that mimic the defended model's behavior, making the defense strategy ultimately futile.
ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]
ResBM introduces a transformer-based architecture with residual encoder-decoder bottlenecks for pipeline-parallel training, achieving 128× activation compression while maintaining convergence. The work advances decentralized, internet-grade distributed training by reducing inter-stage communication overhead.