Attribution-Based Neuron Utility for Plasticity Restoration in Deep Networks

arXiv cs.LG Papers

Summary

This paper introduces Gradient Times Difference from Reference (GXD), a theoretically motivated utility measure for attributing neuron utility to restore plasticity in deep networks during continual learning. It argues that GXD provides more reliable intervention cost estimation compared to existing proxy signals like activation magnitude.

arXiv:2605.06834v1 Announce Type: new Abstract: Continual learning research attempts to conserve two fundamental capabilities: new knowledge acquisition and the preservation of previously acquired knowledge. While knowledge in this case can be measured through performance over an implicit or explicit task space, model plasticity generally concerns adaptability as data distributions evolve. Though much of the literature has focused on catastrophic forgetting, deep networks can also suffer from loss of plasticity, becoming progressively harder to update under continued training. Recent research has identified multiple mechanisms underlying this phenomenon, including neuron saturation, parameter norm growth, and loss of useful curvature directions. Adaptive reset-based interventions, which selectively reinitialize low-utility network parameters, have emerged as practical solutions to restore trainability. Existing utility measures used to guide resets, such as activation magnitude, contribution utility, or gradient-based activity, rely on proxy signals that can become misaligned with the intervention they are meant to guide. In this paper, we introduce gradient times difference from reference (GXD), a theoretically motivated utility measure based on reference-based gradient attribution that estimates the first-order functional cost of replacing a unit. Our results show that utility measures aligned with the functional cost of the reset can make interventions more reliable in settings where existing reset criteria degrade. GXD reframes adaptive resetting as an intervention cost estimation problem, providing a practical path toward more robust continual learning systems.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/11/26, 06:55 AM

# Attribution-Based Neuron Utility for Plasticity Restoration in Deep Networks
Source: [https://arxiv.org/html/2605.06834](https://arxiv.org/html/2605.06834)
Patrick Elisii Lucas Beauchemin Dawer Jamshed

The Vanguard Group, Inc\.

###### Abstract

Continual learning research attempts to conserve two fundamental capabilities: new knowledge acquisition and the preservation of previously acquired knowledge\. While knowledge in this case can be measured through performance over an implicit or explicit task space, model plasticity generally concerns adaptability as data distributions evolve\. Though much of the literature has focused on catastrophic forgetting, deep networks can also suffer from loss of plasticity, becoming progressively harder to update under continued training\. Recent research has identified multiple mechanisms underlying this phenomenon, including neuron saturation, parameter norm growth, and loss of useful curvature directions\. Adaptive reset\-based interventions, which selectively reinitialize low\-utility network parameters, have emerged as practical solutions to restore trainability\. Existing utility measures used to guide resets, such as activation magnitude, contribution utility, or gradient\-based activity, rely on proxy signals that can become misaligned with the intervention they are meant to guide\. In this paper, we introduce gradient times difference from reference \(GXD\), a theoretically motivated utility measure based on reference\-based gradient attribution that estimates the first\-order functional cost of replacing a unit\. Our results show that utility measures aligned with the functional cost of the reset can make interventions more reliable in settings where existing reset criteria degrade\. GXD reframes adaptive resetting as an intervention cost estimation problem, providing a practical path toward more robust continual learning systems\.

## 1Introduction

Continual learning is an increasingly important area of machine learning research, with applications spanning computer vision, time series forecasting, and natural language processing\(Wang et al\.,[2024](https://arxiv.org/html/2605.06834#bib.bib26)\)\. A major challenge of continual learning is the loss of plasticity in deep neural networks, where models trained incrementally on non\-stationary data become progressively less able to adapt to new information\(Dohare et al\.,[2024](https://arxiv.org/html/2605.06834#bib.bib5)\)\. This failure mode is distinct from, though often intertwined with, catastrophic forgetting\. Rather than only losing previously acquired knowledge, the model also loses the capacity to acquire new knowledge efficiently\. Prior work has linked this degradation to dormant or saturated units\(Sokar et al\.,[2023](https://arxiv.org/html/2605.06834#bib.bib24)\), pre\-activation and target\-distribution shift\(Lyle et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib17)\), parameter norm growth and reduced effective rank\(Dohare et al\.,[2024](https://arxiv.org/html/2605.06834#bib.bib5);Lyle et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib17)\), and loss of useful curvature directions\(Lewandowski et al\.,[2024a](https://arxiv.org/html/2605.06834#bib.bib12)\)\. Because plasticity loss has been shown to arise from multiple interacting mechanisms, adaptive interventions that restore capacity during training have become an important practical direction\.

Reset\-based methods are one such intervention\. These methods periodically reinitialize parts of the network judged to have low utility, injecting fresh capacity while attempting to preserve the current function\. ReDo resets neurons with low normalized activity\(Sokar et al\.,[2023](https://arxiv.org/html/2605.06834#bib.bib24)\), Continual Backpropagation \(CBP\) continuously replaces a small fraction of low\-utility hidden units\(Dohare et al\.,[2024](https://arxiv.org/html/2605.06834#bib.bib5)\), Selective Weight Reinitialization extends this idea to individual weights\(Hernandez\-Garcia et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib8)\), and ReGraMa uses gradient information to identify neurons for recycling in reinforcement learning\(Liu et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib14)\)\. These methods differ not only in how they measure utility, but also in how utility is used to allocate resets\.

We focus on the utility problem in rank\-based reset methods such as CBP\. In these methods, the reset rate fixes how much plasticity is injected, while the utility score determines where that intervention is applied\. This separates two questions that are often conflated: what is the frequency and magnitude of resets, and which units should be reset? From this perspective, unit utility is a ranking signal for allocating a fixed reset budget under a plasticity\-stability tradeoff\. The reset should provide plasticity benefit while minimizing disruption to the current function, thus a central role of utility is to estimate the functional consequences of replacing a candidate unit\.

Existing reset utilities capture different parts of this tradeoff, but none directly estimates the downstream cost of the reset intervention\. Activation\-based dormancy scores measure local expression\(Sokar et al\.,[2023](https://arxiv.org/html/2605.06834#bib.bib24)\), while CBP’s contribution utility adds outgoing weights and a running reference activation\(Dohare et al\.,[2024](https://arxiv.org/html/2605.06834#bib.bib5)\)\. These local proxies can become unreliable when downstream influence is decoupled from local magnitude, as can occur with non\-ReLU activations, normalization layers, skip connections, and multi\-branch computation\(Liu et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib14)\)\. Loss\-gradient utilities add downstream awareness, but they rank units by their effect on the current loss rather than by the functional perturbation caused by replacing them\. For rank\-based resets, where the reset budget is fixed, utility should help allocate replacement toward units whose reset preserves trainability without unnecessarily disrupting the current function\. We focus on the directly estimable part of this problem: the functional cost of the actual intervention, moving a unit from its current activation to the reset reference used by the reset mechanism\.

We propose gradient times difference from reference \(GXD\), a utility score that estimates this intervention cost\. GXD weights a unit’s displacement from the reset reference by the sensitivity of a task\-relevant output to that displacement\. In the CBP setting, the reference is the running activation value used by reset compensation\(Dohare et al\.,[2024](https://arxiv.org/html/2605.06834#bib.bib5)\), so GXD estimates the first\-order downstream effect of moving a unit from its current activation to the effective value it will take after reset\. This connects reset utility to reference\-based attribution methods such as DeepLIFT\(Shrikumar et al\.,[2017](https://arxiv.org/html/2605.06834#bib.bib22)\)and Integrated Gradients\(Sundararajan et al\.,[2017](https://arxiv.org/html/2605.06834#bib.bib25)\), but uses attribution not for post hoc interpretation, but as an online ranking signal for low\-cost plasticity injection\.

We make three contributions\. First, we formulate utility estimation for rank\-based reset methods as a plasticity\-stability tradeoff and identify downstream reset cost as the key quantity needed to allocate a fixed reset budget\. Second, we introduce GXD as a simple reference\-relative attribution utility that aligns the score with the reset intervention used by CBP\. Third, we evaluate GXD in settings where local utility measures become unreliable\. We show that GXD better predicts realized reset shock, improves Continual Backpropagation under smooth and non\-ReLU activations, mitigates plasticity loss in networks with layer normalization, and improves feature stability in residual architectures\. These results support the view that reset utility should estimate the downstream functional cost of a reset, rather than only local expression, local contribution, or learning signal\.

## 2Related work

Loss of plasticity has been linked to several interacting mechanisms, including dormant or saturated units, pre\-activation and target\-distribution shift, parameter norm growth, reduced effective rank, and loss of useful curvature directions\(Sokar et al\.,[2023](https://arxiv.org/html/2605.06834#bib.bib24);Lyle et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib17);Dohare et al\.,[2024](https://arxiv.org/html/2605.06834#bib.bib5);Lewandowski et al\.,[2024a](https://arxiv.org/html/2605.06834#bib.bib12)\)\. Existing mitigations either modify global training dynamics, through normalization, weight decay, regenerative or spectral regularization, and shrink\-and\-perturb style noise injection\(Ash and Adams,[2020](https://arxiv.org/html/2605.06834#bib.bib2);Kumar et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib10);Lewandowski et al\.,[2024b](https://arxiv.org/html/2605.06834#bib.bib13)\), or intervene directly on network components through resets and plasticity injection\(Nikishin et al\.,[2022](https://arxiv.org/html/2605.06834#bib.bib19),[2023](https://arxiv.org/html/2605.06834#bib.bib20);Dohare et al\.,[2021](https://arxiv.org/html/2605.06834#bib.bib4);Sokar et al\.,[2023](https://arxiv.org/html/2605.06834#bib.bib24);Hernandez\-Garcia et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib8);Liu et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib14)\)\. Our work focuses on this second family, and more specifically on the utility signal used to choose which units a rank\-based reset method should replace\. Prior reset utilities rely mainly on activation statistics, activation\-weight contribution heuristics, or loss\-gradient activity\(Dohare et al\.,[2024](https://arxiv.org/html/2605.06834#bib.bib5);Sokar et al\.,[2023](https://arxiv.org/html/2605.06834#bib.bib24);Liu et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib14)\); in contrast, we draw on feature\-attribution methods that estimate how internal components affect model outputs, including DeepLIFT, Integrated Gradients, gradient\-based attribution, conductance, and efficient internal\-neuron importance scores\(Shrikumar et al\.,[2017](https://arxiv.org/html/2605.06834#bib.bib22);Sundararajan et al\.,[2017](https://arxiv.org/html/2605.06834#bib.bib25);Ancona et al\.,[2018](https://arxiv.org/html/2605.06834#bib.bib1);Dhamdhere et al\.,[2019](https://arxiv.org/html/2605.06834#bib.bib3);Shrikumar et al\.,[2018](https://arxiv.org/html/2605.06834#bib.bib23)\)\. While attribution\-based importance scores have been used mainly for interpretation and pruning\(Yeom et al\.,[2021](https://arxiv.org/html/2605.06834#bib.bib28);Yvinec et al\.,[2022](https://arxiv.org/html/2605.06834#bib.bib29)\), we use attribution as an online utility estimator for low\-impact reset selection\.

## 3Preliminaries

### 3\.1Continual Backpropagation as rank\-based reset

We consider continual learning settings in which a network is trained on a long sequence of changing input distributions or tasks\. In such settings, standard backpropagation can cause networks to progressively lose plasticity, meaning that the network becomes less able to adapt to new data as training proceeds\. Continual Backpropagation \(CBP\) addresses this problem by augmenting ordinary gradient\-based learning with a continual generate\-and\-test mechanism that reinitializes a small fraction of mature low\-utility hidden units during training\(Dohare et al\.,[2021](https://arxiv.org/html/2605.06834#bib.bib4),[2024](https://arxiv.org/html/2605.06834#bib.bib5)\)\.

CBP is a rank\-based generate\-and\-test reset method: after each train step, it updates a tracked utility for each hidden unit and replaces mature units with the lowest utilities according to a set replacement rateρ\\rho\. A reset samples new incoming weights, and resets the unit’s optimizer state and age\. After a reset, outgoing weights are set to zero and the bias of each downstream consumer is adjusted by the removed unit’s average contribution,wi,k,t\(l\)​f^l,i,tw\_\{i,k,t\}^\{\(l\)\}\\hat\{f\}\_\{l,i,t\}, to reduce the immediate functional effect of removal\.

The contribution utility used byDohare et al\.\([2024](https://arxiv.org/html/2605.06834#bib.bib5)\)tracks local expression weighted by outgoing connection magnitude,uiCont≈𝔼​\[\|hi\|​∑k\|wi,kout\|\]u\_\{i\}^\{\\mathrm\{Cont\}\}\\approx\\mathbb\{E\}\[\|h\_\{i\}\|\\sum\_\{k\}\|w\_\{i,k\}^\{\\mathrm\{out\}\}\|\]\. Another proposed utility, mean\-corrected adaptable contribution\(Dohare et al\.,[2023](https://arxiv.org/html/2605.06834#bib.bib6)\), replaces\|hi\|\|h\_\{i\}\|with displacement from a running activation reference,\|hi−ri\|\|h\_\{i\}\-r\_\{i\}\|, and includes an adaptation factor inversely proportional to incoming weight magnitude\. Full CBP algorithm and pseudocode are given in Appendix[A\.1](https://arxiv.org/html/2605.06834#A1.SS1)\.

These utilities make CBP an ideal testbed for studying reset selection because the intervention is fixed, but different utility scores induce different rankings of which units to replace\. We therefore keep the CBP reset mechanism fixed and modify only the utility estimator used to rank eligible neurons\.

### 3\.2Reset utility as a cost–benefit tradeoff

Rank\-based reset methods choose which mature units should receive a fixed reset intervention\. Resetting a neuron can be beneficial because it can restore future trainability by replacing a unit whose current state limits adaptation\. However, the same reset can also be harmful because it removes learned features that currently support the network’s represented function\. Thus, reset selection may be considered as a cost\-benefit problem which balances reset cost and future trainability\.

LetRS​\(θt\)R\_\{S\}\(\\theta\_\{t\}\)denote the parameters obtained by applying the reset operator to a setSSof mature units at timett\. Let𝒟t\\mathcal\{D\}\_\{t\}denote the current or recent online data distribution and letzθ​\(x\)∈ℝCz\_\{\\theta\}\(x\)\\in\\mathbb\{R\}^\{C\}denote the logits\. We define the immediate functional cost of resettingSSas

Ct​\(S\)=𝔼x∼𝒟t​\[d​\(zθt​\(x\),zRS​\(θt\)​\(x\)\)\],C\_\{t\}\(S\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\_\{t\}\}\\left\[d\\left\(z\_\{\\theta\_\{t\}\}\(x\),z\_\{R\_\{S\}\(\\theta\_\{t\}\)\}\(x\)\\right\)\\right\],\(1\)whereddis a task\-relevant output distance, such as logit distance or KL divergence between predictive distributions\. LetBt​\(S\)B\_\{t\}\(S\)denote the expected future plasticity benefit of resettingSS: the improvement in subsequent adaptation obtained by replacing those units and continuing training on future data\. An ideal fixed\-budget reset selector would solve

St⋆=arg⁡maxS⊆ℳt\|S\|=k⁡\[Bt​\(S\)−λ​Ct​\(S\)\],S\_\{t\}^\{\\star\}=\\arg\\max\_\{\\begin\{subarray\}\{c\}S\\subseteq\\mathcal\{M\}\_\{t\}\\\\ \|S\|=k\\end\{subarray\}\}\\left\[B\_\{t\}\(S\)\-\\lambda C\_\{t\}\(S\)\\right\],\(2\)whereℳt\\mathcal\{M\}\_\{t\}is the set of mature reset\-eligible units,kkis the reset budget, andλ\\lambdacontrols the plasticity–stability tradeoff\.

The two terms in Equation[2](https://arxiv.org/html/2605.06834#S3.E2)differ in how directly they can be estimated\. The future benefitBt​\(S\)B\_\{t\}\(S\)is a delayed counterfactual quantity whose value depends on the subsequent use of the reset units after training continues\. Existing plasticity\-oriented utilities, such as loss\-gradient magnitude or incoming\-weight norm, therefore act as proxies for this benefit term by trying to identify units whose current state may limit future adaptation\. These signals incentivize replacing units that receive low learning signal or are difficult to adapt\. By contrast, the costCt​\(S\)C\_\{t\}\(S\)has a direct interventional target at selection time, since one can apply the same reset operation to a candidate unit and measure how much the current function changes\. Contribution\-style metrics, including CBP’s original contribution and mean\-corrected contribution terms, can therefore be understood as proxies for this cost term\. They locally estimate how much the current computation depends on a unit and discourage replacing units whose removal would perturb downstream activity\.

Because unit\-level plasticity benefit is delayed, online reset methods can only incentivize it through proxies\. Functional reset cost, however, has a direct interventional target\. Since fixed\-budget resets of poorly chosen units can hinder training or cause performance collapse\(Sokar et al\.,[2023](https://arxiv.org/html/2605.06834#bib.bib24)\), we primarily focus on estimating this cost term\. Section 4 then shows that our proposed attribution score still retains a conservative connection to trainability through the downstream Jacobian\.

### 3\.3Why existing utilities can mis\-rank reset candidates

Activation\-based utilities measure local expression, which can be a useful dormancy signal in simple feed\-forward ReLU networks\. However, activation magnitude is already known to be a weak replacement\-suitability proxy outside this regime, and prior work has improved on it by incorporating contribution\(Dohare et al\.,[2024](https://arxiv.org/html/2605.06834#bib.bib5)\)or gradient information\(Liu et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib14);Hernandez\-Garcia et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib8)\)\.

CBP’s mean\-corrected contribution utility is a stronger stability proxy because it is tied to the reset compensation\. Letri=f^ir\_\{i\}=\\hat\{f\}\_\{i\}be the running reference activation used by CBP\. For a downstream preactivation

qk​\(x\)=bk\+wi,k​hi​\(x\)\+∑j≠iwj,k​hj​\(x\),q\_\{k\}\(x\)=b\_\{k\}\+w\_\{i,k\}h\_\{i\}\(x\)\+\\sum\_\{j\\neq i\}w\_\{j,k\}h\_\{j\}\(x\),CBP removes unitiiby setting its outgoing weights to zero and transferringwi,k​riw\_\{i,k\}r\_\{i\}into the downstream bias\. The immediate preactivation change is therefore

qk​\(x\)−qk′​\(x\)=wi,k​\(hi​\(x\)−ri\)\.q\_\{k\}\(x\)\-q^\{\\prime\}\_\{k\}\(x\)=w\_\{i,k\}\(h\_\{i\}\(x\)\-r\_\{i\}\)\.Thus, CBP’s mean\-corrected contribution estimates the local next\-layer perturbation induced by the compensated reset\. This explains why it improves over raw activation magnitude: units whose outputs are nearly constant around their recent reference can be assigned low utility even if their absolute activations are large\.

The limitation is that this estimate remains local\. It assumes that the measured activation is the component being reset, that the next consumer is an ordinary preactivation, and that one\-hop perturbation magnitude is a faithful proxy for output impact\. These assumptions can fail when branch fusion, normalization, downstream nonlinearities, or task\-relevant output directions amplify, suppress, or redistribute the perturbation\. Two units can therefore have the same mean\-corrected contribution at the next layer while producing very different changes in the logits\.

Loss\-gradient utilities capture a complementary signal\. They can encourage plasticity by favoring replacement of units with weak current learning signals, but they are not reliable as a standalone rank\-based reset\-cost metric\. LetJi​\(x\)=∂zθ​\(x\)/∂hi​\(x\)J\_\{i\}\(x\)=\\partial z\_\{\\theta\}\(x\)/\\partial h\_\{i\}\(x\)be the downstream Jacobian from unitiito the logits, lete​\(x\)=∇zℒ​\(zθ​\(x\),y\)e\(x\)=\\nabla\_\{z\}\\mathcal\{L\}\(z\_\{\\theta\}\(x\),y\), and letΔ​zi​\(x\)\\Delta z\_\{i\}\(x\)denote the linearized reset impact on the logits from replacinghi​\(x\)h\_\{i\}\(x\)withrir\_\{i\}\. The loss gradient and this linearized reset impact are

\|∂ℒ∂hi\|\\displaystyle\\left\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial h\_\{i\}\}\\right\|=\|e​\(x\)⊤​Ji​\(x\)\|,\\displaystyle=\\left\|e\(x\)^\{\\top\}J\_\{i\}\(x\)\\right\|,\(3\)‖Δ​zi​\(x\)‖\\displaystyle\\\|\\Delta z\_\{i\}\(x\)\\\|≈‖Ji​\(x\)​\(hi​\(x\)−ri\)‖\.\\displaystyle\\approx\\left\\\|J\_\{i\}\(x\)\(h\_\{i\}\(x\)\-r\_\{i\}\)\\right\\\|\.Both quantities involve downstream sensitivity, but they answer different questions\. The loss gradient is gated by the current residual directione​\(x\)e\(x\), whereas reset impact is gated by the displacementhi​\(x\)−rih\_\{i\}\(x\)\-r\_\{i\}\. A unit can have small loss gradient because the current example is already fit, or because its output direction is nearly orthogonal to the residual, while replacing it would still perturb the logits\. Conversely, a large loss gradient can occur for a unit whose reset displacement is small\. Thus, loss\-gradient utility can be valuable as a plasticity heuristic, but by itself it can rank as disposable units that still have high functional reset cost\.

These cases point to the same missing structure\. A reset\-cost utility should combine the displacement induced by the reset with the downstream sensitivity of the model output\. Activation scores are too local, CBP contribution estimates a one\-hop compensated perturbation, and loss\-gradient scores emphasize plasticity pressure rather than reset impact\. The next section derives a utility that combines reset displacement and downstream output sensitivity directly through reference\-based attribution\.

![Refer to caption](https://arxiv.org/html/2605.06834v1/combined_mc_bad.png)Figure 1:Failure cases of existing utility measures on Online Permuted MNIST\.Left:Under Leaky ReLU, existing utilities \(defined in Table[2](https://arxiv.org/html/2605.06834#A1.T2)\) fail to prevent plasticity loss\.Center:In a network with LayerNorm, existing utilities also degrade\.Right:In the ReLU feedforward case, Loss Gradient degrades to near\-backprop performance\.

## 4Attribution\-based neuron utility

### 4\.1From reset cost to reference\-based attribution

Section 3\.2 frames reset selection as a tradeoff between future plasticity benefit and immediate functional cost\. We now derive a tractable approximation to the cost termCt​\(\{i\}\)C\_\{t\}\(\\\{i\\\}\)\. Exact evaluation would require applying a candidate reset to each mature unit and measuring the resulting change in the network output\. Instead, our proposed method approximates this intervention directly in function space\.

This function\-space approximation turns reset selection into an attribution problem\. Shapley\-style attribution defines a component’s contribution by its average marginal effect over coalitions of present and absent components\(Shapley,[1953](https://arxiv.org/html/2605.06834#bib.bib21);Lundberg and Lee,[2017](https://arxiv.org/html/2605.06834#bib.bib15)\)\. This provides a principled notion of marginal contribution, but it is more general than the intervention used by CBP, which applies a concrete reset operator to a small set of units in the current network state\. We therefore use Shapley\-style attribution as motivation for a marginal\-effect view of utility, while specializing the attribution target to the reset actually performed by CBP\.

Reference\-based attribution methods provide the closer analogue\. DeepLIFT, Integrated Gradients, and gradient×\\timesdifference\-from\-reference attribute output changes to the displacement of an internal variable from a chosen baseline\(Shrikumar et al\.,[2017](https://arxiv.org/html/2605.06834#bib.bib22);Sundararajan et al\.,[2017](https://arxiv.org/html/2605.06834#bib.bib25);Ancona et al\.,[2018](https://arxiv.org/html/2605.06834#bib.bib1);Shrikumar et al\.,[2018](https://arxiv.org/html/2605.06834#bib.bib23)\)\. CBP’s bias\-compensated reset has the same local form: removing a unit while adding back its average contribution to downstream biases is equivalent to replacing its current activationhi​\(x\)h\_\{i\}\(x\)with the running activation estimate used by the reset\. We therefore setri=rl,i,t=f^l,i,tr\_\{i\}=r\_\{l,i,t\}=\\hat\{f\}\_\{l,i,t\}, making the attribution baseline the endpoint of the reset intervention rather than an arbitrary zero state\.

Letδi​\(x\)=hi​\(x\)−ri\\delta\_\{i\}\(x\)=h\_\{i\}\(x\)\-r\_\{i\}denote the reset displacement\. Let

Ji​\(x\)=∂zθ​\(x\)∂hi​\(x\)∈ℝCJ\_\{i\}\(x\)=\\frac\{\\partial z\_\{\\theta\}\(x\)\}\{\\partial h\_\{i\}\(x\)\}\\in\\mathbb\{R\}^\{C\}\(4\)be the downstream Jacobian from unitiito the logits\. The first\-order logit perturbation induced by the reset is

Δ​zi​\(x\)≈Ji​\(x\)​δi​\(x\)\.\\Delta z\_\{i\}\(x\)\\approx J\_\{i\}\(x\)\\delta\_\{i\}\(x\)\.\(5\)The corresponding logit\-vector reset\-cost proxy is

Uiall​\-​logit=𝔼x​\[‖Ji​\(x\)​\(hi​\(x\)−ri\)‖\]\.U\_\{i\}^\{\\mathrm\{all\\text\{\-\}logit\}\}=\\mathbb\{E\}\_\{x\}\\left\[\\left\\\|J\_\{i\}\(x\)\(h\_\{i\}\(x\)\-r\_\{i\}\)\\right\\\|\\right\]\.\(6\)This expression is the output\-impact principle behind GXD: estimate the functional perturbation caused by moving a unit to the state used by the reset mechanism\.

![Refer to caption](https://arxiv.org/html/2605.06834v1/shock5_full_l1_all_relu.png)

![Refer to caption](https://arxiv.org/html/2605.06834v1/shock5_full_l1_all_silu.png)

![Refer to caption](https://arxiv.org/html/2605.06834v1/shock5_full_l1_all_all.png)

Figure 2:Shock@5%: mean output perturbation from individually setting each neuron of the bottom 5% of a utility rank to its reset reference\. Lower is better\.Left:MLP with ReLU—Loss Gradient has the largest shock\.Center:MLP with SiLU—local utilities become unreliable\.Right:ResNet\-18 \(100 heads\); GXD consistently produces the lowest reset shock across architectures\.
### 4\.2Target\-logit GXD

The implementation studied in this paper uses a scalar output functional rather than the full logit vector\. For a scalar scoreS​\(x\)S\(x\), the first\-order perturbation is

Δ​Si​\(x\)≈\(hi​\(x\)−ri\)​∂S​\(x\)∂hi​\(x\)\.\\Delta S\_\{i\}\(x\)\\approx\(h\_\{i\}\(x\)\-r\_\{i\}\)\\frac\{\\partial S\(x\)\}\{\\partial h\_\{i\}\(x\)\}\.\(7\)
For supervised classification, we use the target logit,Sy​\(x\)=zy​\(x\)S\_\{y\}\(x\)=z\_\{y\}\(x\)\. This avoids scaling the attribution computation with the number of output dimensions while keeping the approximation focused on a task\-relevant output\. The target logit is already produced by the ordinary forward pass, so computing its attribution requires only one additional backward pass\. Averaging the resulting score over datapoints also makes the utility follow the class balance of the sampled data, avoiding an additional class\-weighting rule\. GXD therefore estimates the scalar perturbation as

uiGXD​\(x,y\)=\|\(hi​\(x\)−ri\)​∂zy​\(x\)∂hi​\(x\)\|u\_\{i\}^\{\\mathrm\{GXD\}\}\(x,y\)=\\left\|\(h\_\{i\}\(x\)\-r\_\{i\}\)\\frac\{\\partial z\_\{y\}\(x\)\}\{\\partial h\_\{i\}\(x\)\}\\right\|\(8\)This estimates the magnitude of the first\-order perturbation to the correct\-class logit caused by moving neuroniifrom its current activation to its reset reference\. Appendix[B](https://arxiv.org/html/2605.06834#A2)compares this efficient target\-logit approximation with an all\-logit GXD variant and finds nearly identical reset\-cost performance\.

The output gradient also connects GXD to the learning signal available to the unit\. As shown in Equation[3](https://arxiv.org/html/2605.06834#S3.E3),Ji​\(x\)J\_\{i\}\(x\)is the downstream sensitivity map through which output\-level error signals are assigned to neuronii\. GXD uses this same downstream sensitivity, but weights it by reset displacementhi​\(x\)−rih\_\{i\}\(x\)\-r\_\{i\}rather than by the current loss residual\. This makes output\-gradient sensitivity a promising incentive for trainability, similar in spirit to gradient\-based recycling signals such as ReGraMa\(Liu et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib14)\)\. In this paper, however, we validate the combined GXD score as a reset\-cost estimator; whether output\-gradient sensitivity is useful as a standalone trainability signal requires separate evaluation\.

### 4\.3Batch aggregation and online ranking

In CBP, the instantaneous utility is converted into a stable online ranking signal\. For each minibatchBtB\_\{t\}, we compute the absolute target\-logit GXD score from Equation[8](https://arxiv.org/html/2605.06834#S4.E8)for every example and average those scores over the minibatch, givingu^i,tGXD\\hat\{u\}\_\{i,t\}^\{\\mathrm\{GXD\}\}\. Averaging absolute per\-example perturbation estimates avoids cancellation between positive and negative attributions\. We then maintain an EWMA utility estimate,

ui,t=γ​ui,t−1\+\(1−γ\)​u^i,tGXD,u\_\{i,t\}=\\gamma u\_\{i,t\-1\}\+\(1\-\\gamma\)\\hat\{u\}\_\{i,t\}^\{\\mathrm\{GXD\}\},\(9\)whereγ∈\[0,1\)\\gamma\\in\[0,1\)is the utility decay parameter\. This running score is used exactly as in standard CBP: after the maturity threshold is reached, the lowest\-utility units are selected for replacement under the fixed replacement schedule\.

## 5Experiments

We evaluate GXD across three experimental settings that progressively test different aspects of reset utility quality\. First, a controlled lesion study \(§[5\.1](https://arxiv.org/html/2605.06834#S5.SS1)\) directly measures how well each utility predicts the output perturbation caused by a reset, comparing rankings on MLP and ResNet architectures\. Second, a Permuted MNIST continual learning benchmark \(§[5\.2](https://arxiv.org/html/2605.06834#S5.SS2)\) tests whether better reset\-cost estimation translates to sustained plasticity over hundreds of sequential tasks\. Third, a Continuous CIFAR\-100 experiment \(§[5\.3](https://arxiv.org/html/2605.06834#S5.SS3)\) evaluates feature stability in a ResNet under repeated class exposure\.

Table 1:Spearman rank correlation \(ρ\\rho\) between each utility score and realized output perturbation \(mean±\\pmSE across checkpoints\)\. Higher is better;boldmarks the best in each column\.### 5\.1Minimizing Reset Cost in MLPs and ResNets

We first isolate the ranking problem by measuring reset cost directly at frozen checkpoints\. For each utility, we rank mature hidden units on a calibration set and then evaluate the same units on a disjoint probe set by clamping one unit at a time to the reset reference used by CBP,hi​\(x\)←rih\_\{i\}\(x\)\\leftarrow r\_\{i\}, while leaving all parameters unchanged\. This produces perturbed logitsz\[i→ri\]​\(x\)z^\{\[i\\rightarrow r\_\{i\}\]\}\(x\), which we compare to the original logitsz​\(x\)z\(x\)using logitL1L\_\{1\}distance,‖z​\(x\)−z\[i→ri\]​\(x\)‖1\\\|z\(x\)\-z^\{\[i\\rightarrow r\_\{i\}\]\}\(x\)\\\|\_\{1\}, and KL divergence\. We summarize low\-utility selection quality with Shock@5%: the mean perturbation from clamping each of the bottom 5% of units per layer under each ranking\. This perturb\-and\-measure protocol follows prior ablation\-based evaluations of internal\-neuron importance, which clamp units to a reference value and compare the resulting output changes\(Dhamdhere et al\.,[2019](https://arxiv.org/html/2605.06834#bib.bib3);Shrikumar et al\.,[2018](https://arxiv.org/html/2605.06834#bib.bib23)\)\. We measure full\-ranking quality with Spearman correlation between utility score and realized perturbation\. We run the procedure on ReLU and SiLU MLPs trained on Permuted MNIST and on a ResNet\-18 architecture\(He et al\.,[2016](https://arxiv.org/html/2605.06834#bib.bib7)\), trained on CIFAR\-100\. Across all settings, the reset intervention is fixed and only the utility ranking changes\.

#### Results

GXD is the most consistent reset\-cost estimator across the architectures and activation functions tested\. In the ReLU MLP, where local contribution scores are already strong, GXD nearly saturates the rank correlation with measured logitL1L\_\{1\}shock\. The separation is larger in the SiLU MLP: activation and CBP contribution correlate weakly with realized perturbation, while GXD remains highly correlated\. Figure[2](https://arxiv.org/html/2605.06834#S4.F2)shows the practical consequence of this ranking difference: the bottom 5% of units selected by GXD produce much smaller output shock than those selected by activation\-based utilities\. In the ResNet\-18, GXD is the best in Spearman correlation and yields the closest Shock@5% to the oracle \(minimum possible impact\)\. These results show that GXD further minimizes reset perturbation precisely where existing utilities falter, in non\-ReLU\-activation networks and residual architectures whose skip connections decouple local activity from downstream impact\.

### 5\.2Maintaining Plasticity on Online Permuted MNIST

We evaluate GXD within the Continual Backpropagation \(CBP\) framework on 800 sequential Permuted MNIST tasks, comparing utility estimators while keeping the reset mechanism fixed\. Online Permuted MNIST constructs a sequence of classification tasks by applying a new fixed random permutation to the pixels of MNIST\(LeCun et al\.,[1998](https://arxiv.org/html/2605.06834#bib.bib11)\)images for each task\. This benchmark has been used in recent plasticity studies to expose degradation in a network’s ability to adapt over long non\-stationary training streams\(Dohare et al\.,[2024](https://arxiv.org/html/2605.06834#bib.bib5);Kumar et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib10)\)\.

The model is a fully connected network trained for one epoch per task, and we report test accuracy on each task as a measure of maintained plasticity\. To test whether reset utility remains reliable when local activity is less directly tied to functional importance, we evaluate ReLU, SiLU, Leaky ReLU, and tanh activations\. Results are averaged over 15 random seeds\.

![Refer to caption](https://arxiv.org/html/2605.06834v1/permuted_mnist_per_hp_act.png)Figure 3:Test accuracy on Permuted MNIST across activation functions\. GXD maintains plasticity across ReLU, Tanh, SiLU, and Leaky ReLU, while original CBP’s MC Adaptable Contribution and plain Contribution face degradation or instability for non\-ReLU activations\. Averaged over 15 seeds\.#### Results

Figure[3](https://arxiv.org/html/2605.06834#S5.F3)shows that GXD maintains plasticity more reliably than the local utility baselines across the activation regimes tested\. In the standard ReLU setting, CBP’s original contribution\-based utility is already well matched to the activation geometry, and GXD performs comparably\. The advantage of GXD becomes clear for SiLU and Leaky ReLU, where local activation levels are less reliable estimates of reset cost, as shown in §[5\.1](https://arxiv.org/html/2605.06834#S5.SS1)\. In these settings, both contribution utilities degrade over the task sequence, while GXD continues to sustain high accuracy\. Tanh provides a complementary case where units can saturate at either positive or negative values, and subtracting a running reference helps correct the offset problem faced by plain activation contribution\. As a result, MC Adaptable Contribution degrades less severely than the original contribution score\. GXD still performs best, indicating that reference correction is a key improvement, but that the strongest reset criterion also accounts for output sensitivity\.

We also evaluate a ReLU network with layer normalization, shown in Figure[4](https://arxiv.org/html/2605.06834#S5.F4)\. All local utility baselines collapse toward standard backpropagation, while GXD is the only tested utility that substantially mitigates plasticity loss\. Layer normalization couples units through shared centering and scaling statistics, making one\-hop or activation\-only scores poor proxies for the cost of replacing a feature\. Together, these results show that when local utilities misestimate reset cost, GXD makes CBP’s fixed plasticity injection more useful, preserving learned features while selecting low\-impact units for replacement, allowing subsequent training to build new capacity rather than repair reset\-induced damage\.

![Refer to caption](https://arxiv.org/html/2605.06834v1/permuted_mnist_per_hp_layernorm.png)

![Refer to caption](https://arxiv.org/html/2605.06834v1/cont_cifar.png)

Figure 4:Left:Test accuracy on Permuted MNIST with ReLU and LayerNorm\. GXD is the only tested utility that mitigates plasticity loss\.Right:Continuous CIFAR\-100 with a ResNet\. Test accuracy relative to backprop baseline \(50\-task MA\)\. GXD consistently outperforms Contribution and MC Adaptable Contribution in this feature stability test\.

### 5\.3Maintaining Stability on Continuous CIFAR\-100

We next evaluate GXD in a ResNet architecture using Continuous CIFAR\-100\(Krizhevsky,[2009](https://arxiv.org/html/2605.06834#bib.bib9)\), an adaptation of the Continual ImageNet protocol introduced byDohare et al\.\([2024](https://arxiv.org/html/2605.06834#bib.bib5)\)\. Continual ImageNet trains a binary classifier on a sequence of two\-class tasks sampled from 1000 classes, allowing hundreds of tasks before classes repeat\. CIFAR\-100 has only 100 classes, so this adaptation necessarily revisits classes later in training\. The experiment is therefore primarily a test of*feature stability*: the model must adapt to each new binary task while preserving features that may be useful when a class reappears\.

We design the experiment so that this stability pressure falls on the shared representation rather than the task\-specific classifier\. The model is a small 8\-layer residual network, trained under a fixed stability\-test protocol: learning rate 0\.1, 200 epochs per task, and replacement rateρ=10−4\\rho=10^\{\-4\}for all reset methods\. As inDohare et al\.\([2024](https://arxiv.org/html/2605.06834#bib.bib5)\), the binary head is reset at each task boundary and only the feature extractor persists task to task\. Thus, the utility ranking is the central intervention, as all reset methods use the same nonzero replacement budget and reset mechanism, but choose different feature units to preserve or replace\.

Standard backpropagation does not exhibit the monotonic plasticity loss seen in other related experimental settings as shown in Figure[7](https://arxiv.org/html/2605.06834#A3.F7)\. This is because binary CIFAR\-100 tasks are relatively small for this architecture, and repeated exposure lets the shared feature extractor learn representations for most classes early in training\. This absence of collapse makes the benchmark a stability stress test for rank\-based resets because the head is reinitialized at every task boundary, performance on a revisited class depends on whether the persistent feature extractor has retained useful class structure\.

#### Results

GXD performs best in this stability setting \(Figure[4](https://arxiv.org/html/2605.06834#S5.F4), right\)\. Relative to the backpropagation baseline, it maintains higher accuracy and exhibits smaller performance drawdowns than Contribution and MC Adaptable Contribution\. This is consistent with the reset\-cost view: GXD ranks eligible units by an estimate of the functional perturbation induced by resetting them, so its replacements are biased toward units whose removal least changes the current predictor\. On Continuous CIFAR\-100, that stability advantage translates into better reuse of previously learned features when classes recur\.

## 6Discussion

GXD improves the reset\-selection component of Continual Backpropagation by estimating the functional cost of resetting a unit, aligned with the reset mechanism itself\. This is useful when functional importance is obscured by non\-ReLU activations, normalization, or skip connections\. Empirically, GXD better predicts reset\-induced output changes and improves Continual Backpropagation performance in the settings tested\. Future work will extend this intervention\-aligned view of utility to more complex environments such as transformers and reinforcement learning\. These settings may also require moving beyond neuron\-level resets toward weight\-level resets\(Hernandez\-Garcia et al\.,[2025](https://arxiv.org/html/2605.06834#bib.bib8)\)or partial resets\(McCutcheon et al\.,[2026](https://arxiv.org/html/2605.06834#bib.bib18)\), with attribution methods aligned to the corresponding reset operation\.

#### Limitations\.

Our experiments are limited to supervised continual learning with MLPs with activation and normalization variants, convolutional layers, and small ResNets\. We have not yet evaluated GXD on attention, LLMs, large\-scale vision models, or RL agents\. Our experiments are intentionally an isolated study of the utility measure within rank\-based neuron resets, and future work should compare CBP\-GXD to more recent state\-of\-the\-art plasticity mitigation methods\. GXD is also a first\-order, per\-unit approximation which does not account for nonlinear curvature in the effect of a reset and does not estimate the combined effect of resetting multiple units at once\. Future work could explore higher\-order or set\-aware reset\-cost estimators that account for these effects\. In addition, the current implementation uses an efficient scalar target\-logit attribution, which may need to be extended for architectures with a large number of output heads\. Finally, our experiments validate GXD primarily as a reset\-cost estimator and future work could empirically evaluate output\-gradient sensitivity as a standalone signal for trainability\.

## Acknowledgements

The views expressed here are those of the authors alone and not of The Vanguard Group, Inc\.

## References

- Ancona et al\. \[2018\]Ancona, M\., Ceolini, E\., Öztireli, C\., and Gross, M\.Towards better understanding of gradient\-based attribution methods for deep neural networks\.In*International Conference on Learning Representations*, 2018\.
- Ash and Adams \[2020\]Ash, J\. T\. and Adams, R\. P\.On warm\-starting neural network training\.In*Advances in Neural Information Processing Systems*, 2020\.
- Dhamdhere et al\. \[2019\]Dhamdhere, K\., Sundararajan, M\., and Yan, Q\.How important is a neuron?In*International Conference on Learning Representations*, 2019\.
- Dohare et al\. \[2021\]Dohare, S\., Sutton, R\. S\., and Mahmood, A\. R\.Continual backprop: Stochastic gradient descent with persistent randomness\.*arXiv preprint arXiv:2108\.06325*, 2021\.
- Dohare et al\. \[2024\]Dohare, S\., Hernandez\-Garcia, J\. F\., Lan, Q\., Rahman, P\., Mahmood, A\. R\., and Sutton, R\. S\.Loss of plasticity in deep continual learning\.*Nature*, 632:768–774, 2024\.
- Dohare et al\. \[2023\]Dohare, S\., Hernandez\-Garcia, J\. F\., Rahman, P\., Mahmood, A\. R\., and Sutton, R\. S\.Maintaining plasticity in deep continual learning\.*arXiv preprint arXiv:2306\.13812*, 2023\.
- He et al\. \[2016\]He, K\., Zhang, X\., Ren, S\., and Sun, J\.Deep residual learning for image recognition\.In*IEEE Conference on Computer Vision and Pattern Recognition*, pp\. 770–778, 2016\.
- Hernandez\-Garcia et al\. \[2025\]Hernandez\-Garcia, J\. F\., Dohare, S\., Luo, J\., and Sutton, R\. S\.Reinitializing weights vs units for maintaining plasticity in neural networks\.*arXiv preprint arXiv:2508\.00212v2*, 2025\.
- Krizhevsky \[2009\]Krizhevsky, A\.Learning multiple layers of features from tiny images\.Technical report, University of Toronto, 2009\.
- Kumar et al\. \[2025\]Kumar, S\., Marklund, H\., and Van Roy, B\.Maintaining plasticity in continual learning via regenerative regularization\.In*Proceedings of the 3rd Conference on Lifelong Learning Agents*, volume 274 of*Proceedings of Machine Learning Research*, pp\. 410–430\. PMLR, 2025\.
- LeCun et al\. \[1998\]LeCun, Y\., Bottou, L\., Bengio, Y\., and Haffner, P\.Gradient\-based learning applied to document recognition\.*Proceedings of the IEEE*, 86\(11\):2278–2324, 1998\.
- Lewandowski et al\. \[2024a\]Lewandowski, A\., Tanaka, H\., Schuurmans, D\., and Machado, M\. C\.Directions of curvature as an explanation for loss of plasticity\.*arXiv preprint arXiv:2312\.00246v4*, 2024\.
- Lewandowski et al\. \[2024b\]Lewandowski, A\., Kumar, S\., Schuurmans, D\., György, A\., and Machado, M\. C\.Learning continually by spectral regularization\.*arXiv preprint arXiv:2406\.06811v2*, 2024\.
- Liu et al\. \[2025\]Liu, J\., Wu, Z\., Obando\-Ceron, J\., Castro, P\. S\., Courville, A\., and Pan, L\.Measure gradients, not activations\! Enhancing neuronal activity in deep reinforcement learning\.In*Advances in Neural Information Processing Systems*, 2025\.
- Lundberg and Lee \[2017\]Lundberg, S\. M\. and Lee, S\.\-I\.A unified approach to interpreting model predictions\.In*Advances in Neural Information Processing Systems*, 2017\.
- Luo and Wu \[2020\]Luo, J\.\-H\. and Wu, J\.Neural network pruning with residual\-connections and limited\-data\.In*IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp\. 1458–1467, 2020\.
- Lyle et al\. \[2025\]Lyle, C\., Zheng, Z\., Khetarpal, K\., van Hasselt, H\., Pascanu, R\., Martens, J\., and Dabney, W\.Disentangling the causes of plasticity loss in neural networks\.In*Proceedings of the 3rd Conference on Lifelong Learning Agents*, volume 274 of*Proceedings of Machine Learning Research*, pp\. 750–783\. PMLR, 2025\.
- McCutcheon et al\. \[2026\]McCutcheon, L\., Chatzaroulas, E\., and Fallah, S\.Learning continually at peak performance with continuous continual backpropagation\.*OpenReview preprint, submitted to ICLR 2026*, 2026\.
- Nikishin et al\. \[2022\]Nikishin, E\., Schwarzer, M\., D’Oro, P\., Bacon, P\.\-L\., and Courville, A\.The primacy bias in deep reinforcement learning\.In*International Conference on Machine Learning*, pp\. 16828–16847, 2022\.
- Nikishin et al\. \[2023\]Nikishin, E\., Oh, J\., Ostrovski, G\., Lyle, C\., Pascanu, R\., Dabney, W\., and Barreto, A\.Deep reinforcement learning with plasticity injection\.In*Advances in Neural Information Processing Systems*, 2023\.
- Shapley \[1953\]Shapley, L\. S\.A value fornn\-person games\.In Kuhn, H\. W\. and Tucker, A\. W\. \(eds\.\),*Contributions to the Theory of Games*, volume II, pp\. 307–317\. Princeton University Press, 1953\.
- Shrikumar et al\. \[2017\]Shrikumar, A\., Greenside, P\., and Kundaje, A\.Learning important features through propagating activation differences\.In*International Conference on Machine Learning*, pp\. 3145–3153, 2017\.
- Shrikumar et al\. \[2018\]Shrikumar, A\., Su, J\., and Kundaje, A\.Computationally efficient measures of internal neuron importance\.*arXiv preprint arXiv:1807\.09946*, 2018\.
- Sokar et al\. \[2023\]Sokar, G\., Agarwal, R\., Castro, P\. S\., and Evci, U\.The dormant neuron phenomenon in deep reinforcement learning\.In*International Conference on Machine Learning*, volume 202 of*Proceedings of Machine Learning Research*, pp\. 32145–32168\. PMLR, 2023\.
- Sundararajan et al\. \[2017\]Sundararajan, M\., Taly, A\., and Yan, Q\.Axiomatic attribution for deep networks\.In*International Conference on Machine Learning*, pp\. 3319–3328, 2017\.
- Wang et al\. \[2024\]Wang, L\., Zhang, X\., Su, H\., and Zhu, J\.A comprehensive survey of continual learning: Theory, method and application\.*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 46\(8\):5362–5383, 2024\.
- Xu et al\. \[2018\]Xu, W\., Evans, D\., and Qi, Y\.Feature squeezing: Detecting adversarial examples in deep neural networks\.In*Network and Distributed System Security Symposium*, 2018\.
- Yeom et al\. \[2021\]Yeom, S\.\-K\., Seegerer, P\., Lapuschkin, S\., Binder, A\., Wiedemann, S\., Müller, K\.\-R\., and Samek, W\.Pruning by explaining: A novel criterion for deep neural network pruning\.*Pattern Recognition*, 115:107899, 2021\.
- Yvinec et al\. \[2022\]Yvinec, E\., Dapogny, A\., Cord, M\., and Bailly, K\.SInGE: Sparsity via integrated gradients estimation of neuron relevance\.In*Advances in Neural Information Processing Systems*, 2022\.

## Appendix AExperimental Details

Table[2](https://arxiv.org/html/2605.06834#A1.T2)summarizes all utility methods compared across experiments\. All use EMA\-tracked statistics with decayβ=0\.99\\beta=0\.99and bias correctionui/\(1−βai\)u\_\{i\}/\(1\-\\beta^\{a\_\{i\}\}\)where applicable\. GXD variants are detailed in Appendix[B](https://arxiv.org/html/2605.06834#A2)\.

Table 2:Utility methods\.hih\_\{i\}: post\-activation;rir\_\{i\}: bias\-corrected EMA reference;wout,winw^\{\\mathrm\{out\}\},w^\{\\mathrm\{in\}\}: outgoing/incoming weights;zyz\_\{y\}: target logit;ℒ\\mathcal\{L\}: loss\.### A\.1Continual Backpropagation Details

For hidden unitiiin layerllat updatett, lethl,i,t​\(x\)h\_\{l,i,t\}\(x\)be the post\-activation value andai,t\(l\)a\_\{i,t\}^\{\(l\)\}be the unit age\. CBP utilities maintain recent activation statistics\. The signed running activation reference used by mean\-corrected utilities is

fl,i,t=\(1−η\)​hl,i,t\+η​fl,i,t−1,rl,i,t=f^l,i,t=fl,i,t1−ηai,t\(l\),f\_\{l,i,t\}=\(1\-\\eta\)h\_\{l,i,t\}\+\\eta f\_\{l,i,t\-1\},\\qquad r\_\{l,i,t\}=\\hat\{f\}\_\{l,i,t\}=\\frac\{f\_\{l,i,t\}\}\{1\-\\eta^\{a\_\{i,t\}^\{\(l\)\}\}\},\(10\)where the bias correction removes the zero\-initialization bias of the EMA, makingrl,i,tr\_\{l,i,t\}an age\-corrected estimate of the unit’s recent average activation\.

Contribution utility estimates local expression weighted by outgoing connection magnitude\. It is calculated as an EMA of instantaneous outgoing contribution,

ul,i,tCont=η​ul,i,t−1Cont\+\(1−η\)​\|hl,i,t\|​∑k=1nl\+1\|wi,k,t\(l\)\|,u\_\{l,i,t\}^\{\\mathrm\{Cont\}\}=\\eta u\_\{l,i,t\-1\}^\{\\mathrm\{Cont\}\}\+\(1\-\\eta\)\\left\|h\_\{l,i,t\}\\right\|\\sum\_\{k=1\}^\{n\_\{l\+1\}\}\\left\|w\_\{i,k,t\}^\{\(l\)\}\\right\|,\(11\)withη=0\.99\\eta=0\.99inDohare et al\.\([2024](https://arxiv.org/html/2605.06834#bib.bib5)\)\. Mean\-corrected contribution replaces\|hl,i,t\|\\lvert h\_\{l,i,t\}\\rvertwith displacement from the signed reference,\|hl,i,t−rl,i,t\|\\lvert h\_\{l,i,t\}\-r\_\{l,i,t\}\\rvert\. Mean\-corrected adaptable contribution, denoted as the overall contribution in this version:\(Dohare et al\.,[2023](https://arxiv.org/html/2605.06834#bib.bib6)\), further divides by incoming weight magnitude,

yl,i,tMCAdapt=\|hl,i,t−rl,i,t\|​∑k=1nl\+1\|wi,k,t\(l\)\|∑j=1nl−1\|wj,i,t\(l−1\)\|\.y\_\{l,i,t\}^\{\\mathrm\{MCAdapt\}\}=\\frac\{\\left\|h\_\{l,i,t\}\-r\_\{l,i,t\}\\right\|\\sum\_\{k=1\}^\{n\_\{l\+1\}\}\\left\|w\_\{i,k,t\}^\{\(l\)\}\\right\|\}\{\\sum\_\{j=1\}^\{n\_\{l\-1\}\}\\left\|w\_\{j,i,t\}^\{\(l\-1\)\}\\right\|\}\.\(12\)The tracked utilities are an EMA of the instantaneous score, and mature units are ranked by its bias\-corrected value\.

Algorithm 1Continual Backpropagation for an MLP1:Replacement rate

ρ\\rho, decay rate

η\\eta, maturity threshold

mm
2:Initialize weights with

w∼dlw\\sim d\_\{l\}, where

dld\_\{l\}is the initialization distribution

3:Initialize utilities

uu, replacement counters

cc, and ages

aato zero

4:foreach input

xtx\_\{t\}do

5:Train step

6:foreach hidden layer

l=1,…,L−1l=1,\\ldots,L\-1do

7:Update ages:

al←al\+1a\_\{l\}\\leftarrow a\_\{l\}\+1
8:Update utilities

ulu\_\{l\}with decay rate

η\\eta
9:Count mature units:

neligible←\|\{i:al,i\>m\}\|n\_\{\\mathrm\{eligible\}\}\\leftarrow\|\\\{i:a\_\{l,i\}\>m\\\}\|
10:Accumulate replacements:

cl←cl\+neligible​ρc\_\{l\}\\leftarrow c\_\{l\}\+n\_\{\\mathrm\{eligible\}\}\\rho
11:if

cl\>1c\_\{l\}\>1then

12:Select

r=arg⁡mini:al,i\>m⁡ul,ir=\\arg\\min\_\{i:a\_\{l,i\}\>m\}u\_\{l,i\}
13:Reinitialize incoming weights: resample

wl−1​\[:,r\]∼dlw\_\{l\-1\}\[:,r\]\\sim d\_\{l\}
14:Compensate downstream bias:

bl←bl\+wl​\[r,:\]​f^l,rb\_\{l\}\\leftarrow b\_\{l\}\+w\_\{l\}\[r,:\]\\hat\{f\}\_\{l,r\}
15:Reinitialize outgoing weights: set

wl​\[r,:\]←0w\_\{l\}\[r,:\]\\leftarrow 0
16:Reset utility and age:

ul,r←0u\_\{l,r\}\\leftarrow 0,

al,r←0a\_\{l,r\}\\leftarrow 0
17:Update replacement counter:

cl←cl−1c\_\{l\}\\leftarrow c\_\{l\}\-1
18:endif

19:endfor

20:endfor

### A\.2Reset Cost Assay \(§[5\.1](https://arxiv.org/html/2605.06834#S5.SS1)\)

Models are trained and checkpointed at fixed intervals\. For each checkpoint, utility scores are computed on a calibration set \(2,048 samples\), and the output perturbation from clamping low\-utility neurons to their reference,ri≈𝔼​\[hi​\(x\)\]r\_\{i\}\\approx\\mathbb\{E\}\[h\_\{i\}\(x\)\], is measured on a disjoint probe set \(2,048 samples\)\. We measure output shock using logitL1L\_\{1\}distance, and KL divergence between the original and perturbed model outputs\.L1L\_\{1\}measures absolute functional displacement, while KL measures effective change in the predictive distribution\. Similar output\-distance criteria have been used to quantify changes in model predictions\(Luo and Wu,[2020](https://arxiv.org/html/2605.06834#bib.bib16);Xu et al\.,[2018](https://arxiv.org/html/2605.06834#bib.bib27)\)\. The assay additionally evaluates GXD variants listed in Appendix[B](https://arxiv.org/html/2605.06834#A2)\.

MLP metrics: mean±\\pmSE across 3 seeds×\\times5 checkpoints\. ResNet\-18 metrics: 3 seeds×\\times4 checkpoints\. For all ResNet architectures, a BasicBlock denotes the residual block ofHe et al\.\([2016](https://arxiv.org/html/2605.06834#bib.bib7)\): two 3×\\times3 convolution–BatchNorm–ReLU layers with an identity shortcut when dimensions match and a projection shortcut when channels or spatial resolution change\.

Table 3:Reset Cost Assay: architecture and training\.MLP \(Permuted MNIST\)ResNet\-18 \(CIFAR\-100\)Architecture784→\\to\[256\]×\\times4→\\to10ResNet\-18He et al\.\([2016](https://arxiv.org/html/2605.06834#bib.bib7)\), also used inDohare et al\.\([2024](https://arxiv.org/html/2605.06834#bib.bib5)\): 4 stages with \[2,2,2,2\] BasicBlocks; \[64, 128, 256, 512\] channelsActivationsReLU, SiLU, LReLU, TanhReLUNormalizationNoneBatchNorm2dInitKaiming uniformKaiming normalOptimizerSGD \(lr=0\.01\)SGD \(lr=0\.1, mom=0\.9, wd=5×10−45\{\\times\}10^\{\-4\}\)LR scheduleNone×\\times0\.2 at epochs 60, 120, 160Batch size6490 / 100Tasks / epochs30 tasks, 1 epoch each200 epochs, all 100 classesData splitFull train / test450 train \+ 50 val per classAugmentationNoneFlip, crop\(pad=4\), rot\(0–15∘15^\{\\circ\}\)Seeds33CheckpointsTasks \{0, 5, 10, 20, 30\}Epochs \{60, 120, 160, 200\}ComputeCPUNVIDIA A10G \(24 GB\)
### A\.3Online Permuted MNIST \(§[5\.2](https://arxiv.org/html/2605.06834#S5.SS2)\)

Table 4:Permuted MNIST: configuration\.MNIST pixels normalized to\[0,1\]\[0,1\]; permutations generated from a fixed seed shared across methods\. Hyperparameters were selected from 5\-seed sweeps before running the final reported seeds\. All reset\-based methods received the same learning\-rate and replacement\-rate candidate set\. Hyperparameters not explicitly swept or fixed above follow the prior CBP/plasticity\-loss settings used for Permuted MNIST\(Dohare et al\.,[2024](https://arxiv.org/html/2605.06834#bib.bib5)\)\. Error bars: SE across 15 seeds\. GXD’s additional backward pass for the target\-logit gradient adds minimal overhead on CPU compared to the original CBP \(∼\{\\sim\}6\.1 h vs\.∼\{\\sim\}6 h per seed for Activation CBP\)\.

### A\.4Continuous CIFAR\-100 \(§[5\.3](https://arxiv.org/html/2605.06834#S5.SS3)\)

Follows the Continual ImageNet protocol ofDohare et al\.\([2024](https://arxiv.org/html/2605.06834#bib.bib5)\)adapted for CIFAR\-100: the model trains on a sequence of 500 binary tasks \(random class pairs\)\. The 2\-unit head is zeroed and optimizer state is fully reset between tasks; conv layer parameters carry over\. Loss gradient utility was not included in this experiment due to is poor performance in Online Permuted MNIST\.

Table 5:Continuous CIFAR\-100: architecture and training\.The Continuous CIFAR\-100 experiment is designed as a fixed\-budget stability test\. We report all reset methods at replacement rateρ=10−4\\rho=10^\{\-4\}to compare how different utility rankings allocate the same amount of feature turnover\. Hyperparameters not explicitly swept or fixed above follow the Continual ImageNet protocol ofDohare et al\.\([2024](https://arxiv.org/html/2605.06834#bib.bib5)\)\. Error bars: SE across 10 seeds\.

We use the same layerwise notion of a reset unit asDohare et al\.\([2024](https://arxiv.org/html/2605.06834#bib.bib5)\): in convolutional layers, one unit corresponds to an output channel/filter, so resetting a unit reinitializes the incoming filter weights and removes its outgoing connections\. In linear layers, one unit corresponds to one hidden feature vector\.

## Appendix BGXD Variants

The reset cost experiment \(§[5\.1](https://arxiv.org/html/2605.06834#S5.SS1)\) compares the target\-logit GXD used in the main experiments against a full all\-logit version and GXI \(Gradient×\\timesInput\), a related attribution baseline\. GXD computesf​\(hi−ri,∇i\)f\(h\_\{i\}\-r\_\{i\},\\nabla\_\{i\}\)whererir\_\{i\}is the bias\-corrected EMA reference and∇i\\nabla\_\{i\}denotes a gradient with respect to the target logit or, in the all\-logit case, averaged over allCCoutput logits\. GXI uses the same gradient but replaces the displacement with the raw activationhih\_\{i\}, equivalent to a zero reference\.

Table 6:Variants evaluated in the Reset Cost Experiment\.Section[4\.2](https://arxiv.org/html/2605.06834#S4.SS2)motivates the target\-logit formulation for its efficiency as it only requires one additional backward pass regardless of the number of output classes\. Table[7](https://arxiv.org/html/2605.06834#A2.T7)confirms that the ranking quality of GXD \(target\) nearly matches GXD \(all\-logit\) across all three settings\. The gap is especially small for the ResNet\-18 \(C=100\), where Spearmanρ\\rhoon logitL1L\_\{1\}is\.926\.926vs\.\.937\.937, and GXD \(target\) actually achieves the higher correlation on KL \(\.803\.803vs\.\.783\.783\)\. Figure[5](https://arxiv.org/html/2605.06834#A2.F5)further shows the alignment of target\-logt GXD with the all\-logit version where the resetting of the bottom 5% of each method results in very close resulting perturbation, which is also the minimum among all estimation methods\.

![Refer to caption](https://arxiv.org/html/2605.06834v1/APP_shock5_full_l1_all_relu.png)

![Refer to caption](https://arxiv.org/html/2605.06834v1/APP_shock5_full_l1_all_silu.png)

![Refer to caption](https://arxiv.org/html/2605.06834v1/APP_shock5_full_l1_all_all.png)

Figure 5:Shock@5% for GXD variants: GXI \(target\), GXD \(target\), and GXD \(full\-L1\)\.Left:MLP with ReLU\.Center:MLP with SiLU\.Right:ResNet\-18\.GXI \(Gradient×\\timesInput\) is a standard feature attribution method\(Ancona et al\.,[2018](https://arxiv.org/html/2605.06834#bib.bib1)\)that multiplies a unit’s activation by its output gradient\. In the GXD framework, GXI corresponds to setting the reference to zero \(ri=0r\_\{i\}=0\)\. For ReLU networks, zero is the natural inactive state, so GXI’s implicit reference coincides with the reset endpoint\. This explains its competitive performance in the ReLU MLP \(Table[7](https://arxiv.org/html/2605.06834#A2.T7):\.970\.970vs\.\.996\.996Spearman on logitL1L\_\{1\}; Figure[5](https://arxiv.org/html/2605.06834#A2.F5)left: comparable Shock@5% bars; Figure[6](https://arxiv.org/html/2605.06834#A2.F6)matches performance of GXD on ReLU\)\.

Table 7:Spearman rank correlation \(ρ\\rho\) for GXD variants \(mean±\\pmSE across checkpoints\)\. GXD \(target\) is the scalar target\-logit formulation used in the main experiments; GXI uses zero as the reference; GXD \(all\-logit\) averages attribution over all output logits\.For smooth activations and residual architectures, zero is no longer the inactive state and GXI’s reference becomes misaligned with the actual reset intervention\. In the SiLU MLP, GXI’s Shock@5% is comparable to Contribution and substantially worse than GXD \(Figure[5](https://arxiv.org/html/2605.06834#A2.F5), center\), and its Spearmanρ\\rhoon logitL1L\_\{1\}drops to\.469\.469versus\.980\.980for GXD \(Table[7](https://arxiv.org/html/2605.06834#A2.T7)\)\. The same pattern appears in the ResNet\-18, where GXI’s Shock@5% nearly matches Contribution while GXD approaches the oracle \(Figure[5](https://arxiv.org/html/2605.06834#A2.F5), right\)\. Figure[6](https://arxiv.org/html/2605.06834#A2.F6)shows this gap carries over to online continual learning: GXI matches GXD on ReLU but degrades on Tanh, SiLU, and Leaky ReLU, mirroring the failure mode of activation\-based utilities that assume a zero baseline\. This is an important ablation because it shows our addition of output sensitivity is not a meaningful improvement in our explored settings alone, and requires the intervention aware difference from reference component\.

![Refer to caption](https://arxiv.org/html/2605.06834v1/permuted_mnist_per_hp_act_gxi_grad.png)Figure 6:Test accuracy on Permuted MNIST comparing GXI \(target\), GXD \(target\), and Loss Gradient utilities across activation functions\. Results averaged over 15 seeds\.
## Appendix CExtended Figures

![Refer to caption](https://arxiv.org/html/2605.06834v1/cont_cifar_abs.png)Figure 7:Absolute test accuracy on Continuous CIFAR\-100 with ResNet\.

Similar Articles

Gradient Extrapolation-Based Policy Optimization

arXiv cs.LG

The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

arXiv cs.LG

This paper introduces SURGE, a novel learnable gradient compensation framework for training Binary Neural Networks that addresses gradient mismatch and information loss issues found in traditional methods like the Straight-Through Estimator.

Generalized Neurons

ML at Berkeley

The article explores the Universal Approximation Theorem in deep learning, analyzing the representation capacity of individual neurons and neural network layers using ReLU activation functions.

Extensions and limitations of the neural GPU

OpenAI Blog

This paper explores extensions and limitations of the Neural GPU model, demonstrating improvements through curriculum design and scaling, enabling it to learn arithmetic operations on decimal numbers and long expressions while identifying failure modes on symmetric inputs analogous to adversarial examples.