Balancing Stability and Plasticity in Sequentially Trained Early-Exiting Neural Networks
Summary
The paper addresses catastrophic forgetting in sequentially trained early-exiting neural networks and proposes two methods based on Elastic Weight Consolidation and Learning without Forgetting to preserve earlier exit performance while adding new ones.
View Cached Full Text
Cached at: 05/08/26, 07:01 AM
# Balancing Stability and Plasticity in Sequentially Trained Early-Exiting Neural Networks
Source: [https://arxiv.org/html/2605.05358](https://arxiv.org/html/2605.05358)
###### Abstract
Early\-exiting neural networks enable adaptive inference by allowing inputs to exit at intermediate classifiers, reducing computation for easy samples while maintaining high accuracy\. In practice, exits can be trained sequentially by incrementally adding them to a shared backbone; however, this sequential training can cause newly introduced exits to interfere with previously learned ones, degrading the performance of earlier classifiers\. We address this problem by retaining the knowledge embedded in existing exits while allowing new ones to specialize\. We propose two alternative approaches that operate at different levels of the model\. The first constrains learning by protecting parameters that are important for previously trained exits, while the second preserves the output distributions of earlier exits as the network adapts\. These alternatives directly reflect the stability\-plasticity trade\-off studied in continual learning\. Accordingly, we leverageElastic Weight Consolidationto constrain critical weights andLearning without Forgettingto preserve output distributions\. Experiments on standard benchmarks show that our approaches consistently improve early\-exit performance, achieving higher accuracy over existing sequential training methods and significant performance speedups at low computational budgets\.
Index Terms—Early\-exiting neural networks, Sequential training, Catastrophic Forgetting, Continual learning
## 1Introduction
The increasing deployment of deep learning models under strict latency, energy, and memory constraints has motivated the development of resource\-efficient dynamic architectures\. Rather than executing a fixed computation graph for all inputs, such architectures adapt their computational cost to input difficulty, enabling efficient inference while maintaining high accuracy\[[5](https://arxiv.org/html/2605.05358#bib.bib5)\]\. Early\-exiting neural networks \(EENNs\) are a representative instance of this paradigm\. EENNs augment a backbone network with multiple internal classifiers \(ICs\) placed at increasing depths, allowing inference to terminate early once sufficient confidence is reached\. Easy samples can be classified using shallow representations, while more challenging inputs propagate deeper into the network\. This exit\-based adaptivity significantly reduces average inference cost and makes EENNs well\-suited for resource\-constrained applications\[[2](https://arxiv.org/html/2605.05358#bib.bib15)\]\.
To realize these benefits in practice, a critical design choice concerns how to train the multiple exits within an EENN\. Most prior work has focused on joint training strategies, which optimize the backbone along with all exits simultaneously in an end\-to\-end manner\. Joint training has shown to deliver strong empirical performance and has therefore become the predominant approach\[[17](https://arxiv.org/html/2605.05358#bib.bib26),[4](https://arxiv.org/html/2605.05358#bib.bib14),[3](https://arxiv.org/html/2605.05358#bib.bib4),[11](https://arxiv.org/html/2605.05358#bib.bib1)\]\. In contrast, sequential training performs the training incrementally as exits are added to an existing backbone\. Sequential training naturally aligns with pretrained backbones by enabling a more rapid design time and an efficient analysis of where meaningful predictions emerge within the network – a useful property for large models where the computational cost of full joint training can be prohibitive or even undesirable\[[24](https://arxiv.org/html/2605.05358#bib.bib16),[22](https://arxiv.org/html/2605.05358#bib.bib17)\]\.
Sequential training approaches for EENNs are categorized into three main groups based on the specific training methodology employed for each successive exit\[[11](https://arxiv.org/html/2605.05358#bib.bib1)\]\.Disjoint training\(also referred to as two\-stage training\) first trains the backbone independently and then freezes it while learning ICs\[[20](https://arxiv.org/html/2605.05358#bib.bib13),[23](https://arxiv.org/html/2605.05358#bib.bib19)\]\. This approach is particularly effective for retrofitting pretrained models with early exits at the cost of limited adaptability; new exits cannot refine earlier representations to better support their specific objectives\.Branch\-wise trainingintroduces exits sequentially while unfreezing only the backbone segment associated with the current exit, offering moderate flexibility and helping to alleviate vanishing or exploding gradients in deep architectures\[[7](https://arxiv.org/html/2605.05358#bib.bib24)\]\. However, this forward\-thinking approach suffers from premature freezing: once an exit is trained and the training process moves deeper, that exit’s parameters become fixed, preventing further refinement even as later exits reveal useful representational structures\.Separate trainingfurther relaxes these constraints by unfreezing the full network\[[11](https://arxiv.org/html/2605.05358#bib.bib1)\]\. The exits are also introduced sequentially, however the training of each exit is done jointly with previous exits, allowing each branch to capture features at different abstraction levels through potentially distinct objectives\. Although this approach provides greater flexibility, it applies uniform updates to all backbone parameters during each iteration, without identifying which parameters are essential for earlier exits and which can be modified for new ones\.
A key challenge unifies these methods: introducing new exits disrupts previously learned representations through gradient interference, with degradation intensifying as more exits are added\[[14](https://arxiv.org/html/2605.05358#bib.bib25)\]\. This results from current approaches disregarding the heterogeneous importance of parameters, particularly those that have already been tuned for earlier exits\. Indeed, some weights encode critical features that earlier exits depend on, while others are less consequential and could be freely adapted\. Therefore, distinguishing between these parameter types is essential as it enables selective protection of critical knowledge while maintaining the plasticity needed for new exits to specialize effectively\. Importantly, this challenge is not unique to early\-exit networks\. In continual learning \(CL\), models face a similar dilemma when learning sequential tasks; they must acquire new capabilities \(plasticity\) while preserving previously learned knowledge \(stability\) to mitigate catastrophic forgetting\[[19](https://arxiv.org/html/2605.05358#bib.bib27)\]\. We thus recognize that sequential early\-exit training exhibits this same stability\-plasticity trade\-off\. This conceptual alignment suggests that CL techniques can be adapted to protect salient features of earlier exits from being forgotten during sequential training\. It is worth noting that the goal here is not to implement CL on EENN \(e\.g\., Szatkowski et al\.\[[16](https://arxiv.org/html/2605.05358#bib.bib8)\]\), but to benefit from some of the regularization techniques of CL to mitigate representation degradation in sequential training of EENN\.
Building on this perspective, we propose a novel two\-stage sequential training approach\. In the first stage, we pretrain the backbone using only the final exit in order for deep layers to develop rich abstract representations before early exits are introduced\[[21](https://arxiv.org/html/2605.05358#bib.bib18),[13](https://arxiv.org/html/2605.05358#bib.bib12)\]\. In the second stage, we perform sequential training by explicitly preserving the knowledge acquired by earlier exits as new exits are introduced\. Consistent with the CL perspective, our goal is to counteract the forgetting induced by gradient interference\. Without protection mechanisms, optimizing a new exit inevitably overwrites representations that previous exits depend on, leading to progressive degradation\. Therefore, we aim to regularize sequential training from two alternative standpoints, namely at the parameter and output distribution levels, each addressing forgetting from a different aspect\. At the parameter level, forgetting occurs because optimization updates treat all weights equally, although some are critical for the performance of existing exits\. By identifying which parameters are essential, we can selectively protect them, allowing the network to adapt safely while preserving knowledge that earlier exits rely on\. This motivates the use of Elastic Weight Consolidation \(EWC\)\[[9](https://arxiv.org/html/2605.05358#bib.bib7),[15](https://arxiv.org/html/2605.05358#bib.bib6)\], which estimates parameter importance via Fisher information and constrains updates on crucial weights, thus weighting the learning process to balance stability and plasticity\. At the output distribution level, however, what ultimately matters is that the predictive behavior of the earlier exits remains stable as new exits are trained\. To achieve this, we adopt Learning without Forgetting \(LwF\)\[[12](https://arxiv.org/html/2605.05358#bib.bib3)\], which regularizes the outputs of earlier exits to match their original predictions, maintaining functional consistency as new exits are trained\. Both EWC and LwF are governed by tunable hyperparameters, giving explicit control over the trade\-off between preserving early\-exit performance and enabling adaptation for new exits\.
The main contributions of this work are threefold:
- •We propose a novel technique to mitigate the degradation of previously learnt representation in order to address the stability\-plasticity trade\-off faced during sequential training of EENN\.
- •We adapt two CL regularization strategies, EWC for parameter\-level protection and LwF for output distribution\-level consistency, to enable principled sequential training that balances exit specialization with knowledge preservation\.
- •We demonstrate consistent improvements over existing sequential training methods across benchmark architectures, validating the effectiveness of the CL approach for EENNs\.
## 2Proposed Methodology
𝐱\\mathbf\{x\}Θ1\\Theta\_\{1\}Θ2\\Theta\_\{2\}Θ3\\Theta\_\{3\}Θ4\\Theta\_\{4\}IC1IC\_\{1\}IC2IC\_\{2\}IC3IC\_\{3\}IC4IC\_\{4\}𝐲^1\\mathbf\{\\hat\{y\}\}\_\{1\}𝐲^2\\mathbf\{\\hat\{y\}\}\_\{2\}𝐲^3\\mathbf\{\\hat\{y\}\}\_\{3\}𝐲^4\\mathbf\{\\hat\{y\}\}\_\{4\}𝐲^1\\mathbf\{\\hat\{y\}\}\_\{1\}𝐲^2\\mathbf\{\\hat\{y\}\}\_\{2\}
Fig\. 1:Flow of the proposed sequential training scheme\. Grey refers to regularized parameters or output distributions; dashed lines refer to frozen segments\. The backbone is represented by rotated blocksΘμ\\Theta\_\{\\mu\}’s and intermediate classifiers by horizontal blocksICμIC\_\{\\mu\}’s\. Each ICμ\\muoutputs a distribution𝐲^μ\\mathbf\{\\hat\{y\}\}\_\{\\mu\}\.We consider an EENN consisting ofMMICs, where each backbone subnetworkμ\\muis parameterized by\{Θ1,…,Θμ\}\\\{\\Theta\_\{1\},\\dots,\\Theta\_\{\\mu\}\\\}and produces class probability vector𝐲^μ\\mathbf\{\\hat\{y\}\}\_\{\\mu\}\(see[Figure1](https://arxiv.org/html/2605.05358#S2.F1)\), withy^μ\(c\)\\hat\{y\}\_\{\\mu\}\(c\)\(forc∈\{1,…,C\}c\\in\\\{1,\\dots,C\\\}\) being the probability associated with classcc\. We frame the sequential training of ICs as a sequence of learning tasks indexed byμ∈\{1,2,…,M\}\\mu\\in\\\{1,2,\\ldots,M\\\}, where taskμ\\mucorresponds to training ICμ\\muwhile preserving the predictive capabilities of all previously trained ICs\{1,…,μ−1\}\\\{1,\\ldots,\{\\mu\-1\}\\\}\. The shared backbone parameters must accommodate the objectives of all trained ICs, balancing plasticity \(i\.e\., the ability to refine representations for the current task\) against stability \(i\.e\., maintaining performance on earlier tasks\)\.
To address this stability\-plasticity trade\-off, we augment the standard classification loss with a regularization term that protects previously acquired knowledge\. The loss function for training exitμ\\mutakes the general form:
ℒtotal\(μ\)=ℒCE\(μ\)\+sλℛEWC\(μ\)\+\(1−s\)ρℛLwF\(μ\)\\mathcal\{L\}\_\{\\mathrm\{total\}\}^\{\(\\mu\)\}=\\mathcal\{L\}\_\{\\mathrm\{CE\}\}^\{\(\\mu\)\}\+s\\,\\lambda\\mathcal\{R\}\_\{\\mathrm\{EWC\}\}^\{\(\\mu\)\}\+\(1\-s\)\\,\\rho\\mathcal\{R\}\_\{\\mathrm\{LwF\}\}^\{\(\\mu\)\}\(1\)
whereℒCE\(μ\)\\mathcal\{L\}\_\{\\mathrm\{CE\}\}^\{\(\\mu\)\}denotes the cross\-entropy loss for exitμ\\muwhich depends on\{Θ1,…,Θμ\}\\\{\\Theta\_\{1\},\\dots,\\Theta\_\{\\mu\}\\\}and the parameters of ICμ\\mu;ℛEWC\(μ\)\\mathcal\{R\}\_\{\\mathrm\{EWC\}\}^\{\(\\mu\)\}andℛLwF\(μ\)\\mathcal\{R\}\_\{\\mathrm\{LwF\}\}^\{\(\\mu\)\}are regularization terms that mitigate the degradation of earlier exits, corresponding to EWC \(parameter\-level\) and LwF \(output\-distribution\-level\), respectively\. Their relative strengths are controlled by the hyperparametersλ\\lambdaandρ\\rho\. The binary variables∈\{0,1\}s\\in\\\{0,1\\\}indicates which regularization strategy is active\.
To illustrate the training procedure,[Figure1](https://arxiv.org/html/2605.05358#S2.F1)shows the status of each subnetwork when training exit 3\. At this stage, the first two blocks \(parameterized byΘ1\\Theta\_\{1\}andΘ2\\Theta\_\{2\}\) and their respective classifiers \(producing predictions𝐲^1\\mathbf\{\\hat\{y\}\}\_\{1\}and𝐲^2\\mathbf\{\\hat\{y\}\}\_\{2\}\) have already been trained\. We now add IC 3 and therefore, we need to train the full network up to exit 3\. We minimize the cross\-entropy lossℒCE\(3\)\\mathcal\{L\}^\{\(3\)\}\_\{\\text\{CE\}\}augmented with regularizationℛ\(3\)\\mathcal\{R\}^\{\(3\)\}applied to the previously trained blocks, either constraining their parameters \(EWC\) or preserving their predictions \(LwF\)\. The final block \(Θ4\\Theta\_\{4\}\) and its corresponding classifier remain frozen throughout this stage\.
### 2\.1Parameter\-level Regularization
The first approach employs parameter\-level regularization through EWC, which protects parameters that are critical for previously trained exits by penalizing changes to important weights\. This importance score is computed using Fisher Information\[[1](https://arxiv.org/html/2605.05358#bib.bib10)\]\. Parameters with large Fisher values correspond to directions in parameter space where small perturbations would strongly increase the cross\-entropy lossℒCE\(ν\)\\mathcal\{L\}\_\{\\mathrm\{CE\}\}^\{\(\\nu\)\}on previous exits and are therefore prioritized for protection\.
Following the online EWC framework\[[15](https://arxiv.org/html/2605.05358#bib.bib6)\], the parameter\-level regularization is defined as
ℛEWC\(μ\)=∑θk∈∪i=1μ−1Θi\(∑ν=1μ−1Fk\(ν\)\)\(θk−θk∗\)2\\mathcal\{R\}\_\{\\mathrm\{EWC\}\}^\{\(\\mu\)\}=\\sum\_\{\\theta\_\{k\}\\in\\cup\_\{i=1\}^\{\\mu\-1\}\\Theta\_\{i\}\}\\left\(\\sum\_\{\\nu=1\}^\{\\mu\-1\}F\_\{k\}^\{\(\\nu\)\}\\right\)\\left\(\\theta\_\{k\}\-\\theta\_\{k\}^\{\*\}\\right\)^\{2\}\(2\)Here,θk∗∈ℝ\\theta\_\{k\}^\{\*\}\\in\\mathbb\{R\}denotes the value of parameterθk∈∪i=1μ−1Θi\\theta\_\{k\}\\in\\cup\_\{i=1\}^\{\\mu\-1\}\\Theta\_\{i\}after training up to exitμ−1\\mu\-1\. The termFk\(ν\)∈ℝF\_\{k\}^\{\(\\nu\)\}\\in\\mathbb\{R\}represents the Fisher information measuring the importance of parameterθk\\theta\_\{k\}for exitν\\nu\. In practice, we use the empirical Fisher information\[[18](https://arxiv.org/html/2605.05358#bib.bib20)\], which is computed as:
Fk\(ν\)=1N∑𝐱∈𝒟\(∂∂θklogy^ν\(𝒞\(𝐱\)\)\|θk=θk∗\)2F\_\{k\}^\{\(\\nu\)\}=\\frac\{1\}\{N\}\\sum\_\{\\mathbf\{x\}\\in\\mathcal\{D\}\}\\left\(\\left\.\\frac\{\\partial\}\{\\partial\\theta\_\{k\}\}\\log\\hat\{y\}\_\{\\nu\}\(\\mathcal\{C\}\(\\mathbf\{x\}\)\)\\right\|\_\{\\theta\_\{k\}=\\theta\_\{k\}^\{\*\}\}\\right\)^\{2\}\(3\)where𝒟\\mathcal\{D\}denotes the training dataset,NNis the total number of training samples, andy^ν\(𝒞\(𝐱\)\)\\hat\{y\}\_\{\\nu\}\(\\mathcal\{C\}\(\\mathbf\{x\}\)\)is the probability associated with the true label𝒞\(𝐱\)∈\{1,…,C\}\\mathcal\{C\}\(\\mathbf\{x\}\)\\in\\\{1,\\dots,C\\\}for input𝐱\\mathbf\{x\}at exitν\\nu\.
### 2\.2Output distribution\-level Regularization
The second proposed approach takes a fundamentally different perspective by operating at the output distribution level\. Rather than protecting individual parameters, LwF preserves knowledge by enforcing that the current model maintains the input\-output mappings learned by previously trained exits\. In contrast to EWC, LwF imposes a functional constraint that is agnostic to specific parameter values and instead focuses on preserving decision boundaries\. Under LwF, parameters are free to change substantially as long as the output distributions of previous exits is minimally shifted by the training of this new exit, allowing the network to adapt its internal representations while maintaining stable behavior\. Importantly, the LwF loss enables gradient flow through both the shared backbone and the classifier heads of earlier exits; this added flexibility allows earlier classifiers to compensate for moderate backbone drift, potentially mitigating performance degradation\. The output distribution\-level regularization is thus defined as:
ℛLwF\(μ\)=∑ν=1μ−1𝕂𝕃\(𝐲^ν∥𝐲^ν∗\)\\mathcal\{R\}\_\{\\mathrm\{LwF\}\}^\{\(\\mu\)\}=\\sum\_\{\\nu=1\}^\{\\mu\-1\}\\mathbb\{KL\}\\left\(\\mathbf\{\\hat\{y\}\}\_\{\\nu\}\\;\\\|\\;\\mathbf\{\\hat\{y\}\}^\{\*\}\_\{\\nu\}\\right\)\(4\)where𝐲^ν\\mathbf\{\\hat\{y\}\}\_\{\\nu\}denotes the output distribution of ICν\\nuwhen training exitμ\\mu,𝐲^ν∗\\mathbf\{\\hat\{y\}\}^\{\*\}\_\{\\nu\}denotes the output distribution obtained with the training of ICν\\nu, and𝕂𝕃\(⋅∥⋅\)\\mathbb\{KL\}\(\\cdot\\\|\\cdot\)is the Kullback\-Leibler divergence\.
## 3Experiments
We evaluate our approach on standard early\-exit benchmarks following the experimental setup proposed in\[[11](https://arxiv.org/html/2605.05358#bib.bib1)\]\. Our evaluation focuses on comparing the proposed EWC and LwF regularization strategies against existing sequential training methods across different architectures and computational budgets\.
Table 1:Top\-1 accuracy for ResNet\-34 and MSDNet on CIFAR\-100 compared to baselines\.### 3\.1Experimental Setup
Dataset\.We conduct experiments on CIFAR\-100\[[10](https://arxiv.org/html/2605.05358#bib.bib21)\], a widely\-used image classification benchmark consisting of 60,000 32×32 RGB images across 100 fine\-grained classes\. The dataset is split into 50,000 training images and 10,000 test images\. Architectures\.We adopt the same architectures as in\[[11](https://arxiv.org/html/2605.05358#bib.bib1)\]\. We evaluate two complementary architectures: ResNet\-34\[[6](https://arxiv.org/html/2605.05358#bib.bib22)\]and MSDNet\[[7](https://arxiv.org/html/2605.05358#bib.bib24)\]\. For ResNet\-34, we augment the standard architecture with 8 internal classifiers positioned at layers\{2,4,⋯,16\}\\\{2,4,\\cdots,16\\\}, where each exit consists of an SDN\-type pooling\[[8](https://arxiv.org/html/2605.05358#bib.bib23)\], except for the last classifier, which uses an adaptive average pooling, followed by a linear classifier\. For MSDNet, we employ the CIFAR variant with 7 blocks\. Baselines:All baseline accuracy scores are taken from\[[11](https://arxiv.org/html/2605.05358#bib.bib1)\]\. We primarily compare our method against three sequential training baselines: disjoint, branch\-wise, and separate training, and include joint training for completeness\. For our proposed approach, we setλ=100\\lambda=100andρ=0\.7\\rho=0\.7for ResNet34, andλ=1000\\lambda=1000andρ=0\.2\\rho=0\.2for MSDNet, unless stated otherwise\. Training Details\.We first pretrain each backbone network with only the final exit using conventional supervised training\. Then, following\[[11](https://arxiv.org/html/2605.05358#bib.bib1)\], for ResNet\-34, we use a batch size of 128, a learning rate of5⋅10−45\\cdot 10^\{\-4\}, and no weight decay\. For MSDNet, we use a batch size of 512 and a learning rate of10−310^\{\-3\}with no weight decay\. Both models employ CutMix and Mixup augmentation strategies and early stopping with a patience of 50 epochs\. After pretraining the backbones, we add ICs sequentially from shallow to deep, applying our proposed regularization at each stage\.
### 3\.2Results and Discussion
Table[1](https://arxiv.org/html/2605.05358#S3.T1)presents top\-1 accuracy across exits corresponding to different computational budgets \(25%25\\%,50%50\\%,75%75\\%, and100%100\\%FLOPs\) for our two EENN models on CIFAR\-100\. Our proposed method consistently outperforms all baseline sequential training approaches\. Compared to the strongest sequential baseline \(separate training\), both EWC and LwF achieve substantial improvements, particularly at low budgets:\+4\.25%\+4\.25\\%and\+4\.46%\+4\.46\\%at25%25\\%budget, respectively\. This demonstrates that our approach effectively balances early exit specialization with deep representation learning\.
The comparison across architectures highlights the interplay between regularization type and backbone topology\. In ResNet34, both EWC and LwF improve early\-exit performance, with EWC slightly outperforming LwF at the deepest exit, suggesting that parameter\-level stabilization is sufficient in sequential residual blocks\. The skip connections facilitate feature reuse, so anchoring parameters preserves shallow\-exit performance without constraining deeper layers\. In contrast, MSDNet’s densely connected, multi\-scale structure links many layers to multiple exits\. Here, LwF consistently outperforms EWC across both early and late exits, indicating that output\-distribution regularization better accommodates functional dependencies among exits\. By constraining predictions rather than parameters, LwF maintains early\-exit accuracy while allowing the backbone to adapt flexibly for later classifiers, effectively balancing stability and plasticity in a highly entangled multi\-exit architecture\.
Furthermore, we consistently observe improvements over joint training at earlier exits while maintaining comparable performance at later exits\. This difference stems from the distinct optimization dynamics of the two approaches\. Joint training drives all exits toward a shared compromise solution, whereas our method progressively increases regularization as new exits are added, limiting deeper layers’ ability to modify earlier representations\. Although this can slightly constrain later adaptation, it yields substantially stronger early exits, aligning with the core goal of EENNs\. Further gains may be possible by dynamically scheduling the regularization strength to better balance early and late exits\.
Table 2:Ablation study on MSDNet for different hyperparameters\.Reg\.stands for Regularization and defines which type of regularization is used\.W\.Ustands for Warm\-up and defines whether the backbone is pretrained\.To better understand the impact of our design choices, we conduct an ablation study on MSDNet \(see[Table2](https://arxiv.org/html/2605.05358#S3.T2)\)\. Our ablation explores two main dimensions: \(i\) the regularization strengths, namelyλ\\lambdaandρ\\rho, and \(ii\) the impact of an initial pretraining, or warm\-up, phase where the full model is trained while optimizing only the final classifier\. Across both methods, the strength of the regularization plays a central role in controlling the stability–plasticity trade\-off: stronger regularization protects early exits but reduces the backbone’s flexibility for deeper layers, while weaker regularization allows adaptation at the cost of shallow\-exit performance\. The key difference lies in how this constraint is applied\. EWC rigidly anchors parameters important for earlier exits, stabilizing low\-level features but limiting the backbone’s ability to reorganize representations for deeper classifiers\. This explains why early\-exit performance improves steadily with largerλ\\lambda, while high\-budget exits are slightly constrained\. In contrast, LwF constrains the outputs of earlier exits rather than the parameters themselves, allowing internal representations to adjust as long as the output distribution is preserved\. Moderateρ\\rhoachieves a balance, maintaining early\-exit accuracy while enabling deep exits to benefit from flexible feature adaptation, whereas highρ\\rhooveremphasizes shallow classifiers and reduces high\-budget performance\. Furthermore, backbone warm\-up further interacts with both methods by providing an initial alignment of features across layers before regularization is applied\. Without a warm\-up, both methods struggle: EWC may freeze suboptimal parameters, and LwF may over\-constrain features that are not yet meaningfully structured, leading to lower accuracy across all exits\. Overall, this ablation study validates the effectiveness of our proposed two\-stage strategy: pretraining the final exit establishes rich hierarchical features, and selective regularization allows each exit to adapt effectively without compromising the network’s full representational capacity\.
Fig\. 2:Top\-1 accuracy given different FLOPs budgets for MSDNet\.Figure[2](https://arxiv.org/html/2605.05358#S3.F2)follows the budgeted classification approach \(e\.g\.,\[[3](https://arxiv.org/html/2605.05358#bib.bib4),[11](https://arxiv.org/html/2605.05358#bib.bib1)\]\) by evaluating the top\-1 accuracy of a subset of configurations on MSDNet within a given computational budget\. EWC withλ=10\\lambda=10\(blue curve\) exhibits delayed accuracy growth, starting at only55%55\\%and requiring approximately 40M FLOPs to reach70%70\\%accuracy, nearly double the computation needed by well\-regularized methods\. This indicates that weak regularization fails to protect early\-layer representations and forces reliance on deeper layers, thus heavier computational load\. In contrast, stronger regularization enables substantially faster convergence: LwF variants and well\-tuned EWC configurations reach70%70\\%accuracy at roughly 20\-25M FLOPs, yielding about a 2x computational speedup\.
Fig\. 3:Exit ratios of MSDNet on the CIFAR\-100 test set for different thresholds for EWC \(λ=1000\\lambda=1000\) and LwF \(ρ=0\.5\\rho=0\.5\) models\. For each threshold, exits 1 through 7 are shown from left to right\.Finally, Figure[3](https://arxiv.org/html/2605.05358#S3.F3)further analyzes how confidence thresholds control the dynamic behavior of MSDNet by reporting exit ratios across all intermediate classifiers\. As the threshold increases, predictions are progressively shifted toward deeper exits, indicating that harder samples are routed to later classifiers with richer representations\. Comparing EWC \(λ=1000\\lambda=1000\) and LwF \(ρ=0\.5\\rho=0\.5\) reveals a systematic difference: LwF exits fewer samples at early classifiers under the same thresholds, while EWC commits earlier\. Interpreted together with[Figure2](https://arxiv.org/html/2605.05358#S3.F2), where LwF achieves higher accuracy at substantially lower FLOPs, this shows that LwF provides a higher effective speedup despite exiting later on average\. This behavior suggests that LwF learns more accurate but less confident early classifiers, deferring decisions more often, whereas EWC produces earlier but less accurate exits\.
## 4Conclusion
In this work, we propose a novel strategy for training EENNs to mitigate the degradation of previously learned representations caused by incremental exit addition\. By incorporating parameter\-level protection via EWC and output\-distribution consistency via LwF, we establish a principled sequential training framework that balances exit specialization with knowledge preservation\. Our experimental study demonstrates that the proposed approach consistently outperforms existing sequential training methods and achieves better early exit accuracy than joint training\. Future work will investigate adaptive regularization scheduling and more structured protection mechanisms, e\.g\., at the level of convolutional kernels or attention heads\.
## References
- \[1\]\(1998\)Natural gradient works efficiently in learning\.Neural Computation10\(2\),pp\. 251–276\.External Links:[Document](https://dx.doi.org/10.1162/089976698300017746)Cited by:[§2\.1](https://arxiv.org/html/2605.05358#S2.SS1.p1.1)\.
- \[2\]R\. Dong, Y\. Mao, and J\. Zhang\(2022\)Resource\-constrained edge ai with early exit prediction\.Journal of Communications and Information Networks7\(2\),pp\. 122–134\.External Links:[Document](https://dx.doi.org/10.23919/JCIN.2022.9815196)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p1.1)\.
- \[3\]C\. Gong, Y\. Chen, Q\. Luo, Y\. Lu, T\. Li, Y\. Zhang, Y\. Sun, and L\. Zhang\(2025\)Deep feature surgery: towards accurate and efficient multi\-exit networks\.InComputer Vision – ECCV 2024,Cham,pp\. 435–451\.External Links:ISBN 978\-3\-031\-72967\-6Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.05358#S3.SS2.p5.4)\.
- \[4\]D\. Han, J\. Park, S\. Ham, N\. Lee, and J\. Moon\(2024\)Improving low\-latency predictions in multi\-exit neural networks via block\-dependent losses\.IEEE Transactions on Neural Networks and Learning Systems35\(11\),pp\. 16927–16935\.External Links:[Document](https://dx.doi.org/10.1109/TNNLS.2023.3282249)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p2.1)\.
- \[5\]Y\. Han, G\. Huang, S\. Song, L\. Yang, H\. Wang, and Y\. Wang\(2022\-11\)Dynamic Neural Networks: A Survey\.IEEE Transactions on Pattern Analysis & Machine Intelligence44\(11\),pp\. 7436–7456\.External Links:ISSN 1939\-3539,[Document](https://dx.doi.org/10.1109/TPAMI.2021.3117837),[Link](https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p1.1)\.
- \[6\]K\. He, X\. Zhang, S\. Ren, and J\. Sun\(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 770–778\.Cited by:[§3\.1](https://arxiv.org/html/2605.05358#S3.SS1.p1.7)\.
- \[7\]G\. Huang, D\. Chen, T\. Li, F\. Wu, L\. van der Maaten, and K\. Q\. Weinberger\(2018\)Multi\-scale dense networks for resource efficient image classification\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 284–293\.Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.05358#S3.SS1.p1.7)\.
- \[8\]Y\. Kaya, S\. Hong, and T\. Dumitras\(2019\-09–15 Jun\)Shallow\-deep networks: understanding and mitigating network overthinking\.InProceedings of the 36th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.97,pp\. 3301–3310\.External Links:[Link](https://proceedings.mlr.press/v97/kaya19a.html)Cited by:[§3\.1](https://arxiv.org/html/2605.05358#S3.SS1.p1.7)\.
- \[9\]J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska, D\. Hassabis, C\. Clopath, D\. Kumaran, and R\. Hadsell\(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the National Academy of Sciences114\(13\),pp\. 3521–3526\.External Links:[Document](https://dx.doi.org/10.1073/pnas.1611835114),[Link](https://www.pnas.org/doi/abs/10.1073/pnas.1611835114),https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.1611835114Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p5.1)\.
- \[10\]A\. Krizhevsky\(2009\)Learning multiple layers of features from tiny images\.Technical reportUniversity of Toronto\.Cited by:[§3\.1](https://arxiv.org/html/2605.05358#S3.SS1.p1.7)\.
- \[11\]P\. Kubaty, B\. Wójcik, B\. T\. Krzepkowski, M\. Michaluk, T\. Trzcinski, J\. Pomponi, and K\. Adamczewski\(2025\)How to train your multi\-exit model? analyzing the impact of training strategies\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=vhTPfOdwyQ)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p2.1),[§1](https://arxiv.org/html/2605.05358#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.05358#S3.SS1.p1.7),[§3\.2](https://arxiv.org/html/2605.05358#S3.SS2.p5.4),[§3](https://arxiv.org/html/2605.05358#S3.p1.1)\.
- \[12\]Z\. Li and D\. Hoiem\(2018\-12\)Learning without forgetting\.IEEE Trans\. Pattern Anal\. Mach\. Intell\.40\(12\),pp\. 2935–2947\.External Links:ISSN 0162\-8828,[Link](https://doi.org/10.1109/TPAMI.2017.2773081),[Document](https://dx.doi.org/10.1109/TPAMI.2017.2773081)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p5.1)\.
- \[13\]W\. Liu, P\. Zhou, Z\. Wang, Z\. Zhao, H\. Deng, and Q\. Ju\(2020\-07\)FastBERT: a self\-distilling BERT with adaptive inference time\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 6035–6044\.External Links:[Link](https://aclanthology.org/2020.acl-main.537/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.537)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p5.1)\.
- \[14\]H\. Rahmath P, V\. Srivastava, K\. Chaurasia, R\. G\. Pacheco, and R\. S\. Couto\(2024\-11\)Early\-exit deep neural network \- a comprehensive survey\.ACM Comput\. Surv\.57\(3\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/3698767),[Document](https://dx.doi.org/10.1145/3698767)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p4.1)\.
- \[15\]J\. Schwarz, W\. Czarnecki, J\. Luketina, A\. Grabska\-Barwinska, Y\. W\. Teh, R\. Pascanu, and R\. Hadsell\(2018\-10–15 Jul\)Progress & compress: a scalable framework for continual learning\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 4528–4537\.External Links:[Link](https://proceedings.mlr.press/v80/schwarz18a.html)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p5.1),[§2\.1](https://arxiv.org/html/2605.05358#S2.SS1.p2.13)\.
- \[16\]F\. Szatkowski, F\. Yang, B\. Twardowski, T\. Trzcinski, and J\. van de Weijer\(2024\-06\)Accelerated inference and reduced forgetting: the dual benefits of early\-exit networks in continual learning\.In5th Workshop on Continual Learning in Computer Vision, CVPR,Seattle, USA\.Note:External Links:[Link](https://sites.google.com/view/clvision2024/call-for-papers/accepted-papers#272)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p4.1)\.
- \[17\]S\. Teerapittayanon, B\. McDanel, and H\.T\. Kung\(2016\)BranchyNet: fast inference via early exiting from deep neural networks\.In2016 23rd International Conference on Pattern Recognition \(ICPR\),Vol\.,pp\. 2464–2469\.External Links:[Document](https://dx.doi.org/10.1109/ICPR.2016.7900006)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p2.1)\.
- \[18\]G\. M\. v\. d\. Ven\(April 28, 2025\)On the computation of the fisher information in continual learning\.InICLR Blogposts 2025,Note:https://iclr\-blogposts\.github\.io/2025/blog/fisher/External Links:[Link](https://iclr-blogposts.github.io/2025/blog/fisher/)Cited by:[§2\.1](https://arxiv.org/html/2605.05358#S2.SS1.p2.6)\.
- \[19\]L\. Wang, X\. Zhang, H\. Su, and J\. Zhu\(2024\)A comprehensive survey of continual learning: theory, method and application\.IEEE Transactions on Pattern Analysis and Machine Intelligence46\(8\),pp\. 5362–5383\.External Links:[Document](https://dx.doi.org/10.1109/TPAMI.2024.3367329)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p4.1)\.
- \[20\]B\. Wojcik, M\. Przewiezlikowski, F\. Szatkowski, M\. Wolczyk, K\. Balazy, B\. Krzepkowski, I\. Podolak, J\. Tabor, M\. Smieja, and T\. Trzcinski\(2023\)Zero time waste in pre\-trained early exit neural networks\.Neural Networks168,pp\. 580–601\.External Links:ISSN 0893\-6080,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.neunet.2023.10.003),[Link](https://www.sciencedirect.com/science/article/pii/S0893608023005555)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p3.1)\.
- \[21\]J\. Xin, R\. Tang, J\. Lee, Y\. Yu, and J\. Lin\(2020\-07\)DeeBERT: dynamic early exiting for accelerating BERT inference\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 2246–2251\.External Links:[Link](https://aclanthology.org/2020.acl-main.204/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.204)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p5.1)\.
- \[22\]J\. Xin, R\. Tang, Y\. Yu, and J\. Lin\(2021\-04\)BERxiT: early exiting for BERT with better fine\-tuning and extension to regression\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,Online,pp\. 91–104\.External Links:[Link](https://aclanthology.org/2021.eacl-main.8/),[Document](https://dx.doi.org/10.18653/v1/2021.eacl-main.8)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p2.1)\.
- \[23\]G\. Xu, J\. Hao, L\. Shen, H\. Hu, Y\. Luo, H\. Lin, and J\. Shen\(2023\)LGViT: dynamic early exiting for accelerating vision transformer\.InProceedings of the 31st ACM International Conference on Multimedia,MM ’23,New York, NY, USA,pp\. 9103–9114\.External Links:ISBN 9798400701085,[Link](https://doi.org/10.1145/3581783.3611762),[Document](https://dx.doi.org/10.1145/3581783.3611762)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p3.1)\.
- \[24\]W\. Zhu, X\. Wang, Y\. Ni, and G\. Xie\(2021\-11\)GAML\-BERT: improving BERT early exiting by gradient aligned mutual learning\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,Online and Punta Cana, Dominican Republic,pp\. 3033–3044\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.242/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.242)Cited by:[§1](https://arxiv.org/html/2605.05358#S1.p2.1)\.Similar Articles
Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms
This paper systematically investigates unlearnable examples under diverse training paradigms, revealing that pretrained weights weaken existing methods, and proposes Shallow Semantic Camouflage (SSC) to maintain unlearnability by generating perturbations in a semantically valid subspace.
Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting
This paper introduces Self-Distillation Fine-Tuning (SDFT) as a recovery mechanism for LLMs suffering from performance degradation due to catastrophic forgetting, quantization, and pruning. The authors provide theoretical justification using Centered Kernel Alignment (CKA) to demonstrate that self-distillation aligns the student model's high-dimensional manifold with the teacher's optimal structure, effectively recovering lost capabilities.
Rethinking State Tracking in Recurrent Models Through Error Control Dynamics
This paper argues that robust state tracking in recurrent models depends on error control dynamics rather than just expressive capacity, proving that affine recurrent networks suffer from accumulating errors that limit their effective horizon.
Attribution-Guided Continual Learning for Large Language Models
This paper proposes an attribution-guided continual fine-tuning framework for large language models that estimates task-specific parameter importance in Transformer layers and modulates gradients accordingly, mitigating catastrophic forgetting while maintaining performance on new tasks.
Learning sparse neural networks through L₀ regularization
OpenAI proposes a practical L₀ regularization method for neural networks that encourages weights to become exactly zero during training, enabling network pruning for improved speed and generalization. The method uses stochastic gates and introduces the hard concrete distribution to make the non-differentiable L₀ norm optimization tractable via gradient descent.