Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys

arXiv cs.LG Papers

Summary

This paper argues that catastrophic forgetting in neural networks is not erasure but an interface alignment problem. It introduces 'transport keys' to recover latent task-specific features from sequentially trained models, demonstrating significant performance recovery on split CIFAR-100.

arXiv:2606.02860v1 Announce Type: new Abstract: Catastrophic forgetting is often framed as a representational problem: after sequential training, a model appears to lose the features that supported performance on earlier tasks. We challenge the stronger form of this view. Across controlled continual-learning settings, we find that a significant portion of apparent forgetting can be attributed to interface drift between internal stages rather than permanent erasure of task-relevant computation. We study this phenomenon through a stitched evaluation protocol that combines early computation from a post-update network with late computation from its predecessor, optionally mediated by a compact, task-specific transport key. We describe transport keys at a systems level as compact interface-alignment operators estimated from a small set of paired anchor activations and evaluated through model stitching. On split CIFAR-100 with a ResNet-style network, transport keys recover most of the original Task A performance after sequential training on Task B. On a compact vision transformer, we observe a similar recovery pattern. These results suggest that continual learning may require better mechanisms for indexing and re-accessing latent computations, not only methods that prevent weight change.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:40 AM

# Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys
Source: [https://arxiv.org/html/2606.02860](https://arxiv.org/html/2606.02860)
###### Abstract

Catastrophic forgetting is often framed as a representational problem: after sequential training, a model appears to lose the features that supported performance on earlier tasks\. We challenge the stronger form of this view\. Across controlled continual\-learning settings, we find that a significant portion of apparent forgetting can be attributed to interface drift between internal stages rather than permanent erasure of task\-relevant computation\. We study this phenomenon through a stitched evaluation protocol that combines early computation from a post\-update network with late computation from its predecessor, optionally mediated by a compact, task\-specific transport key\. We describe transport keys at a systems level as compact interface\-alignment operators estimated from a small set of paired anchor activations and evaluated through model stitching\. On split CIFAR\-100 with a ResNet\-style network, transport keys recover most of the original Task A performance after sequential training on Task B\. On a compact vision transformer, we observe a similar recovery pattern\. These results suggest that continual learning may require better mechanisms for indexing and re\-accessing latent computations, not only methods that prevent weight change\.

## 1Introduction

Continual Learning \(CL\) refers to the ability of an autonomous agent or machine to learn multiple different types of knowledge and behave as if it learned all of them at onceWanget al\.\([2024](https://arxiv.org/html/2606.02860#bib.bib16)\)\. Enabling CL remains one of the largest problems in contemporary deep networks, where fine\-tuning or training on a new task often results in the collapse of accuracy or performance on a prior task\. This collapse has been historically defined as catastrophic forgetting, or erasure, where a model’s ability to generalize is limited by its internal representationsMcCloskey and Cohen \([1989](https://arxiv.org/html/2606.02860#bib.bib12)\)\. Catastrophic forgetting is often framed as the result of a fundamental, architectural compromise between stability and plasticity: the only way to enable a model to learn new tasks without losing performance on prior tasks is to scale itFrench \([1999](https://arxiv.org/html/2606.02860#bib.bib6)\)\. In this work, we show that in high\-dimensional neural representations, forgetting can be modeled as anaccess problem, being the result of drift between various network stages, and that training on a new task need not necessarily result in the permanent erasure of a model’s ability to do a task it observed previously\. By modeling a neural network’s representation of a new learned task as a staged computation where later layers learn to interpret intermediate activations, we define forgetting as a dimensional problem, in which the internal features of a model are altered significantly when learning a new task, leaving the original set of features associated with a task invalid\. Under this view, a model’s performance on a task or functionality over a wide range of scenarios can be recovered through the use of a small key that realigns its internal features at specific internal interfaces\. Drawing upon model stitchingBansalet al\.\([2021](https://arxiv.org/html/2606.02860#bib.bib3)\), we construct such a key, allowing us to recover latent task\-specific features and lost performance\.

#### Contributions

Our specific contributions are as follows\.

\(i\) We present evidence that a substantial component of catastrophic forgetting can be interpreted as an interface\-access failure rather than complete representational erasure\.

\(ii\) We introduce the transport\-key framing: a compact, task\-specific alignment object that restores compatibility between stages of sequentially trained networks by correcting activation\-space interface drift\.

\(iii\) We provide a stitched evaluation protocol that directly tests whether post\-update networks preserve task\-relevant latent computation, and we use controls to distinguish genuine interface alignment from generic adaptation\.

\(iv\) We report preliminary results on ResNet\-style and transformer\-style vision models showing substantial recovery of Task A performance after sequential training\.

#### Related Work

Continual learning \(CL\) has been studied extensively across supervised and reinforcement learning settings, with a focus on addressing the inherent tradeoff between long\-term stability and plasticityFrench \([1999](https://arxiv.org/html/2606.02860#bib.bib6)\)\. Contemporary approaches to addressing forgetting have mainly been focused on constraining weight updates to specific parameters, such as Elastic Weight Consolidation \(EWC\)Kirkpatricket al\.\([2017](https://arxiv.org/html/2606.02860#bib.bib8)\), Synaptic Intelligence \(SI\)Zenkeet al\.\([2017](https://arxiv.org/html/2606.02860#bib.bib17)\), and Memory Aware SynapsesAljundiet al\.\([2018](https://arxiv.org/html/2606.02860#bib.bib2)\)\. Alternative methodologies have leaned toward dynamically generating prior task examples while training on new ones, as seen in iCaRLRebuffiet al\.\([2017](https://arxiv.org/html/2606.02860#bib.bib13)\)and Experience ReplayRolnicket al\.\([2019](https://arxiv.org/html/2606.02860#bib.bib14)\)\. Meanwhile, gradient\-based methods such as GEMLopez\-Paz and Ranzato \([2017](https://arxiv.org/html/2606.02860#bib.bib10)\)and A\-GEMChaudhryet al\.\([2019](https://arxiv.org/html/2606.02860#bib.bib4)\)enforce guards that ensure specific weight updates do not affect the features associated with prior tasks\.

We implement the stitching evaluation as an extension of model stitching, initially defined byLenc and Vedaldi \([2015](https://arxiv.org/html/2606.02860#bib.bib9)\); Bansalet al\.\([2021](https://arxiv.org/html/2606.02860#bib.bib3)\)\. Traditionally, stitching has been utilized to compare architectures or to merge models in weight\-spaceEntezariet al\.\([2022](https://arxiv.org/html/2606.02860#bib.bib5)\); Ainsworthet al\.\([2023](https://arxiv.org/html/2606.02860#bib.bib1)\)\. We adapt stitching to a continual learning setting by treating the network post\-training as the sender and its predecessor as the receiver\. As such, we are able to test directly for the existence of latent knowledge, without needing retraining or gradient updates, in contrast to the parameter\-based methods such as those implemented byRusuet al\.\([2016](https://arxiv.org/html/2606.02860#bib.bib15)\); Mallya and Lazebnik \([2018](https://arxiv.org/html/2606.02860#bib.bib11)\), or adapter\-dependent methods such asHoulsbyet al\.\([2019](https://arxiv.org/html/2606.02860#bib.bib7)\)\.

## 2Background and Formalization

We start by defining some preliminaries, particularly centered around formalizing sequential training as it applies to a standard CL setting\. We also provide a generalized definition of representational drift, which is essentially a value that quantifies the difference in a model’s ability to perform a TaskAonce it has been subsequently trained on a different TaskB\.

### 2\.1Sequential Training

We start with a standard neural networkf​\(x;θ\)f\(x;\\theta\)that maps inputsxxto outputs, and train it on a sequential set of tasks\. For simplicity, we restrict our focus to two tasks, although in practice, this extends to any number of tasks, all trained one after the other\.

LetθA\\theta\_\{A\}denote the parameters after training on an individual starting Task A\. We then continue training on Task B, yielding parametersθA​B\\theta\_\{AB\}\. In classic CL, this results in the accuracy on task A for the model with parametersθA​B\\theta\_\{AB\}degrading significantly, with the most common explanation being that the process of training on Task B overwrites the weights that defined the features associated with Task A\.

### 2\.2Network Decomposition

Contemporary deep networks are essentially a broad combination of different stages\. For example, in a traditional ResNet\-style network, these stages are essentially the core stem, and then four residual blocks that get progressively deeper\.

At any individual stageℓ\\ell, we can split the network into two parts:

- •The early networkf≤ℓf\_\{\\leq\\ell\}, which maps an inputxxto an intermediate activation tensorhℓ∈ℝC×H×Wh\_\{\\ell\}\\in\\mathbb\{R\}^\{C\\times H\\times W\}\.
- •The late networkf\>ℓf\_\{\>\\ell\}, which mapshℓh\_\{\\ell\}to a prediction, including the classification head\.

This decomposition allows us to precisely determine the stage at which performance on Task A drops once the network’s parameters have been updated\.

### 2\.3Interface Drift

Our primary hypothesis is that the catastrophic forgetting often seen in neural networks occurs not at a particular stage, but at the interface, such that the later stages in a network are no longer able to recognize the features for a prior task produced by the earlier stages\.

To formalize this, suppose there exists a transformationTℓT\_\{\\ell\}such that

hℓ​\(x;θA\)≈Tℓ​\(hℓ​\(x;θA​B\)\)h\_\{\\ell\}\(x;\\theta\_\{A\}\)\\approx T\_\{\\ell\}\\bigl\(h\_\{\\ell\}\(x;\\theta\_\{AB\}\)\\bigr\)\(1\)for inputsxxfrom Task A\. IfTℓT\_\{\\ell\}is simple, then the network post\-update is still able to compute and encode information associated with task A: the problem arises during decoding, when the later stages of the network \(which have now been updated\) expect a different computational result for the features it is decoding\.

Figure[1](https://arxiv.org/html/2606.02860#S2.F1)showcases the phenomenon\. Rather than the features associated with Task A being permanently deleted as a result of training on Task B, they are simply uninterpretable by the later stages of the network\. We call this phenomenoninterface drift\.

![Refer to caption](https://arxiv.org/html/2606.02860v1/figures/ijcai_paper_1.jpg)Figure 1:Comparison of traditional erasure versus interface drift

## 3Methodology

This section explains transport keys at the level needed to understand the mechanism and the empirical claim\. The central object is an interface\-level alignment operator: it is inserted between an early portion of a post\-update network and a late portion of the pre\-update network, and it maps the new activation coordinate system back into a form that the old downstream computation can decode\.

### 3\.1Overview: Transport Keys as Interface Alignment

Letf​\(x;θ\)f\(x;\\theta\)be a neural network trained first on Task A and then on Task B\. LetθA\\theta\_\{A\}denote the parameters after Task A training andθA​B\\theta\_\{AB\}denote the parameters after subsequent Task B training\. At an internal stageℓ\\ell, we decompose the network into an early mapf≤ℓf\_\{\\leq\\ell\}and a late mapf\>ℓf\_\{\>\\ell\}\. For an inputxxfrom Task A, the two checkpoints induce paired activations

hℓA​\(x\)=f≤ℓA​\(x\),hℓA​B​\(x\)=f≤ℓA​B​\(x\)\.h^\{A\}\_\{\\ell\}\(x\)=f^\{A\}\_\{\\leq\\ell\}\(x\),\\qquad h^\{AB\}\_\{\\ell\}\(x\)=f^\{AB\}\_\{\\leq\\ell\}\(x\)\.\(2\)A transport keyTℓT\_\{\\ell\}is a compact transformation that acts onhℓA​B​\(x\)h^\{AB\}\_\{\\ell\}\(x\)so that the transformed activation is readable by the old Task A decoder:

h~ℓA​\(x\)=Tℓ​\(hℓA​B​\(x\)\)\.\\widetilde\{h\}^\{A\}\_\{\\ell\}\(x\)=T\_\{\\ell\}\(h^\{AB\}\_\{\\ell\}\(x\)\)\.\(3\)Rather than being a new trained object or replay buffer, a transport key is simply an activation\-space alignment object attached to a particular interface\. Its role is to correct the mismatch between what the post\-update early network now emits and what the pre\-update late network expects to receive\.

Intuitively, sequential training can change an internal representation in at least two ways\. First, it can shift or rescale individual channels, leaving the semantic content mostly intact but changing calibration\. Second, it can rotate or mix features across channels, leaving the information present but expressed in a different basis\. Transport keys are designed to correct these interface\-level changes while preserving the original downstream decision rule\.

### 3\.2Anchor Sets

To estimate a key, we use a small anchor set drawn from Task A\. Anchors are ordinary examples from the earlier task and are passed through both checkpoints\. This produces paired activations at the same interface: one activation from the Task A checkpoint and one from the post\-update checkpoint\. The use of paired anchors is important because the key is intended to align corresponding computations, not merely match aggregate activation statistics\.

Balanced anchor selection is used when the earlier task is a classification problem, so that the alignment signal is not dominated by a small subset of classes\. The anchor set is used only to build the interface key\. It does not retrain the backbone, does not update the Task A head, and does not introduce new labels at evaluation time\.

### 3\.3Key Families

We consider two conceptual families of transport keys\. The first is a channel\-calibration key\. It corrects per\-channel drift, such as changes in activation scale or offset\. This form is deliberately small and is most effective when sequential training preserves the original channel basis but changes its calibration\. In the same\-domain CIFAR experiments, this compact correction explains most of the recovered accuracy\.

The second is a cross\-channel alignment key\. It allows structured mixing between channels and is useful when the post\-update representation has changed basis more substantially\. This form is still applied at the activation interface rather than through end\-to\-end retraining\. It becomes more important under domain shift, where a network trained after CIFAR\-10 on SVHN may preserve useful CIFAR\-10 information but express it in a less directly compatible coordinate system\.

These key families correspond to two interpretable modes of interface drift:*calibration drift*, where individual channels remain meaningful but change scale, and*mixing drift*, where information is distributed across channels in a different basis\. The experiments below use this distinction to explain why small keys suffice in some settings while more expressive keys matter in others\.

### 3\.4Stitched Evaluation

We evaluate transport keys using model stitching\. Given the post\-update early networkf≤ℓA​Bf\_\{\\leq\\ell\}^\{AB\}, the pre\-update late networkf\>ℓAf\_\{\>\\ell\}^\{A\}, and a transport keyTℓT\_\{\\ell\}, the stitched model is

\(f≤ℓA​B⊕Tℓf\>ℓA\)​\(x\)=f\>ℓA​\(Tℓ​\(f≤ℓA​B​\(x\)\)\)\.\(f\_\{\\leq\\ell\}^\{AB\}\\oplus\_\{T\_\{\\ell\}\}f\_\{\>\\ell\}^\{A\}\)\(x\)=f\_\{\>\\ell\}^\{A\}\\bigl\(T\_\{\\ell\}\(f\_\{\\leq\\ell\}^\{AB\}\(x\)\)\\bigr\)\.\(4\)For Task A evaluation, the stitched prediction is

y^=headA​\(f\>ℓA​\(Tℓ​\(f≤ℓA​B​\(x\)\)\)\)\.\\hat\{y\}=\\mathrm\{head\}\_\{A\}\\bigl\(f\_\{\>\\ell\}^\{A\}\(T\_\{\\ell\}\(f\_\{\\leq\\ell\}^\{AB\}\(x\)\)\)\\bigr\)\.\(5\)The Task A head is preserved and is not retrained\. This design makes the evaluation strict: recovery must come from restoring compatibility with the old internal decoder, not from learning a fresh classifier on top of post\-update features\.

This protocol separates three cases\. If the post\-update early network has destroyed the information needed for Task A, keyed stitching should fail\. If the information remains but has drifted out of the coordinate system expected by the old decoder, no\-key stitching should perform poorly while keyed stitching should recover\. If the interface remains mostly compatible, no\-key stitching itself can recover substantial performance\.

### 3\.5Controls

We use controls to test whether the key is exploiting real interface structure\. The no\-key control sendshℓA​Bh^\{AB\}\_\{\\ell\}directly into the old downstream network\. Channel\-disruption controls test whether channel identity matters for compact calibration keys\. Correspondence\-breaking controls test whether the mapping depends on paired examples rather than only marginal statistics\. Stage\-wise controls evaluate the same procedure at multiple internal boundaries\.

The expected signature is specific: valid keyed stitching should outperform the forgotten model and the no\-key stitch; disrupting channel structure or example correspondence should reduce recovery; and recovery should vary systematically by stage\. This is the pattern observed in the experiments below\.

![Refer to caption](https://arxiv.org/html/2606.02860v1/figures/figure_2.png)Figure 2:Transport\-key evaluation workflow\. Paired anchor activations from the Task A checkpoint and the post\-update checkpoint are used to build an interface key\. The key is then inserted into a stitched model that combines post\-update early computation with the original Task A downstream decoder\.

## 4Experimental Setup

We evaluate the transport\-key hypothesis in controlled image\-classification continual\-learning settings\. The purpose of these experiments is not to establish a production\-ready continual\-learning system, but to test whether post\-update networks retain latent computations that can be re\-accessed by an internal alignment mechanism\.

### 4\.1Datasets and Task Constructions

We use CIFAR\-100, CIFAR\-10, and SVHN as standard vision benchmarks\. CIFAR\-100 enables same\-domain task splits; CIFAR\-10 to SVHN provides a stronger domain\-shift test because natural\-object recognition is followed by digit recognition\. We consider four settings, summarized in Table[1](https://arxiv.org/html/2606.02860#S4.T1): a two\-task CIFAR\-100 split, a longer three\-task CIFAR\-100 sequence, a CIFAR\-10 to SVHN domain\-shift sequence, and a compact vision\-transformer variant of the CIFAR\-100 split\.

Table 1:Summary of experimental settings\.
### 4\.2Models and Training

The main convolutional experiments use a ResNet\-style network adapted to small images\. The transformer experiment uses a compact ViT\-style architecture\. In each setting, the model is trained sequentially: first on Task A, then on one or more subsequent tasks\. The Task A head is preserved for evaluation so that degradation in Task A performance is attributable to changes in the shared representation rather than replacement of the classifier\.

The baseline sequential\-training setup follows ordinary supervised continual\-learning practice\. Each model is trained on Task A, then continued on the subsequent task or tasks\. Transport keys are computed after the sequential update from paired internal activations and are evaluated only through the stitched protocol described above\. This keeps the recovery test separate from ordinary retraining or adapter fine\-tuning\.

### 4\.3Evaluation Metrics

Our primary metric is classification accuracy on the Task A test set\. We report four quantities:

1. 1\.Pre\-update accuracy: Task A accuracy immediately after Task A training, using parametersθA\\theta\_\{A\}\.
2. 2\.Post\-update accuracy: Task A accuracy after subsequent training, using the post\-update shared network with the preserved Task A head\.
3. 3\.Stitched accuracy: Task A accuracy using the stitched network defined in Section[3](https://arxiv.org/html/2606.02860#S3)\.
4. 4\.Recovery rate: the fraction of lost accuracy recovered by stitching, Recovery=Stitched−Post\-updatePre\-update−Post\-update\.\\text\{Recovery\}=\\frac\{\\text\{Stitched\}\-\\text\{Post\-update\}\}\{\\text\{Pre\-update\}\-\\text\{Post\-update\}\}\.\(6\)

A recovery rate of 100% means the stitched network matches the original Task A accuracy\. A recovery rate near 0% means stitching provides no improvement over direct post\-update evaluation\.

### 4\.4Stage Selection

We evaluate stitching at multiple internal stage boundaries\. Earlier stages test whether low\-level and mid\-level features remain accessible after sequential training\. Later stages test whether more task\-specific representations can be realigned\. We report stage identifiers such ass​1s1–s​4s4for convolutional networks andb​0b0–b​3b3for transformer blocks\.

## 5Results

We present our results across all aforementioned experimental settings\. Table[2](https://arxiv.org/html/2606.02860#S5.T2)summarizes the primary findings\.

Table 2:Task A accuracy before training on subsequent tasks \(Pre\-update\), after training \(Post\-update\), and after applying a transport key at the indicated stage \(Keyed\)\. All settings show substantial recovery\.### 5\.1Split CIFAR\-100

Figure[3](https://arxiv.org/html/2606.02860#S5.F3)showcases our primary result\. After training on Task B, Task A accuracy drops from 0\.750 to 0\.392\. No\-key stitching at stage 1 yields 0\.390, indicating that the interface is not already compatible\. A compact transport key recovers accuracy to 0\.721, restoring 92% of lost performance\. A more expressive cross\-channel key family provides only marginal additional benefit, suggesting that the relevant drift in this setting is mostly structured channel\-calibration drift rather than arbitrary feature rewriting\.

![Refer to caption](https://arxiv.org/html/2606.02860v1/figures/fig1_splitcifar100_recovery_bars.png)Figure 3:Split CIFAR\-100 recovery via stitching at stage 1\. A compact transport key recovers 92% of lost Task A accuracy\. A more expressive cross\-channel variant provides only marginal additional benefit\.The recovery degrades significantly at later stages\. Table[3](https://arxiv.org/html/2606.02860#S5.T3)shows that keyed stitching ats​1s1achieves 92% recovery, dropping to 71% ats​2s2, 50% ats​3s3, and failing entirely ats​4s4\. This indicates that early features associated with a particular task do remain largely intact and encapsulate the majority of latent features\.

Table 3:Stage\-wise recovery in Split CIFAR\-100\.
### 5\.2Extended Task Sequences

The three\-task experiment \(A→\\toB→\\toC\) tests whether interface drift is only concentrated on limited sequences\. Task A accuracy degrades from 0\.735 after initial training to 0\.450 after Task B and 0\.337 after Task C\. Despite two updates, a transport key ats​1s1recovers accuracy to 0\.674, restoring 85% of lost performance\. The stage\-wise gradient replicates:s​1s1achieves 85% recovery, falling to 71% ats​2s2, 50% ats​3s3, and failing ats​4s4\.

### 5\.3Domain Shift

In the domain\-shift setting \(CIFAR\-10→\\toSVHN\), the drop in performance is severe, with Task A accuracy collapsing from 0\.877 to 0\.154\. Yet, no\-key stitching ats​1s1already recovers to 0\.619, indicating substantial latent retention even when the subsequent task involves a different visual domain\.

Table[4](https://arxiv.org/html/2606.02860#S5.T4)presents the control analysis\. The transport key substantially improves accuracy at the earliest stage, while structure\-disrupting controls reduce the gain\. At the next stage, the same pattern becomes sharper: keyed recovery remains strong, while a correspondence\-breaking control collapses close to the post\-update baseline\. These results support the claim that recovery depends on structured alignment rather than generic post\-hoc adaptation\.

Table 4:Control analysis under domain shift \(CIFAR\-10→\\toSVHN\)\. Perm is a channel\-permutation control for compact calibration keys\. Rand is a correspondence\-breaking control for cross\-channel keys\.
### 5\.4Architectural Generality

The Mini ViT experiment tests whether interface drift is specific to convolutional architectures\. Task A accuracy drops from 0\.590 to 0\.315 after Task B training\. Stitching at block 0 with a transport key recovers to 0\.542 \(83% recovery\)\. Notably, no\-key stitching already achieves 0\.541, suggesting that early transformer representations are highly stable under fine\-tuning, with drift concentrated in late blocks\. The stage gradient replicates: recovery degrades from 83% atb​0b0to 68% atb​1b1, 42% atb​2b2, and fails atb​3b3\.

## 6Discussion

Our results show that forgetting in continual\-learning settings is not necessarily permanent, and that substantial recovery can be achieved through a compact transport key\. Across all settings, our stitched evaluation protocol showed that task\-specific computations persist as latent features that are no longer directly accessible, rather than being permanently erased\. This was most evident in the cross\-domain experiment, where stitched evaluation recovered a large portion of the lost accuracy\.

We also showed that forgetting is localized, and is primarily the result of representational drift between interfaces rather than erasure\. In all experiments, transport keys applied at the early stages of the network were able to recover most of the pre\-update performance, while keys at later stages were less effective\. This suggests that early layers are more general, and often store latent features associated with individual, prior tasks\.

Third, interface drift appears to have internal structure\. The same\-domain setting can be corrected with a highly compact key, while the domain\-shift setting benefits from a more expressive cross\-channel variant\. The control experiments indicate that recovery depends on the relationship between corresponding examples and internal interfaces, not merely on aggregate activation statistics\.

These findings show that forgetting is potentially an access problem, rather than being a representational one\. Future systems in CL settings could store individual keys for each task, rather than needing to freeze their weights at a particular checkpoint or maintaining replay buffers\. In our main experiment, a compact key was able to recover most lost accuracy in a ResNet\-style network while remaining small relative to the base model\. This suggests that there need not necessarily be a tradeoff between stability and plasticity; indeed, deep networks regardless of their size may be able to achieve a high degree of plasticity if they are able to maintain an index that allows them access knowledge of prior tasks\.

#### Limitations and Future Work

We note several limitations in our work here\. Our experiments only use a maximum of three tasks in an individual task for simplicity\. Longer sequences may exhibit drift that is unable to be fixed by our lightweight transport keys\. The stitched evaluation requires access to pre\-update components, which makes it primarily a diagnostic rather than a deployment recipe\. Future work will investigate in\-place recovery and online variants that preserve the same access\-based principle without requiring a full predecessor network at inference time\.

## Ethical Statement

A relevant ethical consideration is that improved recovery of prior capabilities could be used to preserve or restore capabilities that a deployer intended to remove\. We therefore treat transport\-key systems as controlled research infrastructure\. Broader release should be paired with safety evaluations, access controls, and clear restrictions on applications that restore harmful or intentionally deprecated capabilities\.

## References

- Ainsworthet al\.\[2023\]Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa\.Git re\-basin: Merging models modulo permutation symmetries\.InInternational Conference on Learning Representations \(ICLR\), 2023\.
- Aljundiet al\.\[2018\]Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars\.Memory aware synapses: Learning what \(not\) to forget\.InProceedings of the European Conference on Computer Vision \(ECCV\), pages 139–154, 2018\.
- Bansalet al\.\[2021\]Yamini Bansal, Gal Kaplun, Jing Yang, Preetum Raghavan, Shreya Medipally, Doina Precup, and Irina Rish\.Revisiting model stitching to compare neural representations\.InAdvances in Neural Information Processing Systems \(NeurIPS\), volume 34, pages 225–236, 2021\.
- Chaudhryet al\.\[2019\]Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny\.Efficient lifelong learning with a\-gem\.InInternational Conference on Learning Representations \(ICLR\), 2019\.
- Entezariet al\.\[2022\]Rahim Entezari, Hanie Sedghi, Olga Saukh, and Neyshabur Behnam\.The role of permutation invariance in linear mode connectivity of neural networks\.InInternational Conference on Learning Representations \(ICLR\), 2022\.
- French \[1999\]Robert M\. French\.Catastrophic forgetting in connectionist networks\.Trends in Cognitive Sciences, 3\(4\):128–135, 1999\.
- Houlsbyet al\.\[2019\]Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Brunslo, Anna Degtyareva, Gal Kaplyn, et al\.Parameter\-efficient transfer learning for nlp\.Proceedings of the 36th International Conference on Machine Learning \(ICML\), 2019\.
- Kirkpatricket al\.\[2017\]James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Pashapi Milan, John Quan, Tiago Ramalho, Agnieszka Grabska\-Barwinska, et al\.Overcoming catastrophic forgetting in neural networks\.InProceedings of the National Academy of Sciences \(PNAS\), volume 114, pages 3521–3526, 2017\.
- Lenc and Vedaldi \[2015\]Karel Lenc and Andrea Vedaldi\.Understanding image representations by measuring their equivariance and equivalence\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\), pages 991–999, 2015\.
- Lopez\-Paz and Ranzato \[2017\]David Lopez\-Paz and Marc’Aurelio Ranzato\.Gradient episodic memory for continual learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\), volume 30, 2017\.
- Mallya and Lazebnik \[2018\]Arun Mallya and Svetlana Lazebnik\.Packnet: Adding multiple tasks to a single network by iterative pruning\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\), pages 7765–7773, 2018\.
- McCloskey and Cohen \[1989\]Michael McCloskey and Neal J\. Cohen\.Catastrophic interference in connectionist networks: The sequential learning problem\.volume 24 ofPsychology of Learning and Motivation, pages 109–165\. Academic Press, 1989\.
- Rebuffiet al\.\[2017\]Sylvestre\-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert\.icarl: Incremental classifier and representation learning\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\), pages 2001–2010, 2017\.
- Rolnicket al\.\[2019\]David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Wayne, David Saxton, Timothy Lillicrap, and Greg Wayne\.Experience replay for continual learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\), volume 32, 2019\.
- Rusuet al\.\[2016\]Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Soyer Hubert, Kirkpatrick James, Kavukcuoglu Koray, Pascanu Razvan, and Hadsell Raia\.Progressive neural networks\.InarXiv preprint arXiv:1606\.04671, 2016\.
- Wanget al\.\[2024\]Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu\.A comprehensive survey of continual learning: Theory, method and application, 2024\.
- Zenkeet al\.\[2017\]Friedemann Zenke, Ben Poole, and Ganguli Surya\.Continual learning through synaptic intelligence\.InProceedings of the 34th International Conference on Machine Learning \(ICML\), pages 3987–3995, 2017\.

## Appendix AAnchor Efficiency on Split CIFAR\-100

Figure[4](https://arxiv.org/html/2606.02860#A1.F4)reports the sensitivity of Task A recovery to the amount of Task A calibration data used to estimate the transport key\. The important qualitative result is that recovery saturates rapidly: after a small balanced set of Task A examples, adding substantially more examples produces little additional improvement\. This supports the claim that the dominant failure mode is a structured interface shift rather than a high\-capacity relearning problem\.

The graph should be read as follows\. The horizontal axis varies the amount of calibration evidence available for key estimation\. The vertical axis reports recovered Task A accuracy under the stitched evaluation\. A steep early rise followed by a plateau means that the post\-update network already retains much of the Task A computation; the key only needs enough evidence to identify the interface mismatch\. If the method were simply retraining Task A, we would expect a stronger dependence on additional examples\.

![Refer to caption](https://arxiv.org/html/2606.02860v1/figures/fig2_splitcifar100_anchor_efficiency.png)Figure 4:Anchor\-efficiency analysis for Split CIFAR\-100\. Recovery reaches its effective plateau with a small balanced calibration set, suggesting that the relevant drift is low\-dimensional or otherwise structured\. The plateau indicates that the key is identifying a structured interface mismatch rather than relearning the task\.
## Appendix BControl Experiments Under Domain Shift

Figures[5](https://arxiv.org/html/2606.02860#A2.F5)and[6](https://arxiv.org/html/2606.02860#A2.F6)expand the CIFAR\-10→\\toSVHN control analysis from Table[4](https://arxiv.org/html/2606.02860#S5.T4)\. This setting is intentionally more difficult than the same\-domain CIFAR split: after learning SVHN, the model’s preserved CIFAR\-10 head receives internal representations that have shifted under a different visual distribution\. The controls ask whether recovery comes from meaningful interface alignment or from a generic improvement produced by inserting an additional module\.

The stage\-1 graph in Figure[5](https://arxiv.org/html/2606.02860#A2.F5)compares keyed recovery against structure\-disrupting controls\. The transport key recovers a large fraction of the lost CIFAR\-10 accuracy, while disrupting the relationship between the key and the intended interface sharply reduces performance\. This pattern is important: it shows that the improvement depends on preserving the correspondence between the post\-update representation and the pre\-update decoder\.

![Refer to caption](https://arxiv.org/html/2606.02860v1/figures/fig3a_cifar10_to_svhn_controls_s1.png)Figure 5:Control analysis at stage 1 for CIFAR\-10→\\toSVHN\. The transport key substantially improves stitched Task A accuracy, while controls that disrupt the relevant alignment structure reduce the gain\.The stage\-2 graph in Figure[6](https://arxiv.org/html/2606.02860#A2.F6)shows a sharper version of the same phenomenon\. At this later interface, simple compatibility is weaker, so the difference between valid keyed alignment and disrupted controls becomes more diagnostic\. The result supports the interpretation that recovery is not an artifact of the classifier head or a generic smoothing effect\. It depends on the internal geometry of the source and target interfaces\.

![Refer to caption](https://arxiv.org/html/2606.02860v1/figures/fig3b_cifar10_to_svhn_controls_s2.png)Figure 6:Control analysis at stage 2 for CIFAR\-10→\\toSVHN\. The valid keyed condition remains substantially above the post\-update baseline, while correspondence\-breaking controls collapse toward the forgotten model\.
## Appendix CHow to Interpret the Stage\-Wise Results

Across the main and supplementary figures, the same stage\-wise pattern recurs: early interfaces exhibit the strongest recovery, middle interfaces show partial recovery, and late interfaces often fail\. This is consistent with the access\-based view of forgetting\. Early and mid\-level computation remains partially reusable after sequential training, but the coordinate system or interface expected by the original downstream model has shifted\. Later representations are more task\-specific and more strongly rewritten by subsequent training, leaving less recoverable structure for stitching\.

This interpretation also explains why no\-key stitching can sometimes perform well, especially in the Mini ViT experiment and the first stage of the domain\-shift experiment\. In those cases, the post\-update early representation remains partly compatible with the pre\-update decoder even without a key\. The transport key is most informative when no\-key stitching fails but keyed stitching succeeds, because that pattern isolates interface drift from direct feature preservation\.

## Appendix DMethodological Takeaway

Taken together, the supplementary graphs support the central mechanism proposed in the paper\. Small anchor sets are sufficient because the model is not relearning Task A from scratch; controls degrade performance because the key depends on real correspondence between old and new internal representations; and stage\-wise degradation shows that recoverable latent structure is concentrated earlier in the network\. This combination of anchor efficiency, control sensitivity, and stage localization is the empirical signature of transport\-key recovery\.

Similar Articles