Reliability-Gated Source Anchoring for Continual Test-Time Adaptation
Summary
This paper proposes RMemSafe, a reliability-gated extension for continual test-time adaptation that attenuates source anchoring when the frozen source's predictive entropy becomes high, preventing blind anchoring under source collapse. The method achieves state-of-the-art error reduction on the CCC benchmark.
View Cached Full Text
Cached at: 05/15/26, 06:26 AM
# Reliability-Gated Source Anchoring for Continual Test-Time Adaptation
Source: [https://arxiv.org/html/2605.14063](https://arxiv.org/html/2605.14063)
Vikash Singh1Debargha Ganguly1Weicong Chen1Sabyasachi Sahoo2,3 Sreehari Sankar1Biyao Zhang1Mohsen Hariri1Shouren Wang1 Osama Zafar1Christian Gagné2,3Vipin Chaudhary1 1Case Western Reserve University2Université Laval3Mila \- Québec AI Institute vikash@case\.edu
###### Abstract
Continual test\-time adaptation \(CTTA\) updates a pretrained model online on an unlabeled, non\-stationary stream while anchoring it to a frozen source checkpoint\. This anchor is useful only when the source remains reliable\. On CCC\-Hard, however, a ResNet\-50 source falls to approximately1\.3%1\.3\\%top\-11accuracy, while existing source\-anchored CTTA methods continue applying the same anchor strength\. We call this failure mode*blind anchoring*and proposeRMemSafe, a reliability\-gated extension of ROID that uses the frozen source’s normalized predictive entropy to attenuate all explicit source\-coupled uses in the objective\. When the source posterior approaches uniformity, the gate closes: the source anchor and agreement filter vanish, and the objective reduces to a source\-agnostic fallback comprising ROID’s base losses plus marginal calibration\. Combined with ASR,RMemSafeachieves the lowest error on88of99matched\-split continual\-corruption cells and is the best reset\-based method on all99, improving ROID\+ASR by1\.051\.05pp on ResNet\-50 and0\.480\.48pp on ViT\-B/16\. A controlled source\-degradation sweep shows a1\.13×1\.13\{\\times\}shallower harm slope than ROID\+ASR, consistent with the graceful\-decay prediction\. The entropy gate detects high\-entropy source collapse, not confidently wrong low\-entropy sources; this scope is explicitly evaluated and discussed\.
## 1Introduction
A model deployed against a shifting, unlabeled data stream faces a basic safety question: can the adaptation procedure be prevented from driving the model into states worse than those it started in? Continual test\-time adaptation \(CTTA\) addresses this by updating a pre\-trained model online while anchoring it to its frozen source\[[32](https://arxiv.org/html/2605.14063#bib.bib32),[19](https://arxiv.org/html/2605.14063#bib.bib19),[20](https://arxiv.org/html/2605.14063#bib.bib20),[16](https://arxiv.org/html/2605.14063#bib.bib16),[14](https://arxiv.org/html/2605.14063#bib.bib14),[21](https://arxiv.org/html/2605.14063#bib.bib21)\], the anchor preventing runaway drift under noisy pseudo\-labels\. The anchor carries a silent, runtime\-unchecked precondition: that the source remains a meaningful reference throughout adaptation\. On the hardest level of the CCC benchmark\[[22](https://arxiv.org/html/2605.14063#bib.bib22),[14](https://arxiv.org/html/2605.14063#bib.bib14)\], the ResNet\-50 source’s top\-1 accuracy is∼1%\\sim\\\!1\\%; yet every prior CTTA method we evaluate continues to pull the adapting model toward this degraded reference with an unconditionalℓ2\\ell\_\{2\}penalty of fixed strength\. This blind anchoring is a systematic failure mode of current CTTA methods, and fixed\-strength trust in the source can be replaced with a lightweight runtime check that recovers an analytical graceful\-decay property\.
CleanCCC\-EasyCCC\-MedCCC\-Hard02020404060608080catastrophicanchor regimesource accuracysource top\-1 acc\. \(%\)00\.50\.5111\.51\.522prior methods \(fixedλ=2\\lambda=2\)RMemSafe\(λ⋅ℛsrc\\lambda\\\!\\cdot\\\!\\mathcal\{R\}\_\{\\mathrm\{src\}\}\)measured \(3,1283\{,\}128batches\)3\.8×3\.8\\\!\\timesgapeffective anchor strengthλeff\\lambda\_\{\\mathrm\{eff\}\}Figure 1:Continual test\-time adaptation under collapsing source reliability\.CTTA anchors the adapter to its frozen source, presuming the source stays a meaningful reference\.Blue:on CCC, frozen RN\-50 source top\-11collapses from76%76\\%\(clean ImageNet\) to1\.3%1\.3\\%\(CCC\-Hard\)\.Red dashed:prior reset\-based methods \(ROID/ETA/EATA\+ASR, ROID\+RDumb\) holdλ=2\\lambda\{=\}2across all severities, pulling the adapter toward near\-noise output\.Green:RMemSafemultipliesλ\\lambdaby the runtime reliabilityℛsrc=max\(0,1−ℋsrc\)\\mathcal\{R\}\_\{\\mathrm\{src\}\}\{=\}\\max\(0,1\{\-\}\\mathcal\{H\}\_\{\\mathrm\{src\}\}\)from the source’s own posterior entropy\. The CCC\-Hardλeff=0\.53\\lambda\_\{\\mathrm\{eff\}\}\{=\}0\.53is measured \(3,1283\{,\}128\-batch trace, App\. Fig\.[6](https://arxiv.org/html/2605.14063#A11.F6)\); the others follow the monotone entropy\-accuracy relation\. In the catastrophic regime \(shaded\),RMemSafeattenuates the anchor3\.8×3\.8\\times; this attenuation is the analytical safety property demonstrated under controlled degradation \(§[4](https://arxiv.org/html/2605.14063#S4), App\.[M](https://arxiv.org/html/2605.14063#A13)\)\. Matched\-split gains on standard CCC reflect the integrated objective; the gate’s safety role is isolated on the controlled source\-degradation axis \(App\.[I](https://arxiv.org/html/2605.14063#A9)\)\.RMemSafeinstantiates this check\. It gates all explicit source\-coupled uses in the ROID\[[16](https://arxiv.org/html/2605.14063#bib.bib16)\]optimization: theℓ2\\ell\_\{2\}anchor towardθsrc∗\\theta^\{\*\}\_\{\\mathrm\{src\}\}, the cosine\-based source\-expert agreement filter, and the source\-divergence factor inside the anchor\. All three are multiplicatively modulated by a single source\-reliability scalarℛsrc=max\(0,1−ℋsrc\)\\mathcal\{R\}\_\{\\mathrm\{src\}\}=\\max\(0,1\-\\mathcal\{H\}\_\{\\mathrm\{src\}\}\)derived from source entropy\. Because gating alone is not sufficient \(the remaining loss must not itself collapse when the gate closes\), the gate is composed of three auxiliary stabilizers adapted from standard practice: marginal calibration, a confidence\-scaled learning rate, and a decoupled confidence\-interpolated flip at inference\. The resulting method admits an analytical graceful\-decay property \(Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)\): when source entropy saturates, the source\-dependent terms vanish, and the total objective reduces to the base CTTA loss plus the marginal\-calibration term, independent of the frozen source parameters\. On CCC\-Hard, the gate attenuates the runtime anchor by a factor of3\.8×3\.8\\\!\\timesrelative to prior methods with a fixedλ=2\\lambda=2\. This attenuation is the method’s analytical safety property; its empirical signature is a shallower harm slope under controlled source degradation, while the matched\-split benchmark improvements reflect the integrated objective operating jointly\.
Two evaluation axes confirm the property\. On matched\-split evaluations across nine continual\-corruption benchmark cells \(two architectures, three CCC difficulty levels, two CIN\-C orderings, and IN\-C 20\-revisit\),RMemSafeachieves the lowest error on 8 of 9 cells, reducing mean CCC error over ROID\+ASR, the strongest matched\-split prior baseline, by1\.051\.05pp on ResNet\-50 and0\.480\.48pp on ViT\-B/16 \(pairedtt, pooledp<10−16p\{<\}10^\{\-16\},95%95\\%CI\[−1\.16,−0\.94\]\[\-1\.16,\-0\.94\], Cohen’sdz=−3\.65d\_\{z\}\{=\}\{\-\}3\.65; App\.[E](https://arxiv.org/html/2605.14063#A5), Table[7](https://arxiv.org/html/2605.14063#A5.T7)\)\. In a controlled source\-degradation experiment that varies clean\-test source accuracy fromS=0\.75S\{=\}0\.75down toS=0\.12S\{=\}0\.12via Gaussian weight noise,RMemSafedegrades more gracefully than ROID\+ASR across every tested severity, with a harm slope of11\.4311\.43pp per unit ofSSversus12\.9212\.92pp for ROID\+ASR, a1\.13×1\.13\\\!\\timesratio in the direction predicted by Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)\.
An explicit scope is stated upfront\. The reliability gate uses source*entropy*as its sole reliability signal\. This handles the dominant CTTA failure mode observed in practice, high\-entropy source collapse as exemplified by CCC\-Hard, but not*confidently miscalibrated*sources: a source that is wrong without being uncertain would be deemed reliable byℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}\. Designing a correctness\-aware reliability signal is a direction for future work \(§[5](https://arxiv.org/html/2605.14063#S5)\)\. We flag this scope here so the method’s claims are understood narrowly\. We confirm this empirically: under a permuted\-class low\-entropy wrong sourceRMemSafeloses its advantage as predicted, while in the high\-entropy regime it preserves it \(Table[16](https://arxiv.org/html/2605.14063#A16.T16)\)\.RMemSafeis graceful\-decay*with respect to entropy\-detectable source failure*\.
This paper advances graceful\-decay CTTA through the following key contributions:
- •Diagnosis of blind anchoring\.On CCC\-Hard, prior methods anchor at full strength to a source with∼1%\{\\sim\}1\\%top\-11accuracy: a systematic CTTA failure mode we name and characterize \(Figure[1](https://arxiv.org/html/2605.14063#S1.F1)\)\.
- •Reliability\-gated CTTA with a graceful\-decay guarantee\.To our knowledge,RMemSafeis the first CTTA method to gate all explicit source\-coupled uses by a runtime reliability signal derived from the frozen source and to admit an analytical graceful\-decay property \(Prop\.[1](https://arxiv.org/html/2605.14063#Thmproposition1)\); the integrated objective keeps the fallback regime well\-defined\.
- •Empirical confirmation at real and controlled degradation\.8/98/9wins on matched\-split benchmarks \(per\-cell pairedtt, CIs,dzd\_\{z\}in App\.[E](https://arxiv.org/html/2605.14063#A5)\); shallower harm slope than ROID\+ASR under a controlled source\-quality sweep, within an explicitly stress\-tested scope \(App\.[P](https://arxiv.org/html/2605.14063#A16)\)\.
- •Diagnosis of a reset\-paradigm failure mode\.On CCC\-Hard with ViT\-B/16, every reset\-based method underperforms non\-reset ROID, characterized at matched\-split granularity for the first time \(§[4\.3](https://arxiv.org/html/2605.14063#S4.SS3)\); a reliability\-gatedτ\\tau\-sweep partially closes the gap \(App\.[O](https://arxiv.org/html/2605.14063#A15)\)\.
## 2Related Work
Test\-Time Adaptation\.TTA\[[28](https://arxiv.org/html/2605.14063#bib.bib28)\]adapts pre\-trained models from unlabeled streams via test\-time batch\-norm recalibration\[[24](https://arxiv.org/html/2605.14063#bib.bib24)\], entropy minimization\[[30](https://arxiv.org/html/2605.14063#bib.bib30)\], Fisher\-regularized updates\[[19](https://arxiv.org/html/2605.14063#bib.bib19)\], self\-supervised auxiliary updates\[[15](https://arxiv.org/html/2605.14063#bib.bib15)\], prototype\-based classifier adjustment\[[10](https://arxiv.org/html/2605.14063#bib.bib10)\]and per\-instance test\-time augmentation\[[36](https://arxiv.org/html/2605.14063#bib.bib36)\]; non\-saturating surrogates, such as the soft likelihood ratio\[[18](https://arxiv.org/html/2605.14063#bib.bib18)\]inherited via ROID, replace entropy where it saturates\. DeYO\[[11](https://arxiv.org/html/2605.14063#bib.bib11)\]shows entropy\-based selection fails under disentangled spurious factors, and COME\[[37](https://arxiv.org/html/2605.14063#bib.bib37)\]addresses entropy’s overconfidence pathology via Dirichlet\-prior conservative entropy\. We use entropy as a frozen\-source reliability proxy rather than as a self\-training target for the adapting expert; this changes the failure mode, but does not eliminate low\-entropy miscalibration, which we evaluate in §[5](https://arxiv.org/html/2605.14063#S5)and Appendix[P](https://arxiv.org/html/2605.14063#A16)\. Standalone TTA still suffers catastrophic forgetting and error accumulation under continuous, non\-stationary shifts\[[13](https://arxiv.org/html/2605.14063#bib.bib13)\]\.
Continual Test\-Time Adaptation\.CTTA stabilizes continuous adaptation through stochastic restoration\[[32](https://arxiv.org/html/2605.14063#bib.bib32)\], sample filtering\[[20](https://arxiv.org/html/2605.14063#bib.bib20)\], instance\-aware normalization under temporal correlation\[[7](https://arxiv.org/html/2605.14063#bib.bib7),[35](https://arxiv.org/html/2605.14063#bib.bib35)\], symmetric\-cross\-entropy mean\-teacher training\[[2](https://arxiv.org/html/2605.14063#bib.bib2)\]\(whose SymCE term we inherit in our source\-agnostic fallback\), momentum ensembling\[[16](https://arxiv.org/html/2605.14063#bib.bib16)\], Kalman filtering\[[12](https://arxiv.org/html/2605.14063#bib.bib12)\], memory\-efficient self\-distilled regularization\[[27](https://arxiv.org/html/2605.14063#bib.bib27)\], adaptive collapse\-sensing\[[9](https://arxiv.org/html/2605.14063#bib.bib9)\], and robust pseudo\-labeling self\-training\[[23](https://arxiv.org/html/2605.14063#bib.bib23),[29](https://arxiv.org/html/2605.14063#bib.bib29)\]\. A critical limitation along this line is the implicit assumption that the static source remains a reliable anchor, which fails when severe corruption catastrophically confuses the source\.RMemSafeaddresses this “blind anchoring” flaw by dynamically gating source agreement and anchor strength based on localized source reliability, with an analytically derived graceful\-decay property under source collapse\.
Long\-Term Adaptation and Resets\.Long\-term severe shifts \(e\.g\., CCC\-Hard\[[22](https://arxiv.org/html/2605.14063#bib.bib22),[17](https://arxiv.org/html/2605.14063#bib.bib17)\]\) cause CTTA models to collapse, motivating the reset paradigm: periodic parameter restoration \(RDumb\[[22](https://arxiv.org/html/2605.14063#bib.bib22)\]\) or adaptive resets \(ASR\[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]\)\. Resets arrest collapse but may erase acquired target knowledge, and a reset toward a catastrophically confused source can itself be harmful \(§[4\.3](https://arxiv.org/html/2605.14063#S4.SS3)\)\.RMemSafeis complementary: it attenuates unreliable source\-guided updates before they accumulate, improving adaptation\-phase stability, and combines effectively with ASR\. The CCC and ImageNet\-C benchmarks\[[22](https://arxiv.org/html/2605.14063#bib.bib22),[8](https://arxiv.org/html/2605.14063#bib.bib8)\]provide the corruption streams used in our evaluation\.
## 3Method: Reliability\-Gated Test\-Time Adaptation
When the source posterior is uniform, every source\-dependent term should vanish multiplicatively, thereby dropping the source from the gradient \(Prop\.[1](https://arxiv.org/html/2605.14063#Thmproposition1)\); we now construct the gate and stabilizers that keep this fallback well\-behaved\.
Figure 2:Overview ofRMemSafe\.The reliability engine derives the source\-reliability gateℛsrc=1−ℋsrc\\mathcal\{R\}\_\{src\}=1\-\\mathcal\{H\}\_\{src\}from the frozen source’s entropy\.ℛsrc\\mathcal\{R\}\_\{src\}gates all explicit source\-coupled uses: the Dynamic Anchor, the agreement filter \(interpolating between source agreement and pass\-through\), and the source\-divergence scaling inside the anchor, while leaving the base ROID losses and marginal calibration ungated\. At inference, a confidence\-interpolated flip average produces the final prediction\. When the source collapses,ℛsrc→0\\mathcal\{R\}\_\{src\}\\\!\\rightarrow\\\!0and the total loss reduces to a non\-collapsing source\-agnostic fallback regime \(Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)\)\.Letfθsrc∗f\_\{\\theta^\{\*\}\_\{\\mathrm\{src\}\}\}denote a pre\-trained source model with frozen parametersθsrc∗\\theta^\{\*\}\_\{\\mathrm\{src\}\}trained on a source distribution𝒟S\\mathcal\{D\}\_\{S\}\. During continual test\-time adaptation, the model observes an unlabeled streamxt∈𝒳Tx\_\{t\}\\in\\mathcal\{X\}\_\{T\}drawn from a shifted, non\-stationary target distribution𝒟T\\mathcal\{D\}\_\{T\}and continuously updates expert parametersθexp\\theta\_\{\\mathrm\{exp\}\}\(initialized asθexp←θsrc∗\\theta\_\{\\mathrm\{exp\}\}\\leftarrow\\theta^\{\*\}\_\{\\mathrm\{src\}\}\) to minimize empirical risk on the stream\. We build on the ROID framework\[[16](https://arxiv.org/html/2605.14063#bib.bib16)\], which optimizes a soft\-likelihood\-ratio lossℒslr\\mathcal\{L\}\_\{\\mathrm\{slr\}\}and a consistency lossℒcons\\mathcal\{L\}\_\{\\mathrm\{cons\}\}on augmented views under diversity and certainty weighting, and additionally anchors the adapting parameters to the frozen source via a fixed\-strengthℓ2\\ell\_\{2\}penalty\. The design choice that motivates our method is that this anchor strength is*unconditional*on source quality, if the frozen source’s predictions collapse under distribution shift \(on CCC\-Hard, source top\-1 accuracy is approximately1%1\\%\), the anchor continues to pull the adapting model toward a useless reference with full strength\.
The Reliability Gate\.For each batchxtx\_\{t\}we compute the source and expert predictive distributions𝐩src=fθsrc∗\(xt\)\\mathbf\{p\}\_\{\\mathrm\{src\}\}=f\_\{\\theta^\{\*\}\_\{\\mathrm\{src\}\}\}\(x\_\{t\}\)and𝐩exp=fθexp\(xt\)\\mathbf\{p\}\_\{\\mathrm\{exp\}\}=f\_\{\\theta\_\{\\mathrm\{exp\}\}\}\(x\_\{t\}\), and the normalized source entropy
ℋsrc=−1logC∑c=1Cpsrc\(c\)logpsrc\(c\)∈\[0,1\],\\mathcal\{H\}\_\{\\mathrm\{src\}\}=\-\\tfrac\{1\}\{\\log C\}\\sum\_\{c=1\}^\{C\}p\_\{\\mathrm\{src\}\}^\{\(c\)\}\\log p\_\{\\mathrm\{src\}\}^\{\(c\)\}\\;\\in\\;\[0,1\],\(1\)whereCCis the number of classes\. The*source reliability gate*isℛsrc=max\(0,1−ℋsrc\)\.\\mathcal\{R\}\_\{\\mathrm\{src\}\}=\\max\\\!\\bigl\(0,\\;1\-\\mathcal\{H\}\_\{\\mathrm\{src\}\}\\bigr\)\.ℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}takes value11when the source is maximally confident and0when its posterior is uniform\.RMemSafeusesℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}as a single scalar signal to modulate every source\-dependent optimization term\.
Source\-coupled uses gated by reliability\.ℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}is applied to all explicit places where ROID’s optimization uses source information: the anchor, the agreement filter, and the source\-divergence factor inside the anchor\.\(i\) Divergence\-aware dynamic anchor\.Replacing the fixedℓ2\\ell\_\{2\}penalty of ROID, we define
ℒanch=λ⋅ℛsrc\(1\+αℋexp\+βDJS\(𝐩src∥𝐩exp\)\)‖θexp−θsrc∗‖22,\\mathcal\{L\}\_\{\\mathrm\{anch\}\}\\;=\\;\\lambda\\cdot\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\bigl\(1\+\\alpha\\,\\mathcal\{H\}\_\{\\mathrm\{exp\}\}\+\\beta\\,D\_\{\\mathrm\{JS\}\}\(\\mathbf\{p\}\_\{\\mathrm\{src\}\}\\\|\\mathbf\{p\}\_\{\\mathrm\{exp\}\}\)\\bigr\)\\,\\\|\\theta\_\{\\mathrm\{exp\}\}\-\\theta^\{\*\}\_\{\\mathrm\{src\}\}\\\|\_\{2\}^\{2\},\(2\)whereℋexp\\mathcal\{H\}\_\{\\mathrm\{exp\}\}is the expert’s normalized entropy,DJSD\_\{\\mathrm\{JS\}\}is Jensen–Shannon divergence, andλ,α,β\\lambda,\\alpha,\\betaare fixed scalars\. The outerℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}prefactor makes the anchor pull towardθsrc∗\\theta^\{\*\}\_\{\\mathrm\{src\}\}only when the source is reliable; the inner bracket makes the pull stronger when the expert is uncertain or diverges from the source\.\(ii\) Source\-expert agreement gating\.The ROID base losses are gated by a cosine\-agreement weightwraw=wmin\+\(1−wmin\)max\(0,cos\(𝐩src,𝐩exp\)\)w\_\{\\mathrm\{raw\}\}=w\_\{\\min\}\+\(1\-w\_\{\\min\}\)\\max\(0,\\cos\(\\mathbf\{p\}\_\{\\mathrm\{src\}\},\\mathbf\{p\}\_\{\\mathrm\{exp\}\}\)\)111max\(⋅,0\)\\max\(\\cdot,0\)is redundant sincepsrc,pexpp\_\{\\mathrm\{src\}\},p\_\{\\mathrm\{exp\}\}are softmax outputs and socos≥0\\cos\\\!\\geq\\\!0; we keep it as a numerical safeguard\., and the gate itself is gated byℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}:
wcos=ℛsrcwraw\+\(1−ℛsrc\)⋅1,w\_\{\\mathrm\{cos\}\}=\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\,w\_\{\\mathrm\{raw\}\}\+\(1\-\\mathcal\{R\}\_\{\\mathrm\{src\}\}\)\\cdot 1,\(3\)which smoothly interpolates between source\-expert agreement filtering \(whenℛsrc→1\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\\!\\to\\\!1\) and unfiltered adaptation \(whenℛsrc→0\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\\!\\to\\\!0\)\. Under an unreliable source, the agreement signal itself is untrustworthy, so the filter is disabled\.\(iii\) Divergence\-scaled anchor strength\.TheβDJS\\beta D\_\{\\mathrm\{JS\}\}term in \([2](https://arxiv.org/html/2605.14063#S3.E2)\) is itself source\-dependent: a large JS divergence between source and expert carries no information when the source is uninformative\. It is gated by the sameℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}prefactor via its location insideℒanch\\mathcal\{L\}\_\{\\mathrm\{anch\}\}, so its contribution decays to zero with source reliability\.
Auxiliary Stabilizers for the Fallback Regime\.Gating the source\-dependent terms is necessary but not sufficient: whenℛsrc→0\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\\!\\to\\\!0the gated terms vanish, and the remaining loss must not itself collapse\. We therefore include three auxiliary stabilizers, each adapted from standard practice in test\-time adaptation\. These are not claimed as contributions of this work; we describe them so that the fallback regime is fully specified\.
Marginal calibration\.To prevent class collapse without enforcing uniform batch marginals, we penalize the KL divergence between the current batch’s class marginal𝐩¯exp=1B∑b𝐩exp,b\\bar\{\\mathbf\{p\}\}\_\{\\mathrm\{exp\}\}=\\tfrac\{1\}\{B\}\\sum\_\{b\}\\mathbf\{p\}\_\{\\mathrm\{exp\},b\}and an EMA of historical class priors𝐩prior\\mathbf\{p\}\_\{\\mathrm\{prior\}\}:ℒmarg=DKL\(𝐩¯exp∥𝐩prior\)\.\\mathcal\{L\}\_\{\\mathrm\{marg\}\}=D\_\{\\mathrm\{KL\}\}\(\\bar\{\\mathbf\{p\}\}\_\{\\mathrm\{exp\}\}\\,\\\|\\,\\mathbf\{p\}\_\{\\mathrm\{prior\}\}\)\.This term is not source\-dependent and is active regardless ofℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}\.
Confidence\-scaled learning rate\.To dampen destructive updates under high expert uncertainty, the effective learning rate is scaled asηeff=η⋅\(ηmin\+\(1−ηmin\)\(1−ℋexp\)\)\\eta\_\{\\mathrm\{eff\}\}=\\eta\\cdot\\bigl\(\\eta\_\{\\min\}\+\(1\-\\eta\_\{\\min\}\)\(1\-\\mathcal\{H\}\_\{\\mathrm\{exp\}\}\)\\bigr\)\. This depends on expert uncertainty, not source reliability\.
Decoupled flip interpolation at inference\.For a test inputxtx\_\{t\}and its horizontal flipx~t\\tilde\{x\}\_\{t\}, the final prediction interpolates logits by a confidence\-adaptive weight:
𝐩final=Softmax\(\(1−γ\)zorig\+γzflip\),\\mathbf\{p\}\_\{\\mathrm\{final\}\}=\\mathrm\{Softmax\}\\bigl\(\(1\-\\gamma\)\\,z\_\{\\mathrm\{orig\}\}\+\\gamma\\,z\_\{\\mathrm\{flip\}\}\\bigr\),\(4\)withγ=γmin\+\(γmax−γmin\)\(ℋexp\(zorig\)\)\\gamma=\\gamma\_\{\\min\}\+\(\\gamma\_\{\\max\}\-\\gamma\_\{\\min\}\)\(\\mathcal\{H\_\{\\mathrm\{exp\}\}\}\(z\_\{\\mathrm\{orig\}\}\)\)\. Confident predictions are dominated by the original view; uncertain predictions benefit from the augmented view\.
Total Objective and Graceful Decay\.The total training objective is
ℒtotal=wcos\(ℒslr\+ℒcons\)\+λmargℒmarg\+ℒanch\.\\mathcal\{L\}\_\{\\mathrm\{total\}\}=w\_\{\\mathrm\{cos\}\}\\bigl\(\\mathcal\{L\}\_\{\\mathrm\{slr\}\}\+\\mathcal\{L\}\_\{\\mathrm\{cons\}\}\\bigr\)\\;\+\\;\\lambda\_\{\\mathrm\{marg\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{marg\}\}\\;\+\\;\\mathcal\{L\}\_\{\\mathrm\{anch\}\}\.\(5\)The design is structured so that the novel mechanism, reliability gating, has a well\-defined limiting behavior\.
###### Proposition 1\(Graceful decay under source collapse\)\.
When the source posterior becomes uniform,ℋsrc→1\\mathcal\{H\}\_\{\\mathrm\{src\}\}\\\!\\to\\\!1impliesℛsrc→0\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\\!\\to\\\!0\. Under this limit:*\(i\)*ℒanch→0\\mathcal\{L\}\_\{\\mathrm\{anch\}\}\\\!\\to\\\!0, so theℓ2\\ell\_\{2\}anchor towardθsrc∗\\theta^\{\*\}\_\{\\mathrm\{src\}\}is completely removed;*\(ii\)*wcos→1w\_\{\\mathrm\{cos\}\}\\\!\\to\\\!1, so the source\-expert agreement filter reverts to an unfiltered pass\-through;*\(iii\)*the total objective reduces toℒslr\+ℒcons\+λmargℒmarg\\mathcal\{L\}\_\{\\mathrm\{slr\}\}\+\\mathcal\{L\}\_\{\\mathrm\{cons\}\}\+\\lambda\_\{\\mathrm\{marg\}\}\\mathcal\{L\}\_\{\\mathrm\{marg\}\}, which is independent ofθsrc∗\\theta^\{\*\}\_\{\\mathrm\{src\}\}and well\-defined\.
proof\.Substitutingℛsrc=0\\mathcal\{R\}\_\{\\mathrm\{src\}\}=0into Eq\. \([2](https://arxiv.org/html/2605.14063#S3.E2)\) zeroes the outer prefactor of the anchor loss, soℒanch=0\\mathcal\{L\}\_\{\\mathrm\{anch\}\}=0regardless of the value of∥θexp−θsrc∗∥22\\lVert\\theta\_\{\\mathrm\{exp\}\}\-\\theta^\{\*\}\_\{\\mathrm\{src\}\}\\rVert\_\{2\}^\{2\}and regardless of the inner divergence and entropy terms; theℓ2\\ell\_\{2\}pull towardθsrc∗\\theta^\{\*\}\_\{\\mathrm\{src\}\}is therefore removed, not merely attenuated\. Substitutingℛsrc=0\\mathcal\{R\}\_\{\\mathrm\{src\}\}=0into Eq\. \([3](https://arxiv.org/html/2605.14063#S3.E3)\) yieldswcos=0⋅wraw\+\(1−0\)⋅1=1w\_\{\\mathrm\{cos\}\}=0\\cdot w\_\{\\mathrm\{raw\}\}\+\(1\-0\)\\cdot 1=1, so the source\-expert agreement filter reverts to an unfiltered pass\-through\. Substituting both into Eq\. \([5](https://arxiv.org/html/2605.14063#S3.E5)\) givesℒtotal=ℒslr\+ℒcons\+λmargℒmarg\\mathcal\{L\}\_\{\\mathrm\{total\}\}=\\mathcal\{L\}\_\{\\mathrm\{slr\}\}\+\\mathcal\{L\}\_\{\\mathrm\{cons\}\}\+\\lambda\_\{\\mathrm\{marg\}\}\\mathcal\{L\}\_\{\\mathrm\{marg\}\}, which depends only on the expert predictions𝐩exp\\mathbf\{p\}\_\{\\mathrm\{exp\}\}and the running marginal𝐩prior\\mathbf\{p\}\_\{\\mathrm\{prior\}\}, and is therefore independent ofθsrc∗\\theta^\{\*\}\_\{\\mathrm\{src\}\}\.
Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)states the load\-bearing property of the method: in the limit of a catastrophically confused source,RMemSafedoes not enforce a corrupted prior; instead, it reduces to a source\-agnostic adaptation regime comprising the ROID base losses plus a marginal\-calibration term\. The confidence\-LR and decoupled flip remain active throughout the regime because they depend on expert rather than source statistics\.
Scope of Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)\.The proposition concerns only the regime in which source entropy is*high*\. It does not claim the gate detects all forms of source failure; in particular, a source that is*confidently wrong*has low entropy and would be deemed reliable byℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}\. We return to this scope boundary in §[5](https://arxiv.org/html/2605.14063#S5)\. The per\-batch update combining equations \([3](https://arxiv.org/html/2605.14063#S3)\)–\([5](https://arxiv.org/html/2605.14063#S3.E5)\) with the ASR adaptive\-selective reset controller\[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]is given as Algorithm[1](https://arxiv.org/html/2605.14063#alg1)in Appendix[A](https://arxiv.org/html/2605.14063#A1)\. Hyperparameters are fixed across all nine benchmark cells; specific values are in Table[3](https://arxiv.org/html/2605.14063#A2.T3)of Appendix[B](https://arxiv.org/html/2605.14063#A2)\.
## 4Experiments
### 4\.1Experimental Setup
Benchmarks\.We evaluate on three continual\-TTA benchmark families\.CCC\[[22](https://arxiv.org/html/2605.14063#bib.bib22)\]is a non\-stationary ImageNet\-scale stream of continuously transitioning corruptions; we use the three difficulty levels ofLim et al\., \[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]\(Easy,Medium,Hard; source accuracies∼34%\\sim\\\!34\\%,17%17\\%,1%1\\%\), each over99split seeds of50,00050\{,\}000samples\.CIN\-C\(CIFAR\-10\-C,2020revisits\) cycles1515corruptions20×20\\times; we report i\.i\.d\. and correlated \(Dirichletα=0\.1\\alpha\{=\}0\.1\) orderings over1010seeds\.IN\-Cis the analogous ImageNet\-C2020\-revisit protocol\. These nine cells span two architectures \(ResNet\-50, ViT\-B/16\)\. The CCC and IN\-C cells use ImageNet\-pretrained ResNet\-50 and ViT\-B/16; the CIN\-C cells follow the standard CIFAR\-10\-C protocol with a CIFAR\-10\-trained WRN\-28\-10 backbone \(App\.[L](https://arxiv.org/html/2605.14063#A12)\)\. §[4\.4](https://arxiv.org/html/2605.14063#S4.SS4)adds a controlled*source\-degradation*experiment that varies clean\-test source accuracy via Gaussian weight noise, probing Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)’s graceful\-decay regime directly\.
Baselines\.We compare against seven baselines, all re\-run locally on identical splits for matched comparison:Source\(frozen, no adaptation\);ROID\[[16](https://arxiv.org/html/2605.14063#bib.bib16)\], the strongest soft\-weighting CTTA method;ROID\+RDumb\[[22](https://arxiv.org/html/2605.14063#bib.bib22)\], adding periodic hard reset every1,0001\{,\}000updates;ETA\+ASR,EATA\+ASR\[[19](https://arxiv.org/html/2605.14063#bib.bib19),[14](https://arxiv.org/html/2605.14063#bib.bib14)\], combining entropy\-filtered adaptation with Adaptive And Selective Reset;ROID\+ASR\[[14](https://arxiv.org/html/2605.14063#bib.bib14)\], the strongest prior reset baseline in our matched reruns; andRMemSafe\-noASR, our gate without reset to isolate the gating contribution\. On our shards, ROID\+ASR’s CCC\-Hard error differs from the streamed\-data number inLim et al\., \[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]by∼7\\sim\\\!7pp; the offset is approximately constant across methods and preserves ranking \(Appendix[Q](https://arxiv.org/html/2605.14063#A17)\), so we evaluate all methods on the same shards\.
Implementation\.We use ResNet\-50 and ViT\-B/16 backbones with publicly released ImageNet weights\. Following ROID, we adapt only the affine parameters of normalization layers using SGD \(learning rate2\.5×10−42\.5\{\\times\}10^\{\-4\}, momentum0\.90\.9, batch size6464, no weight decay\)\.RMemSafeintroduces five additional hyperparameters, all fixed across all nine benchmarks: anchor strengthλ=2\.0\\lambda\{=\}2\.0, entropy scaleα=2\.0\\alpha\{=\}2\.0, divergence scaleβ=1\.0\\beta\{=\}1\.0, marginal weightλmarg=0\.1\\lambda\_\{\\mathrm\{marg\}\}\{=\}0\.1, and confidence\-LR floorηmin=0\.2\\eta\_\{\\min\}\{=\}0\.2\.
### 4\.2Matched\-Split Benchmark Results
RMemSafe\+ASR achieves the lowest error on8 of 9benchmark cells \(Table[1](https://arxiv.org/html/2605.14063#S4.T1)\) and is the*uniquely best method in the reset\-based family on every one of the 9 cells*; the single exception \(CCC\-Hard ViT, against non\-reset ROID\) is a property of the reset paradigm itself, which we analyze in §[4\.3](https://arxiv.org/html/2605.14063#S4.SS3)\. Against ROID\+ASR, the strongest matched\-split prior baseline,RMemSafereduces the CCC\-ResNet\-50 mean by1\.05\\mathbf\{1\.05\}pp and the CCC\-ViT mean by0\.48\\mathbf\{0\.48\}pp\. On the long\-horizon benchmarks, where twenty revisit cycles amplify error accumulation,RMemSafereduces CIN\-C error by0\.800\.80to0\.880\.88pp and IN\-C error by0\.470\.47pp relative to ROID\+ASR\. No other baseline, including ETA\+ASR and EATA\+ASR, achieves the best score on any cell\. Per\-split pairedtt\-tests \(Appendix[E](https://arxiv.org/html/2605.14063#A5)\) confirm the ResNet\-50 improvements are statistically significant: the pooled comparison across the three CCC levels yieldsp<10−16p<10^\{\-16\}against ROID\+ASR\. The pooled CCC RN\-50 effect isΔ=−1\.05\\Delta\{=\}\{\-\}1\.05pp,95%95\\%CI\[−1\.16,−0\.94\]\[\-1\.16,\-0\.94\],dz=−3\.65d\_\{z\}\{=\}\{\-\}3\.65\(Table[7](https://arxiv.org/html/2605.14063#A5.T7)\); CIN\-C cells all significant atp<10−10p\{<\}10^\{\-10\}\. The lone non\-significant cell \(CCC\-Hard ViT,p=0\.052p\{=\}0\.052\) is the reset\-paradigm failure of §[4\.3](https://arxiv.org/html/2605.14063#S4.SS3)\.
Table 1:Error \(%, lower is better\) on all nine continual TTA benchmark cells, means across99CCC splits,1010CIN\-C seeds, or a single IN\-C 20\-revisit run; per\-split std in Table[4](https://arxiv.org/html/2605.14063#A4.T4)\(appendix\)\.Boldmarks the best method per column\. All baselines are re\-run locally on identical data splits\. Our method \(grey row\) wins 8 of 9 cells; the single non\-win \(CCC\-Hard ViT\) is analyzed in §[4\.3](https://arxiv.org/html/2605.14063#S4.SS3)\.Adaptation\-gap decomposition\.The sequential progression from Source toRMemSafe\+ASR decomposes the total CCC ResNet\-50 improvement into three stages:
78\.62⏟Source→−15\.9862\.64⏟ROID→−1\.3661\.28⏟ROID\+ASR→−1\.0560\.23⏟RMemSafe\+ASR\(pp CCC\-mean\)\\displaystyle\\footnotesize\\underbrace\{78\.62\}\_\{\\text\{Source\}\}\\xrightarrow\{\-15\.98\}\\underbrace\{62\.64\}\_\{\\text\{ROID\}\}\\xrightarrow\{\-1\.36\}\\underbrace\{61\.28\}\_\{\\text\{ROID\+ASR\}\}\\xrightarrow\{\-1\.05\}\\underbrace\{\\mathbf\{60\.23\}\}\_\{\\text\{\{RMemSafe\}\{\+\}ASR\}\}\\quad\\text\{\(pp CCC\-mean\)\}\(6\)
Pure adaptation contributes the bulk of the gain \(15\.98 pp\), ASR reset contributes a further 1\.36 pp, and the integratedRMemSafeobjective contributes another 1\.05 pp on top of ROID\+ASR\. As shown in Eq\.[6](https://arxiv.org/html/2605.14063#S4.E6), decomposing this final 1\.05 pp by component \(detailed in App\.[J](https://arxiv.org/html/2605.14063#A10)\) attributes∼1\.00\\sim 1\.00pp to the auxiliary stabilizers \(decoupled flip, confidence\-scaled LR, marginal calibration\) and∼0\.04\\sim 0\.04pp to the reliability\-gated terms \(anchor, source\-expert agreement\) on the CCC mean\. This decomposition reflects the runtime regime: the matched\-split gains come from the integrated objective, while the gate’s safety role is isolated on the controlled source\-degradation axis below and by the confidently\-wrong\-source stress test \(App\.[P](https://arxiv.org/html/2605.14063#A16)\)\.
Head\-to\-head against reset\-based methods\.Across the 9 benchmark cells,RMemSafe\+ASR never loses to any of the three ASR\-augmented baselines \(ETA\+ASR, EATA\+ASR, ROID\+ASR\) or to ROID\+RDumb:9/99\\,/\\,9wins in the reset\-based family\.
### 4\.3Reset\-Paradigm Failure on ViT\-Hard
Our matched\-split evaluation surfaces a regime in which every reset\-based method we evaluate underperforms the non\-reset ROID baseline\. On CCC\-Hard with ViT\-B/16, the strict ordering,
75\.98⏟ROID<76\.50⏟ROID\+RDumb<80\.11⏟ETA\+ASR≈80\.58⏟EATA\+ASR<83\.18⏟RMemSafe\+ASR<84\.04⏟ROID\+ASR\.\\displaystyle\\underbrace\{75\.98\}\_\{\\text\{ROID\}\}\\;<\\;\\underbrace\{76\.50\}\_\{\\text\{ROID\+RDumb\}\}\\;<\\;\\underbrace\{80\.11\}\_\{\\text\{ETA\+ASR\}\}\\;\\approx\\;\\underbrace\{80\.58\}\_\{\\text\{EATA\+ASR\}\}\\;<\\;\\underbrace\{83\.18\}\_\{\\text\{\{RMemSafe\}\{\+\}ASR\}\}\\;<\\;\\underbrace\{84\.04\}\_\{\\text\{ROID\+ASR\}\}\.is invariant across base adapters \(ROID, ETA, EATA,RMemSafe\) and reset mechanisms \(periodic, adaptive\)\. The phenomenon is specific to ViT\-B/16: on the same CCC\-Hard benchmark with ResNet\-50 the same ASR controller improves on plain ROID by 1\.51 pp \(84\.56 vs 86\.07\), even though the ResNet\-50 source is harder \(1\.33% top\-1 versus 13\.42% on ViT\)\. Source quality alone, therefore, does not explain the ordering; if it did, the ResNet\-50 reset configuration should suffer at least as much\. Plausible architecture\-specific candidates include attention\-block sensitivity to discrete parameter restoration and LayerNorm\-affine dynamics under repeated resets, but our experiments do not isolate a single mechanism, and we report this as a characterized observation rather than a fully explained one\. Within the reset\-based family,RMemSafe\+ASR is the strongest on CCC\-Hard ViT, improving over ROID\+ASR by 0\.86 pp; gating the reset trigger itself by a time\-averagedℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}\(App\.[O](https://arxiv.org/html/2605.14063#A15)\) closes most of the residual gap to non\-reset ROID at thresholdτgate=0\.40\\tau\_\{\\mathrm\{gate\}\}=0\.40, recovering the non\-reset baseline on the mean \(76\.42 vs 75\.98\)\. A full mechanistic account of the architecture\-specific failure and a reset controller that is reliable for every split rather than the mean are left to future work \(§[5](https://arxiv.org/html/2605.14063#S5)\)\.
Table 2:CCC ViT\-B/16 accuracy \(%\)across difficulty levels, focused view of the reset\-paradigm observation\. Plain ROID \(no reset\) leads on Hard because resets toward the degraded source are harmful in this regime\. Among reset\-based methods,RMemSafe\+ASR is best on Easy and Medium and best within the reset\-based family on Hard\.Boldmarks the best per column\.
### 4\.4Controlled Source Degradation
The main benchmark results establish thatRMemSafeimproves over prior methods on standard continual\-corruption streams, but they do not directly measure how the method responds to*varying*source quality, the axis along which Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)makes its prediction\. To probe this axis, we conduct a controlled experiment in which the source model’s clean\-test accuracy is varied continuously by injecting Gaussian noise into its convolutional weights and normalization affine parameters\. For each parameter tensorθk\\theta\_\{k\}with per\-tensor standard deviationσk\\sigma\_\{k\}, we setθk←θk\+ϵσk𝒩\(0,I\)\\theta\_\{k\}\\leftarrow\\theta\_\{k\}\+\\epsilon\\,\\sigma\_\{k\}\\,\\mathcal\{N\}\(0,I\), withϵ\\epsiloncalibrated by binary search so that the resulting source has clean\-test accuracyS∈\{0\.75,0\.30,0\.12\}S\\in\\\{0\.75,0\.30,0\.12\\\}\. Biases and BN running statistics are not perturbed\. Full protocol is in Appendix[L](https://arxiv.org/html/2605.14063#A12)\. We then run ROID\+ASR andRMemSafe\+ASR on CIN\-C \(both i\.i\.d\. and correlated orderings,33seeds\) using each of the three degraded sources, for a total of3636runs\. Figure[3](https://arxiv.org/html/2605.14063#S4.F3)reports the result\.
30305050707090903030505070709090ROID\+ASR err\. \(%\)RMemSafe\+ASR err\. \(%\)Paired per\-split \(54 CCC splits\)H RN50H ViTM RN50M ViTE RN50E ViT0\.120\.120\.300\.300\.750\.751616202024242828source acc\.SSCIN\-C err\. \(%\)CIN\-C iidROID\+ASRRMemSafe\+ASR0\.120\.120\.300\.300\.750\.7577777979818183838585source acc\.SSCIN\-C err\. \(%\)CIN\-C correlatedFigure 3:Left:paired per\-split CCC comparison \(n=54n\{=\}54\);RMemSafeis belowy=xy\{=\}xon5151splits, on the diagonal \(\|Δ\|≤0\.02\|\\Delta\|\\\!\\leq\\\!0\.02pp\) on22, and0\.170\.17pp above on11\(a CCC\-Hard ViT split where both methods are at∼98%\\sim\\\!98\\%error\)\.Center, right:controlled source\-degradation on CIN\-C, varying source clean\-test accuracySSvia Gaussian weight noise \(Appendix[L](https://arxiv.org/html/2605.14063#A12)\)\. Error bars:±1\\pm 1std over33seeds; x\-axis reversed\. TheRMemSafe\+ASR−\-ROID\+ASR gap widens monotonically asSSdecreases \(§[4\.4](https://arxiv.org/html/2605.14063#S4.SS4)\), consistent with graceful decay\.Interpretation\.RMemSafe\+ASR is better than or tied with ROID\+ASR on all66cells and strictly better on55\(atS=0\.75S\{=\}0\.75correlated the pooled gap is\+0\.29\+0\.29pp; within\-cell std on this configuration is0\.340\.34for ROID\+ASR and0\.400\.40forRMemSafe\+ASR, both larger than the observed gap, so we treat this cell as a tie\)\. The gap widens monotonically asSSdecreases, atS=0\.75S\{=\}0\.75the pooled improvement is−0\.23\-0\.23pp; atS=0\.30S\{=\}0\.30,−0\.76\-0\.76pp; atS=0\.12S\{=\}0\.12,−1\.17\-1\.17pp\. Fitting a harm slopeHm=\(errm\(Smin\)−errm\(Smax\)\)/\(Smax−Smin\)H\_\{m\}\\\!=\\\!\(\\mathrm\{err\}\_\{m\}\(S\_\{\\min\}\)\-\\mathrm\{err\}\_\{m\}\(S\_\{\\max\}\)\)/\(S\_\{\\max\}\-S\_\{\\min\}\)across the tested range yieldsHROID\+ASR=12\.92H\_\{\\mathrm\{ROID\+ASR\}\}\\\!=\\\!12\.92pp per unit ofSSversusHRMemSafe\+ASR=11\.43H\_\{\\textsc\{RMemSafe\}\+\\mathrm\{ASR\}\}\\\!=\\\!11\.43, a ratio of1\.131\.13in the direction predicted by Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)\. Per\-batch reliability traces collected during these runs showℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}drifting monotonically from approximately0\.840\.84atS=0\.75S\{=\}0\.75to approximately0\.780\.78atS=0\.12S\{=\}0\.12\(Appendix[L](https://arxiv.org/html/2605.14063#A12), Table[13](https://arxiv.org/html/2605.14063#A12.T13)\)\. The drift is modest because weight noise does not push the source to the posterior\-uniform limit that saturates the gate, but it is consistent across seeds and both orderings, confirming that the gate is tracking source quality rather than collapsing to a constant\.
Scope\.Weight noise yields uniformly\-confused, high\-entropy sources, directly probing the regimeℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}is designed to detect; low\-entropy miscalibration is not exercised here \(§[5](https://arxiv.org/html/2605.14063#S5)\)\.
### 4\.5Ablations
To isolate each component of theRMemSafeobjective, we disable one contribution at a time and re\-evaluate on all2727CCC ResNet\-50 splits\. Figure[4](https://arxiv.org/html/2605.14063#S4.F4)shows the degradation of CCC\-mean error\.
0\.00\.00\.20\.20\.40\.40\.60\.60\.80\.8source anchormarg\. calibrationsrc\-expert agreementconfidence LRdecoupled augmentation\+0\.00\+0\.00\+0\.00\+0\.00\+0\.04\+0\.04\+0\.10\+0\.10\+0\.79\+0\.79Δ\\DeltaCCC\-mean \(pp\) when component removedFigure 4:Component ablation on CCC ResNet\-50\(mean over2727splits\)\. Two of the five ablated components \(anchor, source\-expert agreement\) are multiplied byℛsrc≈0\.26\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\\!\\approx\\\!0\.26at CCC\-Hard runtime \(App\.[K](https://arxiv.org/html/2605.14063#A11)\); the three ungated components \(marg\. calibration, confidence\-scaled LR, decoupled flip\) are not\. The decoupled flip is the only contribution with a large leave\-one\-out effect \(\+0\.79\+0\.79pp\)\.Gated components interpretation\.Figure[4](https://arxiv.org/html/2605.14063#S4.F4)shows that several leave\-one\-out effects are small on the standard CCC mean\. This is the expected runtime behavior of reliability gating, and is directly predicted by Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)\. Recall from Eq\. \([2](https://arxiv.org/html/2605.14063#S3.E2)\) that the source anchor, the source\-expert agreement gate, and the divergence term are all multiplied byℛsrc=max\(0,1−ℋsrc\)\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\\!=\\\!\\max\(0,\\,1\-\\mathcal\{H\}\_\{\\mathrm\{src\}\}\)\. On CCC\-Hard, the frozen source produces a broad low\-confidence posterior \(top\-1 accuracy about1%1\\%, normalized entropyℋsrc≈0\.74\\mathcal\{H\}\_\{\\mathrm\{src\}\}\\\!\\approx\\\!0\.74\), soℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}stabilizes at a low floor of roughly0\.26\\mathbf\{0\.26\}for the entire run rather than collapsing to zero \(Figure[6](https://arxiv.org/html/2605.14063#A11.F6), appendix\)\. Consequently,ℒanch\\mathcal\{L\}\_\{\\mathrm\{anch\}\}is scaled down by roughly3\.8×3\.8\\timesat runtime relative to its configured strength, the same3\.8×3\.8\\timesattenuation previewed in Figure[1](https://arxiv.org/html/2605.14063#S1.F1), and at this effective strength the anchor contributes a small but non\-zero pull\. The matched\-split gains therefore come from the integrated objective, while the gate’s safety role is isolated on the controlled source\-degradation axis in §[4\.4](https://arxiv.org/html/2605.14063#S4.SS4)\. Separately, a16×16\\timessweep ofλ\\lambda\(Figure[5](https://arxiv.org/html/2605.14063#A8.F5)\) moves CCC\-Hard error by only0\.210\.21pp, because every configuredλ\\lambdais multiplied by the same∼0\.26\\sim\\\!0\.26factor at runtime, so the effective sweep range is narrow\. Under catastrophic shifts, the system gracefully attenuates the source\-anchored terms, and the residual objective is borne by the ROID base losses, marginal calibration, and the decoupled inference\-time flip, in the non\-collapsing fallback regime of Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)\.
### 4\.6Robustness Analysis
Split\-level paired comparison\.Figure[3](https://arxiv.org/html/2605.14063#S4.F3)plots per\-split ROID\+ASR error againstRMemSafe\+ASR error on all5454CCC splits\. Every point except three lies below they=xy\{=\}xdiagonal:RMemSafereduces error on5151of5454individual splits, is essentially on the diagonal on22, and is0\.170\.17pp above it on11\. All three near\-diagonal points are CCC\-Hard ViT splits on which both methods are pinned near chance accuracy \(∼98%\\sim 98\\%error\); the single marginal non\-improvement at that accuracy level is not distinguishable from seed noise\. The clear diagonal shift on the remaining5151splits demonstrates that the mean improvements in Table[1](https://arxiv.org/html/2605.14063#S4.T1)are consistent across individual splits rather than artifacts of favorable averaging\.
Variance across splits\.Per\-split standard deviations on CCC\-Hard \(ResNet\-50\) are±9\.52\\pm 9\.52pp forRMemSafe\+ASR and±9\.39\\pm 9\.39pp for ROID\+ASR, a property of the benchmark rather than of either method: both vary by similar amounts across the*same*splits, so the paired comparison of Figure[3](https://arxiv.org/html/2605.14063#S4.F3)and the matched\-split means of Table[1](https://arxiv.org/html/2605.14063#S4.T1)are more informative than any single\-split number\.
Within\-run behavior\.On CCC\-Hard split 3 with RN\-50, sliding\-window accuracy \(window=400=400, stride=30=30\) over3,1283\{,\}128batches showsRMemSafe\+ASR holds a1\.11\.1pp higher mean band than ROID\+ASR \(27\.2%27\.2\\%vs\.26\.1%26\.1\\%\), with smaller dips around reset events; full trace in Appendix[K](https://arxiv.org/html/2605.14063#A11)\.
Local\-data offset\.Our CCC shards yield numbers∼7\\sim 7pp harder than the streamed data ofLim et al\., \[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]on CCC\-Hard \(RN\-50\); the offset is approximately constant across methods and preserves ranking \(Appendix[Q](https://arxiv.org/html/2605.14063#A17)\)\. Cross\-study absolute comparisons on CCC\-Hard should therefore be interpreted with caution, but the matched\-split head\-to\-head of Table[1](https://arxiv.org/html/2605.14063#S4.T1)is the unbiased estimator of relative method quality\.
## 5Discussion
Our matched\-split evaluation surfaces two scope boundaries that frame the contribution\. First, the*reset paradigm itself*has a boundary distinct from any particular schedule: on CCC\-Hard with ViT\-B/16, every reset\-based CTTA method we evaluate underperforms non\-reset ROID across base adapters \(ROID, ETA, EATA,RMemSafe\) and reset mechanisms \(periodic, adaptive\); all of them restore \(in a harmful way\) parameters toward a degraded frozen source\. The failure is inherent to the design pattern rather than timing, and to our knowledge has not been characterized at matched\-split granularity before; a direct extension is to gate the reset trigger itself by a time\-averagedℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}, suppressing resets toward a source already deemed unreliable by the same signal that gates the adaptation loss\. Second, reliability gating must not be conflated with reliability*detection*: the gate uses source entropy as its sole signal, which keeps Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)analytically available and is correct for high\-entropy source collapse \(on CCC\-Hard ResNet\-50,ℋsrc≈0\.74\\mathcal\{H\}\_\{\\mathrm\{src\}\}\\approx 0\.74closes the gate toℛsrc≈0\.26\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\approx 0\.26, attenuating the anchor by3\.8×3\.8\\\!\\times\), but a*confidently miscalibrated*source \(low entropy on wrong classes\) would be deemed reliable and the anchor activated toward it\. Standard CTTA benchmarks produce high\-entropy collapse rather than low\-entropy miscalibration, so this regime is not exercised by the main benchmark but is a real failure mode in principle; the entropy\-based gate is sound under the typical\-shift assumption that catastrophic failure manifests as predictive uncertainty, and a correctness\-aware signal preserving the graceful\-decay guarantee is a direction for future work\.
## 6Conclusion
RMemSafegates all explicit source\-coupled uses in the CTTA objective byℛsrc=max\(0,1−ℋsrc\)\\mathcal\{R\}\_\{\\rm src\}\\\!=\\\!\\max\(0,1\-\\mathcal\{H\}\_\{\\rm src\}\), with Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)reducing the objective to a source\-agnostic fallback when source entropy saturates\. We evaluate the method on two axes kept deliberately distinct\. On matched\-split benchmarks the integrated objective attains the lowest error on88of99cells and is the best reset\-based method on all99, improving ROID\+ASR, the strongest matched\-split prior baseline, by 1\.05 pp \(ResNet\-50\) and 0\.48 pp \(ViT\-B/16\); ablations attribute this gain primarily to the stabilizers operating inside the gated objective \(App\.[I](https://arxiv.org/html/2605.14063#A9)\)\. On a controlled source\-quality axis, the gate is load\-bearing: a1\.13×1\.13\\timesshallower harm slope under Gaussian source\-noise sweeps \(App\.[M](https://arxiv.org/html/2605.14063#A13)\) and, on ResNet\-50, recovery of the ROID\-only fallback under a confidently\-wrong source \(Δ=−0\.17\\Delta\\\!=\\\!\-0\.17pp CCC mean, App\.[P](https://arxiv.org/html/2605.14063#A16)\)\. The entropy signal under\-attenuates a confidently\-wrong ViT source \(Δ=\+1\.14\\Delta\\\!=\\\!\+1\.14pp\); a correctness\-aware gate and a reliability\-gated reset trigger that resolves the ViT reset\-paradigm failure without per\-cell tuning are direct next steps\.
## References
- Chen et al\., \[2025\]Chen, W\., Singh, V\., Rahmani, Z\., Ganguly, D\., Hariri, M\., and Chaudhary, V\. \(2025\)\.K4: Online log anomaly detection via unsupervised typicality learning\.arXiv preprint arXiv:2507\.20051\.
- Döbler et al\., \[2023\]Döbler, M\., Marsden, R\. A\., and Yang, B\. \(2023\)\.Robust mean teacher for continual and gradual test\-time adaptation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\), pages 7704–7714\.
- Ganguly et al\., \[2024\]Ganguly, D\., Iyengar, S\., Chaudhary, V\., and Kalyanaraman, S\. \(2024\)\.PROOF OF THOUGHT : Neurosymbolic program synthesis allows robust and interpretable reasoning\.InThe First Workshop on System\-2 Reasoning at Scale, NeurIPS’24\.
- \[4\]Ganguly, D\., Morningstar, W\. R\., Yu, A\. S\., and Chaudhary, V\. \(2025a\)\.Forte : Finding outliers with representation typicality estimation\.InThe Thirteenth International Conference on Learning Representations\.
- Ganguly et al\., \[2026\]Ganguly, D\., Sankar, S\., Zhang, B\., Singh, V\., Gupta, K\., Kavuru, H\., Luo, A\., et al\. \(2026\)\.Trust the typical: An out\-of\-distribution safety detection framework\.arXiv preprint arXiv:2602\.04581\.ICLR 2026\.
- \[6\]Ganguly, D\., Singh, V\., Sankar, S\., Zhang, B\., Zhang, X\., Iyengar, S\., Han, X\., et al\. \(2025b\)\.Grammars of formal uncertainty: When to trust LLMs in automated reasoning tasks\.arXiv preprint arXiv:2505\.20047\.NeurIPS 2025\.
- Gong et al\., \[2022\]Gong, T\., Jeong, J\., Kim, T\., Kim, Y\., Shin, J\., and Lee, S\.\-J\. \(2022\)\.NOTE: Robust continual test\-time adaptation against temporal correlation\.In Oh, A\. H\., Agarwal, A\., Belgrave, D\., and Cho, K\., editors,Advances in Neural Information Processing Systems\.
- Hendrycks and Dietterich, \[2019\]Hendrycks, D\. and Dietterich, T\. \(2019\)\.Benchmarking neural network robustness to common corruptions and perturbations\.InInternational Conference on Learning Representations\.
- Hoang et al\., \[2024\]Hoang, T\.\-H\., Vo, D\. M\., and Do, M\. N\. \(2024\)\.Persistent test\-time adaptation in recurring testing scenarios\.InAdvances in Neural Information Processing Systems \(NeurIPS\)\.
- Iwasawa and Matsuo, \[2021\]Iwasawa, Y\. and Matsuo, Y\. \(2021\)\.Test\-time classifier adjustment module for model\-agnostic domain generalization\.InAdvances in Neural Information Processing Systems \(NeurIPS\)\.
- Lee et al\., \[2024\]Lee, J\., Jung, D\., Lee, S\., Park, J\., Shin, J\., Hwang, U\., and Yoon, S\. \(2024\)\.Entropy is not enough for test\-time adaptation: From the perspective of disentangled factors\.InInternational Conference on Learning Representations \(ICLR\)\.Spotlight \(top 5%\)\.
- Lee and Chang, \[2024\]Lee, J\.\-H\. and Chang, J\.\-H\. \(2024\)\.Continual momentum filtering on parameter space for online test\-time adaptation\.InThe Twelfth International Conference on Learning Representations\.
- Liang et al\., \[2023\]Liang, J\., He, R\., and Tan, T\. \(2023\)\.A comprehensive survey on test\-time adaptation under distribution shifts\.International Journal of Computer Vision\.
- Lim et al\., \[2026\]Lim, T\., Hwang, J\.\-W\., and Lee, K\. \(2026\)\.When and where to reset matters for long\-term test\-time adaptation\.arXiv preprint arXiv:2603\.03796\.
- Liu et al\., \[2021\]Liu, Y\., Kothari, P\., van Delft, B\., Bellot\-Gurlet, B\., Mordan, T\., and Alahi, A\. \(2021\)\.TTT\+\+: When does self\-supervised test\-time training fail or thrive?InAdvances in Neural Information Processing Systems \(NeurIPS\)\.
- Marsden et al\., \[2024\]Marsden, R\. A\., Döbler, M\., and Yang, B\. \(2024\)\.Universal test\-time adaptation through weight ensembling, diversity weighting, and prior correction\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2554–2564\.
- Mishra, \[2026\]Mishra, H\. \(2026\)\.Rdumb\+\+: Drift\-aware continual test\-time adaptation\.
- Mummadi et al\., \[2021\]Mummadi, C\. K\., Hutmacher, R\., Rambach, K\., Levinkov, E\., Brox, T\., and Metzen, J\. H\. \(2021\)\.Test\-time adaptation to distribution shift by confidence maximization and input transformation\.arXiv preprint arXiv:2106\.14999\.
- Niu et al\., \[2022\]Niu, S\., Wu, J\., Zhang, Y\., Chen, Y\., Zheng, S\., Zhao, P\., and Tan, M\. \(2022\)\.Efficient test\-time model adaptation without forgetting\.InInternational Conference on Machine Learning, pages 16888–16905\. PMLR\.
- Niu et al\., \[2023\]Niu, S\., Wu, J\., Zhang, Y\., Wen, Z\., Chen, Y\., Zhao, P\., and Tan, M\. \(2023\)\.Towards stable test\-time adaptation in dynamic wild world\.InInternational Conference on Learning Representations\.
- Prabhu et al\., \[2020\]Prabhu, A\., Torr, P\. H\., and Dokania, P\. K\. \(2020\)\.GDumb: A simple approach that questions our progress in continual learning\.InEuropean Conference on Computer Vision \(ECCV\), pages 524–540\.
- Press et al\., \[2023\]Press, O\., Schneider, S\., Kümmerer, M\., and Bethge, M\. \(2023\)\.RDumb: A simple approach that questions our progress in continual test\-time adaptation\.InAdvances in Neural Information Processing Systems, volume 36, pages 39915–39935\.
- Rusak et al\., \[2022\]Rusak, E\., Schneider, S\., Pachitariu, G\., Eck, L\., Gehler, P\. V\., Bringmann, O\., Brendel, W\., and Bethge, M\. \(2022\)\.If your data distribution shifts, use self\-learning\.Transactions on Machine Learning Research \(TMLR\)\.
- Schneider et al\., \[2020\]Schneider, S\., Rusak, E\., Eck, L\., Bringmann, O\., Brendel, W\., and Bethge, M\. \(2020\)\.Improving robustness against common corruptions by covariate shift adaptation\.InAdvances in Neural Information Processing Systems \(NeurIPS\)\.
- \[25\]Singh, V\., Cassel, D\., Weir, N\., Feng, N\., and Bayless, S\. \(2026a\)\.VERGE: Formal refinement and guidance engine for verifiable LLM reasoning\.arXiv preprint arXiv:2601\.20055\.
- \[26\]Singh, V\., Ganguly, D\., Yu, H\., Zhou, C\., Singh, P\., Lee, B\., Chaudhary, V\., and Datta, G\. \(2026b\)\.Toward guarantees for clinical reasoning in vision language models via formal verification\.arXiv preprint arXiv:2602\.24111\.
- Song et al\., \[2023\]Song, J\., Lee, J\., Kweon, I\. S\., and Choi, S\. \(2023\)\.EcoTTA: Memory\-efficient continual test\-time adaptation via self\-distilled regularization\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)\.
- Sun et al\., \[2020\]Sun, Y\., Wang, X\., Liu, Z\., Miller, J\., Efros, A\. A\., and Hardt, M\. \(2020\)\.Test\-time training with self\-supervision for generalization under distribution shifts\.InProceedings of the 37th International Conference on Machine Learning \(ICML\)\.
- Tzeng et al\., \[2017\]Tzeng, E\., Hoffman, J\., Saenko, K\., and Darrell, T\. \(2017\)\.Adversarial discriminative domain adaptation\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\), pages 7167–7176\.
- Wang et al\., \[2021\]Wang, D\., Shelhamer, E\., Liu, S\., Olshausen, B\., and Darrell, T\. \(2021\)\.Tent: Fully test\-time adaptation by entropy minimization\.InInternational Conference on Learning Representations\.
- \[31\]Wang, N\., Liang, T\., Singh, V\., Song, C\., Yang, V\., Yin, Y\., Ma, J\., Singh, J\., et al\. \(2026a\)\.HugRAG: Hierarchical causal knowledge graph design for RAG\.arXiv preprint arXiv:2602\.05143\.
- Wang et al\., \[2022\]Wang, Q\., Fink, O\., Van Gool, L\., and Dai, D\. \(2022\)\.Continual test\-time domain adaptation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211\.
- \[33\]Wang, S\., Yang, W\., Ma, C\., Ganguly, D\., Singh, V\., Song, C\., Li, X\., Long, X\., Chaudhary, V\., and Han, X\. \(2026b\)\.Path\-lock expert: Separating reasoning mode in hybrid thinking via architecture\-level separation\.
- Yang et al\., \[2026\]Yang, W\., Ganguly, D\., Li, X\., Song, C\., Wang, S\., Singh, V\., Chaudhary, V\., and Han, X\. \(2026\)\.Mid\-Think: Training\-free intermediate\-budget reasoning via token\-level triggers\.arXiv preprint arXiv:2601\.07036\.
- Yuan et al\., \[2023\]Yuan, L\., Xie, B\., and Li, S\. \(2023\)\.Robust test\-time adaptation in dynamic scenarios\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\), pages 15922–15932\.
- Zhang et al\., \[2022\]Zhang, M\., Levine, S\., and Finn, C\. \(2022\)\.MEMO: Test time robustness via adaptation and augmentation\.InAdvances in Neural Information Processing Systems \(NeurIPS\)\.
- Zhang et al\., \[2025\]Zhang, Q\., Bian, Y\., Kong, X\., Zhao, P\., and Zhang, C\. \(2025\)\.COME: Test\-time adaption by conservatively minimizing entropy\.InInternational Conference on Learning Representations \(ICLR\)\.
## Appendix AAlgorithm
Algorithm[1](https://arxiv.org/html/2605.14063#alg1)gives the full per\-batch update rule ofRMemSafe\+ASR\. The method combines the ROID backbone \(soft\-likelihood\-ratio loss, diversity weighting, and prior correction\)\[[16](https://arxiv.org/html/2605.14063#bib.bib16)\]with five additions that act through a single runtime gate, the source reliabilityℛsrc∈\[0,1\]\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\in\[0,1\]\. When the frozen source is confident \(ℛsrc→1\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\to 1\), all safety terms are active; when the source is catastrophically confused on severe corruption \(ℛsrc→0\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\to 0\), the source\-dependent signals gracefully decay to zero, and the system falls back to a safe source\-agnostic ROID\-style adaptation with marginal calibration, confidence\-scaled updates, and decoupled inference\-time flip averaging \(Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)\)\.
Algorithm 1Per\-batch update ofRMemSafe\+ASR\.1:Test batch
xtx\_\{t\}; frozen source
fθ∗f\_\{\\theta^\{\*\}\}with parameters
θ∗\\theta^\{\*\}; expert
fθf\_\{\\theta\}; optimizer
𝒪\\mathcal\{O\}; ASR controller
𝒜\\mathcal\{A\}; running class prior
ppriorp\_\{\\mathrm\{prior\}\}; hyperparameters
λ,α,β,λmarg,wmin,ηmin,γmin,γmax\\lambda,\\alpha,\\beta,\\lambda\_\{\\mathrm\{marg\}\},w\_\{\\min\},\\eta\_\{\\min\},\\gamma\_\{\\min\},\\gamma\_\{\\max\}\.
2:Prediction
y^\\hat\{y\}and updated
θ\\theta\.
3:
psrc←softmax\(fθ∗\(xt\)\)p\_\{\\mathrm\{src\}\}\\leftarrow\\mathrm\{softmax\}\(f\_\{\\theta^\{\*\}\}\(x\_\{t\}\)\)⊳\\trianglerightfrozen source, no grad
4:
pexp←softmax\(fθ\(xt\)\)p\_\{\\mathrm\{exp\}\}\\leftarrow\\mathrm\{softmax\}\(f\_\{\\theta\}\(x\_\{t\}\)\)
5:
ℋsrc←−1logC∑cpsrc\(c\)logpsrc\(c\)\\mathcal\{H\}\_\{\\mathrm\{src\}\}\\leftarrow\-\\tfrac\{1\}\{\\log C\}\\sum\_\{c\}p^\{\(c\)\}\_\{\\mathrm\{src\}\}\\log p^\{\(c\)\}\_\{\\mathrm\{src\}\}⊳\\trianglerightnormalized source entropy
6:
ℛsrc←max\(0,1−ℋsrc\)\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\leftarrow\\max\(0,\\,1\-\\mathcal\{H\}\_\{\\mathrm\{src\}\}\)⊳\\trianglerightreliability gate, Eq\. \([3](https://arxiv.org/html/2605.14063#S3)\)
7:
8:
c←cos\(psrc,pexp\)c\\leftarrow\\mathrm\{cos\}\(p\_\{\\mathrm\{src\}\},p\_\{\\mathrm\{exp\}\}\)
9:
wraw←wmin\+\(1−wmin\)max\(0,c\)w\_\{\\mathrm\{raw\}\}\\leftarrow w\_\{\\min\}\+\(1\-w\_\{\\min\}\)\\max\(0,c\)
10:
wcos←ℛsrcwraw\+\(1−ℛsrc\)⋅𝟏w\_\{\\mathrm\{cos\}\}\\leftarrow\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\,w\_\{\\mathrm\{raw\}\}\+\(1\-\\mathcal\{R\}\_\{\\mathrm\{src\}\}\)\\cdot\\mathbf\{1\}⊳\\trianglerightagreement gate; decays to no\-op when source unreliable
11:
12:
ℒslr←SoftLR\(pexp\)\\mathcal\{L\}\_\{\\mathrm\{slr\}\}\\leftarrow\\mathrm\{SoftLR\}\(p\_\{\\mathrm\{exp\}\}\)⊳\\trianglerightROID base loss
13:
pexpaug←softmax\(fθ\(aug\(xt\)\)\)p\_\{\\mathrm\{exp\}\}^\{\\mathrm\{aug\}\}\\leftarrow\\mathrm\{softmax\}\(f\_\{\\theta\}\(\\mathrm\{aug\}\(x\_\{t\}\)\)\)⊳\\trianglerightaug\. on the image
14:
ℒcons←SymCE\(pexp,pexpaug\)\\mathcal\{L\}\_\{\\mathrm\{cons\}\}\\leftarrow\\mathrm\{SymCE\}\(p\_\{\\mathrm\{exp\}\},p\_\{\\mathrm\{exp\}\}^\{\\mathrm\{aug\}\}\)⊳\\trianglerightconsistency loss
15:
p¯exp←1B∑bpexp\(b\)\\bar\{p\}\_\{\\mathrm\{exp\}\}\\leftarrow\\tfrac\{1\}\{B\}\\sum\_\{b\}p\_\{\\mathrm\{exp\}\}^\{\(b\)\}; update
pprior←\(1−ρ\)pprior\+ρp¯expp\_\{\\mathrm\{prior\}\}\\leftarrow\(1\\\!\-\\\!\\rho\)\\,p\_\{\\mathrm\{prior\}\}\+\\rho\\,\\bar\{p\}\_\{\\mathrm\{exp\}\}
16:
ℒmarg←DKL\(p¯exp∥pprior\)\\mathcal\{L\}\_\{\\mathrm\{marg\}\}\\leftarrow D\_\{\\mathrm\{KL\}\}\(\\bar\{p\}\_\{\\mathrm\{exp\}\}\\,\\\|\\,p\_\{\\mathrm\{prior\}\}\)
17:
18:
ℋexp←entropy\(pexp\)/logC\\mathcal\{H\}\_\{\\mathrm\{exp\}\}\\leftarrow\\mathrm\{entropy\}\(p\_\{\\mathrm\{exp\}\}\)/\\log C
19:
DJS←JS\(psrc∥pexp\)D\_\{\\mathrm\{JS\}\}\\leftarrow\\mathrm\{JS\}\(p\_\{\\mathrm\{src\}\}\\,\\\|\\,p\_\{\\mathrm\{exp\}\}\)
20:
λeff←λ⋅ℛsrc⋅\(1\+αℋexp\+βDJS\)\\lambda\_\{\\mathrm\{eff\}\}\\leftarrow\\lambda\\cdot\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\cdot\(1\+\\alpha\\mathcal\{H\}\_\{\\mathrm\{exp\}\}\+\\beta D\_\{\\mathrm\{JS\}\}\)
21:
ℒanch←λeff∥θ−θ∗∥22\\mathcal\{L\}\_\{\\mathrm\{anch\}\}\\leftarrow\\lambda\_\{\\mathrm\{eff\}\}\\,\\lVert\\theta\-\\theta^\{\*\}\\rVert\_\{2\}^\{2\}⊳\\trianglerightdynamic anchor, Eq\. \([2](https://arxiv.org/html/2605.14063#S3.E2)\)
22:
23:
ℒ←wcos\(ℒslr\+ℒcons\)\+λmargℒmarg\+ℒanch\\mathcal\{L\}\\leftarrow w\_\{\\mathrm\{cos\}\}\(\\mathcal\{L\}\_\{\\mathrm\{slr\}\}\+\\mathcal\{L\}\_\{\\mathrm\{cons\}\}\)\+\\lambda\_\{\\mathrm\{marg\}\}\\mathcal\{L\}\_\{\\mathrm\{marg\}\}\+\\mathcal\{L\}\_\{\\mathrm\{anch\}\}
24:
ηeff←η⋅\(ηmin\+\(1−ηmin\)\(1−ℋexp\)\)\\eta\_\{\\mathrm\{eff\}\}\\leftarrow\\eta\\cdot\(\\eta\_\{\\min\}\+\(1\-\\eta\_\{\\min\}\)\(1\-\\mathcal\{H\}\_\{\\mathrm\{exp\}\}\)\)
25:Update
θ←𝒪\.step\(θ,∇θℒ;ηeff\)\\theta\\leftarrow\\mathcal\{O\}\.\\mathrm\{step\}\(\\theta,\\nabla\_\{\\theta\}\\mathcal\{L\};\\eta\_\{\\mathrm\{eff\}\}\)
26:
27:if
𝒜\.triggers\(ℋexp,trajectory\)\\mathcal\{A\}\.\\mathrm\{triggers\}\(\\mathcal\{H\}\_\{\\mathrm\{exp\}\},\\mathrm\{trajectory\}\)then
28:
θ←𝒜\.reset\(θ,θ∗\)\\theta\\leftarrow\\mathcal\{A\}\.\\mathrm\{reset\}\(\\theta,\\theta^\{\*\}\)⊳\\trianglerightASR adaptive\-scope reset
29:endif
30:
31:
zorig←fθ\(xt\);zflip←fθ\(flip\(xt\)\)z\_\{\\mathrm\{orig\}\}\\leftarrow f\_\{\\theta\}\(x\_\{t\}\);\\quad z\_\{\\mathrm\{flip\}\}\\leftarrow f\_\{\\theta\}\(\\mathrm\{flip\}\(x\_\{t\}\)\)⊳\\trianglerightdecoupled
32:
γ←γmin\+\(γmax−γmin\)\(entropy\(zorig\)/logC\)\\gamma\\leftarrow\\gamma\_\{\\min\}\+\(\\gamma\_\{\\max\}\-\\gamma\_\{\\min\}\)\(\\mathrm\{entropy\}\(z\_\{\\mathrm\{orig\}\}\)/\\log C\)
33:
y^←softmax\(\(1−γ\)zorig\+γzflip\)\\hat\{y\}\\leftarrow\\mathrm\{softmax\}\\\!\\left\(\(1\-\\gamma\)z\_\{\\mathrm\{orig\}\}\+\\gamma z\_\{\\mathrm\{flip\}\}\\right\)
34:
y^←PriorCorrection\(y^\)\\hat\{y\}\\leftarrow\\mathrm\{PriorCorrection\}\(\\hat\{y\}\)⊳\\trianglerightROID post\-hoc
35:return
y^,θ\\hat\{y\},\\theta
## Appendix BHyperparameter Details
Table[3](https://arxiv.org/html/2605.14063#A2.T3)summarizes every hyperparameter introduced byRMemSafe\. All values are fixed across the nine benchmark cells of Table[1](https://arxiv.org/html/2605.14063#S4.T1)in the main text; no per\-benchmark tuning is used\.
Table 3:RMemSafehyperparameters and their chosen values\. Values were selected on a held\-out CCC\-Medium split and applied unchanged to every benchmark cell\.SymbolValueRoleλ\\lambda2\.02\.0base anchor strengthα\\alpha2\.02\.0entropy\-to\-anchor scaleβ\\beta1\.01\.0divergence\-to\-anchor scaleλmarg\\lambda\_\{\\mathrm\{marg\}\}0\.10\.1marginal\-calibration weightηmin\\eta\_\{\\min\}0\.20\.2confidence\-LR floorwminw\_\{\\min\}0\.50\.5source\-expert agreement floorγmin\\gamma\_\{\\min\}0\.00\.0min\. flip weight \(confident samples\)γmax\\gamma\_\{\\max\}0\.50\.5max\. flip weight \(uncertain samples\)ρ\\rho0\.010\.01class\-prior EMA rate*Inherited from ROID\[[16](https://arxiv.org/html/2605.14063#bib.bib16)\]*learning rateη\\eta2\.5×10−42\.5\\\!\\times\\\!10^\{\-4\}SGD with momentum0\.90\.9batch sizeBB6464test\-time batch sizesource EMA momentum0\.990\.99slow source ensembling*Inherited from ASR\[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]*M0,P0,A0,R0,R1M\_\{0\},P\_\{0\},A\_\{0\},R\_\{0\},R\_\{1\}per\-levelreset\-controller hyperparameters
## Appendix CReproducibility Details
#### Hardware\.
All runs are executed on NVIDIA A100 GPUs on an internal SLURM cluster\. A single benchmark task uses a single A100 \(40 GiB\) with 16 CPU cores and 64 GiB system RAM\. Total compute for the experiments in the main paper is approximately540540GPU\-hours, dominated by the nine\-split×\\timestwo\-architecture CCC evaluation\. The controlled source\-degradation experiment of §[4\.4](https://arxiv.org/html/2605.14063#S4.SS4)adds approximately55GPU\-hours \(3636runs at roughly7\.57\.5minutes each plus source\-calibration binary search\)\.
#### Software\.
Python 3\.12\.3, PyTorch 2\.9\.1 with CUDA 12\.8 and cuDNN 9\.10\.2\.RMemSafeis implemented on top of the publicmariodoebler/test\-time\-adaptationrepository\[[16](https://arxiv.org/html/2605.14063#bib.bib16)\]and the ASR codebase ofLim et al\., \[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]\.
#### Data\.
CCC\[[22](https://arxiv.org/html/2605.14063#bib.bib22)\]: we evaluate on locally stored WebDataset tar shards equivalent to the data generated by the official CCC streaming pipeline\. Our matched\-split comparison guarantees that every method in the paper sees the exact same sequence of corruption transitions on every split\. As noted in the main text, this produces a∼7\\sim\\\!7pp offset relative to the streamed numbers reported inLim et al\., \[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]; the offset affects all methods equally and does not alter the method ranking\.CIFAR\-10\-C\[[8](https://arxiv.org/html/2605.14063#bib.bib8)\]: standard public 20\-revisit protocol,1010random corruption orderings\.ImageNet\-C\[[8](https://arxiv.org/html/2605.14063#bib.bib8)\]: 20\-revisit with the fixed corruption sequence ofLim et al\., \[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]\.
#### Code\.
An anonymized implementation ofRMemSafe, including all the scripts used to produce the tables and figures in this paper, will be released upon acceptance\. The main method consists of a single∼400\\sim\\\!400\-line file that subclasses ROID and adds the six loss/gate modifications of Algorithm[1](https://arxiv.org/html/2605.14063#alg1)\.
## Appendix DFull Results with Standard Deviations
Table[4](https://arxiv.org/html/2605.14063#A4.T4)is the complete version of Table[1](https://arxiv.org/html/2605.14063#S4.T1)in the main text, with per\-split standard deviations restored\. Variances on CCC\-Hard are an order of magnitude larger than on Easy/Medium on both architectures; this is a property of the benchmark, not of the adaptation method\.
Table 4:Error \(%, lower is better\) with per\-split standard deviation in parentheses\. Means are over99CCC splits,1010CIN\-C seeds, or a single IN\-C 20\-revisit run\.Boldmarks the best method per column\. Data\-source offset: our local CCC shards yield systematically harder numbers than the streamed data ofLim et al\., \[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]by about77pp on CCC\-Hard; every baseline and our method share the same shards\.Table 5:Accuracy \(%, higher is better\) view of Table[1](https://arxiv.org/html/2605.14063#S4.T1)\.Same nine cells, same matched splits, complementary units convention \(the CTTA literature uses both; ASR\[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]uses accuracy, ROID\[[16](https://arxiv.org/html/2605.14063#bib.bib16)\]uses error\)\. Each CCC column is a mean over99random splits; CIN\-C is the mean over1010seeds\.Boldmarks the best per column\.CCC mean \(across 3 levels\)\.RN\-50:RMemSafe\+ASR39\.77\\mathbf\{39\.77\}\>\>ROID\+ASR38\.7338\.73\>\>ROID37\.3637\.36\.ViT:RMemSafe\+ASR46\.4046\.40\>\>ROID\+ASR45\.9245\.92; plain ROID48\.2648\.26leads the column\. The ViT mean inversion is a property of the reset paradigm itself on CCC\-Hard, analyzed in §[4\.3](https://arxiv.org/html/2605.14063#S4.SS3)\.
## Appendix EStatistical Significance
Table[6](https://arxiv.org/html/2605.14063#A5.T6)reports two\-sided pairedtt\-tests forRMemSafe\+ASR against every baseline on every benchmark cell\. Each test hasn=9n=9\(CCC cells\) orn=10n=10\(CIN\-C\) matched samples; pooled CCC rows additionally aggregate across the three difficulty levels for a total ofn=27n=27paired splits per backbone\. The headline improvement over ROID\+ASR on CCC\-ResNet\-50 is significant atp<10−16p<10^\{\-16\}in the pooled test, and on five of the nine individual cells atp<0\.001p<0\.001\.
Table 6:Pairedtt\-test ofRMemSafe\+ASR versus each baseline on each benchmark cell\.Δ\\Deltais the mean per\-split error difference in percentage points \(negative==RMemSafehas lower error\)\. Cell shade of thepp\-column encodes significance:\\cellcolorgreen\!55\!white dark greenp<0\.001p<0\.001,\\cellcolorgreen\!30\!white medium greenp<0\.01p<0\.01,\\cellcolorgreen\!12\!white light greenp<0\.05p<0\.05,\\cellcolorred\!12\!white light red not significant\. The only non\-significant rows are on CCC\-Hard ViT, where per\-split variance \(±9–16\\pm 9\\text\{\-\-\}16pp\) dominates the observed mean differences\.BenchmarkBaselinennΔ\\Delta\(pp\)ttppCCC\-Easy RN50ROID9−2\.34\-2\.34−27\.2\-27\.2\\cellcolorgreen\!55\!white<10−8<\\\!10^\{\-8\}ROID\+RDumb9−2\.51\-2\.51−28\.1\-28\.1\\cellcolorgreen\!55\!white<10−8<\\\!10^\{\-8\}ETA\+ASR9−1\.43\-1\.43−13\.2\-13\.2\\cellcolorgreen\!55\!white1\.0×10−61\.0\{\\times\}10^\{\-6\}EATA\+ASR9−1\.44\-1\.44−13\.0\-13\.0\\cellcolorgreen\!55\!white1\.1×10−61\.1\{\\times\}10^\{\-6\}ROID\+ASR9−1\.19\-1\.19−57\.4\-57\.4\\cellcolorgreen\!55\!white<10−11<\\\!10^\{\-11\}RMS\-noASR9−1\.53\-1\.53−13\.6\-13\.6\\cellcolorgreen\!55\!white8\.0×10−78\.0\{\\times\}10^\{\-7\}CCC\-Med RN50ROID\+ASR9−1\.19\-1\.19−73\.1\-73\.1\\cellcolorgreen\!55\!white<10−12<\\\!10^\{\-12\}ROID9−2\.62\-2\.62−24\.8\-24\.8\\cellcolorgreen\!55\!white<10−8<\\\!10^\{\-8\}ROID\+RDumb9−2\.91\-2\.91−21\.5\-21\.5\\cellcolorgreen\!55\!white<10−7<\\\!10^\{\-7\}ETA\+ASR9−1\.96\-1\.96−9\.4\-9\.4\\cellcolorgreen\!55\!white1\.4×10−51\.4\{\\times\}10^\{\-5\}EATA\+ASR9−2\.01\-2\.01−9\.5\-9\.5\\cellcolorgreen\!55\!white1\.2×10−51\.2\{\\times\}10^\{\-5\}RMS\-noASR9−2\.11\-2\.11−9\.8\-9\.8\\cellcolorgreen\!55\!white1\.0×10−51\.0\{\\times\}10^\{\-5\}CCC\-Hard RN50ROID\+ASR9−0\.76\-0\.76−6\.54\-6\.54\\cellcolorgreen\!55\!white1\.8×10−41\.8\{\\times\}10^\{\-4\}ROID9−2\.27\-2\.27−5\.88\-5\.88\\cellcolorgreen\!55\!white3\.7×10−43\.7\{\\times\}10^\{\-4\}ROID\+RDumb9−2\.96\-2\.96−4\.60\-4\.60\\cellcolorgreen\!30\!white1\.8×10−31\.8\{\\times\}10^\{\-3\}ETA\+ASR9−5\.94\-5\.94−3\.97\-3\.97\\cellcolorgreen\!30\!white4\.1×10−34\.1\{\\times\}10^\{\-3\}EATA\+ASR9−5\.08\-5\.08−4\.39\-4\.39\\cellcolorgreen\!30\!white2\.3×10−32\.3\{\\times\}10^\{\-3\}RMS\-noASR9−2\.80\-2\.80−4\.30\-4\.30\\cellcolorgreen\!30\!white2\.6×10−32\.6\{\\times\}10^\{\-3\}CIN\-C iidROID\+ASR10−0\.88\-0\.88−40\.9\-40\.9\\cellcolorgreen\!55\!white<10−11<\\\!10^\{\-11\}EATA\+ASR10−3\.11\-3\.11−112\.0\-112\.0\\cellcolorgreen\!55\!white<10−14<\\\!10^\{\-14\}ETA\+ASR10−3\.13\-3\.13−90\.7\-90\.7\\cellcolorgreen\!55\!white<10−13<\\\!10^\{\-13\}ROID\+RDumb10−2\.37\-2\.37−35\.4\-35\.4\\cellcolorgreen\!55\!white<10−10<\\\!10^\{\-10\}ROID10−1\.80\-1\.80−31\.4\-31\.4\\cellcolorgreen\!55\!white<10−9<\\\!10^\{\-9\}RMS\-noASR10−2\.05\-2\.05−33\.7\-33\.7\\cellcolorgreen\!55\!white<10−10<\\\!10^\{\-10\}CIN\-C corrROID\+ASR10−0\.80\-0\.80−51\.2\-51\.2\\cellcolorgreen\!55\!white<10−11<\\\!10^\{\-11\}EATA\+ASR10−3\.09\-3\.09−92\.2\-92\.2\\cellcolorgreen\!55\!white<10−13<\\\!10^\{\-13\}ETA\+ASR10−3\.10\-3\.10−83\.5\-83\.5\\cellcolorgreen\!55\!white<10−13<\\\!10^\{\-13\}ROID\+RDumb10−2\.38\-2\.38−40\.0\-40\.0\\cellcolorgreen\!55\!white<10−10<\\\!10^\{\-10\}ROID10−1\.82\-1\.82−34\.6\-34\.6\\cellcolorgreen\!55\!white<10−10<\\\!10^\{\-10\}RMS\-noASR10−2\.00\-2\.00−40\.2\-40\.2\\cellcolorgreen\!55\!white<10−10<\\\!10^\{\-10\}CCC\-Easy ViTROID\+ASR9−0\.51\-0\.51−1\.24\-1\.24\\cellcolorred\!12\!white0\.250\.25ROID9−0\.59\-0\.59−4\.79\-4\.79\\cellcolorgreen\!30\!white1\.4×10−31\.4\{\\times\}10^\{\-3\}ROID\+RDumb9−0\.84\-0\.84−5\.18\-5\.18\\cellcolorgreen\!55\!white8\.4×10−48\.4\{\\times\}10^\{\-4\}ETA/EATA\+ASR9−5\.21\-5\.21−8\.98\-8\.98\\cellcolorgreen\!55\!white1\.9×10−51\.9\{\\times\}10^\{\-5\}RMS\-noASR9−1\.40\-1\.40−7\.11\-7\.11\\cellcolorgreen\!55\!white1\.0×10−41\.0\{\\times\}10^\{\-4\}CCC\-Med ViTROID\+ASR9−0\.06\-0\.06−2\.51\-2\.51\\cellcolorgreen\!12\!white0\.0360\.036ROID9−1\.04\-1\.04−6\.63\-6\.63\\cellcolorgreen\!55\!white1\.6×10−41\.6\{\\times\}10^\{\-4\}RMS\-noASR9−2\.21\-2\.21−11\.7\-11\.7\\cellcolorgreen\!55\!white2\.5×10−62\.5\{\\times\}10^\{\-6\}CCC\-Hard ViTROID\+ASR9−0\.86\-0\.86−1\.42\-1\.42\\cellcolorred\!12\!white0\.190\.19ROID9\+7\.20\+7\.20\+2\.29\+2\.29\\cellcolorred\!12\!white0\.0520\.052ETA\+ASR9\+3\.07\+3\.07\+1\.18\+1\.18\\cellcolorred\!12\!white0\.270\.27EATA\+ASR9\+2\.60\+2\.60\+0\.99\+0\.99\\cellcolorred\!12\!white0\.350\.35CCC RN50 pooledROID\+ASR27−1\.05\\mathbf\{\-1\.05\}−18\.97\-18\.97\\cellcolorgreen\!55\!white<𝟏𝟎−𝟏𝟔\\mathbf\{<\\\!10^\{\-16\}\}ROID27−2\.41\-2\.41−17\.94\-17\.94\\cellcolorgreen\!55\!white<10−15<\\\!10^\{\-15\}RMS\-noASR27−2\.15\-2\.15−8\.77\-8\.77\\cellcolorgreen\!55\!white3\.0×10−93\.0\{\\times\}10^\{\-9\}CCC ViT pooledROID\+ASR27−0\.48\-0\.48−1\.96\-1\.96\\cellcolorred\!12\!white0\.0600\.060EATA\+ASR27−2\.30\-2\.30−2\.08\-2\.08\\cellcolorgreen\!12\!white0\.0470\.047#### ViT\-Hard variance note\.
The non\-significant rows in Table[6](https://arxiv.org/html/2605.14063#A5.T6)are all on CCC\-Hard ViT, a regime in which per\-split error ranges from roughly58%58\\%to100%100\\%across only99splits\. At this sample size, mean differences below∼10\\sim 10pp cannot be reliably distinguished from zero even when consistent\. In particular, the apparent advantage of plain ROID overRMemSafe\+ASR on this cell \(Δ=\+7\.20\\Delta=\+7\.20pp\) yieldsp=0\.052p=0\.052by pairedtt\-test and therefore should be reported as*not statistically distinguishable*from our method at this sample size, rather than as a confirmed regression\. The qualitative interpretation of this cell \(reset\-paradigm failure\) is given in §[4\.3](https://arxiv.org/html/2605.14063#S4.SS3)\.
#### Effect sizes and 95% confidence intervals\.
Table[7](https://arxiv.org/html/2605.14063#A5.T7)reports the same paired comparisons against ROID\+ASR with two additional columns: a95%95\\%confidence interval on the per\-split mean differenceΔ\\Delta\(pairedtt\-interval,n−1n\-1degrees of freedom\), and the standardized paired effect sizedz=d¯/sdd\_\{z\}\\\!=\\\!\\bar\{d\}/s\_\{d\}\. TheΔ\\Deltavalues, intervals, and effect sizes on the four CCC ResNet\-50 and two CIN\-C cells are uniformly large \(\|dz\|≥2\|d\_\{z\}\|\\\!\\geq\\\!2\), with intervals that exclude zero by a wide margin\. The two ROID\-ResNet variants \(CCC\-Easy, CCC\-Med\) yield extraordinarily tight intervals \(±0\.05\\pm 0\.05pp at95%95\\%\) because per\-split deltas are nearly constant across the nine matched splits\. ViT cells, in contrast, are dominated by per\-split variance and have intervals that straddle zero on Easy, Hard, and the pooled comparison; this is the same phenomenon as thep=0\.052p\\\!=\\\!0\.052note above\.
Table 7:Effect sizes and95%95\\%pairedtt\-intervals onΔ=RMemSafe\+ASR−ROID\+ASR\\Delta\\\!=\\\!\\textsc\{RMemSafe\}\{\+\}\\text\{ASR\}\\,\-\\,\\text\{ROID\}\{\+\}\\text\{ASR\}\.dz=d¯/sdd\_\{z\}\\\!=\\\!\\bar\{d\}/s\_\{d\}is the standardised paired effect size; by Cohen’s convention\|dz\|\>0\.8\|d\_\{z\}\|\\\!\>\\\!0\.8is large,\|dz\|\>1\.5\|d\_\{z\}\|\\\!\>\\\!1\.5is very large\.
## Appendix FPer\-Split Variance Analysis
Figure[3](https://arxiv.org/html/2605.14063#S4.F3)in the main text plots per\-split ROID\+ASR error againstRMemSafe\+ASR error on all5454CCC splits\. The diagonal shift belowy=xy\{=\}xis systematic:RMemSafe\+ASR reduces error on5151of the5454splits, is essentially on the diagonal on22, and is0\.170\.17pp above it on11\. All three near\-diagonal points are CCC\-Hard ViT runs in which both methods collapse to near\-chance accuracy \(∼98%\\sim\\\!98\\%error\)\. Table[8](https://arxiv.org/html/2605.14063#A6.T8)summarizes the per\-split distribution of the paired differences that underlies this figure\.
Table 8:Distribution of per\-split paired differencesRMemSafe\+ASR−\-ROID\+ASR on CCC\. Negative values indicateRMemSafehas lower error\. Splits with\|Δ\|≤0\.02\|\\Delta\|\\leq 0\.02pp \(the precision of the reported per\-split numbers\) are considered ties for this summary\. The single\-loss is a0\.170\.17pp regression on one CCC\-Hard ViT split, where both methods are at∼98%\\sim\\\!98\\%error \(effectively chance\)\.
## Appendix GCompute Cost: Analytic Bound
RMemSafe\+ASR adds one additional forward pass through the frozen source model \(to compute the reliability gate, cosine agreement, and Jensen\-Shannon divergence\) and one additional forward pass through the expert on the horizontally flipped input \(for the decoupled inference prediction\)\. Both are performed undertorch\.no\_gradand neither retains activations\.
A ROID forward\-backward pass has cost roughly3C3C, whereCCis the forward\-pass FLOP count of the backbone \(one forward, one backward≈\\approx22forwards\)\.RMemSafeadds one frozen\-source forward \(\+C\+C, no grad\) and one expert flip forward \(\+C\+C, no grad\), yielding approximately5C5Cper batch, or a5/3≈1\.67×5/3\\\!\\approx\\\!1\.67\\timescompute overhead relative to ROID\. Memory overhead is smaller because neither added forward retains activations: the incremental peak is the one activation\-tensor snapshot needed for the source softmax, on the order ofB×CB\\\!\\times\\\!Cfloats for aBB\-batch,CC\-class problem, which is negligible against the ResNet\-50 activation footprint\. In practice, the two extra no\-grad forwards constitute the entire measurable overhead, and it scales linearly with batch size\.
## Appendix HHyperparameter Sensitivity
Figure[5](https://arxiv.org/html/2605.14063#A8.F5)sweeps each of the fiveRMemSafehyperparameters independently while holding the other four at their paper values\. Each point is the mean error over99CCC\-Hard ResNet\-50 splits \(50,00050\{,\}000samples each\); CCC\-Hard is chosen because it is our most variance\-heavy cell and therefore the toughest test of robustness\. Every sweep is flat to within0\.210\.21pp across the full range*including*16×16\\\!\\timeschanges inλ\\lambdaandα\\alphaand100×100\\\!\\timeschanges inλmarg\\lambda\_\{\\mathrm\{marg\}\}\. The insensitivity of the three anchor\-related parameters \(λ,α,β\\lambda,\\alpha,\\beta\) is expected: on CCC\-Hard the runtime reliabilityℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}stabilizes around0\.260\.26\(Figure[6](https://arxiv.org/html/2605.14063#A11.F6)\), shrinking their effective contribution\.
0\.5124883\.683\.683\.883\.88484λ\\lambdaCCC\-Hard err\. \(%\)anchorλ\\lambda0\.51248α\\alphaentropy scaleα\\alpha00\.5124β\\betadivergence scaleβ\\beta0\.010\.1183\.683\.683\.883\.88484λmarg\\lambda\_\{\\mathrm\{marg\}\}CCC\-Hard err\. \(%\)marginal weightλmarg\\lambda\_\{\\mathrm\{marg\}\}0\.10\.20\.30\.50\.8ηmin\\eta\_\{\\min\}confidence\-LR floorηmin\\eta\_\{\\min\}Range summary\.
Max spread across
any full sweep:0\.21\\mathbf\{0\.21\}pp\.
Dashed line marks
the paper value\.
∙\\bulletgated byℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}
∙\\bulletnot gatedFigure 5:Hyperparameter sensitivity ofRMemSafe\+ASR on CCC\-Hard ResNet\-50 \(99splits,50,00050\{,\}000samples/split\)\. Each panel varies a single parameter while holding the others at their values reported in the paper \(Table[3](https://arxiv.org/html/2605.14063#A2.T3)\)\. The total y\-axis range is0\.60\.6pp across all panels; the observed spread within each sweep is at most0\.210\.21pp\. Red points \(λ,α,β\\lambda,\\alpha,\\beta\) are*gated*by the runtime source reliabilityℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}\(cf\. Figure[6](https://arxiv.org/html/2605.14063#A11.F6)\); violet points \(λmarg,ηmin\\lambda\_\{\\mathrm\{marg\}\},\\eta\_\{\\min\}\) are not gated but still produce flat responses, confirming that the method does not require per\-benchmark tuning\.
## Appendix ICumulative\-Add Component Ablation
Table[9](https://arxiv.org/html/2605.14063#A9.T9)reports a complementary view of the component analysis in Figure[4](https://arxiv.org/html/2605.14063#S4.F4)of the main text\. Whereas Figure[4](https://arxiv.org/html/2605.14063#S4.F4)measures*leave\-one\-out*effects \(remove one component while keeping the others active\), Table[9](https://arxiv.org/html/2605.14063#A9.T9)measures*cumulative\-add*effects \(start from the ROID\+ASR baseline and switch onRMemSafecomponents one at a time\)\. The two views measure different quantities and need not agree numerically; together, they bound the marginal contribution of each component\.
The cumulative\-add story is consistent with the leave\-one\-out story in the main text: the decoupled flip produces the largest unconditional gain \(\+0\.89\+0\.89pp on the CCC mean; the leave\-one\-out estimate in Figure[4](https://arxiv.org/html/2605.14063#S4.F4)is\+0\.79\+0\.79pp on the same splits, with the0\.100\.10pp gap reflecting interaction with the other components\)\. Of the remaining four contributions, the two gated byℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}\(anchor, source\-expert agreement\) contribute small or zero effects on top of the baseline becauseℛ¯src≈0\.81\\bar\{\\mathcal\{R\}\}\_\{\\mathrm\{src\}\}\\\!\\approx\\\!0\.81averaged across the three CCC levels attenuates them at runtime; the two ungated contributions \(marg\. calibration, confidence\-scaled LR\) add small amounts\. This is the intended graceful\-decay behavior: under the CCC distribution, the gated terms are partially attenuated at runtime, so their unconditional contribution is by design small\.
Table 9:Cumulative\-add ablation on CCC ResNet\-50, accuracy \(%\) averaged over2727splits \(33levels×\\times99random seeds\)\. Each✓\\char 51enables oneRMemSafecomponent on top of ROID\+ASR; the bottom row matches the configuration reported in Table[1](https://arxiv.org/html/2605.14063#S4.T1)\. Anchor: divergence\-aware dynamic anchorℒanch\\mathcal\{L\}\_\{\\mathrm\{anch\}\}\. Marg\.: marginal\-calibration KL\. Agree\.: source\-expert cosine agreement gating\. Conf\. LR: confidence\-scaled learning rate\. Decoupled flip: inference\-time confidence\-interpolated flip prediction\.
## Appendix JMatched\-Split and Controlled\-Degradation Contributions
This appendix clarifies how to read the component decomposition\. The 1\.05 pp matched\-split improvement ofRMemSafe\+ASR over ROID\+ASR \(CCC ResNet\-50 mean, Table[1](https://arxiv.org/html/2605.14063#S4.T1)\) is decomposed by cumulative adds in Appendix[I](https://arxiv.org/html/2605.14063#A9)and by leave\-one\-out ablations in Figure[4](https://arxiv.org/html/2605.14063#S4.F4)\. Those matched\-split effects measure which parts carry the standard CCC benchmark gain\. They are distinct from the controlled\-degradation contribution in Appendix[M](https://arxiv.org/html/2605.14063#A13), where the gate’s load\-bearing role is measured by varying source quality directly\. The two quantities answer different questions and should not be compared as if they were the same ablation\.
Equation[6](https://arxiv.org/html/2605.14063#S4.E6)reports cumulative\-add effects starting from ROID\+ASR and switching onRMemSafecomponents one at a time; Figure[4](https://arxiv.org/html/2605.14063#S4.F4)gives the complementary leave\-one\-out view\.
## Appendix KReliability Trace
Figure[6](https://arxiv.org/html/2605.14063#A11.F6)visualizes the source reliabilityℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}and the gated Jensen–Shannon divergenceDJSgated=ℛsrcDJSD\_\{\\mathrm\{JS\}\}^\{\\mathrm\{gated\}\}\\\!=\\\!\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\,D\_\{\\mathrm\{JS\}\}over the course of a single CCC\-Hard split \(3,1283\{,\}128batches, RN\-50, split 3, paper hyperparameters\)\. Contrary to a naive reading of “ℛsrc→0\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\\!\\to\\\!0”, the reliability does*not*collapse to zero on CCC\-Hard; instead it stabilizes between0\.180\.18and0\.370\.37with mean0\.26\\mathbf\{0\.26\}\. This corresponds to the frozen source producing a broad, low\-confidence posterior \(normalized entropyℋsrc≈0\.74\\mathcal\{H\}\_\{\\mathrm\{src\}\}\\\!\\approx\\\!0\.74, top\-1 accuracy≈1%\\approx 1\\%\) rather than a maximally uniform one\. The practical consequence is thatλeff\\lambda\_\{\\mathrm\{eff\}\}is*scaled down*by roughly3\.8×3\.8\\timesrelative to its configured value, not eliminated\. Combined with the observed divergence signalDJSgated≈0\.10D\_\{\\mathrm\{JS\}\}^\{\\mathrm\{gated\}\}\\\!\\approx\\\!0\.10, the anchor contributes a small but non\-zero pull towardθ∗\\theta^\{\\ast\}on every step\. This nuance is consistent with the sensitivity results in Figure[5](https://arxiv.org/html/2605.14063#A8.F5): the16×16\\timesλ\\lambdasweep moves the error by only0\.210\.21pp because eachλ\\lambdavalue is multiplied by an approximately constant factor of0\.260\.26\.
05005001,0001\{,\}0001,5001\{,\}5002,0002\{,\}0002,5002\{,\}5003,0003\{,\}00000\.10\.10\.20\.20\.30\.30\.40\.4mean=0\.263=0\.263Test batchesℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}\(windowed\)ℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}\(left\)05⋅10−25\\cdot 10^\{\-2\}0\.10\.10\.150\.15ℛsrc⋅DJS\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\\!\\cdot\\\!D\_\{\\mathrm\{JS\}\}\(windowed\)ℛsrc⋅DJS\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\\!\\cdot\\\!D\_\{\\mathrm\{JS\}\}\(right\)Figure 6:Source reliabilityℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}\(blue, left axis\) and gated Jensen–Shannon divergenceℛsrcDJS\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\,D\_\{\\mathrm\{JS\}\}\(red, right axis\) over a single CCC\-Hard ResNet\-50 split \(split 3,3,1283\{,\}128test batches\)\. Traces are smoothed using a2525\-batch running mean\. The reliability stays near a low floor of0\.260\.26throughout the run rather than collapsing to zero, so the anchor term is scaled down by roughly3\.8×3\.8\\\!\\timesbut is not deactivated\. This predicts, and agrees with, the flatλ\\lambda\-sensitivity observed in Figure[5](https://arxiv.org/html/2605.14063#A8.F5)\.
## Appendix LControlled Source\-Degradation Protocol
This appendix documents the source\-degradation experiment reported in §[4\.4](https://arxiv.org/html/2605.14063#S4.SS4)\. The protocol probes the regime that Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)targets: source degradation in which the source becomes progressively more*uniformly confused*\(high predictive entropy\) as severity increases\.
#### Source\-degradation procedure\.
Starting from a publicly available WRN\-28\-10 model trained on clean CIFAR\-10 \(clean\-test accuracy94\.77%94\.77\\%\), we produce degraded source checkpoints by injecting Gaussian noise into the model weights\. For each parameter tensorθk\\theta\_\{k\}with per\-tensor standard deviationσk\\sigma\_\{k\}, we set
θk←θk\+ϵ⋅σk⋅𝒩\(0,I\),\\theta\_\{k\}\\;\\leftarrow\\;\\theta\_\{k\}\+\\epsilon\\cdot\\sigma\_\{k\}\\cdot\\mathcal\{N\}\(0,I\),whereϵ\\epsilonis a severity scalar calibrated via binary search so that the resulting source has clean\-test accuracy within\[Starget−0\.02,Starget\+0\.02\]\[S\_\{\\mathrm\{target\}\}\-0\.02,\\;S\_\{\\mathrm\{target\}\}\+0\.02\]for each targetStarget∈\{0\.75,0\.30,0\.12\}S\_\{\\mathrm\{target\}\}\\in\\\{0\.75,0\.30,0\.12\\\}\. Noise is applied only to convolutional weights and batch\-normalization affine parameters; biases and BN running statistics are left untouched\. Table[10](https://arxiv.org/html/2605.14063#A12.T10)records the actual achieved accuracy andϵ\\epsilonfor each source variant\. TheS=0\.12S\{=\}0\.12variant landed at the lower edge of the tolerance window \(achieved0\.1000\.100against the\[0\.10,0\.14\]\[0\.10,0\.14\]band\) when the binary search converged at a coarseϵ\\epsilonstep; results on this endpoint therefore correspond to a slightly more degraded source than the other two targets\.
Table 10:Mechanism\-N source\-variant calibration\. Binary search overϵ\\epsilonterminates when clean\-test accuracy lands within±2\\pm 2pp of the target\.
#### Evaluation\.
We run ROID\+ASR andRMemSafe\+ASR on the standard 20\-revisit CIN\-C protocol \(CIFAR\-10\-C with1515corruptions cycled2020times\), once under the standard i\.i\.d\. ordering and once under the correlated \(Dirichletα=0\.1\\alpha\{=\}0\.1\) ordering, for33random seeds per configuration\. This produces3636runs total:3\(S\-levels\)×2\(methods\)×2\(orderings\)×3\(seeds\)3\\,\(S\\text\{\-levels\}\)\\times 2\\,\(\\text\{methods\}\)\\times 2\\,\(\\text\{orderings\}\)\\times 3\\,\(\\text\{seeds\}\)\. Within\-cell standard deviations are0\.07–0\.600\.07\\text\{\-\-\}0\.60pp; the signal is not noise\.
#### Per\-configuration results\.
Table[11](https://arxiv.org/html/2605.14063#A12.T11)reports the per\-cell paired comparison with two\-sidedtt\-testpp\-values; Table[12](https://arxiv.org/html/2605.14063#A12.T12)provides the underlying per\-seed values\. The main\-text Figure[3](https://arxiv.org/html/2605.14063#S4.F3)visualizes the means\.
Table 11:Source\-degradation pilot under Mechanism N \(Gaussian weight noise\)\.Errors in %\.Δ\\Deltais the paired differenceRMemSafe\+ASR minus ROID\+ASR; negative meansRMemSafeis better\.Boldrows are significant atp<0\.05p<0\.05by pairedtt\-test \(n=3n=3seeds per cell\)\.RMemSafewins or ties every cell, and the gap widens monotonically as source quality degrades in both stream orderings\.Table 12:Per\-configuration CIN\-C error \(%\) for Mechanism\-N sources\. Seeds are fixed across methods and source variants for matched\-split comparison\. Columnss0/s1/s2are individual seed errors;meanandstdare across the three seeds\.
#### Harm\-slope computation\.
For each methodmmwe compute the harm slope
Hm=errm\(S=0\.12\)−errm\(S=0\.75\)0\.75−0\.12,H\_\{m\}\\;=\\;\\frac\{\\mathrm\{err\}\_\{m\}\(S\{=\}0\.12\)\-\\mathrm\{err\}\_\{m\}\(S\{=\}0\.75\)\}\{0\.75\-0\.12\},averaged across seeds and orderings\.HROID\+ASR=12\.92H\_\{\\mathrm\{ROID\+ASR\}\}=12\.92pp per unit ofSS\.HRMemSafe\+ASR=11\.43H\_\{\\textsc\{RMemSafe\}\+\\mathrm\{ASR\}\}=11\.43pp per unit ofSS\. The ratioHROID\+ASR/HRMemSafe\+ASR=1\.13H\_\{\\mathrm\{ROID\+ASR\}\}/H\_\{\\textsc\{RMemSafe\}\+\\mathrm\{ASR\}\}=1\.13points in the direction predicted by Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)\(RMemSafedegrades more slowly than ROID\+ASR along this axis\)\.
#### Reliability\-gate behavior\.
For everyRMemSaferun we log the per\-batch reliabilityℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}and report the mean across each run\. Table[13](https://arxiv.org/html/2605.14063#A12.T13)summarizes the measured values across all 18RMemSafe\+ASR runs in the Mechanism\-N sweep\.ℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}drifts monotonically withSSunder the i\.i\.d\. ordering \(from0\.900\.90atS=0\.75S\{=\}0\.75to0\.820\.82atS=0\.12S\{=\}0\.12\) and also under the correlated ordering \(from0\.770\.77atS=0\.75S\{=\}0\.75to0\.740\.74atS=0\.12S\{=\}0\.12\)\. The drift is modest in absolute terms because Gaussian weight noise does not produce the posterior\-uniform limit that saturates the gate: the noised sources retain a recognizable class\-preference structure even atS=0\.12S\{=\}0\.12, so their predictive entropy remains well below the normalized maximum\. The monotone direction and the ordering\-level gap \(∼0\.1\\sim\\\!0\.1lowerℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}under correlated streams, which produce higher expert\-side noise and mildly raise measured source entropy\) are both consistent with the design of \([3](https://arxiv.org/html/2605.14063#S3)\) and with the harm\-slope separation observed in Figure[3](https://arxiv.org/html/2605.14063#S4.F3)\.
Table 13:Meanℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}over the stream for each Mechanism\-NRMemSafe\+ASR configuration \(mean across 3 seeds\)\.
## Appendix MUnderstanding the harm\-slope ratio
The controlled experiment of §[4\.4](https://arxiv.org/html/2605.14063#S4.SS4)yields a1\.13×1\.13\\\!\\timesharm\-slope ratio in the direction predicted by Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1):HROID\+ASR=12\.92H\_\{\\mathrm\{ROID\+ASR\}\}=12\.92versusHRMemSafe\+ASR=11\.43H\_\{\\textsc\{RMemSafe\}\+\\mathrm\{ASR\}\}=11\.43pp per unit of source accuracySS\. A reader can ask a sharper question than “does the gate help?”: they can ask why the ratio comes out to1\.13×1\.13\\\!\\timesrather than, say,1\.0×1\.0\\\!\\times\(the gate doing nothing\) or2\.0×2\.0\\\!\\times\(the gate dominating\)\. A back\-of\-the\-envelope decomposition shows that the observed value matches the design’s prediction\.
Caveat\.The two equations assume the non\-gated harm contributionHotherH\_\{\\mathrm\{other\}\}is approximately the same for both methods; strictly, this is a first\-order approximation becauseRMemSafeadds ungated stabilizers \(marg\. calibration, confidence\-scaled LR, decoupled flip\) absent from ROID\+ASR\. If those stabilizers contribute an additiveδHstab≈0\\delta H\_\{\\mathrm\{stab\}\}\\\!\\approx\\\!0to the harm slope \(consistent with the leave\-one\-out ablation and the flat sensitivity sweep\), the recoveredHanchor≈8\.0,Hother≈4\.9H\_\{\\mathrm\{anchor\}\}\\\!\\approx\\\!8\.0,H\_\{\\mathrm\{other\}\}\\\!\\approx\\\!4\.9are an unbiased back\-of\-envelope estimate; otherwise they should be read as the gated and combined non\-gated\+\{\+\}stabilizer contributions\. Either reading preserves the qualitative claim that the observed1\.13×1\.13\{\\times\}ratio sits below the analytic upper bound≈2\.6×\\approx\\\!2\.6\{\\times\}becauseℛ¯src≈0\.81\\bar\{\\mathcal\{R\}\}\_\{\\mathrm\{src\}\}\\\!\\approx\\\!0\.81, not0\.
Decomposition\.The total harm slope of either method againstSShas two components: a contribution from the source\-anchored terms, which is gated byℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}inRMemSafebut not in ROID\+ASR, and a contribution from non\-gated terms \(the ROID base losses, marginal calibration, the decoupled flip\), which is identical for both methods\. WritingHanchorH\_\{\\mathrm\{anchor\}\}for the anchored contribution andHotherH\_\{\\mathrm\{other\}\}for the non\-gated contribution, and using the empirical mean reliabilityℛ¯src≈0\.81\\bar\{\\mathcal\{R\}\}\_\{\\mathrm\{src\}\}\\approx 0\.81across the testedSSrange and both orderings \(Table[13](https://arxiv.org/html/2605.14063#A12.T13)\), we have approximately
HROID\+ASR\\displaystyle H\_\{\\mathrm\{ROID\+ASR\}\}≈Hanchor\+Hother,\\displaystyle\\approx H\_\{\\mathrm\{anchor\}\}\+H\_\{\\mathrm\{other\}\},HRMemSafe\+ASR\\displaystyle H\_\{\\textsc\{RMemSafe\}\+\\mathrm\{ASR\}\}≈ℛ¯src⋅Hanchor\+Hother\.\\displaystyle\\approx\\bar\{\\mathcal\{R\}\}\_\{\\mathrm\{src\}\}\\cdot H\_\{\\mathrm\{anchor\}\}\+H\_\{\\mathrm\{other\}\}\.Solving the two equations with the observed slopes givesHanchor≈8\.0H\_\{\\mathrm\{anchor\}\}\\approx 8\.0pp/unitSS\(∼62%\\sim\\\!62\\%of ROID\+ASR’s total harm slope\) andHother≈4\.9H\_\{\\mathrm\{other\}\}\\approx 4\.9pp/unitSS\(∼38%\\sim\\\!38\\%\)\. Substituting back yields a predictedHRMemSafe\+ASR≈11\.4H\_\{\\textsc\{RMemSafe\}\+\\mathrm\{ASR\}\}\\approx 11\.4pp/unitSS, matching the observed11\.4311\.43to two significant figures\. The agreement is not free: it requires the gate to track source quality monotonically acrossSS, as we observe\.
Why1\.13×1\.13\\timesand not larger\.The decomposition explains the upper bound on the gate’s effect\. Withℛ¯src≈0\.81\\bar\{\\mathcal\{R\}\}\_\{\\mathrm\{src\}\}\\approx 0\.81rather than0, the gate is partially closed but not eliminated, because Gaussian weight noise produces high\-but\-not\-saturating source entropy \(ℋsrc\\mathcal\{H\}\_\{\\mathrm\{src\}\}does not approach11\)\. Under regimes that saturate the gate, the entireHanchorH\_\{\\mathrm\{anchor\}\}contribution would be eliminated and the harm\-slope ratio would approachHROID\+ASR/Hother≈2\.6×H\_\{\\mathrm\{ROID\+ASR\}\}/H\_\{\\mathrm\{other\}\}\\approx 2\.6\\\!\\times\. The1\.13×1\.13\\\!\\timesratio observed here therefore reflects the regime our experiment probes: moderate, not catastrophic, source degradation\. Catastrophic regimes \(CCC\-Hard,ℛsrc≈0\.26\\mathcal\{R\}\_\{\\mathrm\{src\}\}\\approx 0\.26\) sit closer to that upper bound and are where the matched\-split CCC gains in §[4\.2](https://arxiv.org/html/2605.14063#S4.SS2)originate\.
## Appendix NReliability\-Signal Validation
A natural concern withℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}as a reliability proxy is whether it actually tracks adaptation outcomes or merely scales the optimization in a way that is benign\. We answer this empirically using the controlled source\-degradation runs of §[4\.4](https://arxiv.org/html/2605.14063#S4.SS4): for each of three target source\-quality levelsS∈\{0\.75,0\.30,0\.12\}S\\\!\\in\\\!\\\{0\.75,\\,0\.30,\\,0\.12\\\}and two domain orderings \(i\.i\.d\. and correlated\), we record the per\-run averageℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}alongside the final adaptation error ofRMemSafe\+ASR \(n=18n\\\!=\\\!18runs total\)\. Table[14](https://arxiv.org/html/2605.14063#A14.T14)reports linear, rank, and AUC\-based agreement between the two\.
Table 14:Validation ofℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}as a reliability proxy on the controlled source\-degradation runs \(n=18n\\\!=\\\!18\)\. Each run contributes a paired observation\(ℛ¯src,final error\)\(\\bar\{\\mathcal\{R\}\}\_\{\\mathrm\{src\}\},\\,\\text\{final error\}\)\. The reliability scalar separates successful from failed adaptation runs perfectly \(AUC=1\.00=\\\!1\.00\) and ranks runs by adaptation accuracy almost perfectly \(Spearmanρ=0\.99\\rho\\\!=\\\!0\.99\)\.Two observations are worth flagging\. First, the high correlation with adaptation accuracy \(ρ=0\.989\\rho\\\!=\\\!0\.989\) is much sharper than the correlation with the underlying source targetSS\(ρ=0\.473\\rho\\\!=\\\!0\.473\):ℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}tracks whether*this*run was able to use the source, not absolute source quality\. Second, a threshold atℛ¯src=0\.80\\bar\{\\mathcal\{R\}\}\_\{\\mathrm\{src\}\}\\\!=\\\!0\.80*perfectly*separates the99adaptation successes \(all i\.i\.d\. runs acrossS∈\{0\.75,0\.30,0\.12\}S\\\!\\in\\\!\\\{0\.75,0\.30,0\.12\\\}, mean error≈21%\\approx\\\!21\\%\) from the99failures \(all correlated\-stream runs, mean error≈80%\\approx\\\!80\\%\), in agreement with AUC=1\.000=\\\!1\.000\. The i\.i\.d\.S=0\.12S\{=\}0\.12run is informative: even at the lowest source quality, the i\.i\.d\. stream cycles all corruptions frequently enough thatℛ¯src\\bar\{\\mathcal\{R\}\}\_\{\\mathrm\{src\}\}stays above the threshold \(0\.8180\.818\) and the run adapts \(mean error25\.80%25\.80\\%\)\. This justifies the qualitative claim in §[4\.4](https://arxiv.org/html/2605.14063#S4.SS4)that the gate has a usable threshold behavior, and is consistent with the trace\-level observation in Appendix[K](https://arxiv.org/html/2605.14063#A11)thatℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}stabilizes to a level rather than drifting continuously\.
## Appendix OGate\-Threshold Sensitivity for Reset\-Triggered ASR
§[4\.3](https://arxiv.org/html/2605.14063#S4.SS3)notes that the reset paradigm fails on CCC\-Hard with a ViT\-B/16 backbone, and motivates a reliability\-conditioned reset triggerτgate\\tau\_\{\\mathrm\{gate\}\}that suppresses an ASR reset wheneverℛsrc<τgate\\mathcal\{R\}\_\{\\mathrm\{src\}\}<\\tau\_\{\\mathrm\{gate\}\}\. Here we report a full sweep ofτgate∈\{0\.20,0\.30,0\.40,0\.50\}\\tau\_\{\\mathrm\{gate\}\}\\in\\\{0\.20,0\.30,0\.40,0\.50\\\}onRMemSafe\+ASR with the ViT\-B/16 backbone, three CCC levels, and the same nine splits used in the main paper\. Table[15](https://arxiv.org/html/2605.14063#A15.T15)summarizes the result\.
Table 15:Reset\-triggered ASR on ViT\-B/16: mean error \(%\) and per\-split standard deviation across nine CCC splits, as a function of the gate thresholdτgate\\tau\_\{\\mathrm\{gate\}\}\. Theτ=0\.30\\tau\{=\}0\.30Hard cell is taken from a prior nine\-split run with identical seeds\.Two findings stand out\. First, on CCC\-Easy and CCC\-Med, the controller is essentially insensitive toτgate\\tau\_\{\\mathrm\{gate\}\}: the across\-τ\\tauspread is below0\.050\.05pp at both levels, an order of magnitude smaller than the across\-split standard deviation\. The reliability gate adds no new tunable knob in the regime where the source expert is informative\. Second, on CCC\-Hard, the gate becomes operative: raisingτgate\\tau\_\{\\mathrm\{gate\}\}from0\.200\.20to0\.400\.40closes roughly77pp of error and brings the reset\-paradigm mean \(RMemSafe\+ASR→76\.42\\to 76\.42\) essentially level with the non\-reset ROID baseline reported in Table[4](https://arxiv.org/html/2605.14063#A4.T4)\(75\.9875\.98on CCC\-Hard ViT\)\. The Hard cell remains highly multimodal across splits \(per\-split errors range from∼60%\\sim\\\!60\\%to∼98%\\sim\\\!98\\%\), so the gate does not by itself eliminate the underlying ViT reset\-paradigm failure; it does, however, demonstrate that the runtime reliability scalar is load\-bearing on the reset trigger and operates in the expected direction, recovering the non\-reset baseline on the mean\.
## Appendix PConfidently\-Wrong Source Stress Test
Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)is a graceful\-decay statement: as the source expert’s entropy approaches its maximum, the source\-dependent terms vanish andRMemSafereduces to a fallback objective that is independent of the source\. The proposition does*not*claim graceful decay when the source is confidently wrong, i\.e\., when the source expert is low\-entropy on a permuted label set\. This appendix demonstrates that boundary empirically\.
#### Protocol\.
We construct a confidently wrong source by applying a fixed class permutation \(random permutation of the1,0001\{,\}000ImageNet classes, seed17291729\) to the source expert’s logits before they are consumed by the agreement filter and divergence\-aware anchor\. The permutation is a deterministic function of the original logits, so the source expert remains low\-entropy on every input, but its top\-11prediction is uncorrelated with the true label\. We rerun ROID\+ASR andRMemSafe\+ASR \(full configuration, paper hyperparameters\) on the CCC benchmark with both ResNet\-50 and ViT\-B/16 backbones, three corruption levels, and three splits per cell \(3636runs total\)\.
Table 16:Confidently\-wrong source stress test on CCC: mean error \(%\) with per\-split standard deviation over three splits per cell\. The source expert is replaced by a class\-permuted copy of itself \(seed17291729\)\.Δ\\DeltaisRMemSafe\+ASR minus ROID\+ASR; positive values indicateRMemSafeis worse\. Per\-cell deltas atn=3n\\\!=\\\!3are observational; we do not run a paired test on this small sample\.On ResNet\-50,RMemSafe\+ASR is statistically indistinguishable from ROID\+ASR under a confidently\-wrong source: per\-cell deltas lie in\[−0\.69,\+0\.36\]\[\-0\.69,\+0\.36\]pp and the CCC mean is0\.170\.17pp better\. The runtime reliability scalar sufficiently suppresses the source\-dependent terms that the wrong source no longer leaks harm into the adapted parameters, thereby recovering the ROID\-only fallback\. On ViT\-B/16, the boundary is less clean:RMemSafe\+ASR is≈1\.1\\approx 1\.1pp worse than ROID\+ASR on the CCC mean, with the gap concentrated in CCC\-Easy and CCC\-Med\. This is consistent with our finding in Appendix[K](https://arxiv.org/html/2605.14063#A11)thatℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}on ViT does not collapse as aggressively as on ResNet\-50, and identifies a regime where the entropy\-based gate under\-attenuates a confidently\-miscalibrated source\. We treat this as a known scope boundary for the present reliability signal and discuss it as a limitation \(Appendix[R](https://arxiv.org/html/2605.14063#A18)\)\.
## Appendix QDiscussion of the Local\-Data Offset
On our locally stored CCC shards, plain ROID\+ASR attains a CCC\-Hard ResNet\-50 error of84\.56%84\.56\\%, whereasLim et al\., \[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]report77\.79%77\.79\\%on the original streamed data\. We investigated this discrepancy at length and concluded that it reflects a deterministic difference in the underlying shard order and image decoding: the same corruption parameters produce slightly different per\-frame pixel content between the streamed and the locally cached versions of CCC\. Two facts support interpreting the offset as a data artifact rather than a methodological one\. First, the offset is roughly constant across methods: every reset\-based baseline we reproduced locally is∼7\\sim 7pp worse than its published value on CCC\-Hard \(RN\-50\), and the method ranking is preserved\. Second, the offset vanishes on CCC\-Easy and CCC\-Medium, where the streamed and local numbers agree to within11pp\.
The consequence for the paper is that*cross\-study absolute comparisons on CCC\-Hard should not be taken at face value*\. Our matched\-split head\-to\-head comparison is the unbiased estimator of relative method quality and is the basis for all our conclusions\. Table[17](https://arxiv.org/html/2605.14063#A17.T17)quantifies the offset per method for transparency\.
Table 17:Local CCC\-Hard ResNet\-50 errors against the streamed numbers reported inLim et al\., \[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]\(where available\)\. The offset is approximately constant \(∼7\\sim\\\!7pp\) across reset\-based methods\.
## Appendix RBroader Impact and Limitations
RMemSafeis designed for*safety*in continual test\-time adaptation: it aims to prevent catastrophic forgetting and class collapse in deployed systems that adapt online to unlabeled data, which is a prerequisite for safety\-critical applications such as autonomous driving, medical imaging, and continuous monitoring\. We are not aware of any direct negative societal impact specific to the method; it does not train any new large models, collect new data, or make deployment decisions itself\.
RMemSafecontributes to an ongoing research program on*runtime safety signals for trustworthy machine learning*, where a model’s deployment\-time behavior is gated by a quantitative reliability measure\. Adjacent work in this program addresses out\-of\-distribution safety detection via typicality\[[5](https://arxiv.org/html/2605.14063#bib.bib5),[1](https://arxiv.org/html/2605.14063#bib.bib1),[4](https://arxiv.org/html/2605.14063#bib.bib4)\], grammar\-based uncertainty quantification for LLMs in formal reasoning tasks\[[6](https://arxiv.org/html/2605.14063#bib.bib6),[25](https://arxiv.org/html/2605.14063#bib.bib25),[26](https://arxiv.org/html/2605.14063#bib.bib26),[3](https://arxiv.org/html/2605.14063#bib.bib3)\]and Hybrid reasoning and RAG systems\[[34](https://arxiv.org/html/2605.14063#bib.bib34),[33](https://arxiv.org/html/2605.14063#bib.bib33),[31](https://arxiv.org/html/2605.14063#bib.bib31)\]\.RMemSafecontributes a third instance of the same design principle, specialized to the continual test\-time adaptation regime: source\-entropy\-derivedℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}as a runtime gate on anchoring\-based stability mechanisms, with the analytical graceful\-decay guarantee of Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)\.
#### Intended use and positive impact\.
RMemSafetargets a specific safety failure mode in continual test\-time adaptation: catastrophic anchoring to a frozen source that has itself collapsed under distribution shift\. CTTA is increasingly deployed in settings where the input stream is non\-stationary and labels are unavailable at test time, autonomous perception, medical imaging under acquisition drift, industrial monitoring, and in each of these the failure mode we diagnose \(continuing to pull toward a∼\\sim1%\-accurate reference at fixed strength\) is a real deployment risk, not a benchmark artifact\. The graceful\-decay property of Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)is the safety contribution: when the source signal becomes uninformative, the source\-coupled terms vanish rather than propagating a corrupted prior into the adapted parameters\.
#### Risks of the method functioning correctly\.
The reliability gate uses source predictive entropy as its sole signal\. This is sound for the high\-entropy collapse regime \(§[3](https://arxiv.org/html/2605.14063#S3), App\.[P](https://arxiv.org/html/2605.14063#A16)\), but a confidently miscalibrated source, low entropy on wrong classes, is deemed reliable byℛsrc\\mathcal\{R\}\_\{\\mathrm\{src\}\}and the anchor activates toward it\. Standard CTTA benchmarks produce high\-entropy collapse rather than low\-entropy miscalibration, but deployment distributions need not\. A practitioner who reads Proposition[1](https://arxiv.org/html/2605.14063#Thmproposition1)as a general safety guarantee, rather than as a guarantee restricted to entropy\-detectable failure, will misjudge when the method is protective\. We flag this scope explicitly in §[5](https://arxiv.org/html/2605.14063#S5)and recommend that high\-stakes deployments pair the entropy gate with an independent correctness signal\.
#### Risks of the method functioning incorrectly\.
On ViT\-B/16 the entropy gate under\-attenuates a confidently\-wrong source \(Δ=\+1\.14\\Delta=\+1\.14pp on the CCC mean, App\.[P](https://arxiv.org/html/2605.14063#A16)\) and the reset paradigm itself underperforms the non\-reset baseline on CCC\-Hard \(§[4\.3](https://arxiv.org/html/2605.14063#S4.SS3)\)\. Both are characterized rather than fixed\. A deployment usingRMemSafewith a ViT backbone under severe shift should not assume the matched\-split CCC gains transfer; the reliability\-gated reset trigger of App\.[O](https://arxiv.org/html/2605.14063#A15)partially addresses the second issue but is not a complete fix\.
#### Misuse and dual\-use\.
RMemSafetrains no new models, collects no data, and makes no deployment decisions on its own\. We are not aware of a direct malicious\-use pathway specific to this work beyond risks generic to robust\-adaptation methods \(e\.g\., a robust deployed system inherits whatever downstream uses its operator chooses\)\.
#### Limitations\.
1. 1\.Scope of the reliability signal\.Entropy\-only; does not detect confidently miscalibrated sources\. ViT\-B/16 under a class\-permuted source is the empirical witness \(Δ=\+1\.14\\Delta=\+1\.14pp, App\.[P](https://arxiv.org/html/2605.14063#A16)\)\.
2. 2\.Reset\-paradigm failure on CCC\-Hard ViT\-B/16\.Every reset\-based method we evaluate underperforms non\-reset ROID on this cell, across base adapters and reset mechanisms \(§[4\.3](https://arxiv.org/html/2605.14063#S4.SS3)\)\. The reliability\-gated reset trigger \(τgate=0\.40\\tau\_\{\\mathrm\{gate\}\}=0\.40, App\.[O](https://arxiv.org/html/2605.14063#A15)\) recovers the non\-reset mean but not per\-split variance\.
3. 3\.Local\-data offset on CCC\.Our shards yield CCC\-Hard numbers∼\\sim7 pp harder than the streamed numbers of Lim et al\.\[[14](https://arxiv.org/html/2605.14063#bib.bib14)\]; the offset is approximately constant across methods on ResNet\-50\. Cross\-study absolute comparisons on CCC\-Hard should be interpreted with caution; the matched\-split head\-to\-head is the unbiased estimator of relative method quality\.
4. 4\.Fixed hyperparameters\.The five core hyperparameters are held constant across all nine benchmark cells\. Per\-cell tuning would likely yield further small gains but is discouraged in the unlabeled test\-time setting\.
5. 5\.Marginal\-calibration EMA under abrupt label shift\.The EMA prior \(ρ=0\.01\\rho=0\.01\) lags abrupt label\-distribution shifts; our streams exhibit gradual rather than abrupt shift, so this regime is not exercised\.Similar Articles
CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs
CRMA introduces a spectrally-bounded residual adapter that enables continual fine-tuning of LLMs without catastrophic forgetting by enforcing a doubly-stochastic mixing matrix via Sinkhorn normalization. Experimental results on Mistral-7B and Gemma-2-9B show improved backward transfer and reduced forgetting compared to frozen-substrate baselines.
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
This paper introduces MemoRepair, a barrier-first cascade repair contract for agentic memory that addresses the problem of stale derived artifacts when source data changes. Experiments demonstrate that MemoRepair significantly reduces invalidated memory exposure and repair costs compared to exhaustive repair methods.
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
SafeHarbor is a novel framework for LLM agent safety that uses hierarchical memory and self-evolution to balance safety and utility, achieving state-of-the-art performance on benign and malicious tasks.
CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning
The paper introduces CERSA, a novel parameter-efficient fine-tuning method that uses singular value decomposition to retain principal components, significantly reducing memory usage while outperforming existing methods like LoRA.
Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
Anchor is a task-generation pipeline that addresses artifact drift in AI agent benchmarks by jointly producing instructions, environments, solutions, and verifiers from a single constraint optimization specification, yielding consistent and auditable evaluation tasks for enterprise workflows. The paper introduces ERP-Bench, a benchmark of 300 long-horizon tasks in a production ERP system, showing that frontier models satisfy explicit constraints in 26.1% of trials but reach optimal solutions in only 17.4%.