Separating Expert Retention from Autonomous Source Inference in Raw-ECG-Replay-Free Continual ECG Deployment
Summary
This paper introduces IRFE-ECG, a method for continual ECG deployment that separates expert retention from autonomous source inference using frozen features from ECGFounder, achieving strong performance without replaying raw ECGs.
View Cached Full Text
Cached at: 07/03/26, 05:45 AM
# Separating Expert Retention from Autonomous Source Inference in Raw-ECG-Replay-Free Continual ECG Deployment
Source: [https://arxiv.org/html/2607.01674](https://arxiv.org/html/2607.01674)
###### Abstract
In multi\-source ECG deployment, models may need to incorporate new data sources when earlier raw ECGs cannot be retained or replayed\. Freezing a pretrained backbone and assigning each source an isolated classifier prevents parameter interference, but deployment still requires selecting an expert when source metadata are unavailable\. We study this distinction through IRFE\-ECG, an incremental expert bank built on frozen 1024\-dimensional ECGFounder features\. Each arriving domain adds a balanced\-softmax linear expert, while a lightweight router is fitted only on retained training features and domain labels from sources observed so far\. A validation\-calibrated margin rule fuses the two most likely experts instead of committing to a single routed expert\.
On CPSC, PTB\-XL, Georgia, and Chapman\-Shaoxing, source\-aware expert selection reaches0\.7915±0\.00360\.7915\\pm 0\.0036Macro\-F1 and a matched offline independent\-head reference reaches0\.7885±0\.00090\.7885\\pm 0\.0009, supporting strong source\-aware expert retention\. Without source IDs, an MLP router reaches0\.7756±0\.00270\.7756\\pm 0\.0027and top\-2 margin fusion reaches0\.7782±0\.00220\.7782\\pm 0\.0022\. The top\-2 gain over hard MLP routing is small \(\+0\.0026\+0\.0026\), with a 95% confidence interval from paired bootstrap that includes zero\. Across three domain orders, the top\-2\-to\-oracle gap remains0\.01110\.0111–0\.01330\.0133, identifying autonomous source inference as the main remaining bottleneck\. No raw ECGs are replayed, but frozen training features are retained for router updates; the method is therefore not memory\-free\.
## IIntroduction
ECG systems rarely remain confined to one stationary data distribution\. Continual cardiac\-signal studies have considered changes across diseases, time, modalities, and institutions\[[6](https://arxiv.org/html/2607.01674#bib.bib20)\], and multicenter ECG continual\-learning work has highlighted data\-governance and data\-sharing constraints in sequential ECG adaptation\[[4](https://arxiv.org/html/2607.01674#bib.bib18)\]\. A deployed ECG system may first receive data from one hospital or acquisition device and later encounter records collected under different hardware, cohorts, or annotation protocols\. Updating a shared model on each new source can overwrite earlier decision boundaries, while retaining or centralizing raw historical ECGs may be limited by data\-governance and data\-sharing constraints\. Foundation models offer a useful alternative: a fixed representation can support lightweight domain\-specific classifiers without repeatedly optimizing a large backbone\.
Parameter isolation alone, however, does not complete the deployment problem\. If source metadata are known at inference time, selecting the corresponding frozen expert is straightforward, and earlier experts cannot be overwritten\. If the source is unknown, the system must infer which expert, or combination of experts, should process each ECG\. This distinction is consequential: source\-aware expert selection can approach a matched offline reference, whereas autonomous routing can still leave a measurable test\-time gap\. Treating the source\-aware result as the complete continual\-learning solution would therefore hide the main failure mode in source\-unknown deployment\.
This paper makes three contributions\. First, we formulate raw\-ECG\-replay\-free continual ECG deployment as two coupled but distinct problems: retaining source\-specific experts and inferring the source of a test ECG when metadata are unavailable\. We instantiate this formulation with IRFE\-ECG, a frozen\-feature expert bank built on ECGFounder, where each arriving domain adds an isolated Balanced\-Softmax linear expert and earlier experts are never updated\.
Second, we evaluate autonomous source inference under a strict seen\-domain protocol\. The comparison includes pooled\-head prediction, centroid routing, kNN, shrinkage LDA, a linear router, an MLP router, and probability\-averaging controls\. We also test a validation\-calibrated top\-2 margin fusion rule as a practical refinement rather than the central claim\.
Third, we show that the main limitation is autonomous source inference rather than expert retention\. Source\-aware expert selection remains close to a matched offline independent\-head reference, while autonomous routing leaves a reproducible gap across three domain orders and feature\-memory budgets from 1% to 100%\. This finding clarifies what must improve before isolated expert banks can be treated as a complete source\-unknown deployment solution\.
## IIRelated Work
### II\-AECG Foundation Models
ECG foundation models are designed to provide transferable representations for downstream ECG analysis\. We use ECGFounder only as a frozen single\-lead feature extractor and keep the backbone fixed throughout continual learning\[[8](https://arxiv.org/html/2607.01674#bib.bib1)\]\. This design shifts the question from representation learning to deployment: whether fixed ECG features can support an expanding set of source\-specific experts, and whether those experts can be selected autonomously when source metadata are unavailable\.
### II\-BContinual Learning and Parameter Isolation
Regularization methods such as elastic weight consolidation \(EWC\)\[[5](https://arxiv.org/html/2607.01674#bib.bib2)\]and synaptic intelligence \(SI\)\[[19](https://arxiv.org/html/2607.01674#bib.bib4)\]constrain changes to parameters considered important for earlier tasks\. Learning without forgetting \(LwF\) distills predictions from a previous model\[[9](https://arxiv.org/html/2607.01674#bib.bib3)\], while rehearsal methods retain a subset of previous examples\. iCaRL combines exemplar rehearsal with a nearest\-mean classifier\[[13](https://arxiv.org/html/2607.01674#bib.bib5)\]\. These approaches primarily address interference in shared parameters\. Expert banks instead isolate parameters, which reduces weight interference but shifts the deployment challenge toward task\- or domain\-inference\.
### II\-CECG\-Specific Continual and Cross\-Domain Learning
Continual learning has also been studied directly in ECG and physiological signals\. CLOPS considered cardiac signals across diseases, time, modalities, and institutions using replay\-based strategies\[[6](https://arxiv.org/html/2607.01674#bib.bib20)\]\. Multicenter ECG continual\-learning work has evaluated sequential adaptation across public ECG sources under data\-sharing constraints\[[4](https://arxiv.org/html/2607.01674#bib.bib18)\], while ECG\-CL investigated parameter\-isolation mechanisms for comprehensive ECG interpretation\[[2](https://arxiv.org/html/2607.01674#bib.bib19)\]\. Prototype\-rehearsal methods such as DREAM\-CL further illustrate the role of memory design in continual ECG arrhythmia detection\[[12](https://arxiv.org/html/2607.01674#bib.bib23)\]\. Broader surveys summarize replay, regularization, and architecture\-based strategies across ECG and related physiological signals\[[7](https://arxiv.org/html/2607.01674#bib.bib21)\]\. In parallel, cross\-dataset transferability analyses such as MELEP highlight that ECG representations and classifiers can behave differently across datasets and label spaces\[[11](https://arxiv.org/html/2607.01674#bib.bib22)\]\. These studies motivate continual and cross\-domain ECG analysis, but they do not directly answer the source\-unknown deployment question studied here\. We use a frozen ECG foundation representation and isolated source\-specific experts to separate expert retention from autonomous source inference when source metadata are unavailable\.
### II\-DPrompt, Adapter, and Expert Routing
Parameter\-efficient continual\-learning methods add prompts, adapters, or experts to pretrained backbones\. L2P retrieves prompts from a learned prompt pool\[[18](https://arxiv.org/html/2607.01674#bib.bib6)\], while CODA\-Prompt learns decomposed attention\-based prompts for rehearsal\-free continual learning\[[16](https://arxiv.org/html/2607.01674#bib.bib7)\]\. Expert Gate adds task experts sequentially and uses learned gating models to route test samples to relevant experts\[[1](https://arxiv.org/html/2607.01674#bib.bib10)\]\. CLOM further emphasizes that task\-incremental and class\-incremental evaluation differ in whether task identity is available at test time\[[3](https://arxiv.org/html/2607.01674#bib.bib13)\]\.
These methods show that test\-time task identity cannot be assumed in general continual\-learning settings\. Rather than proposing expert routing as a general mechanism, our goal is to instantiate and evaluate this distinction in raw\-ECG\-replay\-free continual ECG deployment\. Unlike vision\-transformer prompt pools, our router receives one frozen 1024\-dimensional ECG feature vector\. We therefore focus on lightweight routing components that can be trained and validated directly in this feature space, using frozen ECGFounder features, isolated source\-specific experts, and an autonomous source router\.
### II\-EImbalanced ECG Classification
Normal/abnormal ECG labels are imbalanced across sources, making accuracy and positive\-class F1 insufficient on their own\. Balanced Softmax incorporates class frequencies in the softmax denominator\[[14](https://arxiv.org/html/2607.01674#bib.bib8)\], while logit adjustment shifts decision margins according to label priors\[[10](https://arxiv.org/html/2607.01674#bib.bib9)\]\. We train domain experts with Balanced Softmax and calibrate one decision threshold per expert using validation data only\. We report Macro\-F1 as the primary metric and include class\-specific F1, balanced accuracy, AUROC, and AUPRC for the autonomous top\-2 model\.
Figure 1:Overview of IRFE\-ECG\. A frozen ECGFounder encoder maps each ECG to a 1024\-dimensional feature vector\. Each arriving domain adds an isolated balanced\-softmax linear expert, while previous experts and the backbone remain frozen\. Frozen train features and domain labels from seen domains are retained to update an autonomous source router without replaying raw ECG waveforms\. At test time, source metadata are unavailable; the router selects the most likely experts, and the top\-2 variant combines validation\-calibrated decision margins\. The source\-aware path is an oracle reference only and is not used in autonomous deployment\.
## IIIMethod
### III\-AProblem Setup
Let𝒟1,…,𝒟T\\mathcal\{D\}\_\{1\},\\ldots,\\mathcal\{D\}\_\{T\}be a sequence of ECG domains\. Domain𝒟t=\{\(xi,yi\)\}i=1nt\\mathcal\{D\}\_\{t\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\_\{t\}\}contains single\-lead ECGsxix\_\{i\}and binary normal/abnormal labelsyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}\. At steptt, the learner receives the train and validation partitions of𝒟t\\mathcal\{D\}\_\{t\}\. Raw ECGs from𝒟1:t−1\\mathcal\{D\}\_\{1:t\-1\}are not replayed\. For router updates, we retain frozen feature vectors and domain identities from earlier training samples\. The test partitions remain inaccessible until final evaluation\. Thus, the protocol is raw\-ECG\-replay\-free but retains feature\-level memory for router updates\.
We distinguish two inference settings\. The*source\-aware reference*provides the source IDddand directly selects expertgdg\_\{d\}\. The primary*autonomous setting*withholdsddand requires a router to select or fuse experts\. This separation is important because isolated expert parameters are not updated after their domains are learned, while autonomous predictions can still deteriorate through routing errors\.
For shared\-parameter baselines, we record the conventional result matrixR∈ℝT×TR\\in\\mathbb\{R\}^\{T\\times T\}, whereRt,jR\_\{t,j\}is Macro\-F1 on domainjjafter training through domaintt, and compute
BWT=1T−1∑j=1T−1\(RT,j−Rj,j\)\.\\mathrm\{BWT\}=\\frac\{1\}\{T\-1\}\\sum\_\{j=1\}^\{T\-1\}\\left\(R\_\{T,j\}\-R\_\{j,j\}\\right\)\.\(1\)For the routed expert bank, however, changes inRRcan reflect router changes rather than expert forgetting\. We therefore use routed BWT only as a diagnostic and treat final autonomous Macro\-F1 and route accuracy as the primary measures\.
### III\-BFrozen Foundation Experts
The frozen ECGFounder backbone maps an ECG to
h=fθ\(x\),h∈ℝ1024,h=f\_\{\\theta\}\(x\),\\qquad h\\in\\mathbb\{R\}^\{1024\},\(2\)whereθ\\thetais never updated\. When domainttarrives, a linear expertgtg\_\{t\}is added:
ℓt=gt\(h\)=Wth\+bt,Wt∈ℝ2×1024\.\\ell\_\{t\}=g\_\{t\}\(h\)=W\_\{t\}h\+b\_\{t\},\\qquad W\_\{t\}\\in\\mathbb\{R\}^\{2\\times 1024\}\.\(3\)Onlygtg\_\{t\}is optimized;g1:t−1g\_\{1:t\-1\}remain frozen\. Experts use Balanced Softmax with class counts estimated from the current domain’s train split\. After model selection, a domain\-specific thresholdτt\\tau\_\{t\}is calibrated on that domain’s validation probabilities and then fixed for test evaluation\.
This architecture deliberately isolates domain decision boundaries\. Its source\-aware performance is an oracle reference, not evidence that test\-time domain inference has been solved\. We also compare against a matched offline reference that trains the same independent heads with all domain training sets available concurrently\.
### III\-CIncremental Domain Router
At steptt, the router is trained only over the seen domain set𝒮t=\{1,…,t\}\\mathcal\{S\}\_\{t\}=\\\{1,\\ldots,t\\\}\. Our discriminative router is a compact MLP with a 1024\-dimensional input, LayerNorm, a 128\-unit hidden layer, GELU activation, dropout, and a linear output layer whose dimension equals\|𝒮t\|\|\\mathcal\{S\}\_\{t\}\|:
qϕt\(h\)=W2Dropout\(GELU\(W1LN\(h\)\)\)\.q\_\{\\phi\_\{t\}\}\(h\)=W\_\{2\}\\,\\mathrm\{Dropout\}\\\!\\left\(\\mathrm\{GELU\}\(W\_\{1\}\\,\\mathrm\{LN\}\(h\)\)\\right\)\.\(4\)It is optimized by cross\-entropy on stored train features and domain labels from𝒮t\\mathcal\{S\}\_\{t\}\. The checkpoint maximizing route accuracy on concatenated seen\-domain validation features is restored\. The router is re\-fitted after each newly observed domain; it never receives future\-domain features or test labels\. The feature\-memory budget reported later counts stored train\-feature rows used for router fitting\. Validation features are used for model selection in the experimental protocol and are not counted in the router training memory budget\.
For comparison, we evaluate a cosine nearest\-centroid router, a cosine kNN router over retained features, a shared\-covariance shrinkage\-LDA router, and a linear domain classifier\. The LDA router estimates one domain mean and a single shared covariance matrix with shrinkage regularization using only seen\-domain train features\. These controls test whether a single centroid, local neighborhoods, linear separation, or a nonlinear discriminative boundary is needed in the frozen feature space\.
### III\-DTop\-2 Validation\-Margin Fusion
A hard router applies only the highest\-scoring expert\. To reduce sensitivity to hard top\-1 expert selection, let𝒦\(h\)\\mathcal\{K\}\(h\)contain the two experts with the largest router logitssk\(h\)s\_\{k\}\(h\)\. Their normalized weights are
αk\(h\)=exp\(sk\(h\)/Tr\)∑j∈𝒦\(h\)exp\(sj\(h\)/Tr\),\\alpha\_\{k\}\(h\)=\\frac\{\\exp\(s\_\{k\}\(h\)/T\_\{r\}\)\}\{\\sum\_\{j\\in\\mathcal\{K\}\(h\)\}\\exp\(s\_\{j\}\(h\)/T\_\{r\}\)\},\(5\)whereTr=1T\_\{r\}=1is fixed before evaluation\. Expertkkproduces positive\-class probabilitypk\(h\)p\_\{k\}\(h\)and has a domain\-specific validation thresholdτk\\tau\_\{k\}\. We fuse calibrated margins rather than raw probabilities:
m\(h\)=∑k∈𝒦\(h\)αk\(h\)\(pk\(h\)−τk\),y^=𝕀\[m\(h\)≥0\]\.m\(h\)=\\sum\_\{k\\in\\mathcal\{K\}\(h\)\}\\alpha\_\{k\}\(h\)\\bigl\(p\_\{k\}\(h\)\-\\tau\_\{k\}\\bigr\),\\qquad\\hat\{y\}=\\mathbb\{I\}\[m\(h\)\\geq 0\]\.\(6\)The rule is used only for autonomous inference and uses no source labels, test labels, or future\-domain examples at evaluation time\.
## IVExperimental Setup
### IV\-ADatasets and Splits
We use four ECG domains—CPSC, PTB\-XL, Georgia, and Chapman\-Shaoxing—from the preprocessed PhysioNet/Computing in Cardiology Challenge 2021 cache used in our experiments\[[15](https://arxiv.org/html/2607.01674#bib.bib15)\]\. We keep the original source names to denote the domain of each record\. PTB\-XL and Chapman\-Shaoxing are also cited through their original dataset descriptions for provenance\[[17](https://arxiv.org/html/2607.01674#bib.bib16),[20](https://arxiv.org/html/2607.01674#bib.bib17)\]\. Each ECG is represented in the cache as a fixed\-length single\-lead waveform with 2,500 samples and is mapped to a binary normal/abnormal label\. The main domain order is CPSC→\\rightarrowPTB\-XL→\\rightarrowGeorgia→\\rightarrowChapman\-Shaoxing\.
TABLE I:Four\-domain ECG split\. Waveform\-hash grouping prevents exact train/test waveform overlap but does not guarantee patient\-level isolation\.We use a waveform\-hash\-grouped record\-level split for all four domains\. Records with identical waveform hashes are assigned to the same partition, preventing exact duplicate waveforms from crossing the train/test boundary\. The processed splits and frozen\-feature caches used in this study do not preserve reliable patient identifiers\. We therefore do not claim patient\-level separation\. All reported results should be interpreted as record\-level benchmark evidence rather than patient\-level clinical validation\.
### IV\-BTraining and Validation Protocol
All formal results use seeds 42, 43, and 44\. Domain experts are trained for 30 epochs with AdamW, cosine learning\-rate decay, learning rate5×10−45\\times 10^\{\-4\}, and batch size 32\. MLP and linear routers use learning rate10−310^\{\-3\}for at most 50 epochs\.
For each expert, checkpoints are selected by validation Macro\-F1 at threshold 0\.5\. The selected checkpoint is then restored, and a domain\-specific decision threshold is calibrated on validation data\. Router checkpoints are selected by validation route accuracy on seen domains\. All model selection, threshold calibration, and hyperparameter choices are made without test access\. The held\-out test set is evaluated once per selected model\. Reported uncertainty is the sample standard deviation across the three seeds\.
### IV\-CMetrics and Comparisons
The primary metric is final Macro\-F1 averaged across the four domains\. Route accuracy is the fraction of test samples assigned to their source domain\. We also compute balanced accuracy, class\-specific F1, AUROC, and AUPRC\. For shared\-parameter baselines, BWT measures final change on earlier domains\. For routed experts, BWT is not interpreted as parameter forgetting\.
Comparisons include a pooled frozen\-feature head, centroid, kNN, shrinkage\-LDA, linear and MLP routers, a source\-aware oracle, and a matched offline independent\-head reference\. Shared\-parameter baselines include a sequential shared head, full fine\-tuning, EWC, LwF, SI, and a 256\-example per\-domain replay buffer\. All baselines use the same domain sequence and validation\-only threshold calibration where applicable\.
## VResults
### V\-AAutonomous Source Inference Leaves the Main Gap
Table[II](https://arxiv.org/html/2607.01674#S5.T2)reports final four\-domain Macro\-F1 and route accuracy after the last incremental step\. The pooled head and centroid router are nearly identical at0\.75510\.7551and0\.75570\.7557Macro\-F1\. Nonlinear MLP routing reaches0\.7756±0\.00270\.7756\\pm 0\.0027, above centroid, kNN, LDA, and a matched linear router\. Top\-2 margin fusion is numerically highest among autonomous variants at0\.7782±0\.00220\.7782\\pm 0\.0022\.
TABLE II:Final performance over three seeds\. Source ID is unavailable for all autonomous methods and is provided only to the oracle reference\. Bold indicates the best autonomous selection rule\. Top\-2 variants share the MLP route accuracy\. Dash indicates that route accuracy is not applicable\.The source\-aware oracle remains0\.01330\.0133Macro\-F1 above MLP top\-2\. The top\-2 gain over hard MLP is0\.00260\.0026; a paired bootstrap gives a 95% confidence interval of\[−0\.0010,0\.0061\]\[\-0\.0010,0\.0061\]\(p=0\.144p=0\.144\)\. We therefore interpret top\-2 as a modest operational refinement rather than a statistically significant improvement over hard MLP routing\. The equal\-probability top\-2 control performs substantially worse, indicating that generic averaging does not explain the margin\-fusion result\. No autonomous router uses future\-domain features or test labels\.
### V\-BDomain Boundaries and Routing Confusion
Figure[2](https://arxiv.org/html/2607.01674#S5.F2)indicates that the improvement over centroid routing is largely associated with better domain\-discriminative boundaries\. MLP routing raises correct source selection for CPSC from 51\.7% under centroid routing to 75\.8%, and for Georgia from 47\.9% to 76\.6%\. LDA recovers much of this gap, while the nonlinear MLP is strongest on both domains\. Residual CPSC–Chapman and Georgia–Chapman confusion explains why autonomous performance remains below the oracle\.
Figure 2:Final\-step route confusion matrices averaged over seeds 42–44\. Rows are true sources and columns are selected experts; entries are percentages\.
### V\-CDomain\-Wise Effects and the Oracle Gap
Table[III](https://arxiv.org/html/2607.01674#S5.T3)shows that the modest top\-2 gain is concentrated in Chapman, where Macro\-F1 rises from 0\.8922 to 0\.9032\. Changes on CPSC and PTB\-XL are small, while Georgia is slightly lower than hard MLP routing\. Figure[3](https://arxiv.org/html/2607.01674#S5.F3)visualizes the domain\-wise gap between source\-aware oracle selection and autonomous MLP top\-2 routing\. Although CPSC remains the lowest\-performing autonomous domain, the largest top\-2\-to\-oracle gap occurs on Chapman, where top\-2 reduces the hard\-routing gap from 0\.0319 to 0\.0209 but does not close it\.
Figure 3:Domain\-wise Macro\-F1 gap between source\-aware oracle selection and autonomous MLP top\-2 routing\. Positive values indicate the remaining autonomous source\-inference gap\.TABLE III:Final Macro\-F1 by domain, averaged over seeds 42–44\.
### V\-DSource\-Aware and Offline References
The matched offline independent\-head reference reaches0\.7885±0\.00090\.7885\\pm 0\.0009Macro\-F1, while source\-aware sequential expert selection reaches0\.7915±0\.00360\.7915\\pm 0\.0036\. Each binary linear expert contains 2,050 trainable parameters, so the four\-domain expert bank adds only 8,200 trainable expert parameters beyond the frozen backbone\. The near equality between the offline and source\-aware sequential references is an important control: when source identity is supplied, sequentially adding isolated heads performs similarly to training the same heads offline\. The source\-aware row should therefore be interpreted as an oracle reference for routing rather than as an autonomous deployment result\. Likewise, zero BWT for that row follows from never updating the backbone or earlier heads\.
Table[IV](https://arxiv.org/html/2607.01674#S5.T4)reports additional metrics for autonomous MLP top\-2 routing\. The high Chapman Macro\-F1 is accompanied by high balanced accuracy and strong F1 scores for both classes, rather than being created solely by the majority class\. CPSC and Georgia retain lower negative\-class F1, consistent with the remaining ambiguity observed in the routing analysis\.
TABLE IV:Additional classification metrics for autonomous MLP top\-2 routing, averaged over three seeds\.F1−F1\_\{\-\}/F1\+F1\_\{\+\}are class\-specific F1 scores\.
### V\-EOrder Robustness and Feature\-Memory Budget
Figure[4](https://arxiv.org/html/2607.01674#S5.F4)compares top\-2 with the oracle trained under the same domain order\. The gap is 0\.0133 for the main order, 0\.0111 for the reverse order, and 0\.0132 for a fixed random order\. We emphasize within\-order gaps because changing the order also changes the random training trajectory of the domain experts\. The stable gap suggests that the remaining bottleneck is source inference rather than a peculiarity of one domain order\.
Figure 4:Within\-order Macro\-F1 gap between source\-aware oracle selection and autonomous MLP top\-2 routing across three domain orders\. Error bars denote sample standard deviations over three seeds\.Figure[5](https://arxiv.org/html/2607.01674#S5.F5)summarizes the feature\-memory trade\-off\. Retaining 10% of train features places top\-2 within 0\.0063 Macro\-F1 of the full\-memory result while reducing feature storage from 139\.82 to 13\.98 MiB\. The 1% setting remains above the pooled\-head control but loses route accuracy\. Retained features count frozen train\-feature rows available to the router\. The MiB values include only stored 1024\-dimensional float32 feature vectors; labels, validation features used for benchmark model selection, and container overhead are excluded\. Thus, the protocol avoids raw\-ECG replay but remains feature\-memory based rather than memory\-free\.
Figure 5:Feature\-memory trade\-off for autonomous MLP top\-2 routing\. \(a\) Final Macro\-F1 under different retained train\-feature budgets\. \(b\) Test route accuracy under the same budgets\. The dashed line marks the pooled single\-head baseline\. Error bars denote sample standard deviations over three seeds\. The 10% setting approaches the full\-memory result while using substantially less feature storage\.
### V\-FShared\-Parameter Continual\-Learning Baselines
Table[V](https://arxiv.org/html/2607.01674#S5.T5)places autonomous routing beside shared\-parameter baselines\. These methods address a different failure mechanism: they use one shared predictor and therefore can suffer weight interference, whereas IRFE\-ECG grows isolated experts and must infer their applicability\. The small\-replay baseline stores raw training tensors from previous domains, whereas IRFE\-ECG stores frozen feature vectors only\. The comparison is intended to contrast deployment failure modes rather than to match memory or parameter budgets exactly\.
TABLE V:Shared\-parameter baselines over three seeds\. Macro\-F1 is reported as mean±\\pmstandard deviation; BWT is the mean over the three runs\. Dash indicates that BWT is not interpreted for the routed expert bank\.The shared linear head is the strongest shared\-parameter baseline\. Small replay improves over EWC and LwF in final Macro\-F1 but still exhibits negative BWT\. SI has the least negative mean BWT among the regularization rows\. In contrast, IRFE\-ECG avoids expert\-parameter interference by construction; its remaining error is dominated by routing\. This difference in failure mode is why routed BWT is omitted from the main comparison\.
### V\-GExpert and Loss Ablations
TABLE VI:Oracle\-selected expert and loss ablations over three seeds\.Δ\\Deltais relative to the linear Balanced\-Softmax main expert bank\. Domain columns report Macro\-F1 averaged over seeds 42–44\. CE denotes cross entropy\.Table[VI](https://arxiv.org/html/2607.01674#S5.T6)reports oracle\-selected expert and loss ablations\. The simple linear expert trained with Balanced Softmax remains the strongest source\-aware design, reaching0\.7915±0\.00360\.7915\\pm 0\.0036Macro\-F1\. Replacing Balanced Softmax with weighted cross entropy lowers performance to0\.7862±0\.00160\.7862\\pm 0\.0016, while adding residual feature adapters further reduces Macro\-F1 to0\.7760±0\.00150\.7760\\pm 0\.0015\.
The validation\-selected bank also fails to improve over the simple linear Balanced\-Softmax bank, reaching0\.7833±0\.00570\.7833\\pm 0\.0057Macro\-F1\. This bank selected the expert type for each domain using validation performance, choosing among linear Balanced Softmax, weighted cross entropy, and residual\-adapter experts\. Its higher variance and seed\-dependent selected expert types suggest sensitivity to validation noise rather than a robust heterogeneous\-expert advantage\. We therefore retain the simple balanced\-softmax linear expert bank as the central expert design; the source\-aware result is not produced by expert\-type search, residual adapters, or loss\-function tuning\.
## VIDiscussion
### VI\-ARetention and Routing Are Different Problems
The experiments separate two quantities that are often conflated in expert systems\. Frozen source\-specific heads preserve their parameters exactly after training, so source\-aware performance reflects retained expert decision boundaries rather than autonomous deployment ability\. Autonomous performance is lower because an input may be assigned to an unsuitable expert\. The matched offline reference further shows that sequential presentation does not by itself explain the source\-aware result once independent heads and a frozen backbone are used\. The relevant deployment question is therefore how closely an autonomous router can approach the source\-aware reference without access to source metadata\.
### VI\-BInterpretation of Top\-2 Fusion
The domain MLP learns directions that a single mean cannot represent, which is especially useful for overlapping source distributions such as CPSC and Georgia\. Top\-2 fusion does not improve route accuracy because it uses the same router; instead, it can lower the cost of an incorrect top\-1 decision\. Validation thresholds place each expert’s margin on a comparable decision scale, and router probabilities determine how those margins are combined\. Its average gain is only0\.00260\.0026and is not significant under paired bootstrap, so we treat fusion as a modest practical option rather than the main contribution\. The remaining oracle gap shows that fusion cannot compensate when neither selected expert matches the input well\.
### VI\-CScope and Limitations
Several limitations define the scope of this study\. First, the current benchmark does not have a verified patient\-level split\. Waveform\-hash grouping removes exact duplicate waveforms across partitions, but it cannot determine whether different recordings from the same patient appear in different partitions\. Because the processed splits and frozen\-feature caches used in this study do not retain reliable patient identifiers, patient\-level leakage cannot be excluded or quantified\. Consequently, the results should be interpreted as record\-level benchmark evidence for autonomous source inference, not as patient\-independent clinical validation\. Future work should reconstruct patient\-linked metadata and repeat the evaluation with patient\-independent partitions before making clinical validation claims\.
Second, the benchmark contains four binary normal/abnormal domains; multi\-label diagnosis may require different experts and routing objectives\. Third, the router stores frozen train features from previous domains\. The protocol avoids raw ECG replay but is not memory\-free and makes no privacy guarantee; ECG embeddings may still encode patient\-specific information and should be governed accordingly in deployment\. Fourth, isolated expert banks grow linearly with the number of domains, although each linear head is small\. Finally, the top\-2 gain is modest, and autonomous routing still trails same\-order oracle selection by approximately0\.0110\.011–0\.0130\.013Macro\-F1\.
The comparison also does not establish a universal ECG state of the art\. Existing ECG studies use different label spaces, leads, splits, and metrics; their reported scores are not directly comparable with this four\-domain binary Macro\-F1 protocol\. Our claim is restricted to internally matched continual deployment experiments under the stated data and validation procedure\.
## VIIConclusion
We separated expert retention from autonomous source inference in a raw\-ECG\-replay\-free continual ECG setting\. Frozen source\-specific experts reach0\.7915±0\.00360\.7915\\pm 0\.0036Macro\-F1 under source\-aware oracle selection and remain close to a matched offline independent\-head reference, whereas autonomous MLP top\-2 routing reaches0\.7782±0\.00220\.7782\\pm 0\.0022\. The small, non\-significant top\-2 gain over hard MLP routing does not alter the main finding: source inference, rather than expert parameter forgetting, is the limiting factor across domain orders\. The system retains frozen features and is therefore not memory\-free\. Future work should validate patient\-level splits, reduce feature storage, and extend the analysis to multi\-label ECG diagnosis before making stronger clinical deployment claims\.
## Code Availability
The code and configuration files are available at:https://github\.com/yufanlu221/IRFE\-ECG\. The repository includes scripts for expert training, autonomous routing, feature\-memory experiments, order\-robustness evaluation, expert/loss ablations, and figure generation\. Raw ECG waveforms and large derived feature caches are not redistributed; users should obtain the public ECG datasets according to their original licenses and generate the required caches locally\.
## Acknowledgment
The authors thank all contributors and collaborators who provided feedback on this work\.
## References
- \[1\]R\. Aljundi, P\. Chakravarty, and T\. Tuytelaars\(2017\)Expert gate: lifelong learning with a network of experts\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 3366–3375\.Cited by:[§II\-D](https://arxiv.org/html/2607.01674#S2.SS4.p1.1)\.
- \[2\]H\. Gao, X\. Wang, Z\. Chen, M\. Wu, J\. Li, and C\. Liu\(2023\)ECG\-CL: a comprehensive electrocardiogram interpretation method based on continual learning\.IEEE Journal of Biomedical and Health Informatics27\(11\),pp\. 5225–5236\.External Links:[Document](https://dx.doi.org/10.1109/JBHI.2023.3315715)Cited by:[§II\-C](https://arxiv.org/html/2607.01674#S2.SS3.p1.1)\.
- \[3\]G\. Kim, S\. Esmaeilpour, C\. Xiao, and B\. Liu\(2022\)Continual learning based on ood detection and task masking\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops,pp\. 3856–3866\.Cited by:[§II\-D](https://arxiv.org/html/2607.01674#S2.SS4.p1.1)\.
- \[4\]J\. Kim, M\. H\. Lim, K\. Kim, and H\. Yoon\(2024\)Continual learning framework for a multicenter study with an application to electrocardiogram\.BMC Medical Informatics and Decision Making24\(1\),pp\. 67\.External Links:[Document](https://dx.doi.org/10.1186/s12911-024-02464-9)Cited by:[§I](https://arxiv.org/html/2607.01674#S1.p1.1),[§II\-C](https://arxiv.org/html/2607.01674#S2.SS3.p1.1)\.
- \[5\]J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska, D\. Hassabis, C\. Clopath, D\. Kumaran, and R\. Hadsell\(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the National Academy of Sciences114\(13\),pp\. 3521–3526\.Cited by:[§II\-B](https://arxiv.org/html/2607.01674#S2.SS2.p1.1)\.
- \[6\]D\. Kiyasseh, T\. Zhu, and D\. Clifton\(2021\)A clinical deep learning framework for continually learning from cardiac signals across diseases, time, modalities, and institutions\.Nature Communications12,pp\. 4221\.External Links:[Document](https://dx.doi.org/10.1038/s41467-021-24483-0)Cited by:[§I](https://arxiv.org/html/2607.01674#S1.p1.1),[§II\-C](https://arxiv.org/html/2607.01674#S2.SS3.p1.1)\.
- \[7\]A\. Li, H\. Li, and G\. Yuan\(2024\)Continual learning with deep neural networks in physiological signal data: a survey\.Healthcare12\(2\),pp\. 155\.External Links:[Document](https://dx.doi.org/10.3390/healthcare12020155)Cited by:[§II\-C](https://arxiv.org/html/2607.01674#S2.SS3.p1.1)\.
- \[8\]J\. Li, A\. D\. Aguirre, V\. Moura Junior, J\. Jin, C\. Liu, L\. Zhong, C\. Sun, G\. Clifford, M\. B\. Westover, and S\. Hong\(2025\)An electrocardiogram foundation model built on over 10 million recordings\.NEJM AI2\(7\)\.External Links:[Document](https://dx.doi.org/10.1056/AIoa2401033)Cited by:[§II\-A](https://arxiv.org/html/2607.01674#S2.SS1.p1.1)\.
- \[9\]Z\. Li and D\. Hoiem\(2018\)Learning without forgetting\.IEEE Transactions on Pattern Analysis and Machine Intelligence40\(12\),pp\. 2935–2947\.Cited by:[§II\-B](https://arxiv.org/html/2607.01674#S2.SS2.p1.1)\.
- \[10\]A\. K\. Menon, S\. Jayasumana, A\. S\. Rawat, H\. Jain, A\. Veit, and S\. Kumar\(2021\)Long\-tail learning via logit adjustment\.InInternational Conference on Learning Representations,Cited by:[§II\-E](https://arxiv.org/html/2607.01674#S2.SS5.p1.1)\.
- \[11\]C\. V\. Nguyen, H\. M\. Duong, and C\. D\. Do\(2024\)MELEP: a novel predictive measure of transferability in multi\-label ECG diagnosis\.Journal of Healthcare Informatics Research8\(3\),pp\. 506–522\.External Links:[Document](https://dx.doi.org/10.1007/s41666-024-00168-3)Cited by:[§II\-C](https://arxiv.org/html/2607.01674#S2.SS3.p1.1)\.
- \[12\]S\. Rahmani, R\. Chatterjee, A\. Etemad, and J\. Hashemi\(2025\)Dynamic prototype rehearsal for continual ECG arrhythmia detection\.InICASSP 2025 \- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 1–5\.Cited by:[§II\-C](https://arxiv.org/html/2607.01674#S2.SS3.p1.1)\.
- \[13\]S\. Rebuffi, A\. Kolesnikov, G\. Sperl, and C\. H\. Lampert\(2017\)ICaRL: incremental classifier and representation learning\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 5533–5542\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2017.587)Cited by:[§II\-B](https://arxiv.org/html/2607.01674#S2.SS2.p1.1)\.
- \[14\]J\. Ren, C\. Yu, S\. Sheng, X\. Ma, H\. Zhao, S\. Yi, and H\. Li\(2020\)Balanced meta\-softmax for long\-tailed visual recognition\.InAdvances in Neural Information Processing Systems,Cited by:[§II\-E](https://arxiv.org/html/2607.01674#S2.SS5.p1.1)\.
- \[15\]M\. A\. Reyna, N\. Sadr, E\. A\. Perez Alday, A\. Gu, A\. J\. Shah, C\. Robichaux, A\. B\. Rad, A\. Elola, S\. Seyedi, S\. Ansari, H\. Ghanbari, Q\. Li, A\. Sharma, and G\. D\. Clifford\(2021\)Will two do? varying dimensions in electrocardiography: the PhysioNet/Computing in Cardiology challenge 2021\.InComputing in Cardiology,Vol\.48,pp\. 1–4\.External Links:[Document](https://dx.doi.org/10.23919/CinC53138.2021.9662687)Cited by:[§IV\-A](https://arxiv.org/html/2607.01674#S4.SS1.p1.3)\.
- \[16\]J\. S\. Smith, L\. Karlinsky, V\. Gutta, P\. Cascante\-Bonilla, D\. Kim, A\. Arbelle, R\. Panda, R\. Feris, and Z\. Kira\(2023\)CODA\-prompt: continual decomposed attention\-based prompting for rehearsal\-free continual learning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Cited by:[§II\-D](https://arxiv.org/html/2607.01674#S2.SS4.p1.1)\.
- \[17\]P\. Wagner, N\. Strodthoff, R\. Bousseljot, D\. Kreiseler, F\. I\. Lunze, W\. Samek, and T\. Schaeffter\(2020\)PTB\-XL, a large publicly available electrocardiography dataset\.Scientific Data7\(1\),pp\. 154\.Cited by:[§IV\-A](https://arxiv.org/html/2607.01674#S4.SS1.p1.3)\.
- \[18\]Z\. Wang, Z\. Zhang, C\. Lee, H\. Zhang, R\. Sun, X\. Ren, G\. Su, V\. Perot, J\. Dy, and T\. Pfister\(2022\)Learning to prompt for continual learning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Cited by:[§II\-D](https://arxiv.org/html/2607.01674#S2.SS4.p1.1)\.
- \[19\]F\. Zenke, B\. Poole, and S\. Ganguli\(2017\)Continual learning through synaptic intelligence\.InProceedings of the 34th International Conference on Machine Learning,Vol\.70,pp\. 3987–3995\.Cited by:[§II\-B](https://arxiv.org/html/2607.01674#S2.SS2.p1.1)\.
- \[20\]J\. Zheng, J\. Zhang, S\. Danioko, H\. Yao, H\. Guo, and C\. Rakovski\(2020\)A 12\-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients\.Scientific Data7\(1\),pp\. 48\.Cited by:[§IV\-A](https://arxiv.org/html/2607.01674#S4.SS1.p1.3)\.Similar Articles
DeepArrhythmia: Segment-Contextualized ECG Arrhythmia Classification via Selective Evidence Acquisition
DeepArrhythmia is a multimodal framework for beat-level ECG arrhythmia classification that combines raw ECG signals and waveform images, using segment-level confidence to selectively acquire physiological evidence for improved accuracy.
The Identity Trap in EEG Foundation Models: A Diagnostic Audit
This paper identifies and diagnoses the 'Identity Trap' in EEG foundation models, where high accuracy may stem from subject-identity features rather than genuine clinical biomarkers. It proposes FMScope, a frozen-representation protocol to disentangle these signals, and demonstrates that subject-identity confounding is universal across three models and removable with linear methods.
RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning
RECTOR is a self-supervised framework that learns joint region-channel-temporal representations from EEG/sEEG signals for affective and cognitive state classification, achieving state-of-the-art results on emotion recognition and task-engagement benchmarks.
Domain Knowledge Based Temporal-Spatial Graph Convolution Network for ECG Recognition
This paper proposes a domain knowledge-based temporal-spatial graph convolution network for ECG recognition that uses PRQST landmarks and double-stream directed graphs to model intra- and inter-cycle dependencies, achieving state-of-the-art F1 scores on the First Chinese ECG Intelligent Competition dataset.
Enabling Unsupervised Training of Deep EEG Denoisers With Intelligent Partitioning
This paper proposes Intelligent Partitioning for Self-supervised Denoising (iPSD), a method enabling unsupervised training of deep EEG denoisers by partitioning noisy segments without requiring clean reference data.