ERP-XTTN: Interpretable Prototype-Guided Cross-Attention for Cross-Subject ERP Classification

arXiv cs.LG 06/03/26, 04:00 AM Papers
eeg-classification erp brain-computer-interface cross-attention interpretable-ai deep-learning
Summary
Introduces ERP-XTTN, a cross-attention architecture for interpretable ERP classification across subjects without calibration. Evaluated on multiple datasets, it achieves competitive performance with black-box models while providing transparent routing insights.
arXiv:2606.02939v1 Announce Type: new Abstract: Interpretable brain-computer interface classifiers that generalize across subjects without calibration remain an open challenge. We test whether prototype-based cross-attention can provide competitive, interpretable event-related potential (ERP) classification under deployment-compatible conditions. We propose ERP-XTTN, a cross-attention architecture that routes input EEG patches to fixed difference-wave prototypes via query-key-only cross-attention with no value projection, so classification depends entirely on attention routing and attention faithfulness is structural rather than post-hoc. Prototypes are derived automatically from extrema in the training-fold difference wave. We evaluate across three public sources (BNCI Horizon 2020, HRI Cursor, and ERP CORE) spanning eight ERP components (ERN, LRP, ErrP, N170, P300, N2pc, MMN, N400), using leave-one-subject-out (LOSO) evaluation with causal filtering at two channel counts (3-channel and full montage), against EEGNet and xDAWN with Riemannian geometry (xDAWN+RG). The mean gap between the best baseline and ERP-XTTN was .018 AUROC at 3 channels and .034 at full montage, arising from two largely distinct sources: a temporal-flexibility cost relative to EEGNet and a spatial-exploitation cost relative to xDAWN+RG, the latter driven by signal-to-noise ratio at full montage. Beyond accuracy, the transparent routing reveals cross-subject signal structure that black-box models cannot: false positives resembled true positives more than true negatives did, indicating that classification errors are neurophysiologically explicable. ERP-XTTN generalizes across diverse ERPs under causal, calibration-free conditions with a small interpretability cost at minimal montages. To our knowledge, this is the first epoch-level LOSO benchmark on ERP CORE.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:40 AM
# ERP-XTTN: Interpretable Prototype-Guided Cross-Attention for Cross-Subject ERP Classification
Source: [https://arxiv.org/html/2606.02939](https://arxiv.org/html/2606.02939)
Charlotte Genevier Wyman1,∗\\orcid0009\-0000\-8927\-0485 and Leanne Hirshfield1\\orcid0000\-0003\-0111\-69481University of Colorado Boulder, Boulder, CO, United States of America∗Author to whom any correspondence should be addressed\.[charlotte\.wyman@colorado\.edu](https://arxiv.org/html/2606.02939v1/mailto:[email protected])

###### Abstract

Objective:Interpretable brain\-computer interface classifiers that generalize across subjects without calibration remain an open challenge\. We evaluated whether prototype\-based cross\-attention can provide competitive, inherently interpretable event\-related potential \(ERP\) classification across diverse paradigms under deployment\-compatible conditions\.Approach:We propose ERP\-XTTN \(ERP Cross\-Attention\), a cross\-attention architecture that routes input electroencephalographic patches to fixed difference\-wave prototypes via query\-key\-only cross\-attention with no value projection, so classification depends entirely on attention routing\. Prototypes are derived automatically from prominent extrema in the training\-fold grand\-average difference wave\. We evaluated across three public sources \(BNCI Horizon 2020, HRI Cursor, and ERP CORE\) encompassing eight ERP components \(ERN, LRP, ErrP, N170, P300, N2pc, MMN, N400\)\. Evaluations used leave\-one\-subject\-out \(LOSO\) cross\-validation with causal filtering under two channel conditions \(3\-channel and full montage\), compared against EEGNet and xDAWN with Riemannian geometry \(xDAWN\+RG\)\.Main results:At 3 channels, the mean performance gap between the best baseline and ERP\-XTTN was \.018 area under the receiver operating characteristic curve \(AUROC\); at full montage, \.034\. The gap was associated with two largely distinct sources: a temporal flexibility cost relative to EEGNet, associated with attention entropy, routing discriminability, and signal\-to\-noise ratio \(SNR\); and a spatial exploitation cost relative to xDAWN\+RG, driven by SNR at full montage only\. For two components \(N400, N170\), the dominant routing did not concentrate in the canonical named component’s prototype window, suggesting the named deflection does not always carry the dominant cross\-subject discriminative signal\. False positives morphologically resembled true positives more than true negatives did across most datasets, indicating classification errors are neurophysiologically explicable\.Significance:ERP\-XTTN generalizes across diverse ERP morphologies under causal, calibration\-free conditions with a small interpretability cost at minimal montages\. The transparent routing provides structural insights into cross\-subject signal organization that black\-box models cannot offer\. To our knowledge, this is the first epoch\-level LOSO benchmark on ERP CORE\.

###### keywords:

EEG classification, brain\-computer interface, event\-related potentials, cross\-subject generalization, interpretable deep learning, cross\-attention

## 1Introduction

Event\-related potentials \(ERPs\) are time\-locked neural responses measurable via electroencephalography \(EEG\) that index sensory, attentional, and cognitive processing\[[19](https://arxiv.org/html/2606.02939#bib.bib1)\]and underlie a range of brain\-computer interface \(BCI\) applications such as error detection and spelling systems\[[31](https://arxiv.org/html/2606.02939#bib.bib4)\]\. Deployment\-ready ERP classification requires meeting several constraints simultaneously\. For online compatibility, filtering must be causal: acausal filtering incorporates future samples, producing accuracy estimates that cannot be reproduced in any forward\-processing pipeline\. Practical deployment often limits channel count, as many commercial and mobile devices lack research\-grade electrode coverage\[[25](https://arxiv.org/html/2606.02939#bib.bib2)\]\. Classification must generalize across subjects, since per\-subject calibration delays setup and limits scalability\[[18](https://arxiv.org/html/2606.02939#bib.bib32)\], and standard calibration pipelines fail to produce a functional classifier for a non\-negligible portion of users\[[28](https://arxiv.org/html/2606.02939#bib.bib3)\]\. Finally, clinical and safety\-critical applications demand that classification decisions be interpretable, not merely accurate\.

These constraints are compounded by the diversity of ERP morphologies\. ERPs span frontocentral error responses such as the error\-related negativity \(ERN\), posterior visual components such as the N170, diffuse, low signal\-to\-noise ratio \(SNR\) responses such as N400, lateralized components such as N2pc and lateralized readiness potential \(LRP\), passive auditory responses such as mismatch negativity \(MMN\), and broader error\-related potential \(ErrP\) families\[[19](https://arxiv.org/html/2606.02939#bib.bib1)\]\. A general cross\-subject framework must handle this diversity without requiring component\-specific architecture or preprocessing choices\.

Cross\-subject ERP classification has been addressed from multiple directions\. General\-purpose deep learning architectures such as EEGNet\[[14](https://arxiv.org/html/2606.02939#bib.bib11),[27](https://arxiv.org/html/2606.02939#bib.bib37),[17](https://arxiv.org/html/2606.02939#bib.bib13)\]have been evaluated across multiple BCI paradigms, and CNN\-transformer architectures have demonstrated competitive cross\-subject ErrP performance under leave\-one\-subject\-out \(LOSO\) evaluation\[[23](https://arxiv.org/html/2606.02939#bib.bib17),[22](https://arxiv.org/html/2606.02939#bib.bib18)\], though the latter provide no mechanistic insight into individual classification decisions\. Classical pipelines such as shrinkage linear discriminant analysis \(LDA\)\[[5](https://arxiv.org/html/2606.02939#bib.bib8)\]achieve strong cross\-subject ERP classification and can reveal discriminative spatial patterns, but their interpretability rests on a fixed decision rule rather than trial\-specific evidence for each classification\. xDAWN spatial filtering\[[24](https://arxiv.org/html/2606.02939#bib.bib19)\]with Riemannian geometry \(xDAWN\+RG\) features also perform well cross\-subject\[[15](https://arxiv.org/html/2606.02939#bib.bib34)\]but lack obvious physiological interpretation\[[17](https://arxiv.org/html/2606.02939#bib.bib13)\]\. Calibration\-free cross\-subject ErrP classification has been demonstrated with generalized LDA\[[26](https://arxiv.org/html/2606.02939#bib.bib20)\], an xDAWN\+RG pipeline with a support vector machine classifier\[[12](https://arxiv.org/html/2606.02939#bib.bib36)\], ensemble classifiers\[[4](https://arxiv.org/html/2606.02939#bib.bib7)\], and online PCA\-based pipelines\[[16](https://arxiv.org/html/2606.02939#bib.bib12)\], though without inherent interpretability\. Domain adaptation and domain generalization methods address cross\-subject distribution shift in ERP\-based BCIs\[[33](https://arxiv.org/html/2606.02939#bib.bib23),[32](https://arxiv.org/html/2606.02939#bib.bib24),[20](https://arxiv.org/html/2606.02939#bib.bib14)\], but are not designed for interpretability\.

Interpretability for deep EEG architectures has been pursued post\-hoc and under offline preprocessing, including EEGNet’s feature\-visualization and DeepLIFT relevance analyses\[[14](https://arxiv.org/html/2606.02939#bib.bib11)\]and Grad\-CAM applied to ERP classifiers\[[10](https://arxiv.org/html/2606.02939#bib.bib10)\]\. Post\-hoc importance estimation has also been used to guide channel and time\-window selection for ErrP decoding, with the estimates validated against error\-processing neurophysiology\[[6](https://arxiv.org/html/2606.02939#bib.bib15)\]\. These methods can identify features associated with a trained model’s predictions and map them to known neurophysiological phenomena, in some cases at the single\-trial level, but because they are approximations computed after training, their faithfulness to the decision process is not guaranteed; they are not evidence that is, by construction, the basis of each classification\. Prototype\-based approaches using learned embeddings achieve cross\-subject P300 decoding\[[29](https://arxiv.org/html/2606.02939#bib.bib21),[37](https://arxiv.org/html/2606.02939#bib.bib26)\], though only within a single paradigm and without inherent interpretability\. Discriminative canonical pattern matching with time\-domain ERP templates has been demonstrated across multiple ERP paradigms but only within\-subject\[[36](https://arxiv.org/html/2606.02939#bib.bib25)\]\.

Cross\-subject evaluation on ERP CORE has been reported using fold\-based splits\[[1](https://arxiv.org/html/2606.02939#bib.bib5)\]and time\-point\-wise LOSO decoding\[[21](https://arxiv.org/html/2606.02939#bib.bib16)\], but neither provides an epoch\-level LOSO classification benchmark under deployment\-compatible constraints such as causal filtering\. To our knowledge, no prior approach combines inherent interpretability, LOSO evaluation, causal filtering, and coverage across diverse ERP components\.

This work demonstrates that prototype\-based attention routing can provide competitive, inherently interpretable cross\-subject ERP classification across diverse paradigms under deployment\-compatible constraints\. Here, we generalize ERP\-XTTN \(ERP Cross\-Attention\)\[[34](https://arxiv.org/html/2606.02939#bib.bib38)\], a cross\-attention architecture in which classification depends entirely on interpretable attention routing, from its original ErrP\-specific application by replacing constrained polarity\-based prototype extraction with fully automatic peak detection, requiring no component\-specific architecture, preprocessing, or training choices\. We evaluate across three public sources \(BNCI Horizon 2020, HRI Cursor, and ERP CORE\) encompassing eight ERP components, using LOSO cross\-validation with causal infinite impulse response \(IIR\) filtering under two channel conditions, and compare against EEGNet and xDAWN\+RG\. To our knowledge, this is the first epoch\-level LOSO benchmark on ERP CORE under deployment\-compatible constraints\. Beyond classification performance, we systematically analyze the interpretability cost: what drives the performance gap, when it is small, and what the architecture’s transparent routing reveals about cross\-subject ERP structure\.

## 2Methods

This section describes the nine datasets, the shared preprocessing pipeline, the ERP\-XTTN architecture, the two baseline methods, the training procedure, and the evaluation protocol\.

### 2\.1Datasets

Nine datasets spanning eight ERP components, with ErrP represented by two datasets, were evaluated using both a targeted 3\-channel montage and the full available montage\. Table[1](https://arxiv.org/html/2606.02939#S2.T1)contains a high\-level summary of the datasets, with details described below\.

Table 1:List of datasets evaluated\. EEG channels only; electrooculography/reference channels were excluded where applicable\. Detection channel used to locate prototype time windows \(peak detection\); see Section[2\.3](https://arxiv.org/html/2606.02939#S2.SS3)\. Trials/Subject refers to total classification epochs across both classes and is summed across sessions where applicable \(BNCI includes two sessions per subject\)\.#### 2\.1\.1BNCI Horizon 2020 013\-2015 \(BNCI\) \- ErrP

BNCI111BNCI Horizon 2020 013\-2015 is publicly available at[https://bnci\-horizon\-2020\.eu/database/data\-sets](https://bnci-horizon-2020.eu/database/data-sets)is a 64\-channel EEG dataset designed to elicit ErrP\[[7](https://arxiv.org/html/2606.02939#bib.bib27)\]\. Six subjects were asked to monitor the performance of an agent moving a cursor toward a target; errors occurred on roughly 20% of trials\. Data were collected over two sessions, each averaging 110 error trials and 426 correct trials for a total of roughly 536 trials per session\.

#### 2\.1\.2HRI Cursor \(HRI\) \- ErrP

HRI222HRI Cursor is publicly available at[https://github\.com/stefan\-ehrlich/dataset\-ErrP\-HRI](https://github.com/stefan-ehrlich/dataset-ErrP-HRI)is an EEG dataset recorded with 32 active electrodes \(27 scalp EEG, all analyzed here, plus 3 electrooculography and 2 mastoid reference\) designed to elicit ErrP during a choice\-reaction\-time task with cursor\-movement feedback\[[8](https://arxiv.org/html/2606.02939#bib.bib28)\]\. Eleven subjects responded to one of three target stimuli via keypress, and feedback was presented as a cursor moving toward or away from the target; errors occurred on roughly 35% of trials\. Only the cursor scenario was used in this work; a companion robot\-head\-turn scenario from the same study was excluded\. Subjects had an average of 164 error trials and 319 correct trials for a total of roughly 483 trials\.

#### 2\.1\.3ERP CORE

The remaining seven datasets are drawn from ERP CORE333ERP CORE is publicly available at[https://erpinfo\.org/erp\-core](https://erpinfo.org/erp-core)\[[11](https://arxiv.org/html/2606.02939#bib.bib29)\], a publicly available resource providing standardized paradigms for 7 ERPs and data from 40 subjects across 30 EEG channels\. Note that the original ERP CORE publication\[[11](https://arxiv.org/html/2606.02939#bib.bib29)\]reports smaller per\-component samples \(N = 34–39\) after exclusions based on behavioral\-accuracy and artifact\-rejection criteria tied to their ERP\-averaging pipeline\. We retained all 40 participants because our preprocessing omits the baseline correction, re\-referencing, and ocular\-artifact correction those criteria presuppose, and because retaining all participants provides a more stringent test of cross\-subject generalization\.

ERN:The error\-related negativity was elicited using an Eriksen flankers task in which subjects identified the direction of a central arrowhead flanked by congruent or incongruent distractors\. Trials were response\-locked and classified as incorrect vs correct responses\. Subjects averaged approximately 45 incorrect and 356 correct responses\.

LRP:The lateralized readiness potential was elicited using the same flankers task, restricted to correctly\-responded trials and analyzed with respect to response hand\. Trials were response\-locked and classified as left\-hand vs right\-hand responses\. Subjects averaged approximately 177 left\-hand and 179 right\-hand responses\.

MMN:The mismatch negativity was elicited using a passive auditory oddball task\. Standard \(80 dB, p=0\.8\) and deviant \(70 dB, p=0\.2\) tones were presented while subjects watched a silent video\. Trials were classified as deviant vs standard\. Subjects averaged approximately 199 deviant and 782 standard trials\.

N170:The N170 was elicited using a face perception task in which subjects judged whether each stimulus was an intact object \(faces or cars\) or a texture \(scrambled faces or scrambled cars\)\. Only intact stimuli were used; trials were classified as face vs car\. Subjects averaged 80 face and 80 car trials\.

N2pc:The N2pc is a lateralized ERP component associated with covert attentional selection\. It was elicited using a visual search task in which subjects viewed arrays of colored squares in both hemifields\. On each trial, they identified whether a small notch in the color\-defined target square appeared on the top or bottom edge\. Trials were classified as left\-target vs right\-target\. Subjects averaged approximately 160 left\-target and 160 right\-target trials\.

N400:The N400 was elicited using a word pair judgment task\. A red prime word was followed by a green target word, and subjects indicated whether the pair was semantically related or unrelated\. Trials were classified as unrelated vs related\. Subjects averaged 60 unrelated and 60 related trials\.

P300:The P300 was elicited using an active visual oddball task in which letters were presented equiprobably and one was designated the target per block\. Trials were classified as target vs non\-target\. Subjects averaged 40 target and 160 non\-target trials\.

This study used only publicly available, de\-identified datasets\. No new human subjects data were collected\. Ethical approval for the original data collection was obtained by the original investigators: the ERP CORE dataset was approved by the Institutional Review Board at the University of California, Davis\[[11](https://arxiv.org/html/2606.02939#bib.bib29)\]; the HRI ErrP dataset was approved by the ethics commission of the Faculty of Medicine, Technische Universität München \(reference number 236/15s\)\[[8](https://arxiv.org/html/2606.02939#bib.bib28)\]\. The original publication describing the BNCI dataset\[[7](https://arxiv.org/html/2606.02939#bib.bib27)\]does not include a formal ethics statement; the data were collected at the Idiap Research Institute, Martigny, Switzerland\.

### 2\.2Preprocessing

A causal IIR 4th\-order Butterworth 1–10 Hz bandpass filter was applied to all datasets\. This frequency range was selected to capture the primary spectral content of ERP components while attenuating muscle artifact and high\-frequency noise, and was fixed across all datasets prior to evaluation\. BNCI was downsampled from 512 Hz to 256 Hz after filtering\. HRI was recorded at 1000 Hz but distributed pre\-downsampled to 256 Hz, requiring no further downsampling\. All ERP CORE datasets were downsampled from 1024 Hz to 256 Hz after filtering\. Baseline correction, re\-referencing, and spatial filtering were omitted to evaluate performance under constraints compatible with real\-time, zero\-calibration deployment, and all datasets were epoched to 0–800 ms after the dataset\-specific time\-locking event\. For the response\-locked ERP CORE LRP and ERN paradigms, this window captures post\-response lateralized motor activity and error processing respectively, rather than pre\-response preparatory components\. All datasets were evaluated using two channel sets: a 3\-channel set tailored to the ERP of interest \(Table[1](https://arxiv.org/html/2606.02939#S2.T1)\), and the full montage available for the dataset\.

### 2\.3Architecture \(ERP\-XTTN\)

The proposed model, ERP\-XTTN, classifies trial epochs by routing input EEG patches to a set of fold\-specific ERP prototypes via query\-key \(QK\)\-only cross\-attention, then classifying directly from the resulting attention\-weight distribution \(Figure[1](https://arxiv.org/html/2606.02939#S2.F1)\)\. Prototypes are derived from the grand\-average training\-fold difference wave and are recomputed twice within each LOSO fold: first on the phase\-1 training split used for validation\-based epoch selection, and again on the full non\-test training pool for phase\-2 retraining at the selected epoch count\.

![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/erp_xttn_architecture.png)Figure 1:ERP\-XTTN architecture\. Input patches serve as queries into a QK\-only cross\-attention module against K≤\\leq4 frozen ERP prototypes derived from the training set’s grand\-average difference wave via automatic peak detection on the detection channel \(Section[2\.3](https://arxiv.org/html/2606.02939#S2.SS3)\)\. No value projection is used; the attention\-weight distribution over prototypes is the sole input to the classification head, making attention faithfulness a structural property of the architecture\. Patch and positional embeddings are shared between the input and prototype paths \(dotted line\)\. Prototypes are recomputed twice per LOSO fold: once on the phase\-1 training split for validation\-based epoch selection, and again on the full non\-test training pool for final retraining\.Whereas the original ERP\-XTTN\[[34](https://arxiv.org/html/2606.02939#bib.bib38)\]located prototype windows using a predefined polarity sequence, the present version derives prototype windows directly from prominent extrema in the training\-fold difference wave, enabling application across datasets with heterogeneous ERP morphologies\. On the two ErrP datasets where both methods were evaluated, automatic and constrained prototype detection achieved equivalent performance; all reported results used the automatic method\. Prototype time windows are located on a single detection channel \(Table[1](https://arxiv.org/html/2606.02939#S2.T1)\), chosen from the available montage as a site within each component’s topographic region that exhibits a clear difference\-wave deflection; the detection\-channel designation is used only for prototype\-window placement\. On this channel, the smoothed difference wave \(Gaussian smoothing, sigma = 2 samples\) is searched for positive and negative peaks occurring at least 50 ms after the time\-locking event\. Peaks are required to exceed a prominence threshold of 0\.02 and are separated by a minimum within\-polarity distance of 80 ms\. Up to four of the most prominent peaks are retained and ordered in time\. Each retained peak is expanded to a window bounded by neighboring zero\-crossings of the smoothed signal and clamped to a width of 40–200 ms\. The multichannel difference wave within each window defines one frozen prototype, yielding a prototype tensor of shape \(K, C, T\), where K is the number of detected prototypes \(capped at four, varying by fold\), C is the number of EEG channels, and T is the number of time samples\. An example set of resulting prototypes for the HRI 3\-channel configuration is shown in Figure[2](https://arxiv.org/html/2606.02939#S2.F2)\.

![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_prototypes_hri_errp_3ch_Cz.png)Figure 2:Difference\-wave prototypes \(Cz channel; HRI ErrP dataset; 3\-channel configuration\) derived from the grand\-average error\-minus\-correct waveform\. Each panel shows one prototype: P1\-diff \(77–200 ms\), Ne\-diff \(200–275 ms\), Pe\-diff \(275–375 ms\), and LateN\-diff \(375–515 ms\)\. Shaded regions indicate detected prototype windows\. Thin lines show individual LOSO fold prototypes \(n = 11\); thick line shows their mean\. Prototypes for all nine datasets are shown in Supplementary Figure S1\.Input epochs of shape \(C, T\) are divided into non\-overlapping patches of width 8 samples\. At 256 Hz, the epoch window \[0, 800\] ms yields T = 206 samples \(inclusive endpoints\); 25 patches cover 200 samples and the final 6 are discarded, yielding N = 25 patches per epoch\. Each patch is flattened and linearly projected to a 64\-dimensional embedding, then summed with a learned positional embedding\. A dropout rate of 0\.3 is applied after patch embedding\. Prototype waveforms are passed through the same shared patch embedding, mean\-pooled across patch positions, and augmented with positional embeddings indexed at the temporal centers of their detected windows\.

Patch embeddings are first passed through a single transformer\-style self\-attention layer using pre\-normalization, four attention heads, residual connection, and dropout of 0\.3 on the attention weights\. Cross\-attention is then computed between input patch queries and prototype keys using separate layer normalizations and linear projections\. Crucially, no value projection is used; instead, the softmax attention weights over prototypes are the sole input to the classification head\. This makes attention faithfulness a structural guarantee rather than an empirical question: the model cannot classify on information that is not reflected in the attention distribution\. In standard transformer architectures, attention weights may not faithfully represent a model’s decision process\[[9](https://arxiv.org/html/2606.02939#bib.bib9),[30](https://arxiv.org/html/2606.02939#bib.bib22)\]; the QK\-only design resolves this by construction, as there is no alternative information pathway through which classification\-relevant features could bypass the attention weights\. The neurophysiological interpretability of the routing patterns is supported by the use of fixed difference\-wave prototypes and by the waveform analyses in Section[3\.3](https://arxiv.org/html/2606.02939#S3.SS3), but is not guaranteed by the architecture alone\.

Finally, attention weights are averaged across heads, producing a tensor of shape \(N, K\), flattened to \(N×\\timesK\), and passed through a single linear layer to produce a scalar logit\.

For a representative 3\-channel, four\-prototype configuration, ERP\-XTTN has 28,645 trainable parameters\.

### 2\.4Baselines

We compared ERP\-XTTN against two established baselines representing complementary classification paradigms: EEGNet, a compact convolutional neural network widely used for EEG decoding, and xDAWN\+RG with logistic regression, a classical pipeline combining supervised spatial filtering with Riemannian geometry classification\. Together these cover the dominant deep learning and feature\-engineering approaches to cross\-subject ERP classification, and have been benchmarked directly against one another across ERP paradigms\[[14](https://arxiv.org/html/2606.02939#bib.bib11)\]\.

#### 2\.4\.1EEGNet

As a deep learning baseline, we used EEGNet with F1=8, D=2, and a kernel length of 128 samples \(half the 256 Hz sampling rate\), yielding 2,017 trainable parameters for the 3\-channel configuration\. Dropout of 0\.25 was applied in both blocks, with batch normalization after every convolution, average pooling \(kernel sizes 4 and 8 following Blocks 1 and 2 respectively\), and ELU activations, matching the cross\-subject configuration of Lawhern et al\.\[[14](https://arxiv.org/html/2606.02939#bib.bib11)\]\. Max\-norm weight constraints of 1\.0 \(depthwise convolution\) and 0\.25 \(classifier\) were enforced after each optimizer step, also per the original specification\. Training and evaluation followed the shared protocol described in Sections[2\.5](https://arxiv.org/html/2606.02939#S2.SS5)and[2\.6](https://arxiv.org/html/2606.02939#S2.SS6)\.

#### 2\.4\.2xDAWN\+RG

As a classical baseline, we used xDAWN\+RG with logistic regression, implemented via the pyriemann\[[2](https://arxiv.org/html/2606.02939#bib.bib33)\]library\. For each LOSO fold, xDAWN spatial filters\[[24](https://arxiv.org/html/2606.02939#bib.bib19)\]\(nfilter = 4; effectively capped at 3 per class in the 3\-channel configuration\) were estimated from the training pool to enhance evoked responses\. The resulting trials were represented using xDAWN covariance matrices, projected into the Riemannian tangent space, and classified using logistic regression on the tangent\-space features\. Covariance matrices were estimated with Ledoit\-Wolf shrinkage and projected to tangent space using the affine\-invariant Riemannian metric\. Logistic regression used L2 regularization with C=1\.0 and the L\-BFGS solver\. Unlike the neural baselines, no explicit class re\-weighting was applied\. This baseline was selected because Riemannian geometry classifiers are recognized as state\-of\-the\-art for ERP classification\[[17](https://arxiv.org/html/2606.02939#bib.bib13),[3](https://arxiv.org/html/2606.02939#bib.bib31)\], and xDAWN\+RG specifically has demonstrated strong cross\-subject performance in prior BCI studies\[[15](https://arxiv.org/html/2606.02939#bib.bib34)\]\. Unlike ERP\-XTTN and EEGNet, xDAWN\+RG does not use iterative epoch\-based optimization, early stopping, or data augmentation; instead, it is fit directly on the full training pool within each LOSO fold\. At inference, the learned pipeline is computationally lightweight and compatible with real\-time operation\.

### 2\.5Training

The neural models \(EEGNet and ERP\-XTTN\) were trained using AdamW \(lr=1×10−3=1\\times 10^\{\-3\}, weight decay=1×10−4=1\\times 10^\{\-4\}\), batch size 128, and class\-weighted binary cross\-entropy with logits loss, with the positive\-class weight set tonnegative/npositiven\_\{\\text\{negative\}\}/n\_\{\\text\{positive\}\}\. The learning rate schedule used linear warmup for 5 epochs followed by cosine annealing over 100 epochs, after which it was held at1×10−51\\times 10^\{\-5\}; gradient norms were clipped at 1\.0\. Data augmentation, applied on\-the\-fly during training only, consisted of temporal jitter \(uniform random shift in \[\-10, \+10\] samples with zero\-padding at exposed boundaries\) and additive Gaussian noise \(σ=0\.1\\sigma=0\.1\) in normalized units\. Early stopping used a trial\-level 15% stratified validation split from the training pool, stratified jointly by training subject and class; the held\-out test subject contributed no trials to either split\. The selection criterion was area under the receiver operating characteristic curve \(AUROC; patience 15, maximum 250 epochs\)\. The final model was then retrained from scratch on all training and validation data for that epoch count\. No hyperparameter optimization was performed; identical optimization hyperparameters and training procedures were used across all datasets and channel configurations\. Neural model experiments were conducted in PyTorch on an NVIDIA RTX PRO 6000 Blackwell Server Edition GPU using cloud compute; xDAWN\+RG used the pyriemann library on CPU\.

### 2\.6Evaluation

All models were evaluated using LOSO cross\-validation, which avoids the subject\-level data leakage inherent in standard k\-fold splits\[[13](https://arxiv.org/html/2606.02939#bib.bib30)\]\. Within each dataset, each subject served as the test set exactly once, with all remaining subjects forming the training set\. To simulate zero\-calibration deployment, no subject data from the held\-out test subject was used during training\. All models received identical preprocessing, channel selections, and train/test splits for each dataset, ensuring that performance differences reflect only the classification method\. Within each LOSO fold, all channels were z\-scored using statistics computed exclusively from the training pool; held\-out test subjects were normalized with those same training\-derived statistics\. The primary evaluation metric was AUROC, which is threshold\-independent and robust to class imbalance\. Balanced accuracy was computed at a fixed decision threshold of 0\.5 on the predicted positive\-class probability, applied uniformly across all subjects and configurations\. This uncalibrated threshold simulates deployment conditions where per\-subject optimization is unavailable; threshold\-dependent metrics such as balanced accuracy may therefore understate discriminative capacity relative to AUROC\.

To characterize the interpretability costΔ\\Delta\(the difference between the best baseline and ERP\-XTTN AUROC within each dataset and channel condition\), we related it to four candidate predictors computed per dataset and channel condition\. The SNR proxy is the absolute grand\-average difference\-wave amplitude at each detected prototype’s peak latency, divided by the trial\-to\-trial standard deviation across all training\-pool trials \(both classes\) at that sample, averaged across prototypes and LOSO folds \(detection channel\)\. Attention entropy is the normalized per\-trial entropy of the flattened patch\-by\-prototype attention map \(normalized by log of the number of patch×\\timesprototype entries; 0 = peaked routing, 1 = diffuse\), averaged across trials and subjects\. Routing discriminability is the mean cosine distance between the class\-averaged attention vectors, averaged across subjects\. Prototype stability is the mean pairwise Pearson correlation of detected prototypes across LOSO folds at the detection channel\. Associations were summarized with Spearmanρ\\rho\(n = 9 datasets per channel condition\) and are descriptive trend indicators rather than significance tests\.

## 3Results

### 3\.1Classification Performance

Table 2:AUROC \(mean, LOSO\) across channel configurations\.Δ\\Delta= best baseline minus ERP\-XTTN \(interpretability cost\)\. Rows ordered by 3\-channel max AUROC descending\. Full montage channel counts vary by dataset \(Table[1](https://arxiv.org/html/2606.02939#S2.T1)\)\. Per\-subject AUROC values are reported in Supplementary Tables S2–S10\.Table[2](https://arxiv.org/html/2606.02939#S3.T2)reports mean LOSO AUROC for all nine datasets under both channel conditions\. Cross\-subject classification difficulty followed a broadly consistent ordering across all three methods, from ERN \(easiest\) through MMN and N400 \(hardest\) at 3 channels\. At full montage the ordering shifts, with MMN unambiguously hardest and N400 recovering to mid\-range performance\. At 3 channels, EEGNet was the top baseline on six datasets and xDAWN\+RG on three, and the mean gap between the best baseline and ERP\-XTTN \(Δ\\Delta; best baseline \- ERP\-XTTN\) was \.018 AUROC \(range \.008–\.029\)\. At full montage, this gap nearly doubled to a meanΔ\\Deltaof \.034 \(range \.014–\.060\), with EEGNet leading on seven of nine datasets and xDAWN\+RG on the remaining two \(N170, MMN\)\. Per\-subject AUROC values are reported in Supplementary Tables S2–S10\.Δ\\Deltavalues are descriptive estimates of the mean interpretability cost across LOSO folds, not formal significance tests\.

Beyond classification accuracy, single\-trial inference latency is relevant for real\-time deployment\. Median batch\-1 CPU forward\-pass latency on an Apple M2 \(PyTorch 2\.10 in evaluation mode for the neural models, scikit\-learn predict\_proba on the fitted pipeline for xDAWN\+RG, 500 iterations after warm\-up, excluding preprocessing and data transfer\) was 0\.30 ms for ERP\-XTTN, 0\.32 ms for xDAWN\+RG, and 0\.58 ms for EEGNet under matched 3\-channel single\-trial settings\.

### 3\.2Attention Routing Patterns

On all nine datasets, the distribution of attention weights over prototypes differed between classes, though the specific patterns varied across components\. On several datasets, the dominant routing did not concentrate in the prototype window corresponding to the paradigm’s canonical named component; this pattern is examined in Section[4\.2](https://arxiv.org/html/2606.02939#S4.SS2)\. Class\-averaged routing timecourses and per\-subject routing contrasts for all nine datasets are provided in Supplementary Figures S2\-S10; single\-trial routing figures for all subjects \(in the format of Figure[3](https://arxiv.org/html/2606.02939#S3.F3)\) are available in the public code repository\[[35](https://arxiv.org/html/2606.02939#bib.bib39)\]\.

![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_tp_tn_routing_hri_sub03_3ch.png)Figure 3:Single\-trial attention routing for HRI ErrP sub\-03 \(AUROC = 0\.863\) in the 3\-channel configuration\. Top: difference\-wave prototypes at Cz with shaded prototype windows\. The shaded windows here are sub\-03’s fold\-specific boundaries from this LOSO split and differ by a few milliseconds from the cross\-fold mean windows shown in Figures[2](https://arxiv.org/html/2606.02939#S2.F2)and S3\. Middle: raw Cz waveforms for a high\-confidence true positive \(error trial, left\) and true negative \(correct trial, right\)\. Bottom: per\-prototype attention weights on a shared time axis\. The error trial routes through P1\-diff, Pe\-diff, and LateN\-diff; the correct trial routes predominantly through Ne\-diff\. Attention weights are read directly from the model’s forward pass without post\-hoc analysis\.Mean routing contrasts were computed within each prototype’s detected temporal window\. Supplementary heatmaps \(Figures S2\-S10\) show attention contrast across the full epoch and may include patterns outside the prototype window\. For each prototype we report the mean routing contrast \(mean attention weight for the positive class minus mean attention weight for the negative class, averaged across subjects\), the number of subjects in which the contrast favored the positive class, and the number of subjects for which that prototype received the largest absolute routing contrast \(dominant\)\.

For ERN, routing concentrated in the early ERN\-diff prototype \(mean routing contrast = \+0\.17, error\-favored in 38 of 40 subjects, dominant for 31 of 40\), with substantial secondary error\-favored routing on the Pe\-diff prototype \(mean contrast = \+0\.07, 33 of 40 subjects, dominant for 5 of 40\); the late prototype \(408–607 ms\) carried an opposite\-sign contrast \(mean = \-0\.05, correct\-favored in 27 of 40\)\. For LRP, the late prototype was left\-hand\-favored in 39 of 40 subjects \(mean routing contrast = \+0\.24\) and dominant for 39 of 40, with the early prototype carrying an opposite\-sign contrast\. On HRI ErrP \(Figure[3](https://arxiv.org/html/2606.02939#S3.F3)\), the Ne\-diff prototype was correct\-favored in all 11 subjects and dominant for 9 of 11; among error\-favored prototypes, subjects split between P1\-diff\-dominant \(5 of 11\) and LateN\-diff\-dominant \(5 of 11\)\. BNCI ErrP showed the same Ne\-diff pattern \(correct\-favored 5 of 6, dominant 4 of 6\), with positive\-routing dominance split between P1\-diff \(3 of 6\) and Pe\-diff \(3 of 6\); the LateN\-diff prototype was error\-favored in 4 of 6 subjects but was not the dominant error\-favored prototype for any subject\. For N170, routing was weak overall \(peak mean routing contrast = 0\.042\) and distributed: the LateN\-diff prototype was dominant for 21 of 40 subjects but with no consistent direction, and the N170\-window prototype also showed no consistent direction\. P300 routing was bimodal: the early P3\-diff prototype was target\-favored in 32 of 40, while the late prototype \(572–771 ms\) was non\-target\-favored in 25 of 40; prototypes were less stable across LOSO folds than for any other dataset \(mean pairwise Pearson r = 0\.645 vs≥\\geq0\.79 for all others\)\. N2pc routing was weak in magnitude \(maximum mean routing contrast = 0\.028\) but consistent in direction within each prototype: N2pc\-diff was target\-left\-favored in 34 of 40 subjects, while the SPCN\-diff \(sustained posterior contralateral negativity\) and the late prototype \(497–642 ms\) were target\-right\-favored in 27 and 28 of 40, respectively\. MMN routing was essentially unstructured \(maximum mean routing contrast = 0\.009\), with no consistent direction on any prototype\. For N400, the N400\-window prototype was inconsistent across the group \(mean routing contrast = \-0\.004; 22 of 40 subjects leaned related, 13 leaned unrelated, and 5 were near\-neutral\) and was dominant for only 15 of 40 subjects; the flanking P2\-diff prototype carried the largest within\-window contrast \(\+0\.014, unrelated\-favored\)\. Across components, per\-subject routing consistency visually tracked classification difficulty: ERN heatmaps showed near\-uniform contrast across subjects, while MMN showed no consistent structure \(Supplementary Figures S2, S9\)\.

### 3\.3Predictors of the Performance Gap

Table S11 reports cross\-component analysis metrics for all 18 dataset×\\timeschannel combinations\. We examined which properties of the signal and of ERP\-XTTN’s attention routing are associated with the performance gap \(Δ\\Delta, best baseline minus ERP\-XTTN\) reported in Table[2](https://arxiv.org/html/2606.02939#S3.T2)\. With n = 9 datasets per channel condition, the correlations below are trend indicators, not significance tests\.

Figure[4](https://arxiv.org/html/2606.02939#S3.F4)shows the association betweenΔ\\Deltaand four candidate predictors, separately for each baseline\. The EEGNet gap was moderately associated with attention entropy, routing discriminability, and SNR proxy at both channel counts, with no single dominant predictor \(Figure[4](https://arxiv.org/html/2606.02939#S3.F4), left\)\. Prototype stability showed no association\. The xDAWN\+RG gap showed a different profile: SNR proxy was the strongest predictor at full montage \(ρ\\rho= 0\.72\), with prototype stability showing a weaker secondary association \(ρ\\rho= 0\.35\); other associations were weak or absent \(Figure[4](https://arxiv.org/html/2606.02939#S3.F4), right\)\. Class imbalance and dynamic K showed no visible association with either gap \(Table S11\)\.

![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_tax_drivers.png)Figure 4:Spearmanρ\\rhobetween the performance gap \(Δ\\Delta\) and four candidate predictors \(defined in Section[2\.6](https://arxiv.org/html/2606.02939#S2.SS6)\), computed separately within each channel condition \(n = 9 datasets each\)\. Left:Δ\\Deltavs EEGNet\. Right:Δ\\Deltavs xDAWN\+RG\. Filled circles: full montage; open circles: 3\-channel\. Dashed line atρ\\rho= 0\.Beyond aggregate predictors, we examined whether individual classification outcomes could be explained by trial\-level waveform morphology by comparing grand\-average waveforms conditioned on model predictions\. On HRI ErrP at Cz \(Figure[5](https://arxiv.org/html/2606.02939#S3.F5)\), true positives showed larger and sharper ERP components than false negatives, which were attenuated\. False positives closely resembled true positives in waveform shape across the full ErrP complex \(Ne\-diff, Pe\-diff, and LateN\-diff windows\) but at reduced amplitude, while true negatives showed substantially less morphological similarity to true positives\. Outcome\-conditioned waveforms for all nine datasets are shown in Supplementary Figure S11\.

We quantified this relationship by computing the cross\-subject Pearson correlation between mean true\-positive and mean false\-positive waveforms \(TP↔\\leftrightarrowFP\) and between mean true\-positive and mean true\-negative waveforms \(TP↔\\leftrightarrowTN\) at the detection channel \(Table S11\)\. TP↔\\leftrightarrowFP correlations were positive and substantial on most datasets \(r = 0\.46–0\.85\)\. TP↔\\leftrightarrowTN correlations were generally lower, with two notable exceptions: LRP showed TP↔\\leftrightarrowTN higher than TP↔\\leftrightarrowFP \(r≈\\approx0\.83 vs 0\.80\), and MMN showed negative TP↔\\leftrightarrowTN \(r = \-0\.19 to \-0\.39\)\. Bootstrap 95% confidence intervals overlapped for many datasets; these patterns should be interpreted as directionally consistent rather than definitive\.

![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_morphology_hri_errp_3ch_Cz.png)Figure 5:Outcome\-conditioned grand\-average waveforms at Cz for HRI ErrP \(3\-channel configuration\)\. Top: true positive \(correctly classified error trials\) and false negative \(missed error trials\) waveforms with standard\-error ribbons\. Bottom: true negative \(correctly classified correct trials\) and false positive \(falsely flagged correct trials\) waveforms with standard\-error ribbons\. Shaded regions indicate detected prototype windows\. Trial counts are shown per category\.

## 4Discussion

ERP\-XTTN achieves competitive cross\-subject classification across diverse ERP paradigms while providing transparent attention routing\. We examine the sources and practical significance of the remaining performance gap, the structure revealed by the routing patterns, and the implications for calibration\-free BCI deployment\.

### 4\.1Interpretability Cost

Across datasets, ERP\-XTTN’s performance gap relative to baselines was associated with two largely distinct factors\. The EEGNet gap was moderately associated with attention entropy, routing discriminability, and SNR proxy at both channel counts\. All three metrics reflect how much peaked, class\-discriminative signal is present in the data, and their shared association with the gap suggests the prototype constraint costs more when there is strong temporal structure for an unconstrained architecture to exploit, and less when the signal is diffuse, as for MMN and N400 in the present datasets\. The xDAWN\+RG gap showed a different profile: SNR proxy at full montage was the sole strong predictor, suggesting that xDAWN\+RG’s advantage grows with signal strength when spatial information is available for supervised filtering\. At 3 channels, no metric predicted the xDAWN gap\. Prototype stability showed no association with the EEGNet gap and only a moderate association with the xDAWN gap at full montage, consistent with the cost being primarily architectural, not a deficiency in the prototype representation itself\.

These two costs are architecturally distinct\. EEGNet wins on temporal flexibility, learning unconstrained temporal filters rather than routing through fixed prototypes\. xDAWN\+RG wins on spatial exploitation, extracting spatial information that ERP\-XTTN forgoes deliberately\. Importantly, the model\-derived routing metrics analyzed above \(attention entropy, routing discriminability\) directly describe the classifier’s decision process, not a post\-hoc approximation\. Because ERP\-XTTN uses QK\-only cross\-attention with no value projection, the attention weights are the sole input to the classification head\.

For deployment, the practical implication is straightforward: at minimal montages, where spatial filtering has little to exploit and the prototype constraint costs little, ERP\-XTTN offers competitive accuracy with built\-in transparency\. At full montage, the cost is larger, and practitioners must weigh this against the value of interpretable routing for their application\.

A second deployment consideration is inference latency\. ERP\-XTTN, EEGNet, and xDAWN\+RG all deliver sub\-millisecond single\-trial inference on commodity CPU \(Section[3\.1](https://arxiv.org/html/2606.02939#S3.SS1)\), so latency does not constrain the choice between them\. ERP\-XTTN’s median inference latency was roughly half EEGNet’s despite a larger parameter count, indicating that parameter count is not a useful proxy for single\-trial deployment cost\.

### 4\.2Routing and ERP Structure

On two datasets, the dominant cross\-subject routing did not concentrate in the prototype window corresponding to the paradigm’s canonical named component\. For N400, the N400\-window prototype exhibited a near\-zero within\-window contrast while the flanking P2 prototype exhibited the largest contrast\. For N170, routing was distributed across prototypes with no consistent direction at the N170 window itself\. These observations do not indicate the prototypes have failed\. The prototypes are still derived from the grand\-average difference wave and are neurophysiologically grounded by construction\. Rather, the architecture reveals that the named deflection does not always carry the dominant cross\-subject discriminative signal within the difference wave\. Possible contributors include cross\-subject latency variability, the spatial nature of some contrasts, and differences in which subcomponents generalize across individuals, though disentangling these factors requires further investigation\. A black\-box model would achieve comparable AUROC on these datasets, but this structural insight into cross\-subject signal organization would be invisible\.

The TP↔\\leftrightarrowFP analysis provides a complementary view of the architecture’s transparency\. Across most datasets, false positives morphologically resembled true positives more than true negatives did \(Supplementary Figure S11\), consistent with classification operating by waveform\-prototype similarity: the model misclassifies trials that genuinely look like the target class\. Classification errors are therefore neurophysiologically explicable\. Two exceptions are informative\. For LRP, TP↔\\leftrightarrowFP and TP↔\\leftrightarrowTN correlations were similar, consistent with left and right response trials having similar temporal morphology at the detection channel\. For MMN, TP↔\\leftrightarrowTN was negative, reflecting that standard and deviant tones produce genuinely opposing waveforms\. These exceptions are themselves neurophysiologically expected, further validating the prototype\-similarity interpretation\.

### 4\.3LOSO Benchmark

To the best of our knowledge, these are the first published epoch\-level LOSO results on ERP CORE under deployment\-compatible \(causal\) preprocessing constraints\. The ordering of classification difficulty across ERP components, with ERN and LRP easiest and MMN at or near floor, was broadly consistent across xDAWN\+RG, EEGNet, and ERP\-XTTN \(Tables[2](https://arxiv.org/html/2606.02939#S3.T2)and S1\)\. This consistency, which aligns with prior cross\-subject results on this dataset\[[1](https://arxiv.org/html/2606.02939#bib.bib5)\], suggests that the ordering reflects ERP signal properties rather than classifier\-specific behavior\. The main channel\-dependent deviation was N400, which dropped by approximately 13–18 AUROC points from full montage to 3 channels across all three methods\. This drop is expected given the N400’s broad centro\-parietal scalp distribution: reduced spatial coverage substantially impairs decoding of components that lack a strongly focal topography\. The cross\-component difficulty ordering is useful independently of ERP\-XTTN: it characterizes which components generalize cross\-subject under matched preprocessing and offers practitioners a guide to which paradigms are viable for calibration\-free deployment\.

On the two previously benchmarked ErrP datasets \(HRI and BNCI\), ERP\-XTTN’s LOSO performance is competitive with published results\. On HRI, Schönleitner et al\.\[[26](https://arxiv.org/html/2606.02939#bib.bib20)\]reported 72\.7% balanced accuracy with generalized LDA using 27 channels; ERP\-XTTN achieved 76\.0% with three midline channels\. On BNCI, Ren et al\.\[[22](https://arxiv.org/html/2606.02939#bib.bib18)\]reported a best per\-session LOSO AUROC of 0\.755 with two channels \(FCz, Cz\); ERP\-XTTN achieved 0\.776 with three midline channels under causal filtering\. On several datasets, ERP\-XTTN outperformed xDAWN\+RG at 3 channels \(ERN, P300, HRI ErrP; Table[2](https://arxiv.org/html/2606.02939#S3.T2)\), suggesting prototype routing can provide an advantage over spatial filtering when the temporal signal is strong but the montage limits spatial information\.

All results reported here use causal IIR preprocessing and LOSO evaluation, which directly simulates calibration\-free deployment: the held\-out subject contributes no data to training\[[13](https://arxiv.org/html/2606.02939#bib.bib30)\]\. Many published LOSO benchmarks use acausal filtering, which is not applicable to real\-time deployment\[[26](https://arxiv.org/html/2606.02939#bib.bib20),[23](https://arxiv.org/html/2606.02939#bib.bib17)\]\.

### 4\.4Limitations

Several limitations should be noted\. First, all evaluations were offline; online or real\-time deployment may introduce additional challenges not captured here\. Second, prototypes were frozen after extraction from the training set and may not capture individual variation in ERP morphology; adaptive prototypes that update during inference could address this but would complicate the interpretability guarantee\. Third, the ERP CORE datasets each contain 40 subjects, moderate for LOSO evaluation but insufficient to characterize performance on clinical populations or underrepresented demographics\. Fourth, automatic prototype detection introduces hyperparameters \(prominence threshold 0\.02, maximum K = 4\) that were fixed across all datasets; exploring their sensitivity is left for future work\. Fifth, the LRP epoch window \(0–800 ms post\-response\) captures post\-response lateralized activity rather than the canonical pre\-response lateralized readiness potential\. Sixth, training hyperparameters including early\-stopping patience and learning rate were shared across both neural models without per\-model optimization; the reported gap may not reflect the full capacity of each architecture\. Finally, balanced accuracy lagged AUROC across all three methods \(Table S1\), indicating that the fixed, uncalibrated 0\.5 decision threshold is suboptimal under LOSO\. Post\-hoc probability calibration \(e\.g\., Platt scaling\) or threshold calibration could address this without altering any of the architectures\.

## 5Conclusion

ERP\-XTTN generalizes from ErrP to seven additional ERP components via automatic prototype detection, achieving competitive cross\-subject classification under causal, calibration\-free LOSO conditions with a mean interpretability cost of \.018 AUROC at 3 channels\. The performance gap is associated with two largely distinct architectural sources, both small at minimal montages\. The architecture’s transparent routing provides structural insights, including named\-component displacement and morphologically explicable classification errors, that black\-box models cannot offer\. To our knowledge, this work also provides the first epoch\-level LOSO benchmark on ERP CORE\.

\\ack

The authors declare no conflicts of interest\. The authors used Claude Opus 4\.6–4\.8 \(Anthropic\) during the preparation of this work for the following purposes: editing human\-written text for grammar, clarity, and structure; assisting with the drafting and debugging of analysis code; and supporting literature review by helping locate relevant publications\. All AI\-generated text was critically revised by the authors, all references were independently verified, and all results and figures derive from the analyses reported in the paper, conducted on publicly available datasets\. The authors reviewed and approved the final manuscript and take full responsibility for its content\.

\\funding

This research received no external funding\.

\\roles

Charlotte Genevier Wyman: Conceptualization, Methodology, Software, Formal Analysis, Investigation, Writing – Original Draft, Writing – Review & Editing, Visualization\. Leanne Hirshfield: Supervision, Writing – Review & Editing\.

\\suppdata

Supplementary material includes: Table S1 \(balanced accuracy\), Tables S2\-S10 \(per\-subject AUROC for all nine datasets\), Table S11 \(cross\-component analysis metrics\), Figure S1 \(difference\-wave prototypes for all nine datasets\), Figures S2\-S10 \(class\-averaged attention routing and per\-subject routing contrasts for all nine datasets\), and Figure S11 \(outcome\-conditioned grand\-average waveforms for all nine datasets\)\.

## References

## References

- \[1\]\(2023\-07\)Evaluating the structure of cognitive tasks with transfer learning\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2308.02408),2308\.02408Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p5.1),[§4\.3](https://arxiv.org/html/2606.02939#S4.SS3.p1.1)\.
- \[2\]PyRiemannExternal Links:[Document](https://dx.doi.org/10.5281/zenodo.593816),[Link](https://doi.org/10.5281/zenodo.593816)Cited by:[§2\.4\.2](https://arxiv.org/html/2606.02939#S2.SS4.SSS2.p1.1)\.
- \[3\]A\. Barachant and M\. Congedo\(2014\-08\)A Plug&Play P300 BCI Using Information Geometry\.arXiv\.External Links:1409\.0107,[Document](https://dx.doi.org/10.48550/arXiv.1409.0107)Cited by:[§2\.4\.2](https://arxiv.org/html/2606.02939#S2.SS4.SSS2.p1.1)\.
- \[4\]S\. Bhattacharyya, A\. Konar, D\. N\. Tibarewala, and M\. Hayashibe\(2017\-05\)A Generic Transferable EEG Decoder for Online Detection of Error Potential in Target Selection\.Frontiers in Neuroscience11,pp\. 226\.External Links:ISSN 1662\-453X,[Document](https://dx.doi.org/10.3389/fnins.2017.00226)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1)\.
- \[5\]B\. Blankertz, S\. Lemm, M\. Treder, S\. Haufe, and K\. Müller\(2011\-05\)Single\-trial analysis and classification of ERP components — A tutorial\.NeuroImage56\(2\),pp\. 814–825\.External Links:ISSN 10538119,[Document](https://dx.doi.org/10.1016/j.neuroimage.2010.06.048)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1)\.
- \[6\]H\. T\. J\. Chan, M\. Wimmer, I\. Šimić, G\. R\. Müller\-Putz, and E\. Veas\(2025\-10\)Informing EEG\-Based Error Decoding With Explainable AI\.In2025 IEEE International Conference on Systems, Man, and Cybernetics \(SMC\),Vienna, Austria,pp\. 4550–4557\.External Links:[Document](https://dx.doi.org/10.1109/SMC58881.2025.11343575),ISBN 979\-8\-3315\-3358\-8Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p4.1)\.
- \[7\]R\. Chavarriaga and J\. D\. R\. Millan\(2010\-08\)Learning From EEG Error\-Related Potentials in Noninvasive Brain\-Computer Interfaces\.IEEE Transactions on Neural Systems and Rehabilitation Engineering18\(4\),pp\. 381–388\.External Links:ISSN 1534\-4320, 1558\-0210,[Document](https://dx.doi.org/10.1109/TNSRE.2010.2053387)Cited by:[§2\.1\.1](https://arxiv.org/html/2606.02939#S2.SS1.SSS1.p1.1),[§2\.1\.3](https://arxiv.org/html/2606.02939#S2.SS1.SSS3.p9.1)\.
- \[8\]S\. K\. Ehrlich and G\. Cheng\(2019\-04\)A Feasibility Study for Validating Robot Actions Using EEG\-Based Error\-Related Potentials\.International Journal of Social Robotics11\(2\),pp\. 271–283\.External Links:ISSN 1875\-4791, 1875\-4805,[Document](https://dx.doi.org/10.1007/s12369-018-0501-8)Cited by:[§2\.1\.2](https://arxiv.org/html/2606.02939#S2.SS1.SSS2.p1.1),[§2\.1\.3](https://arxiv.org/html/2606.02939#S2.SS1.SSS3.p9.1)\.
- \[9\]S\. Jain and B\. C\. Wallace\(2019\-06\)Attention is not Explanation\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),Minneapolis, Minnesota,pp\. 3543–3556\.Cited by:[§2\.3](https://arxiv.org/html/2606.02939#S2.SS3.p4.1)\.
- \[10\]S\. Jalilpour and G\. Müller\-Putz\(2025\-01\)A framework for Interpretable deep learning in cross\-subject detection of event\-related potentials\.Engineering Applications of Artificial Intelligence139,pp\. 109642\.External Links:ISSN 09521976,[Document](https://dx.doi.org/10.1016/j.engappai.2024.109642)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p4.1)\.
- \[11\]E\. S\. Kappenman, J\. L\. Farrens, W\. Zhang, A\. X\. Stewart, and S\. J\. Luck\(2021\-01\)ERP CORE: An open resource for human event\-related potential research\.NeuroImage225,pp\. 117465\.External Links:ISSN 10538119,[Document](https://dx.doi.org/10.1016/j.neuroimage.2020.117465)Cited by:[§2\.1\.3](https://arxiv.org/html/2606.02939#S2.SS1.SSS3.p1.1),[§2\.1\.3](https://arxiv.org/html/2606.02939#S2.SS1.SSS3.p9.1)\.
- \[12\]A\. Kumar, E\. Pirogova, S\. S\. Mahmoud, and Q\. Fang\(2021\-10\)Classification of error\-related potentials evoked during stroke rehabilitation training\.Journal of Neural Engineering18\(5\),pp\. 056022\.External Links:ISSN 1741\-2560, 1741\-2552,[Document](https://dx.doi.org/10.1088/1741-2552/ac1d32)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1)\.
- \[13\]S\. Kunjan, T\. S\. Grummett, K\. J\. Pope, D\. M\. W\. Powers, S\. P\. Fitzgibbon, T\. Bastiampillai, M\. Battersby, and T\. W\. Lewis\(2021\)The Necessity of Leave One Subject Out \(LOSO\) Cross Validation for EEG Disease Diagnosis\.InBrain Informatics,M\. Mahmud, M\. S\. Kaiser, S\. Vassanelli, Q\. Dai, and N\. Zhong \(Eds\.\),Vol\.12960,pp\. 558–567\.External Links:[Document](https://dx.doi.org/10.1007/978-3-030-86993-9%5F50),ISBN 978\-3\-030\-86992\-2 978\-3\-030\-86993\-9Cited by:[§2\.6](https://arxiv.org/html/2606.02939#S2.SS6.p1.1),[§4\.3](https://arxiv.org/html/2606.02939#S4.SS3.p3.1)\.
- \[14\]V\. J\. Lawhern, A\. J\. Solon, N\. R\. Waytowich, S\. M\. Gordon, C\. P\. Hung, and B\. J\. Lance\(2018\-10\)EEGNet: a compact convolutional neural network for EEG\-based brain–computer interfaces\.Journal of Neural Engineering15\(5\),pp\. 056013\.External Links:ISSN 1741\-2560, 1741\-2552,[Document](https://dx.doi.org/10.1088/1741-2552/aace8c)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1),[§1](https://arxiv.org/html/2606.02939#S1.p4.1),[§2\.4\.1](https://arxiv.org/html/2606.02939#S2.SS4.SSS1.p1.1),[§2\.4](https://arxiv.org/html/2606.02939#S2.SS4.p1.1)\.
- \[15\]F\. Li, Y\. Xia, F\. Wang, D\. Zhang, X\. Li, and F\. He\(2020\-03\)Transfer Learning Algorithm of P300\-EEG Signal Based on XDAWN Spatial Filter and Riemannian Geometry Classifier\.Applied Sciences10\(5\),pp\. 1804\.External Links:ISSN 2076\-3417,[Document](https://dx.doi.org/10.3390/app10051804)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1),[§2\.4\.2](https://arxiv.org/html/2606.02939#S2.SS4.SSS2.p1.1)\.
- \[16\]C\. Lopes\-Dias, A\. I\. Sburlea, K\. Breitegger, D\. Wyss, H\. Drescher, R\. Wildburger, and G\. R\. Müller\-Putz\(2021\-08\)Online asynchronous detection of error\-related potentials in participants with a spinal cord injury using a generic classifier\.Journal of Neural Engineering18\(4\),pp\. 046022\.External Links:ISSN 1741\-2560, 1741\-2552,[Document](https://dx.doi.org/10.1088/1741-2552/abd1eb)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1)\.
- \[17\]F\. Lotte, L\. Bougrain, A\. Cichocki, M\. Clerc, M\. Congedo, A\. Rakotomamonjy, and F\. Yger\(2018\-06\)A review of classification algorithms for EEG\-based brain–computer interfaces: a 10 year update\.Journal of Neural Engineering15\(3\),pp\. 031005\.External Links:ISSN 1741\-2560, 1741\-2552,[Document](https://dx.doi.org/10.1088/1741-2552/aab2f2)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1),[§2\.4\.2](https://arxiv.org/html/2606.02939#S2.SS4.SSS2.p1.1)\.
- \[18\]F\. Lotte\(2015\-06\)Signal Processing Approaches to Minimize or Suppress Calibration Time in Oscillatory Activity\-Based Brain–Computer Interfaces\.Proceedings of the IEEE103\(6\),pp\. 871–890\.External Links:ISSN 0018\-9219, 1558\-2256,[Document](https://dx.doi.org/10.1109/JPROC.2015.2404941)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p1.1)\.
- \[19\]S\. J\. Luck\(2014\)An introduction to the event\-related potential technique\.2 edition,A Bradford Book,MIT Press,Cambridge, Mass\.External Links:ISBN 978\-0\-262\-52585\-5Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p1.1),[§1](https://arxiv.org/html/2606.02939#S1.p2.1)\.
- \[20\]T\. Luo\(2026\-04\)Domain generalized feature embedded learning for calibration\-free event\-related potentials recognition\.Cognitive Neurodynamics20\(1\),pp\. 77\.External Links:ISSN 1871\-4080, 1871\-4099,[Document](https://dx.doi.org/10.1007/s11571-026-10450-2)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1)\.
- \[21\]Y\. Qin, Q\. Xu, T\. Kujala, X\. Wang, and F\. Cong\(2026\-05\)Evaluating spatial normalization for SVM\-based EEG decoding: A within\- and between\-subjects perspective\.Biomedical Signal Processing and Control116,pp\. 109535\.External Links:ISSN 17468094,[Document](https://dx.doi.org/10.1016/j.bspc.2026.109535)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p5.1)\.
- \[22\]G\. Ren, A\. Kumar, S\. S\. Mahmoud, and Q\. Fang\(2024\-06\)A deep neural network and transfer learning combined method for cross\-task classification of error\-related potentials\.Frontiers in Human Neuroscience18,pp\. 1394107\.External Links:ISSN 1662\-5161,[Document](https://dx.doi.org/10.3389/fnhum.2024.1394107)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1),[§4\.3](https://arxiv.org/html/2606.02939#S4.SS3.p2.1)\.
- \[23\]G\. Ren, S\. S\. Mahmoud, A\. Kumar, Q\. Fang, and B\. Yu\(2023\-10\)A Transformer Encoder and Convolutional Neural Network Combined Method for Classification of Error\-related Potentials\.In2023 IEEE Biomedical Circuits and Systems Conference \(BioCAS\),Toronto, ON, Canada,pp\. 1–5\.External Links:[Document](https://dx.doi.org/10.1109/BioCAS58349.2023.10389146),ISBN 979\-8\-3503\-0026\-0Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1),[§4\.3](https://arxiv.org/html/2606.02939#S4.SS3.p3.1)\.
- \[24\]B\. Rivet, A\. Souloumiac, V\. Attina, and G\. Gibert\(2009\-08\)xDAWN Algorithm to Enhance Evoked Potentials: Application to Brain–Computer Interface\.IEEE Transactions on Biomedical Engineering56\(8\),pp\. 2035–2043\.External Links:ISSN 0018\-9294, 1558\-2531,[Document](https://dx.doi.org/10.1109/TBME.2009.2012869)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1),[§2\.4\.2](https://arxiv.org/html/2606.02939#S2.SS4.SSS2.p1.1)\.
- \[25\]J\. Sabio, N\. S\. Williams, G\. M\. McArthur, and N\. A\. Badcock\(2024\-03\)A scoping review on the use of consumer\-grade EEG devices for research\.PLOS ONE19\(3\),pp\. e0291186\.External Links:ISSN 1932\-6203,[Document](https://dx.doi.org/10.1371/journal.pone.0291186)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p1.1)\.
- \[26\]F\. M\. Schönleitner, L\. Otter, S\. K\. Ehrlich, and G\. Cheng\(2020\-08\)Calibration\-Free Error\-Related Potential Decoding With Adaptive Subject\-Independent Models: a Comparative Study\.IEEE Transactions on Medical Robotics and Bionics2\(3\),pp\. 399–409\.External Links:ISSN 2576\-3202,[Document](https://dx.doi.org/10.1109/tmrb.2020.3012436)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1),[§4\.3](https://arxiv.org/html/2606.02939#S4.SS3.p2.1),[§4\.3](https://arxiv.org/html/2606.02939#S4.SS3.p3.1)\.
- \[27\]A\. J\. Solon, S\. M\. Gordon, V\. J\. Lawhern, and B\. J\. Lance\(2017\-12\)Deep Learning Approaches for P300 Classification in Image Triage: Applications to the NAILS Task\.InProceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies \(NTCIR\-13\),Tokyo, Japan\.Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1)\.
- \[28\]C\. Vidaurre and B\. Blankertz\(2010\-06\)Towards a Cure for BCI Illiteracy\.Brain Topography23\(2\),pp\. 194–198\.External Links:ISSN 0896\-0267, 1573\-6792,[Document](https://dx.doi.org/10.1007/s10548-009-0121-6)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p1.1)\.
- \[29\]W\. Wei, S\. Qiu, Y\. Zhang, J\. Mao, and H\. He\(2022\-04\)ERP prototypical matching net: a meta\-learning method for zero\-calibration RSVP\-based image retrieval\.Journal of Neural Engineering19\(2\),pp\. 026028\.External Links:ISSN 1741\-2560, 1741\-2552,[Document](https://dx.doi.org/10.1088/1741-2552/ac5eb7)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p4.1)\.
- \[30\]S\. Wiegreffe and Y\. Pinter\(2019\-11\)Attention is not not Explanation\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),Hong Kong, China,pp\. 11–20\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1002)Cited by:[§2\.3](https://arxiv.org/html/2606.02939#S2.SS3.p4.1)\.
- \[31\]J\. R\. Wolpaw, N\. Birbaumer, D\. J\. McFarland, G\. Pfurtscheller, and T\. M\. Vaughan\(2002\-06\)Brain–computer interfaces for communication and control\.Clinical Neurophysiology113\(6\),pp\. 767–791\.External Links:ISSN 13882457,[Document](https://dx.doi.org/10.1016/S1388-2457%2802%2900057-3)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p1.1)\.
- \[32\]D\. Wu, Y\. Xu, and B\. Lu\(2022\-03\)Transfer Learning for EEG\-Based Brain–Computer Interfaces: A Review of Progress Made Since 2016\.IEEE Transactions on Cognitive and Developmental Systems14\(1\),pp\. 4–19\.External Links:ISSN 2379\-8920, 2379\-8939,[Document](https://dx.doi.org/10.1109/TCDS.2020.3007453)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1)\.
- \[33\]D\. Wu\(2017\-08\)Online and Offline Domain Adaptation for Reducing BCI Calibration Effort\.IEEE Transactions on Human\-Machine Systems47\(4\),pp\. 550–563\.External Links:ISSN 2168\-2291, 2168\-2305,[Document](https://dx.doi.org/10.1109/THMS.2016.2608931)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p3.1)\.
- \[34\]C\. G\. Wyman and L\. Hirshfield\(2026\)ERP\-XTTN: Interpretable Cross\-Subject Error\-Related Potential Classification via Cross\-Attention to Data\-Driven ERP Prototypes\.InProceedings of the 10th Graz Brain\-Computer Interface Conference,Note:acceptedCited by:[§1](https://arxiv.org/html/2606.02939#S1.p6.1),[§2\.3](https://arxiv.org/html/2606.02939#S2.SS3.p2.1)\.
- \[35\]ERP\-XTTN: Interpretable Cross\-Attention ERP ClassifierExternal Links:[Document](https://dx.doi.org/10.5281/zenodo.20497891),[Link](https://doi.org/10.5281/zenodo.20497891)Cited by:[§3\.2](https://arxiv.org/html/2606.02939#S3.SS2.p1.1),[§5](https://arxiv.org/html/2606.02939#S5.p5.2)\.
- \[36\]X\. Xiao, M\. Xu, J\. Jin, Y\. Wang, T\. Jung, and D\. Ming\(2020\-08\)Discriminative Canonical Pattern Matching for Single\-Trial Classification of ERP Components\.IEEE Transactions on Biomedical Engineering67\(8\),pp\. 2266–2275\.External Links:ISSN 0018\-9294, 1558\-2531,[Document](https://dx.doi.org/10.1109/TBME.2019.2958641)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p4.1)\.
- \[37\]B\. Zhang, M\. Xu, Y\. Zhang, S\. Ye, and Y\. Chen\(2024\-04\)Attention\-ProNet: A Prototype Network with Hybrid Attention Mechanisms Applied to Zero Calibration in Rapid Serial Visual Presentation\-Based Brain–Computer Interface\.Bioengineering11\(4\),pp\. 347\.External Links:ISSN 2306\-5354,[Document](https://dx.doi.org/10.3390/bioengineering11040347)Cited by:[§1](https://arxiv.org/html/2606.02939#S1.p4.1)\.

## Supplementary Material

Table S1:Balanced accuracy \(mean, LOSO\) across channel configurations\. Layout matches main text Table 2\.Δ\\Delta= best baseline minus ERP\-XTTN \(interpretability cost\)\. Rows ordered as in main text Table 2\. Full\-montage channel counts vary by dataset \(Table 1\)\. Per\-subject balanced\-accuracy values are not tabulated separately; per\-subject AUROC values are reported in Tables[S2](https://arxiv.org/html/2606.02939#Sx2.T2)–[S10](https://arxiv.org/html/2606.02939#Sx2.T10)\.Table S2:Per\-subject AUROC for ERP CORE ERN \(40 subjects×\\times3 models×\\times2 channel conditions\)\. Each row is one LOSO fold with the indicated subject held out \(these are not within\-subject cross\-validated values\)\. Mean row at bottom matches the corresponding column of main text Table 2\.Table S3:Per\-subject AUROC for HRI ErrP \(11 subjects×\\times3 models×\\times2 channel conditions\)\. Each row is one LOSO fold with the indicated subject held out \(these are not within\-subject cross\-validated values\)\. Mean row at bottom matches the corresponding column of main text Table 2\.Table S4:Per\-subject AUROC for BNCI ErrP \(6 subjects×\\times3 models×\\times2 channel conditions\)\. Each row is one LOSO fold with the indicated subject held out \(these are not within\-subject cross\-validated values\)\. Mean row at bottom matches the corresponding column of main text Table 2\.Table S5:Per\-subject AUROC for ERP CORE LRP \(40 subjects×\\times3 models×\\times2 channel conditions\)\. Each row is one LOSO fold with the indicated subject held out \(these are not within\-subject cross\-validated values\)\. Mean row at bottom matches the corresponding column of main text Table 2\.Table S6:Per\-subject AUROC for ERP CORE N170 \(40 subjects×\\times3 models×\\times2 channel conditions\)\. Each row is one LOSO fold with the indicated subject held out \(these are not within\-subject cross\-validated values\)\. Mean row at bottom matches the corresponding column of main text Table 2\.Table S7:Per\-subject AUROC for ERP CORE P300 \(40 subjects×\\times3 models×\\times2 channel conditions\)\. Each row is one LOSO fold with the indicated subject held out \(these are not within\-subject cross\-validated values\)\. Mean row at bottom matches the corresponding column of main text Table 2\.Table S8:Per\-subject AUROC for ERP CORE N2pc \(40 subjects×\\times3 models×\\times2 channel conditions\)\. Each row is one LOSO fold with the indicated subject held out \(these are not within\-subject cross\-validated values\)\. Mean row at bottom matches the corresponding column of main text Table 2\.Table S9:Per\-subject AUROC for ERP CORE MMN \(40 subjects×\\times3 models×\\times2 channel conditions\)\. Each row is one LOSO fold with the indicated subject held out \(these are not within\-subject cross\-validated values\)\. Mean row at bottom matches the corresponding column of main text Table 2\.Table S10:Per\-subject AUROC for ERP CORE N400 \(40 subjects×\\times3 models×\\times2 channel conditions\)\. Each row is one LOSO fold with the indicated subject held out \(these are not within\-subject cross\-validated values\)\. Mean row at bottom matches the corresponding column of main text Table 2\.Table S11:Cross\-component analysis metrics \(18 rows: 9 datasets×\\times2 channel conditions\)\. All values are LOSO\-averaged for ERP\-XTTN\.Column legend\(full definitions in Section 2, Methods\): Ch = channel condition \(Full = full montage, 3ch = 3\-channel preset\)\. AUROC = mean LOSO test AUROC\.ΔEEG\\Delta\_\{\\mathrm\{EEG\}\},ΔxDR\\Delta\_\{\\mathrm\{xDR\}\}= paired LOSO interpretability tax vs\. EEGNet and vs\. xDAWN\+RG \(mean of per\-subject baseline\_AUROC−\-ERP\-XTTN\_AUROC; positive = baseline wins\)\. Bal = minority\-class proportion \(0\.5 = perfectly balanced\)\. K = modal number of prototypes detected by the auto peak\-finder\. K\_con = fraction of folds at the modal K\. P\_stab = prototype stability: per\-slot mean pairwise Pearsonrracross folds at the detection channel\. SNR = absolute grand\-average difference\-wave amplitude divided by trial\-to\-trial standard deviation \(SD\) across all trials \(both classes\) at each prototype’s peak, averaged across prototypes and folds\. H\_attn = mean normalized attention entropy across subjects \(0 = peaked, 1 = diffuse\)\. R\_disc = routing discriminability, mean cosine distance between class\-averaged attention vectors\. Lat\_SD = cross\-subject SD \(ms\) of the dominant difference\-wave peak latency\. r\_FP, r\_TN = mean cross\-subject Pearsonrrbetween TP and FP \(resp\. TN\) grand\-mean waveforms at the detection channel\.Note on repeated values:Bal, K, K\_con, P\_stab, SNR, and Lat\_SD are computed from the preprocessed signal at the detection channel and from auto\-detected difference\-wave windows; they do not depend on the model’s input channel set, so values are identical across the two channel conditions of each dataset by construction\. AUROC,Δ\\Deltacolumns, H\_attn, R\_disc, r\_FP, and r\_TN are model\-dependent and differ across channel conditions\.
![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_prototypes_all_datasets.png)Figure S1:Difference\-wave prototypes \(detection channel only, 3\-channel condition\) across all nine datasets\. Each row corresponds to one dataset; each column to one prototype slot \(K varies by dataset: K=2 for LRP, K=3 for ERN and MMN, K=4 for all others\)\. Thick lines show the mean across LOSO folds; thin lines show individual fold prototypes\. Prototype windows \(shaded\) are derived automatically from prominent extrema in the training\-fold grand\-average difference wave \(Section 2\.3\)\. Prototype labels reflect the dominant ERP component within each window based on latency and polarity\. Fold\-to\-fold variability is low for most datasets \(mean pairwiser\>0\.89r\>0\.89\), with P300 showing the greatest instability across folds \(mean pairwiser=0\.645r=0\.645; see Table[S11](https://arxiv.org/html/2606.02939#Sx2.T11)\)\.![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_routing_erpcore_ern_3ch.png)Figure S2:Class\-averaged attention routing and per\-subject routing contrasts for ERN \(3\-channel configuration, n = 40 subjects\)\. Top: mean attention weight per prototype across temporal positions for error \(black, solid\) and correct \(gray, dashed\) trials, with standard\-error ribbons\. Shaded regions indicate detected prototype windows\. Bottom: per\-subject routing contrast \(error−\-correct attention\) for each prototype, with subjects sorted by AUROC \(descending\)\. Heatmap color scale clipped to the 95th percentile of absolute attention contrast per figure; values beyond this threshold are saturated\.![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_routing_hri_errp_cursor_3ch.png)Figure S3:Class\-averaged attention routing and per\-subject routing contrasts for HRI ErrP \(3\-channel configuration, n = 11 subjects\)\. Top: mean attention weight per prototype for error \(black, solid\) and correct \(gray, dashed\) trials, with standard\-error ribbons\. Bottom: per\-subject routing contrast \(error−\-correct attention\) for each prototype, with subjects sorted by AUROC \(descending\)\. Heatmap color scale clipped to the 95th percentile of absolute attention contrast per figure; values beyond this threshold are saturated\.![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_routing_bnci_errp_013_2015_3ch.png)Figure S4:Class\-averaged attention routing and per\-subject routing contrasts for BNCI ErrP \(3\-channel configuration, n = 6 subjects\)\. Top: mean attention weight per prototype for error \(black, solid\) and correct \(gray, dashed\) trials, with standard\-error ribbons\. Bottom: per\-subject routing contrast \(error−\-correct attention\) for each prototype, with subjects sorted by AUROC \(descending\)\. Heatmap color scale clipped to the 95th percentile of absolute attention contrast per figure; values beyond this threshold are saturated\.![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_routing_erpcore_lrp_3ch.png)Figure S5:Class\-averaged attention routing and per\-subject routing contrasts for LRP \(3\-channel configuration, n = 40 subjects\)\. Top: mean attention weight per prototype for left\-hand \(black, solid\) and right\-hand \(gray, dashed\) response trials, with standard\-error ribbons\. Bottom: per\-subject routing contrast \(left−\-right attention\) for each prototype, with subjects sorted by AUROC \(descending\)\. Heatmap color scale clipped to the 95th percentile of absolute attention contrast per figure; values beyond this threshold are saturated\.![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_routing_erpcore_n170_3ch.png)Figure S6:Class\-averaged attention routing and per\-subject routing contrasts for N170 \(3\-channel configuration, n = 40 subjects\)\. Top: mean attention weight per prototype for face \(black, solid\) and car \(gray, dashed\) trials, with standard\-error ribbons\. Bottom: per\-subject routing contrast \(face−\-car attention\) for each prototype, with subjects sorted by AUROC \(descending\)\. Heatmap color scale clipped to the 95th percentile of absolute attention contrast per figure; values beyond this threshold are saturated\.![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_routing_erpcore_p300_3ch.png)Figure S7:Class\-averaged attention routing and per\-subject routing contrasts for P300 \(3\-channel configuration, n = 40 subjects\)\. Top: mean attention weight per prototype for target \(black, solid\) and non\-target \(gray, dashed\) trials, with standard\-error ribbons\. Bottom: per\-subject routing contrast \(target−\-non\-target attention\) for each prototype, with subjects sorted by AUROC \(descending\)\. Heatmap color scale clipped to the 95th percentile of absolute attention contrast per figure; values beyond this threshold are saturated\.![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_routing_erpcore_n2pc_3ch.png)Figure S8:Class\-averaged attention routing and per\-subject routing contrasts for N2pc \(3\-channel configuration, n = 40 subjects\)\. Top: mean attention weight per prototype for target\-left \(black, solid\) and target\-right \(gray, dashed\) trials, with standard\-error ribbons\. Bottom: per\-subject routing contrast \(target\-left−\-target\-right attention\) for each prototype, with subjects sorted by AUROC \(descending\)\. Heatmap color scale clipped to the 95th percentile of absolute attention contrast per figure; values beyond this threshold are saturated\.![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_routing_erpcore_mmn_3ch.png)Figure S9:Class\-averaged attention routing and per\-subject routing contrasts for MMN \(3\-channel configuration, n = 40 subjects\)\. Top: mean attention weight per prototype for deviant \(black, solid\) and standard \(gray, dashed\) trials, with standard\-error ribbons\. Bottom: per\-subject routing contrast \(deviant−\-standard attention\) for each prototype, with subjects sorted by AUROC \(descending\)\. Heatmap color scale clipped to the 95th percentile of absolute attention contrast per figure; values beyond this threshold are saturated\.![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_routing_erpcore_n400_3ch.png)Figure S10:Class\-averaged attention routing and per\-subject routing contrasts for N400 \(3\-channel configuration, n = 40 subjects\)\. Top: mean attention weight per prototype for unrelated \(black, solid\) and related \(gray, dashed\) trials, with standard\-error ribbons\. Bottom: per\-subject routing contrast \(unrelated−\-related attention\) for each prototype, with subjects sorted by AUROC \(descending\)\. Heatmap color scale clipped to the 95th percentile of absolute attention contrast per figure; values beyond this threshold are saturated\.![Refer to caption](https://arxiv.org/html/2606.02939v1/figures/fig_morphology_all_datasets_3ch.png)Figure S11:Outcome\-conditioned grand\-average waveforms at the detection channel for all nine datasets \(3\-channel configuration\), arranged in Table 2 order\. Each panel shows two subplots: top, true positive \(TP, correctly classified positive\-class trials\) and false negative \(FN, missed positive\-class trials\); bottom, true negative \(TN, correctly classified negative\-class trials\) and false positive \(FP, falsely flagged negative\-class trials\)\. Waveforms are grand\-averaged across all subjects within each dataset; per\-subject TP↔\\leftrightarrowFP correlations reported in Section 3\.3 are lower because within\-subject variability that separates the two conditions averages out across subjects\. Dataset name, detection channel, and trial counts are shown per panel\. The main\-text analysis \(Section 3\.3, Figure 5\) is based on HRI ErrP; this figure extends the comparison to all components\. On most datasets, false positives morphologically resemble true positives more than true negatives do, consistent with the TP↔\\leftrightarrowFP\>\>TP↔\\leftrightarrowTN pattern reported in Table[S11](https://arxiv.org/html/2606.02939#Sx2.T11)\. Notable exceptions include LRP, where both correlations are similar, and MMN, where true negatives show an inverted waveform relative to true positives\.
ERP-XTTN: Interpretable Prototype-Guided Cross-Attention for Cross-Subject ERP Classification

Similar Articles

X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Human Attention

A Temporally Augmented Graph Attention Network for Affordance Classification

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

AttnGen: Attention-Guided Saliency Learning for Interpretable Genomic Sequence Classification

Submit Feedback

Similar Articles

X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Human Attention
A Temporally Augmented Graph Attention Network for Affordance Classification
Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text
Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers
AttnGen: Attention-Guided Saliency Learning for Interpretable Genomic Sequence Classification