Efficient Temporal Modeling for Mobile Sleep Staging via Lightweight Random Attention
Summary
Introduces Random Attention (RA), a lightweight temporal modeling module for mobile sleep staging that uses fixed random projections for similarity-based aggregation, achieving competitive performance with minimal additional parameters.
View Cached Full Text
Cached at: 06/15/26, 09:13 AM
# Efficient Temporal Modeling for Mobile Sleep Staging via Lightweight Random Attention
Source: [https://arxiv.org/html/2606.13694](https://arxiv.org/html/2606.13694)
###### Abstract
Mobile sleep staging serves as a foundational infrastructure for in\-home sleep monitoring and closed\-loop modulation\. But existing sequential models such as RNNs and Transformers are computationally expensive for mobile deployment\. In this paper, we propose Random Attention \(RA\), a lightweight temporal modeling module based on fixed random projections, which replaces learnable sequence modeling with similarity\-based aggregation\. RA introduces little additional parameters beyond the epoch encoder while enabling effective temporal smoothing\. We further provide a theoretical interpretation via the Random Attention Prior Kernel \(RAPK\), which decomposes RA into a global smoothing term and a feature similarity term, offering an interpretable view of temporal sleep structure\. Experiments on Sleep\-EDF\-20 and Sleep\-EDF\-78 show that RA consistently improves epoch\-wise baselines by 1–3% in accuracy and F1 score, while achieving competitive performance compared with LSTM, GRU, and Transformer models\. RA also demonstrates strong generalization across different backbone encoders and improved robustness over conventional temporal smoothing methods\. These results indicate that efficient sleep staging can be achieved through lightweight similarity\-based temporal aggregation, making RA suitable for real\-time wearable applications\.
## IIntroduction
Automatic sleep staging is fundamental to large\-scale sleep health monitoring, digital phenotyping, and closed\-loop neuromodulation\[[1](https://arxiv.org/html/2606.13694#bib.bib1),[2](https://arxiv.org/html/2606.13694#bib.bib2),[3](https://arxiv.org/html/2606.13694#bib.bib3)\]\. Although polysomnography \(PSG\) remains the clinical gold standard, its reliance on expensive equipment and labor\-intensive expert annotation limits scalability\. Recent advances in wearable sensing, particularly portable EEG devices, have enabled mobile sleep staging in home environments\[[4](https://arxiv.org/html/2606.13694#bib.bib4),[5](https://arxiv.org/html/2606.13694#bib.bib5),[6](https://arxiv.org/html/2606.13694#bib.bib6),[7](https://arxiv.org/html/2606.13694#bib.bib7)\]\. These systems support long\-term, real\-world sleep monitoring and real\-time closed\-loop intervention\. However, their deployment on mobile devices remains constrained by computational and energy budgets, limiting high\-accuracy sleep staging in practice\[[8](https://arxiv.org/html/2606.13694#bib.bib8)\]\.
Existing deep learning approaches can be broadly categorized into epoch\-wise modeling and sequence modeling\. Epoch\-wise methods process each 30\-second EEG epoch independently\[[8](https://arxiv.org/html/2606.13694#bib.bib8),[9](https://arxiv.org/html/2606.13694#bib.bib9),[10](https://arxiv.org/html/2606.13694#bib.bib10)\]\. While computationally efficient, they ignore the strong temporal continuity of sleep architecture, often resulting in unstable predictions that violate physiological transition patterns\. To address this limitation, sequence models such as LSTM and GRU have been widely adopted to capture temporal dependencies across neighboring epochs\[[11](https://arxiv.org/html/2606.13694#bib.bib11),[12](https://arxiv.org/html/2606.13694#bib.bib12),[13](https://arxiv.org/html/2606.13694#bib.bib13)\]\. More recently, Transformer\-based models have shown improved performance by modeling long\-range dependencies via self\-attention\[[14](https://arxiv.org/html/2606.13694#bib.bib14),[15](https://arxiv.org/html/2606.13694#bib.bib15)\]\. However, these gains come at the cost of increased computational complexity, memory usage, and inference latency, making them less suitable for resource\-constrained mobile applications\. This motivates a closer look at whether such modeling complexity is fundamentally required for sleep staging\.
Empirical evidence further suggests that extending the temporal context in conventional sequence models often yields only marginal or inconsistent performance gains\[[16](https://arxiv.org/html/2606.13694#bib.bib16),[17](https://arxiv.org/html/2606.13694#bib.bib17),[18](https://arxiv.org/html/2606.13694#bib.bib18),[19](https://arxiv.org/html/2606.13694#bib.bib19)\]; if long\-range dependency modeling were the primary driver, performance would be expected to improve systematically with larger temporal windows\. This indicates that the benefits of temporal modeling may stem more from enforcing local temporal consistency than from capturing complex long\-range interactions\. This observation is consistent with the physiological characteristics of sleep, where stage transitions are typically smooth, gradual, and highly redundant, with neighboring epochs sharing similar patterns\. Therefore, a fundamental question arises: is the assumption that sleep staging requires modeling complex long\-range dependencies truly justified, or can the temporal structure be more effectively captured by simpler smoothing mechanisms grounded in the physiological continuity of sleep stage transitions?
Our previous work with stochastic transformers suggests that sleep staging can be effectively interpreted as an adaptive smoothing process, where stochastic attention suppresses local noise while preserving meaningful transitions based on feature similarity\[[20](https://arxiv.org/html/2606.13694#bib.bib20)\]\. This perspective implies that performance gains primarily arise from enforcing temporal consistency rather than learning complex dependencies\.
Motivated by this insight, we propose a lightweight Random Attention \(RA\) mechanism for mobile sleep staging\. Instead of learning parameter\-intensive temporal dependencies, RA performs content\-aware temporal aggregation using fixed random projections, achieving efficient sequence modeling with minimal computational overhead\. This design explicitly leverages the physiological characteristics of sleep stage transitions, making it well\-suited for real\-time deployment on resource\-constrained devices\.
Extensive experiments on benchmark datasets demonstrate that the proposed method achieves performance comparable to conventional sequence models while significantly reducing model size and computational cost\. It also outperforms standard post\-processing smoothing methods in both robustness and peak performance\.
The main contributions of this paper are summarized as follows:
- •We revisit temporal sleep staging from the perspective of adaptive smoothing, challenging the necessity of complex dependency modeling\.
- •We propose a lightweight Random Attention mechanism that enables efficient, content\-aware temporal modeling\.
- •We demonstrate through extensive experiments that the proposed method achieves competitive performance with improved efficiency and robustness compared to both sequence models and conventional smoothing approaches\.
## IIMethod
### II\-AProblem Definition
Given a sequence of EEG epochsX=\{x1,x2,…,xT\}X=\\\{x\_\{1\},x\_\{2\},\\dots,x\_\{T\}\\\}, where eachxt∈ℝC×Lx\_\{t\}\\in\\mathbb\{R\}^\{C\\times L\}is a 30\-second segment \(L=3000L=3000at 100 Hz\), the goal is to predict the sequence of sleep stage labelsY=\{y1,y2,…,yT\}Y=\\\{y\_\{1\},y\_\{2\},\\dots,y\_\{T\}\\\},yt∈\{W,N1,N2,N3,REM\}y\_\{t\}\\in\\\{W,N1,N2,N3,REM\\\}\.
Most mobile systems first extract epoch\-level representationsZ=\{z1,…,zT\}Z=\\\{z\_\{1\},\\dots,z\_\{T\}\\\},zt∈ℝdz\_\{t\}\\in\\mathbb\{R\}^\{d\}, via a lightweight CNN encoder, then apply temporal modeling\. We replace costly LSTM/GRU/Transformer modules with a Random Attention \(RA\)\.
### II\-BRandom Attention
RA constructs a lightweight random attention matrixAAand aggregates features asO=AZO=AZ, whereAAis never learned\. Instead, each epoch is projected into a fixed random low\-dimensional space:
Q=ZWQ,K=ZWK,WQ,WK∈ℝd×dkQ=ZW\_\{Q\},\\quad K=ZW\_\{K\},\\quad W\_\{Q\},W\_\{K\}\\in\\mathbb\{R\}^\{d\\times d\_\{k\}\}with entries sampled once at initialization and kept frozen thereafter\. The attention weights are then\[[21](https://arxiv.org/html/2606.13694#bib.bib21)\]
A=softmax\(QK⊤dk\)A=\\text\{softmax\}\\left\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)\(row\-wise\)\. SinceWQW\_\{Q\}andWKW\_\{K\}are fixed, RA introduces almost no additional trainable parameters, and only the epoch encoder and final classifier are learned\.
Based on our prior study, the projection matrices are initialized using Xavier uniform initialization\[[22](https://arxiv.org/html/2606.13694#bib.bib22)\]\.
W∼𝒰\(−6d\+dk,6d\+dk\)\.W\\sim\\mathcal\{U\}\\left\(\-\\sqrt\{\\frac\{6\}\{d\+d\_\{k\}\}\},\\,\\sqrt\{\\frac\{6\}\{d\+d\_\{k\}\}\}\\right\)\.Empirically, uniform schemes \(e\.g\., Xavier or Kaiming\) outperform Gaussian alternatives\. From a kernel perspective, Gaussian initialization compresses feature variance, causing attention logits to collapse toward zero and degenerating the model into near\-uniform averaging\. In contrast, uniform initialization preserves feature scale, keeping attention scores in a stable regime that balances smoothing and structure preservation, which is critical for capturing temporal dependencies in sleep stage transitions\.
### II\-CTheoretical Interpretation
The effectiveness of RA follows directly from the Random Attention Prior Kernel \(RAPK\) established in our prior work\[[20](https://arxiv.org/html/2606.13694#bib.bib20)\]\. In the high\-dimensional limit \(dk→∞d\_\{k\}\\rightarrow\\infty\), the expected kernel converges to
𝔼\[KRAP\]≈C0𝟏𝟏⊤\+C1ZZ⊤,\\mathbb\{E\}\[K\_\{\\text\{RAP\}\}\]\\approx C\_\{0\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\+C\_\{1\}ZZ^\{\\top\},whereZZ⊤ZZ^\{\\top\}is the Gram matrix of epoch representations, andC0,C1C\_\{0\},C\_\{1\}are positive, sequence\-dependent scalars, whose magnitudes are determined by the initialization variances and sequence length, and scale proportionally with the global statistics of the input features\.
This decomposition reveals that RA implements content\-aware temporal smoothing:
- •The global termC0𝟏𝟏⊤C\_\{0\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}enforces temporal inertia and suppresses high\-frequency noise\.
- •The similarity termC1ZZ⊤C\_\{1\}ZZ^\{\\top\}adaptively weights interactions according to feature similarity\.
Consequently, the RAPK kernel combines a global averaging term and a content\-adaptive smoothing term, matching the physiology of sleep staging\.
The global termC0𝟏𝟏⊤C\_\{0\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}captures sleep\-state inertia: sleep changes slowly, so adjacent epochs are likely to share the same stage\. It therefore applies a uniform averaging bias within the local window, suppressing isolated fluctuations and noisy predictions\.
However, uniform averaging alone would blur true stage transitions\. The content\-adaptive termC1ZZ⊤C\_\{1\}ZZ^\{\\top\}adjusts the smoothing strength according to feature similarity: epochs with similar representations are smoothed together, whereas epochs with dissimilar representations, typically those located across stage boundaries, interact much more weakly\. As a result, RAPK enforces strong smoothing within a stage while preserving meaningful transitions between stages\.
### II\-DComputational Complexity
Standard self\-attention has quadratic complexity in sequence length, requiring𝒪\(T2d\+Td2\)\\mathcal\{O\}\(T^\{2\}d\+Td^\{2\}\)computation and𝒪\(T2\+Td\)\\mathcal\{O\}\(T^\{2\}\+Td\)memory, which becomes prohibitive for long EEG sequences\.
In contrast, RA replaces the explicit pairwise attention computation with a fixed random projection mechanism, eliminating the need to construct the fullT×TT\\times Tattention matrix\. This reduces the computational cost to𝒪\(TdDk\)\\mathcal\{O\}\(TdD\_\{k\}\)and memory cost to𝒪\(TDk\)\\mathcal\{O\}\(TD\_\{k\}\)\. The absence of the feed\-forward network further reduces the overall complexity while preserving temporal interaction through low\-rank token mixing\.
The resulting structured projection avoids dense attention storage and enables full parallelization across time\. Compared with LSTM, GRU, and Transformer baselines, RA achieves substantially lower latency and memory consumption, making it well\-suited for real\-time wearable EEG inference scenarios\. Table[I](https://arxiv.org/html/2606.13694#S2.T1)quantitatively compares the computational and memory complexity of different models\.
TABLE I:Computational complexity comparison\.TTis sequence length,ddis feature dimension, andDkD\_\{k\}is the random projection dimension in RA\.TABLE II:Statistics of Sleep\-EDF datasets\.
## IIIEXPERIMENTS
### III\-ADatasets
We evaluate on Sleep\-EDF\-20 and Sleep EDFX\[[23](https://arxiv.org/html/2606.13694#bib.bib23)\]\[[24](https://arxiv.org/html/2606.13694#bib.bib24)\]\. Both contain overnight EEG recordings \(Fpz\-Cz channel, 100 Hz\) annotated into five stages \(W, N1, N2, N3, REM\)\. We follow standard subject\-independent cross\-validation after excluding movement/unknown epochs and merging S3/S4 into N3 per American Academy of Sleep Medicine guidelines\. Following prior work, only 30 minutes of wakefulness before sleep onset and after sleep termination are retained\[[25](https://arxiv.org/html/2606.13694#bib.bib25)\]\. For Sleep\-EDF\-20, we adopt 20\-fold cross\-validation, while for Sleep\-EDFX, we adopt 10\-fold cross\-validation for consistency with previous studies\. Table[II](https://arxiv.org/html/2606.13694#S2.T2)summarizes the statistics of the datasets used in this study\.
### III\-BImplementation Details
We adopt a lightweight CNN\-based epoch encoder followed by a temporal modeling module for sleep staging\. Specifically, we build upon our previous mobile sleep staging model, MicrosleepNet\[[8](https://arxiv.org/html/2606.13694#bib.bib8)\], which consists of two components: \(1\) a group\-convolution\-based feature extraction encoder and \(2\) a dilated\-convolution\-based feature fusion module\. We consider two backbone settings\. The first, denoted as MicrosleepNet\_Encoder, contains only the feature extraction encoder\. The second, denoted as MicrosleepNet, includes the complete original architecture\. For RA, the default random projection dimension isdk=128d\_\{k\}=128\. Baselines include: \(i\) epoch encoder only, \(ii\) LSTM\[[26](https://arxiv.org/html/2606.13694#bib.bib26)\], \(iii\) GRU\[[27](https://arxiv.org/html/2606.13694#bib.bib27)\], and \(iv\) a trainable Transformer\[[21](https://arxiv.org/html/2606.13694#bib.bib21)\]\. Each sample consists of a sliding window of 10 consecutive epochs\.
Training is performed for 100 epochs using AdamW with an initial learning rate of1×10−31\\times 10^\{\-3\}, weight decay of1×10−41\\times 10^\{\-4\}, and batch size 20\. For the trainable Transformer baseline, Transformer layers and learnable positional embeddings are optimized with a reduced learning rate of1×10−41\\times 10^\{\-4\}\. We apply early stopping with a patience of 10 epochs, a 5\-epoch warmup schedule, and gradient clipping with a maximum norm of 2\.0\. Evaluation is conducted using overall accuracy, weighted F1, Cohen’s kappa, and per\-stage F1\.
All experiments are implemented in PyTorch and conducted on a single NVIDIA RTX 3090 GPU \(24GB\)\. No additional signal preprocessing, data augmentation, or class balancing strategies are applied to ensure a fair comparison across models\.
TABLE III:Main results\. Baselines are highlighted in gray, and proposed RA variants are fully bolded\. Improvements over baselines are shown in parentheses\. Computational cost \(Params and MFLOPs\) is added\.
### III\-CMain Results
Table[III](https://arxiv.org/html/2606.13694#S3.T3)summarizes the full performance comparison of the proposed Random Attention \(RA\) against the epoch\-wise baseline and three strong sequence modeling baselines on both Sleep\-EDF\-20 and Sleep\-EDFX\. Results are reported for two MicroSleepNet variants: MicroSleepNet\_Encoder and MicroSleepNet model\.
On Sleep\-EDF\-20, the plain MicroSleepNet\_Encoder baseline achieves 81\.79% accuracy and 81\.62% weighted F1\. Adding the RA module improves performance to 83\.35% accuracy \(\+1\.56%\) and 83\.23% weighted F1 \(\+1\.61%\)\. A similar trend is observed on the full MicroSleepNet model, where RA increases accuracy from 82\.36% to 83\.70% and weighted F1 from 82\.10% to 83\.78%\. These improvements are achieved with negligible additional trainable parameters\. Compared with other temporal modeling variants, RA achieves comparable performance, with results generally lying within the same performance range as LSTM and GRU\-based designs, while remaining slightly below the best\-performing LSTM configuration\. It also remains below the best\-performing trainable Transformer, which achieves 84\.34% accuracy and 84\.16% weighted F1\. Overall, RA provides a favorable trade\-off between performance and efficiency, delivering consistent gains over encoder\-only baselines with minimal computational overhead\.
The trend is even more pronounced on the larger and more challenging Sleep\-EDFX dataset\. The MicroSleepNet\_Encoder baseline yields only 78\.32% accuracy and 76\.99% weight F1\. RA improves these by \+2\.43% and \+2\.94%, respectively, reaching 80\.75% accuracy and 79\.93% weight F1\. For the full MicroSleepNet model, RA lifts accuracy from 79\.20% to 81\.19% and weight F1 from 78\.43% to 80\.78%\. On this dataset, RA achieves performance comparable to LSTM and GRU, and approaches the trainable Transformer \(81\.83% accuracy\)\.These consistent improvements across both backbones and both datasets demonstrate that the lightweight random attention mechanism delivers highly effective temporal smoothing that rivals or exceeds far more expensive sequence models\.
### III\-DAnalysis of Different Sleep Stages
The per\-stage F1 scores in Table[III](https://arxiv.org/html/2606.13694#S3.T3)reveal that RA’s benefits are selective and clinically meaningful\. The largest and most consistent improvements occur on the challenging transitional stages N1 and REM, while performance on the more stable stages \(W, N2, N3\) remains strong or shows modest gains\. This pattern holds across both datasets and both MicroSleepNet backbones\.
On both datasets, RA yields its largest improvements on N1, increasing F1 by approximately 0\.06–0\.15 depending on the backbone\. REM also benefits substantially, with gains ranging from about 0\.04 to 0\.10\. In contrast, the more stable stages \(W, N2, and N3\) exhibit only modest improvements, typically around 0\.01\. These results align directly with the Random Attention Prior Kernel \(RAPK\) analysis: N1 epochs are short, unstable, and frequently confused with neighboring stages, yet they remain close in feature space to correct neighbors\. The similarity termC1ZZ⊤C\_\{1\}ZZ^\{\\top\}selectively strengthens these interactions, correcting noisy predictions without explicit transition modeling\. REM benefits analogously due to shared transitional EEG patterns\. Meanwhile, the global averaging termC0𝟏𝟏⊤C\_\{0\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}provides sufficient regularization for stable stages without over\-smoothing discriminative boundaries\. Overall, RA reinforces the natural temporal continuity of sleep physiology, delivering competitive macro\-level performance at a fraction of the computational cost of conventional sequence models\.
### III\-EEffect of Random Projection Dimensiondkd\_\{k\}
We further ablate the random projection dimensiondkd\_\{k\}\(default 128\) on Sleep\-EDF\-20\. Results are shown in Table[IV](https://arxiv.org/html/2606.13694#S3.T4)\. Performance improves monotonically with largerdkd\_\{k\}, reaching a peak of 84\.28% Acc atdk=512d\_\{k\}=512\. This trend is theoretically predicted by RAPK: higher dimensionality yields a more accurate low\-variance approximation of the similarity kernelC1ZZ⊤C\_\{1\}ZZ^\{\\top\}, resulting in more stable content\-aware smoothing\. However, excessively strong smoothing can slightly reduce sensitivity to transient stages such as N1\. This indicates a trade\-off between prediction smoothness and sensitivity to short transitional stages, which can be controlled by adjustingdkd\_\{k\}\. Even at the modest defaultdk=128d\_\{k\}=128, RA already delivers strong gains, making it highly practical for mobile deployment\.
TABLE IV:Ablation on random projection dimensiondkd\_\{k\}\(MicroSleepNet backbone, Sleep\-EDF\-20\)\.TABLE V:Generalization across epoch encoders \(Sleep\-EDF\-20\)\.

Figure 1:Comparison with different window sizes on Sleep\-EDF\-20\(left\) and Sleep\-EDF\-78\(right\)\.
### III\-FGeneralization Across Different Epoch Encoders
To verify that RA is not tailored to a specific backbone, we evaluate it with four additional lightweight sleep staging encoders: DeepSleepNet\[[11](https://arxiv.org/html/2606.13694#bib.bib11)\], TinySleepNet\[[12](https://arxiv.org/html/2606.13694#bib.bib12)\], ULW\-SleepNet\[[9](https://arxiv.org/html/2606.13694#bib.bib9)\], and MSA\-CNN\[[10](https://arxiv.org/html/2606.13694#bib.bib10)\]\. For DeepSleepNet, we adopted the implementation from LPSGM and further enhanced feature extraction by expanding the output channels of the final convolutional layer to 256\[[28](https://arxiv.org/html/2606.13694#bib.bib28)\]\.
As shown in Table[V](https://arxiv.org/html/2606.13694#S3.T5), RA delivers consistent 2–3% absolute improvements in accuracy and weighted F1 across all backbones, with especially large gains in N1 F1 \(e\.g\., DeepSleepNet\_Encoder: \+0\.1547 and TinySleepNet\_Encoder: \+0\.1511\)\.Notably, the lightweight RA\-enhanced variants even outperform the original DeepSleepNet and TinySleepNet architectures equipped with their native sequence modeling modules\.For example, DeepSleepNet\_Encoder\_RA surpasses the original DeepSleepNet by \+0\.86% accuracy and \+0\.88% weighted F1, while TinySleepNet\_Encoder\_RA also exceeds TinySleepNet despite using a substantially lighter temporal modeling design\. These findings suggest that RA can capture sleep transition dynamics more effectively than conventional recurrent sequence modeling, while maintaining significantly lower computational complexity\. Overall, the results establish RA as a plug\-and\-play, encoder\-agnostic temporal smoothing module that can be seamlessly integrated into existing mobile sleep staging pipelines\.
### III\-GComparison with Simple Temporal Smoothing and Different Window Sizes
We further compare RA with several non\-learnable temporal smoothing methods on Sleep\-EDF\-20 and Sleep\-EDF\-78, including majority voting, moving average, Kalman filtering, weighted averaging, Savitzky–Golay filtering, and Gaussian smoothing\. All methods are applied to the same epoch\-level predictions while varying the temporal window from 2 to 20 epochs\. For smoothing methods involving hyperparameters, we first perform grid search under a fixed window setting to identify the optimal configuration, ensuring that each baseline is evaluated under its best achievable performance\.
As shown in Figure[1](https://arxiv.org/html/2606.13694#S3.F1), introducing even simple temporal smoothing consistently improves all methods over the non\-sequential baseline, confirming that local temporal context is beneficial for sleep staging\. Across both datasets, most conventional smoothing methods achieve their best performance at moderate window sizes \(typically 5–10 epochs\), after which performance saturates or declines\. The degradation is more pronounced for majority voting and Kalman filtering, where both F1 and accuracy decrease steadily with larger windows, while Gaussian and Savitzky–Golay filters remain relatively more robust but still exhibit mild performance drops beyond approximately 10 epochs\. On Sleep\-EDF\-78, a consistent but slightly lower overall performance level is observed compared with EDF\-20, while the relative trends across methods remain highly consistent\. In particular, the sensitivity of traditional smoothing approaches to increasing window size is still evident, with similar saturation behavior and subsequent degradation under overly large contexts\. In contrast, RA maintains stable performance across a wide range of window sizes on both datasets\. RA\-128 already achieves competitive results, and increasing the projection dimension to 256 or 512 yields consistent improvements, with RA\-512 achieving the best overall performance\. Compared with strong conventional smoothing methods, RA not only attains higher peak performance on both EDF\-20 and EDF\-78, but also demonstrates markedly reduced sensitivity to window size selection\. This behavior aligns with the RAPK interpretation, where conventional methods rely on fixed temporal weighting schemes, whereas RA adaptively aggregates information based on representation similarity\.
## IVDiscussion
RA consistently improves both Sleep\-EDF\-20 and Sleep\-EDF\-78 by 1–3% over the epoch\-wise baseline while introducing almost no additional trainable parameters\. Despite its simplicity, RA shows competitive performance compared with a fully trainable Transformer, while maintaining a modest performance gap\. It also often matches or outperforms LSTM and GRU\. This suggests that, for sleep staging, a substantial portion of useful temporal information can be effectively captured through lightweight similarity\-based aggregation, without relying on complex sequence modeling\. This behavior agrees with the RAPK interpretation\. RA combines a global smoothing term, which suppresses isolated prediction noise, with a similarity term, which propagates information between epochs that have similar EEG representations\. This simple content\-aware smoothing is sufficient to capture the dominant temporal structure\. The largest gains occur for the difficult transitional stages N1 and REM\. These stages are often confused locally, but their neighboring epochs usually remain close in feature space\. RA therefore improves them substantially by reinforcing consistent context\. Finally, RA provides similar improvements across multiple backbones and random projection dimensions, demonstrating that it is a lightweight, plug\-and\-play temporal modeling module suitable for real\-time wearable temporal sleep staging\.
## VConclusion
This paper demonstrates that the primary requirement for mobile sleep staging is lightweight, similarity\-based temporal smoothing rather than complex sequence reasoning\. We introduce RA, a random attention mechanism grounded in Random Transformer and RAPK theory\. By constructing a structured smoothing kernel that balances global averaging and feature similarity, RA achieves efficient temporal aggregation\. Experiments across multiple datasets and encoders confirm that RA achieves performance comparable to traditional temporal modeling baselines while delivering superior efficiency\. Compared with a series of simple smoothing baselines, RA is also more stable and achieves higher peak performance\. The practical significance of RA lies in its ability to provide accurate temporal sleep staging directly on wearable devices with negligible computational overhead, thereby making long\-term home monitoring and future sleep closed\-loop modulation systems feasible\.
## Acknowledgment
The authors gratefully acknowledge the Brain\-Computer Modulation Laboratory of Southeast University for providing the computational resources used in this study\. The authors also gratefully acknowledge the financial support provided by the China Scholarship Council\.
## References
- \[1\]G\. Liu, J\. Zhang, Y\. Luo, G\. Wei, S\. Sun, S\. Deng, P\. Wei, and N\. Chen, “Sleep modulation: The challenge of transitioning from open loop to closed loop,” 2025\. \[Online\]\. Available: https://arxiv\.org/abs/2512\.03784
- \[2\]M\. J\. Esfahani, S\. Farboud, H\.\-V\. V\. Ngo, J\. Schneider, F\. D\. Weber, L\. M\. Talamini, and M\. Dresler, “Closed\-loop auditory stimulation of sleep slow oscillations: Basic principles and best practices,”*Neuroscience & Biobehavioral Reviews*, vol\. 153, p\. 105379, 2023\.
- \[3\]W\. G\. Coon, S\. J\. Nilsson, M\. T\. Smith, and M\. J\. Reid, “Acoustic stimulation and other emerging approaches to enhance sleep: design notes for the next generation of closed\-loop neurostimulation technology,”*Frontiers in Neuroscience*, vol\. Volume 19 \- 2025, 2026\.
- \[4\]W\. G\. Coon, P\. Zerr, G\. Milsap, N\. Sikder, M\. Smith, M\. Dresler, and M\. Reid, “ezscore\-f: A set of freely available, validated sleep stage classifiers for forehead EEG\.” \[Online\]\. Available: https://www\.biorxiv\.org/content/early/2025/07/17/2025\.06\.02\.657451
- \[5\]N\. Sikder, L\. Verkaar, A\. Paltarzhytskaya, S\. Acan, L\. Bovy, T\. Almazova, E\. Krugliakova, Y\. Rosenblum, M\. Krauledat, M\. Dresler, and P\. Zerr, “Wearanize\+: a multimodal dataset for evaluating wearable technologies in sleep research,”*SLEEP Advances*, vol\. 7, no\. 1, p\. zpaf094, 01 2026\. \[Online\]\. Available: https://doi\.org/10\.1093/sleepadvances/zpaf094
- \[6\]N\. Sikder, P\. Zerr, M\. J\. Esfahani, M\. Dresler, and M\. Krauledat, “eegfloss: A python package for refining sleep eeg recordings using machine learning models,” 2025\. \[Online\]\. Available: https://arxiv\.org/abs/2507\.06433
- \[7\]M\. J\. Esfahani, F\. D\. Weber, M\. Boon, S\. Anthes, T\. Almazova, M\. v\. Hal, Y\. Keuren, C\. Heuvelmans, E\. Simo, L\. Bovy, N\. Adelhöfer, M\. M\. t\. Avest, M\. Perslev, R\. t\. Horst, C\. Harous, T\. Sundelin, J\. Axelsson, and M\. Dresler, “Validation of the sleep eeg headband zmax,”*bioRxiv*, 2023\. \[Online\]\. Available: https://www\.biorxiv\.org/content/early/2023/08/21/2023\.08\.18\.553744
- \[8\]G\. Liu, G\. Wei, S\. Sun, D\. Mao, J\. Zhang, D\. Zhao, X\. Tian, X\. Wang, and N\. Chen, “Micro sleepnet: efficient deep learning model for mobile terminal real\-time sleep staging,”*Frontiers in Neuroscience*, vol\. Volume 17 \- 2023, 2023\.
- \[9\]Z\. Wang, D\. Zhou, Q\. Xu, F\. Cong, M\. Al\-Sa’d, and J\. Raitoharju, “Ulw\-sleepnet: An ultra\-lightweight network for multimodal sleep stage scoring,” 2026\. \[Online\]\. Available: https://arxiv\.org/abs/2602\.23852
- \[10\]S\. Goerttler, Y\. Wang, E\. Eldele, F\. He, and M\. Wu, “Msa\-cnn: A lightweight multi\-scale cnn with attention for sleep stage classification,”*Biomedical Signal Processing and Control*, vol\. 120, p\. 110141, 2026\. \[Online\]\. Available: https://www\.sciencedirect\.com/science/article/pii/S1746809426006956
- \[11\]A\. Supratak, H\. Dong, C\. Wu, and Y\. Guo, “Deepsleepnet: A model for automatic sleep stage scoring based on raw single\-channel eeg,”*IEEE Transactions on Neural Systems and Rehabilitation Engineering*, vol\. 25, no\. 11, pp\. 1998–2008, 2017\.
- \[12\]A\. Supratak and Y\. Guo, “Tinysleepnet: An efficient deep learning model for sleep stage scoring based on raw single\-channel eeg,” in*2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society \(EMBC\)*, 2020, pp\. 641–644\.
- \[13\]H\. Phan, F\. Andreotti, N\. Cooray, O\. Y\. Chén, and M\. De Vos, “Seqsleepnet: End\-to\-end hierarchical recurrent neural network for sequence\-to\-sequence automatic sleep staging,”*IEEE Transactions on Neural Systems and Rehabilitation Engineering*, vol\. 27, no\. 3, pp\. 400–410, 2019\.
- \[14\]H\. Phan, K\. Mikkelsen, O\. Y\. Chén, P\. Koch, A\. Mertins, and M\. De Vos, “Sleeptransformer: Automatic sleep staging with interpretability and uncertainty quantification,”*IEEE Transactions on Biomedical Engineering*, vol\. 69, no\. 8, pp\. 2456–2467, 2022\.
- \[15\]H\. Lee, Y\. R\. Choi, H\. K\. Lee, J\. Jeong, J\. Hong, H\.\-W\. Shin, and H\.\-S\. Kim, “Explainable vision transformer for automatic visual sleep staging on multimodal PSG signals,” vol\. 8, no\. 1, p\. 55\. \[Online\]\. Available: https://doi\.org/10\.1038/s41746\-024\-01378\-0
- \[16\]J\. Pradeepkumar, M\. Anandakumar, V\. Kugathasan, D\. Suntharalingham, S\. L\. Kappel, A\. C\. De Silva, and C\. U\. S\. Edussooriya, “Toward interpretable sleep stage classification using cross\-modal transformers,”*IEEE Transactions on Neural Systems and Rehabilitation Engineering*, vol\. 32, pp\. 2893–2904, 2024\.
- \[17\]H\. Phan, K\. P\. Lorenzen, E\. Heremans, O\. Y\. Chén, M\. C\. Tran, P\. Koch, A\. Mertins, M\. Baumert, K\. B\. Mikkelsen, and M\. De Vos, “L\-seqsleepnet: Whole\-cycle long sequence modeling for automatic sleep staging,”*IEEE Journal of Biomedical and Health Informatics*, vol\. 27, no\. 10, pp\. 4748–4757, 2023\.
- \[18\]W\. G\. Coon and M\. Ogg, “Laying the foundation: Modern transformers for gold\-standard sleep analysis and beyond,” in*2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society \(EMBC\)*, 2024, pp\. 1–7\.
- \[19\]J\. G\. Ciudad, M\. Mørup, B\. R\. Kornum, and A\. N\. Zahid, “Evaluating the influence of temporal context on automatic mouse sleep staging through the application of human models,” in*2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society \(EMBC\)*, 2024, pp\. 1–4\.
- \[20\]G\. Liu, X\. Gao, M\. Dresler, J\. Zhang, and P\. Wei, “Rethinking random transformers as adaptive sequence smoothers for sleep staging,” 2026\. \[Online\]\. Available: https://arxiv\.org/abs/2605\.09905
- \[21\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin, “Attention is all you need,” 2023\. \[Online\]\. Available: https://arxiv\.org/abs/1706\.03762
- \[22\]X\. Glorot and Y\. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in*Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics*, ser\. Proceedings of Machine Learning Research, Y\. W\. Teh and M\. Titterington, Eds\., vol\. 9\. Chia Laguna Resort, Sardinia, Italy: PMLR, 13–15 May 2010, pp\. 249–256\. \[Online\]\. Available: https://proceedings\.mlr\.press/v9/glorot10a\.html
- \[23\]A\. L\. Goldberger, L\. A\. N\. Amaral, L\. Glass, J\. M\. Hausdorff, P\. C\. Ivanov, R\. G\. Mark, J\. E\. Mietus, G\. B\. Moody, C\.\-K\. Peng, and H\. E\. Stanley, “Physiobank, physiotoolkit, and physionet,”*Circulation*, vol\. 101, no\. 23, pp\. e215–e220, 2000\. \[Online\]\. Available: https://www\.ahajournals\.org/doi/abs/10\.1161/01\.CIR\.101\.23\.e215
- \[24\]B\. Kemp, A\. Zwinderman, B\. Tuk, H\. Kamphuisen, and J\. Oberye, “Analysis of a sleep\-dependent neuronal feedback loop: the slow\-wave microcontinuity of the eeg,”*IEEE Transactions on Biomedical Engineering*, vol\. 47, no\. 9, pp\. 1185–1194, 2000\.
- \[25\]E\. Eldele, Z\. Chen, C\. Liu, M\. Wu, C\.\-K\. Kwoh, X\. Li, and C\. Guan, “An attention\-based deep learning approach for sleep stage classification with single\-channel eeg,”*IEEE Transactions on Neural Systems and Rehabilitation Engineering*, vol\. 29, pp\. 809–818, 2021\.
- \[26\]S\. Hochreiter and J\. Schmidhuber, “Long short\-term memory,”*Neural Comput\.*, vol\. 9, no\. 8, p\. 1735–1780, Nov\. 1997\. \[Online\]\. Available: https://doi\.org/10\.1162/neco\.1997\.9\.8\.1735
- \[27\]J\. Chung, C\. Gulcehre, K\. Cho, and Y\. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” 2014\. \[Online\]\. Available: https://arxiv\.org/abs/1412\.3555
- \[28\]G\. Deng, M\. Niu, S\. Rao, Y\. Luo, J\. Zhang, J\. Xie, Z\. Yu, W\. Liu, J\. Zhang, S\. Zhao, G\. Pan, X\. Li, W\. Deng, W\. Guo, Y\. Zhang, T\. Li, and H\. Jiang, “A unified flexible large psg model for sleep staging and brain disorder diagnosis,”*medRxiv*, 2024\. \[Online\]\. Available: https://www\.medrxiv\.org/content/early/2025/11/27/2024\.12\.11\.24318815Similar Articles
Dynamic Linear Attention
This paper proposes DLA, a dynamic memory modeling framework for multi-state linear attention that adaptively merges states based on token information variation and maintains a fixed-size state cache, enabling better long-context representation without the quadratic complexity of standard attention.
Staging by the Book: Automatic Sleep Stage Classification Using Scoring Rules
This paper presents a deterministic, rule-based sleep staging method that explicitly implements the American Academy of Sleep Medicine (AASM) scoring rules, providing epoch-level natural language explanations. It achieves 60.5% epoch-level agreement with a majority-vote consensus on 50 polysomnography recordings, offering transparency as a complement to opaque deep learning models.
A Conflict-aware Evidential Framework for Reliable Sleep Stage Classification
ConfSleepNet is a conflict-aware evidential framework for reliable sleep stage classification using multi-modal data. It introduces hybrid category structures and a conflict-aware aggregation method to resolve inter-view conflicts, demonstrating effectiveness on sleep staging tasks.
STDA-Net: Spectrogram-Based Domain Adaptation for cross-dataset Sleep Stage Classification
This paper introduces STDA-Net, a domain adaptation framework for cross-dataset sleep stage classification using 2D spectrograms and adversarial learning. It demonstrates improved accuracy and stability over existing 1D EEG baseline methods on public datasets.
Dynamic Linear Attention
DLA introduces adaptive state merging and capacity-bounded memory modeling for multi-state linear attention, improving long-context LLM performance.