Enabling Unsupervised Training of Deep EEG Denoisers With Intelligent Partitioning
Summary
This paper proposes Intelligent Partitioning for Self-supervised Denoising (iPSD), a method enabling unsupervised training of deep EEG denoisers by partitioning noisy segments without requiring clean reference data.
View Cached Full Text
Cached at: 05/11/26, 06:42 AM
# Enabling Unsupervised Training of Deep EEG Denoisers With Intelligent Partitioning
Source: [https://arxiv.org/html/2605.06724](https://arxiv.org/html/2605.06724)
Qiyu Rao,1\{\}^\{\\,\\,\\,,1\}Haozhe Tian2Homayoun Hamedmoghadam1Danilo Mandic1 1Department of Electrical and Electronic Engineering, Imperial College London 2Dyson School of Design Engineering, Imperial College London \{kianna\.rao21, haozhe\.tian21, h\.hamed, danilo\.mandic\}@imperial\.ac\.uk
###### Abstract
Denoising wearable electroencephalogram \(EEG\) is inherently challenging since neural activity is not only subtle but also inseparable from spectrally overlapping noise artifacts\. Classical signal processing methods, relying on fixed or heuristic rules, cannot handle the time\-varying pervasive artifacts in wearable EEGs\. Deep learning methods, on the other hand, show promise in decomposition\-free EEG denoising using highly expressive neural networks, but the training requires artifact\-free EEG, which is inherently unobtainable\. To address this, we propose Intelligent Partitioning for Self\-supervised Denoising \(iPSD\)\. Our method eliminates the need for clean references by learning to partition an input EEG segment into independent noisy realizations with the same underlying signal\. This enables self\-supervision of deep learning denoisers, even in zero\-shot settings where only a single EEG segment to be denoised is available\. We validate iPSD through extensive experiments, including validations on wearable EEG from in\-ear sensors\. The results show that iPSD achieves state\-of\-the\-art performance, most notably under extremely low signal\-to\-noise ratios \(down to−10\-10dB\) and challenging artifacts \(e\.g\., EMG\), with spectral fidelity orders of magnitude higher than competitive baselines\.
## 1Introduction
Wearable technologies are transforming continuous health monitoring, with significant advances in electrocardiogram \(ECG\)\[[1](https://arxiv.org/html/2605.06724#bib.bib1)\]and photoplethysmogram \(PPG\)\[[2](https://arxiv.org/html/2605.06724#bib.bib2)\]tracking\. However, progress in wearable electroencephalogram \(EEG\) monitoring has been comparatively slow: on top of the poor electrode\-skin contact, motion artifacts, and physiological interference inherent to wearable recordings, the brain signals themselves are also subtle, broadband, and non\-stationary\[[3](https://arxiv.org/html/2605.06724#bib.bib3),[4](https://arxiv.org/html/2605.06724#bib.bib4),[5](https://arxiv.org/html/2605.06724#bib.bib5)\]\.
A range of signal processing methods has been proposed to denoise EEG through decomposition and selective reconstruction\. The discrete wavelet transform decomposes the signal using wavelet bases and reconstructs a denoised version by thresholding wavelet coefficients\[[6](https://arxiv.org/html/2605.06724#bib.bib6),[7](https://arxiv.org/html/2605.06724#bib.bib7)\]\. However, this approach struggles when the noise spectrum overlaps with that of the EEG, as is the case for electromyographic \(EMG\) artifacts from muscle contractions, a prevalent noise source in wearable recordings\. Mode decomposition methods partially alleviate this by decomposing the signal into data\-driven mode functions\[[8](https://arxiv.org/html/2605.06724#bib.bib8),[9](https://arxiv.org/html/2605.06724#bib.bib9)\]\. However, they are sensitive to hyperparameter choices and abrupt perturbations common in wearable EEG\.
Deep learning \(DL\) offers a compelling alternative to decomposition\-based denoising, as highly expressive neural networks can directly map a noisy EEG to its clean counterparts without any fixed or heuristic basis\[[10](https://arxiv.org/html/2605.06724#bib.bib10),[11](https://arxiv.org/html/2605.06724#bib.bib11),[12](https://arxiv.org/html/2605.06724#bib.bib12),[13](https://arxiv.org/html/2605.06724#bib.bib13)\]\. However, supervised training of DL denoisers requires clean reference recordings, which are unattainable since neural signals are inseparable from physiological and environmental noise at the measurement point\. Self\-supervision, using only independent noisy realizations, has proven effective for denoising images with the hugely successful Noise2Noise \(N2N\)\[[14](https://arxiv.org/html/2605.06724#bib.bib14)\]method\. N2N learns to restore the clean image by training a neural network to map one noisy realization of the image to another\.Mansour and Heckel\[[15](https://arxiv.org/html/2605.06724#bib.bib15)\]later extended N2N to settings without multiple measurements of the same image via interleaved downsampling, which assigns alternating pixels to two sub\-images to form a pair of noisy realizations\. This heuristic, however, does not straightforwardly transfer to EEG: unlike on locally smooth images, EEG’s rapid oscillatory structure causes interleaved downsampling to yield realizations with distinct underlying clean signals, thus violating N2N’s core assumption\.
To address the challenge of self\-supervised restoration of clean EEG signals, we propose Intelligent Partitioning for Self\-supervised Denoising \(iPSD\)\. The core idea is to learn to assign samples of a noisy input into two sub\-signals with the same underlying clean signal\. \(Rather than assignment via a predetermined rule, such as in interleaved downsampling, the intelligent partitioning enables learning a flexible, signal\-specific assignment\.\) The two sub\-signals then act as a pair of noisy realizations for N2N\-style self\-supervised denoising\. The intuition is that, for two sub\-signals sharing the same clean signal but carrying independent noise, the noise in one is unpredictable from the other; so a denoiser network trained to map one to the other is driven to converge to output their shared component, i\.e\., the clean signal\. In iPSD, the partitioning is optimized via reinforcement learning \(RL\), jointly with the denoiser, using the negative of the converged denoising loss as the reward signal\. We show in[Section˜2\.2](https://arxiv.org/html/2605.06724#S2.SS2)that, under reasonable assumptions, this drives the partitioning module toward sub\-signals that are maximally informative of one another’s underlying signal\. We further develop a zero\-shot variant, named iPSD\-Zero, that recovers the clean signal from a single noisy segment without any prior training\.
We evaluate iPSD against multiple baselines on EEG signals corrupted by White Gaussian Noise \(WGN\) and real\-world EMG artifacts—across noise levels up to ten times the signal power—as well as real noisy wearable EEG data from custom\-built in\-ear sensors\[[16](https://arxiv.org/html/2605.06724#bib.bib16),[17](https://arxiv.org/html/2605.06724#bib.bib17)\]\. Across all conditions, iPSD consistently outperforms other methods with spectral fidelity orders of magnitude higher than established baselines\. On a downstream sleep\-stage classification task, wearable EEG denoised by iPSD achieves accuracy comparable to clinical\-grade scalp EEG, underscoring the method’s practical value for accessible health monitoring\. To the best of our knowledge, iPSD is the first self\-supervised EEG denoising method to demonstrate such effectiveness across noise types, SNR levels, and real\-world wearable recordings\.
## 2Methodology
Let𝐱∈ℝL\\mathbf\{x\}\\in\\mathbb\{R\}^\{L\}denote a clean EEG signal of lengthLL, and let𝐬=𝐱\+𝐧\\mathbf\{s\}=\\mathbf\{x\}\+\\mathbf\{n\}be a noise\-corrupted measurement of𝐱\\mathbf\{x\}, where𝐧∈ℝL\\mathbf\{n\}\\in\\mathbb\{R\}^\{L\}denotes additive noise that is independent of𝐱\\mathbf\{x\}, i\.e\.,𝔼\[𝐧⊤𝐱\]=0\\mathbb\{E\}\\left\[\\mathbf\{n\}^\{\\top\}\\mathbf\{x\}\\right\]=0\. Denoising then refers to recovering the underlying clean signal𝐱\\mathbf\{x\}from its noisy measurement𝐬\\mathbf\{s\}\. We focus on deep\-learning denoisers, which employ expressive neural networks to directly map the noisy𝐬\\mathbf\{s\}to its clean estimate𝐱^\\hat\{\\mathbf\{x\}\}, bypassing the fixed or heuristic bases of classical signal processing\.
### 2\.1iPSD formulation
In most practical scenarios, the clean reference𝐱\\mathbf\{x\}is unavailable, making supervised training of the denoiser unfeasible\. Unlike images or audio, where clean reference signals can be synthetically generated, EEG recordings are inherently contaminated\. This is because the true neural signal𝐱\\mathbf\{x\}is inseparable from physiological and environmental noise at the point of acquisition, no matter how precise the measurement device\. The design of iPSD is optimized for this fundamentally constrained setting, where the true target is physically inaccessible rather than merely scarce\.
Figure 1:Schematic overview of iPSD\. The partitioning moduleπψ\\pi\_\{\\psi\}splits the input signal into two sub\-signals,𝐬l\\mathbf\{s\}^\{l\}and𝐬r\\mathbf\{s\}^\{r\}, which are denoised by the denoising modulefθf\_\{\\theta\}\. Training offθf\_\{\\theta\}uses the self\-supervised N2N loss,‖fθ\(𝐬l\)−𝐬r‖22\+‖fθ\(𝐬r\)−𝐬l‖22\\left\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{s\}^\{r\}\\right\\\|\_\{2\}^\{2\}\+\\left\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{r\}\)\-\\mathbf\{s\}^\{l\}\\right\\\|\_\{2\}^\{2\}\. The negative of the convergedfθf\_\{\\theta\}loss is then used as a reward to updateπψ\\pi\_\{\\psi\}\. We alternate the updates ofπψ\\pi\_\{\\psi\}andfθf\_\{\\theta\}until convergence\.The schematic in[Fig\.˜1](https://arxiv.org/html/2605.06724#S2.F1)depicts the architecture of iPSD, which consists of two learnable modules: the*partitioning module*takes a noisy input𝐬\\mathbf\{s\}and uses a neural networkπψ\\pi\_\{\\psi\}parametrized byψ\\psito generate a pair of noisy sub\-signals\(𝐬l,𝐬r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\); and the*denoising module*uses a neural networkfθf\_\{\\theta\}parametrized byθ\\thetato map a noisy signal to its clean estimate\. The parameters of the two neural networks are jointly optimized via the following self\-supervised objective
θ⋆,ψ⋆=argminθ,ψ𝔼\(𝐬l,𝐬r\)∼πψ\(⋅∣𝐬\)\[‖fθ\(𝐬l\)−𝐬r‖22\+‖fθ\(𝐬r\)−𝐬l‖22\]\.\\theta^\{\\star\},\\psi^\{\\star\}=\\operatorname\*\{argmin\}\_\{\\theta,\\psi\}\\mathbb\{E\}\_\{\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\sim\\pi\_\{\\psi\}\(\\cdot\\mid\\mathbf\{s\}\)\}\\left\[\\left\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{s\}^\{r\}\\right\\\|\_\{2\}^\{2\}\+\\left\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{r\}\)\-\\mathbf\{s\}^\{l\}\\right\\\|\_\{2\}^\{2\}\\right\]\.\(1\)
#### 2\.1\.1Self\-supervised denoising
The objective in[Eq\.˜1](https://arxiv.org/html/2605.06724#S2.E1)trains the denoising modelfθf\_\{\\theta\}using only the noisy sub\-signals\(𝐬l,𝐬r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)with no clean reference required\. This self\-supervision is valid when the two sub\-signals share the same underlying clean signal: the independent noise components in𝐬l\\mathbf\{s\}^\{l\}and𝐬r\\mathbf\{s\}^\{r\}are each unpredictable from the other, sofθf\_\{\\theta\}trained to map one sub\-signal to the other is driven to output their shared component, i\.e\., the clean signal\. The following theorem, adapted from\[[15](https://arxiv.org/html/2605.06724#bib.bib15)\], formalizes this intuition:
###### Theorem 1\.
Suppose\(𝐬l,𝐬r\)∼πψ\(⋅∣𝐬\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\sim\\pi\_\{\\psi\}\(\\cdot\\mid\\mathbf\{s\}\)are independent noisy realizations of the same underlying signal𝐱\\mathbf\{x\}\. Then theθ⋆\\theta^\{\\star\}that optimizes the self\-supervised loss in[Eq\.˜1](https://arxiv.org/html/2605.06724#S2.E1)also minimizes the expectedL2L2distance between the network outputs and the ground\-truth𝐱\\mathbf\{x\}, i\.e\.,
θ⋆=argminθ𝔼\[‖fθ\(𝐬l\)−𝐱‖22\+‖fθ\(𝐬r\)−𝐱‖22\]\.\\theta^\{\\star\}=\\operatorname\*\{argmin\}\_\{\\theta\}\\,\\mathbb\{E\}\\Big\[\\big\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{x\}\\big\\\|\_\{2\}^\{2\}\+\\big\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{r\}\)\-\\mathbf\{x\}\\big\\\|\_\{2\}^\{2\}\\Big\]\.
The proof of[Theorem˜1](https://arxiv.org/html/2605.06724#Thmtheorem1)is provided in[Section˜A](https://arxiv.org/html/2605.06724#A0.SS1)\. This theorem shows that, given the partition\(𝐬l,𝐬r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\),fθf\_\{\\theta\}can be optimized via gradient descent on the loss
ℒ\(θ\)=‖fθ\(𝐬l\)−𝐬r‖22\+‖fθ\(𝐬r\)−𝐬l‖22\.\\mathcal\{L\}\(\\theta\)=\\left\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{s\}^\{r\}\\right\\\|\_\{2\}^\{2\}\+\\left\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{r\}\)\-\\mathbf\{s\}^\{l\}\\right\\\|\_\{2\}^\{2\}\.\(2\)Crucially, the validity of this self\-supervised training scheme hinges on the two sub\-signals\(𝐬l,𝐬r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)sharing the same underlying clean signal\. Achieving this is particularly challenging for EEG, whose non\-stationarity and fast oscillatory structure demand a flexible, targeted partitioning strategy\. This motivates the learnable partitioning module at the core of iPSD\.
#### 2\.1\.2Intelligent partitioning
Here, we describe the partitioning module that obtains the pair\(𝐬l,𝐬r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)from the input noisy signal𝐬\\mathbf\{s\}\(see[Fig\.˜2](https://arxiv.org/html/2605.06724#S2.F2)\)\. Two ideas are combined here: \(i\) the input is segmented into small windows where the underlying signal is approximately stationary, and \(ii\) the modelπψ\\pi\_\{\\psi\}learns how to best partition each window to optimize the denoising performance by stochastically exploring all candidate partitions\.
Formally, the input𝐬∈ℝL\\mathbf\{s\}\\in\\mathbb\{R\}^\{L\}, with index setI=\{1,⋯,L\}I=\\\{1,\\cdots,L\\\}, is segmented into non\-overlapping local windows of an appropriate even lengthWW\(that dividesLL\)\. The index set of thekk\-th window isI\(k\)=\{\(k−1\)W\+1,⋯,kW\}I^\{\(k\)\}=\\\{\(k\-1\)W\+1,\\cdots,kW\\\}, withk=1,⋯,L/Wk=1,\\cdots,L/W\. The partitioning modelπψ\\pi\_\{\\psi\}learns to extract from eachI\(k\)I^\{\(k\)\}\(corresponding to windowkk\), a subsetI\(k\),lI^\{\(k\),l\}, constrained to cardinalityW/2W/2\(and thus its complementI\(k\),r=I\(k\)∖I\(k\),lI^\{\(k\),r\}=I^\{\(k\)\}\\setminus I^\{\(k\),l\}\)\. We use\(𝐬l,𝐬r\)∼πψ\(⋅∣𝐬\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\sim\\pi\_\{\\psi\}\(\\cdot\\mid\\mathbf\{s\}\)to denote the extraction of the signal pair\(𝐬l,𝐬r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)according to the stochastic policyπψ\\pi\_\{\\psi\}:ℝL→𝒫\(Il\)\\mathbb\{R\}^\{L\}\\rightarrow\\mathcal\{P\}\(I^\{l\}\)by
𝐬l=\[𝐬i\]i∈Il∈ℝL/2,𝐬r=\[𝐬i\]i∈Ir∈ℝL/2,\\mathbf\{s\}^\{l\}=\[\\mathbf\{s\}\_\{i\}\]\_\{i\\in I^\{l\}\}\\in\\mathbb\{R\}^\{L/2\},\\qquad\\mathbf\{s\}^\{r\}=\[\\mathbf\{s\}\_\{i\}\]\_\{i\\in I^\{r\}\}\\in\\mathbb\{R\}^\{L/2\},\(3\)whereIl=⋃k=1L/WI\(k\),lI^\{l\}=\\bigcup\_\{k=1\}^\{L/W\}I^\{\(k\),l\}andIr=⋃k=1L/WI\(k\),rI^\{r\}=\\bigcup\_\{k=1\}^\{L/W\}I^\{\(k\),r\}\.
Figure 2:Workflow of the partitioning module\. The input signal𝐬\\mathbf\{s\}of lengthLLis reshaped intoL/WL/Wnon\-overlapping windows of lengthWW, each admitting\(WW/2\)/2\\tbinom\{W\}\{W/2\}/2candidate partitions into two equal\-length subsets\. The modelπψ\\pi\_\{\\psi\}selects a partition for each window, and the corresponding subsets are concatenated across windows to form the sub\-signal pair\. The denoising modulefθf\_\{\\theta\}then processes the sub\-signals and provides feedback used to optimizeπψ\\pi\_\{\\psi\}\(see[Fig\.˜1](https://arxiv.org/html/2605.06724#S2.F1)\)\.###### Proposition 1\.
For any signal windowI\(k\)I^\{\(k\)\}, the within\-window partitioning strategy used in iPSD can always achieve a partition that is at least as good as the interleaved partition used in\[[15](https://arxiv.org/html/2605.06724#bib.bib15)\], in terms of the mismatch between the underlying clean components of the sub\-signals\(𝐬I\(k\),l,𝐬I\(k\),r\)\(\\mathbf\{s\}\_\{I^\{\(k\),l\}\},\\mathbf\{s\}\_\{I^\{\(k\),r\}\}\)\.
We provide the detailed proof of[Proposition˜1](https://arxiv.org/html/2605.06724#Thmproposition1)in[Section˜B](https://arxiv.org/html/2605.06724#A0.SS2)\.
### 2\.2Training iPSD
The combinatorial nature of the partitioning policyπψ\\pi\_\{\\psi\}renders the objective in[Eq\.˜1](https://arxiv.org/html/2605.06724#S2.E1)non\-differentiable with respect to the parametersψ\\psi\. To address this, iPSD uses an alternating optimization scheme, wherefθf\_\{\\theta\}is optimized via gradient descent andπψ\\pi\_\{\\psi\}via RL\.
Let𝒮\\mathcal\{S\}denote a set of noisy training signals\. In each iPSD training iteration, we sample𝐬∈𝒮\\mathbf\{s\}\\in\\mathcal\{S\}and useπψ\\pi\_\{\\psi\}to partition it into\(𝐬l,𝐬r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\. We then updatefθf\_\{\\theta\}via gradient descent on the loss in[Eq\.˜2](https://arxiv.org/html/2605.06724#S2.E2)until convergence, defined as the variance of loss values in the last1010steps falling below10−610^\{\-6\}\. The negative of the converged denoising loss serves as the reward signal for the RL\-based optimization ofπψ\\pi\_\{\\psi\}\. To reduce reward variance, we apply an averaging scheme\. Supposeθ\\thetaconverges inNNsteps, and letJndJ^\{d\}\_\{n\}denote the value ofℒ\(θ\)\\mathcal\{L\}\(\\theta\)at stepnn\. The rewardR\(𝐬l,𝐬r\)R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)is then defined as:
R\(𝐬l,𝐬r\)=−110∑n=N−9NJnd\.R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\;=\\;\-\\frac\{1\}\{10\}\\sum\_\{n=N\-9\}^\{N\}J\_\{n\}^\{d\}\\,\.\(4\)In[Section˜C](https://arxiv.org/html/2605.06724#A0.SS3), we show that under reasonable assumptions, this RL reward formulation drivesπψ\\pi\_\{\\psi\}toward𝐬l\\mathbf\{s\}^\{l\}and𝐬r\\mathbf\{s\}^\{r\}that are maximally informative of each other’s underlying signal\.
Algorithm 1Training iPSD1:Initialization:Partitioning model
πψ\\pi\_\{\\psi\}; training set
𝒮\\mathcal\{S\}; learning rates
ηθ\\eta^\{\\theta\},
ηψ\\eta^\{\\psi\}\.
2:whilenot convergeddo
3:Sample
𝐬∈𝒮\\mathbf\{s\}\\in\\mathcal\{S\}and partition
𝐬\\mathbf\{s\}into
\(𝐬l,𝐬r\)∼πψ\(⋅∣𝐬\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\sim\\pi\_\{\\psi\}\(\\cdot\\mid\\mathbf\{s\}\)
4:Initialize
fθf\_\{\\theta\}; set
N=0N=0
5:while
θ\\thetanot convergeddo
6:
Jnd=‖fθ\(𝐬l\)−𝐬r‖22\+‖fθ\(𝐬r\)−𝐬l‖22J^\{d\}\_\{n\}=\\left\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{s\}^\{r\}\\right\\\|\_\{2\}^\{2\}\+\\left\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{r\}\)\-\\mathbf\{s\}^\{l\}\\right\\\|\_\{2\}^\{2\}
7:
θ←θ−ηθ∇θJnd\\theta\\leftarrow\\theta\-\\eta^\{\\theta\}\\nabla\_\{\\theta\}J^\{d\}\_\{n\};
N=N\+1N=N\+1
8:endwhile
9:
ψ←ψ\+ηψ∇ψlogπψ\(𝐬l,𝐬r∣𝐬\)R\(𝐬l,𝐬r\)\\psi\\leftarrow\\psi\+\\eta^\{\\psi\}\\nabla\_\{\\psi\}\\log\\pi\_\{\\psi\}\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}\)\\,R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\),
R\(𝐬l,𝐬r\)=−110∑n=N−9NJnd\\quad R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)=\-\\frac\{1\}\{10\}\\sum\_\{n=N\-9\}^\{N\}J\_\{n\}^\{d\}
10:endwhile
11:Note:In practice, a batch of partitions is sampled in parallel andψ\\psiis updated with the average policy gradient\.
Using the scalar rewardR\(𝐬l,𝐬r\)R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)enables gradient\-based optimization of the partitioning policyπψ\\pi\_\{\\psi\}\. To findπψ\\pi\_\{\\psi\}that maximizes the reward, we apply the policy gradient theorem \(see[Section˜D](https://arxiv.org/html/2605.06724#A0.SS4)\), which shows that the gradient of the expected reward with respect toψ\\psican be written as
∇ψ𝔼\(𝐬l,𝐬r\)\[R\(𝐬l,𝐬r\)\]=𝔼\(𝐬l,𝐬r\)\[∇ψlogπψ\(𝐬l,𝐬r∣𝐬\)R\(𝐬l,𝐬r\)\]\.\\nabla\_\{\\psi\}\\,\\mathbb\{E\}\_\{\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\}\\bigl\[R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\bigr\]=\\mathbb\{E\}\_\{\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\}\\\!\\left\[\\nabla\_\{\\psi\}\\log\\pi\_\{\\psi\}\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}\)\\,R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\right\]\.\(5\)We therefore updateπψ\\pi\_\{\\psi\}via gradient ascent on the right\-hand side of[Eq\.˜5](https://arxiv.org/html/2605.06724#S2.E5)\.[Algorithm˜1](https://arxiv.org/html/2605.06724#alg1)summarizes the full self\-supervised denoising procedure\. In practice, we employ Proximal Policy Optimization \(PPO\)\[[18](https://arxiv.org/html/2605.06724#bib.bib18)\]and sample a batch of partitions in parallel to further reduce the variance in the observedR\(𝐬l,𝐬r\)R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\), ensuring stable and efficient policy updates\.
Zero\-shot setting\. As an extension, we develop iPSD\-Zero, a variant of iPSD designed for the challenging*zero\-shot*setting, where a clean signal is recovered directly from a single noisy recording without any prior training\. This capability is particularly valuable for wearable EEG, where instant deployment is desirable, yet large variabilities in individual physiology, noise characteristics, and hardware make it impractical to form a representative training set before deployment\. One significant challenge when only the test signal𝐬\\mathbf\{s\}is available is to rapidly identify the optimal partition from a large search space\. We address this by enforcing a shared partitioning strategy across all localized windows, reducing the search space to\(WW/2\)/2\\binom\{W\}\{W/2\}/2candidates\. Selecting the optimal partition can be naturally formulated as a multi\-armed bandit problem, where each candidate partition corresponds to an*arm*\. Pulling each arm yields a reward, defined as in[Eq\.˜4](https://arxiv.org/html/2605.06724#S2.E4), which is stochastic due to the inherent randomness of trainingfθf\_\{\\theta\}\. The objective of iPSD\-Zero is to identify the best arm, measured by the expected reward, with high probability using as few trials as possible; to this end, we employ the lil’ UCB algorithm\[[19](https://arxiv.org/html/2605.06724#bib.bib19)\], whose tight confidence bounds make it well\-suited to the sample\-limited zero\-shot setting\. By iteratively pulling different arms, iPSD\-Zero gradually builds empirical estimates of each arm’s expected reward and eventually identifies the arm whose corresponding partition of𝐬\\mathbf\{s\}best allowsfθf\_\{\\theta\}to recover the clean signal\. Further details are provided in[Section˜E](https://arxiv.org/html/2605.06724#A0.SS5)\.
## 3Experiments
We begin with some notes on the implementation111The code will be released publicly upon acceptance\.of iPSD\. For the partitioning module, we use a window length of 8 samples, chosen by ablation \([Section˜F](https://arxiv.org/html/2605.06724#A0.SS6)\) to achieve sufficient partitioning flexibility while keeping the search space tractable\. The policy networkπψ\\pi\_\{\\psi\}uses a bidirectional GRU architecture, which outperforms alternatives including MLP, U\-Net, and RNN \(see[Section˜F](https://arxiv.org/html/2605.06724#A0.SS6)for detailed comparisons\)\. We trainπψ\\pi\_\{\\psi\}via PPO using the Adam optimizer with a learning rate of10−410^\{\-4\}, a batch size of 64, and a total of20,00020\{,\}000update steps; other hyperparameters follow\[[20](https://arxiv.org/html/2605.06724#bib.bib20)\]\.
For the denoising module, we use the loss functionℒ\(θ\)\\mathcal\{L\}\(\\theta\)in[Eq\.˜2](https://arxiv.org/html/2605.06724#S2.E2)together with two regularization terms proposed in\[[15](https://arxiv.org/html/2605.06724#bib.bib15)\]\. Given a partition\(𝐬l,𝐬r\)∼πψ\(⋅∣𝐬\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\sim\\pi\_\{\\psi\}\(\\cdot\\mid\\mathbf\{s\}\), we define functionsπl,πr:ℝL→ℝL/2\\pi^\{l\},\\pi^\{r\}:\\mathbb\{R\}^\{L\}\\rightarrow\\mathbb\{R\}^\{L/2\}such that𝐬l=πl\(𝐬\)\\mathbf\{s\}^\{l\}=\\pi^\{l\}\(\\mathbf\{s\}\)and𝐬r=πr\(𝐬\)\\mathbf\{s\}^\{r\}=\\pi^\{r\}\(\\mathbf\{s\}\)\. The full loss function for updatingfθf\_\{\\theta\}is
ℒfull\(θ\)=ℒ\(θ\)\+‖fθ\(πl\(𝐬\)\)−πl\(fθ\(𝐬\)\)‖22\+‖fθ\(πr\(𝐬\)\)−πr\(fθ\(𝐬\)\)‖22,\\begin\{split\}\\mathcal\{L\}\_\{\\text\{full\}\}\(\\theta\)=\\mathcal\{L\}\(\\theta\)\+\\\|f\_\{\\theta\}\(\\pi^\{l\}\(\\mathbf\{s\}\)\)\-\\pi^\{l\}\(f\_\{\\theta\}\(\\mathbf\{s\}\)\)\\\|^\{2\}\_\{2\}\+\\\|f\_\{\\theta\}\(\\pi^\{r\}\(\\mathbf\{s\}\)\)\-\\pi^\{r\}\(f\_\{\\theta\}\(\\mathbf\{s\}\)\)\\\|^\{2\}\_\{2\},\\end\{split\}\(6\)where the two regularization terms encourage iPSD’s output to obey a consistency property that the underlying clean signal should satisfy: partitioning and then denoising should yield the same result as denoising and then partitioning\. Sincefθf\_\{\\theta\}accepts inputs of varying length inℒfull\(θ\)\\mathcal\{L\}\_\{\\text\{full\}\}\(\\theta\), we implement it as a lightweight fully convolutional network with three layers and LeakyReLU activations, totaling approximately77k parameters\. The channel dimensions progress as1→48→48→11\\rightarrow 48\\rightarrow 48\\rightarrow 1, with all layers using a kernel size of 3 and padding of 1\. The compact architecture offθf\_\{\\theta\}leads to fast convergence \(in around 200 epochs\) and permits parallel training of multiple instances, allowing the partitioning policyπψ\\pi\_\{\\psi\}to quickly explore diverse strategies and identify high\-quality partitions\.
### 3\.1Experiments on synthetic data
Here, we benchmark iPSD against state\-of\-the\-art baselines spanning wavelet\-based\[[7](https://arxiv.org/html/2605.06724#bib.bib7),[21](https://arxiv.org/html/2605.06724#bib.bib21)\], mode decomposition\-based\[[22](https://arxiv.org/html/2605.06724#bib.bib22),[23](https://arxiv.org/html/2605.06724#bib.bib23),[24](https://arxiv.org/html/2605.06724#bib.bib24)\], hybrid\[[25](https://arxiv.org/html/2605.06724#bib.bib25)\], and self\-supervised DL\[[26](https://arxiv.org/html/2605.06724#bib.bib26)\]approaches \(see a summary in[Section˜G](https://arxiv.org/html/2605.06724#A0.SS7)\)\. To quantitatively evaluate performance, we use three complementary metrics standard in signal denoising: signal\-to\-noise ratio \(SNR\), peak signal\-to\-noise ratio \(PSNR\), and the mean squared error of the power spectrum \(Spectral MSE\)\. SNR provides a global measure of time\-domain fidelity, PSNR emphasizes robustness to large transient spikes, and Spectral MSE assesses spectral fidelity, which is critical because EEG analysis depends heavily on spectral features such as relative band power, peak frequency, and spectral entropy\. Together, these three metrics provide a balanced evaluation covering both the temporal and spectral domains\. The formal definitions of all metrics are provided in[Section˜H](https://arxiv.org/html/2605.06724#A0.SS8)\. Note that the clean signal𝐱\\mathbf\{x\}is used solely for metric computation and is not accessible to any denoising method\.
We use EEG signals from the CHB\-MIT Scalp EEG Database\[[27](https://arxiv.org/html/2605.06724#bib.bib27),[28](https://arxiv.org/html/2605.06724#bib.bib28)\], which was collected at the Children’s Hospital Boston and contains over 900 h long\-term scalp EEG recordings from 22 pediatric patients \(5 males aged 3–22 years; 17 females aged 1\.5–19\) with intractable seizures\. All signals are sampled at256256Hz with 16\-bit resolution, using the international 10–20 electrode placement system with a bipolar montage\. To generate synthetic data, we first split all recordings into 10\-second segments, which are long enough for iPSD and the baselines to capture meaningful neural rhythms\[[29](https://arxiv.org/html/2605.06724#bib.bib29)\]\. To create contaminated EEG signals, we introduce two types of noise: WGN and real\-world EMG artifacts\. The WGN is generated by sampling random values from a Gaussian distribution, while the EMG artifacts are drawn from a real\-world dataset\[[30](https://arxiv.org/html/2605.06724#bib.bib30)\]comprising EMG artifacts recorded from 15 healthy participants \(8 female, 7 male; aged 26–57\) performing facial movements including chewing, smiling, lip puckering, and frowning\. Since the EMG artifacts were recorded at a different sampling rate, we downsample them to256256Hz before adding them to the EEG\. The synthetic dataset was divided into training and test sets with an 80/20 split\.
Table 1:Comparison of Denoising performance\.[Table˜1](https://arxiv.org/html/2605.06724#S3.T1)compares the denoising performance of iPSD and its zero\-shot variant, iPSD\-Zero, against the baselines at input SNR levels of−5\-5dB and0dB\. \(The zero\-shot iPSD is optimized directly on each test signal\.\) The proposed iPSD consistently and significantly outperforms all baselines across all evaluation metrics, under both WGN and EMG contamination\. Compared to the strongest baseline, Optimal\-WT, iPSD consistently achieves better denoising performance across all noise conditions, with up to 3\.3 dB higher output SNR, corresponding to more than53%53\\%lower residual noise power after denoising\. We also observe that iPSD is more robust under low\-SNR conditions \(at−5\-5dB, signal power is only≈32%\\approx 32\\%of noise power\), where the baselines show limited denoising capability, with several outputting negative SNRs\. Most notably, while baseline spectral MSEs are on the order of thousands under WGN noise and hundreds under EMG noise, the spectral MSE of iPSD remains consistently below 60\. \(To provide an intuitive view of the quantitative results above, we present visual comparisons of iPSD against Optimal\-WT in[Section˜J](https://arxiv.org/html/2605.06724#A0.SS10), which highlights the strong spectral fidelity of iPSD\.\)
Figure 3:Performance over different noise levels\. The results compare iPSD with Optimal\-WT and EMD\-LoG in the presence of WGN \(top\) and EMG \(bottom\) noise in terms of output SNR\.The results in[Table˜1](https://arxiv.org/html/2605.06724#S3.T1)highlight the benefit of the learning\-based iPSD over fixed or heuristic partitioning strategies\. Performance of the baselines varies across noise types: mode decomposition\-based methods \(CEEMDAN and EMD\-LoG\) are less effective for WGN, due to the highly unstructured temporal form, whereas wavelet\-based methods \(Optimal\-WT and FrWT\) are less effective for EMG, due to the substantial spectral overlap with EEG activity\. The iPSD method, in contrast, maintains robust performance under these challenging conditions\. We also note that iPSD\-Zero achieved competitive performance, outperforming all baseline methods and ranking only second to the original iPSD\. This slight performance degradation comes with a huge saving in computational cost\. Starting from scratch, iPSD\-Zero denoises a 10\-second EEG signal in 5 seconds, making it suitable for real\-time and resource\-constrained scenarios where only the incoming signal is accessible\.
[Figure˜3](https://arxiv.org/html/2605.06724#S3.F3)compares the output SNR of iPSD against two representative baselines across input SNR levels ranging from−10\-10dB to1010dB\. The baselines are Optimal\-WT from the wavelet\-based family and EMD\-LoG from the mode decomposition\-based family\. Across all noise levels, iPSD outperforms both baselines\. We also observe that the baselines’ output SNR gradually plateaued as input SNR increased, whereas iPSD maintained a steady increase throughout the range\. These results demonstrate that iPSD generalizes well to both stationary and nonstationary artifacts and remains robust under critically low SNR\.
### 3\.2Experiments on real\-world data
The iPSD and Optimal\-WT were also evaluated on real\-world EEG recordings collected at the Surrey Sleep Research Centre, using our custom\-built wearable sensors222The study was conducted in accordance with the Declaration of Helsinki and Good Clinical Practice, and recieved ethical approval from the University of Surrey Ethics Committe \(UEC\-2019\-065\-FHMS\)\. All participants gave written informed consent beforehand\.\. The dataset comprised 684 hours of recordings from 37 participants \(aged 65–83 years; 17 females, 20 males\), sampled at256256Hz with 16\-bit resolution and segmented into 10\-second windows\.[Figure˜4](https://arxiv.org/html/2605.06724#S3.F4)a illustrates the in\-ear sensor used for data collection\. To maintain stable skin contact and minimise motion artifacts, the in\-ear sensor uses a viscoelastic foam earbud with a stretchable, low\-impedance cloth electrode on its outer surface\. We also applied conductive gel to the electrode before inserting the sensor into the ear canal to further improve the skin\-electrode contact\.
Figure 4:Real\-world in\-ear EEG denoising with iPSD\. \(a\) Wearable EEG acquisition setup using the in\-ear sensor, which integrates a viscoelastic earbud with a stretchable soft\-cloth electrode\. \(b\) Noisy in\-ear EEG \(gray\) and the denoised output from Optimal\-WT \(blue\) and iPSD\-Zero \(pink\)\. \(c\) PSDs of the noisy in\-ear EEG \(gray\), the Optimal\-WT output \(blue\), and the iPSD\-Zero output \(pink\)\. The key frequency bands for EEG analysis:δ\\delta\(0\.50\.5–44Hz\),θ\\theta\(44–88Hz\),α\\alpha\(88–1212Hz\),β\\beta\(1212–3030Hz\), andγ\\gamma\(3030–8080Hz\) are color\-coded in the background\.[Figure˜4](https://arxiv.org/html/2605.06724#S3.F4)b and[4](https://arxiv.org/html/2605.06724#S3.F4)c show representative noisy in\-ear EEG recordings, together with the denoising outputs of Optimal\-WT and iPSD\-Zero\. While both iPSD and iPSD\-Zero are applicable here, we only show iPSD\-Zero in the figure: it operates under the more constrained zero\-shot setting, yet produces outputs that are visually indistinguishable from those of iPSD\.[Figure˜4](https://arxiv.org/html/2605.06724#S3.F4)b shows time\-domain waveforms\. The Optimal\-WT outputs, although visually smooth, substantially corrupt the original EEG signal\. For example, the spindle\-like waveform between 7–8\.5 s in the lower panel of[Fig\.˜4](https://arxiv.org/html/2605.06724#S3.F4)b \(In\-ear EEG \#2\), characteristic of non\-REM sleep, is corrupted by Optimal\-WT but preserved by iPSD\-Zero\. The corresponding power spectral densities \(PSDs\) in[Fig\.˜4](https://arxiv.org/html/2605.06724#S3.F4)c further support that Optimal\-WT is oversmoothing: it exhibits spurious dips at multiple frequencies, indicating that genuine spectral content is being corrupted\. In contrast, iPSD\-Zero effectively suppresses the spiky, fluctuating noise artifacts while preserving the underlying EEG\. Since the in\-ear EEG was recorded during sleep, its spectral power is expected to concentrate below3030Hz, with the elevated high\-frequency power reflecting noise contamination\. As shown, iPSD\-Zero suppresses the high\-frequency noise while preserving the low\-frequency EEG components\. This visual comparison is consistent with the quantitative results in[Table˜1](https://arxiv.org/html/2605.06724#S3.T1), where iPSD substantially outperforms the baselines in spectral fidelity\.
Figure 5:Confusion matrices for sleep\-stage classification using the output of the baseline Optimal\-WT and the proposed iPSD\. Classification results using concurrently recorded high\-quality scalp EEG are also included as a reference performance target\. The recall and precision \(RE/PR\) for each sleep stage are shown on the right of each row\.To demonstrate the practical value of iPSD, we evaluate it on sleep stage classification, a cornerstone task in automatic sleep monitoring\. Classification is performed on 3,563 in\-ear EEG segments, each annotated by experts into one of five sleep stages: Wake, N1, N2, N3, and REM\. We train three XGBoost classifiers using a standard set of features\[[31](https://arxiv.org/html/2605.06724#bib.bib31)\]extracted from \(i\) in\-ear EEG denoised by the baseline Optimal\-WT, \(ii\) in\-ear EEG denoised by iPSD, and \(iii\) concurrently recorded scalp EEG \(single channel: C3\-M2\)\. The scalp EEG, recorded with the clinical\-grade SomnoHD system, is of high quality and serves as a performance reference that the denoising methods aim to match\.[Figure˜5](https://arxiv.org/html/2605.06724#S3.F5)shows resulting confusion matrices\. We observe that classification based on iPSD substantially outperforms that based on Optimal\-WT \(accuracy87\.0%87\.0\\%, Cohen’sκ=0\.827\\kappa=0\.827vs\.79\.8%79\.8\\%,κ=0\.733\\kappa=0\.733\), with the largest gains in sleep stages characterised by higher\-frequency activity \(Wake, N1, and REM\)\. This pattern aligns with[Fig\.˜4](https://arxiv.org/html/2605.06724#S3.F4)c, where iPSD shows superior spectral fidelity above44Hz\. N1 is consistently the most difficult stage to classify, but iPSD’s recall and precision substantially exceed those of Optimal\-WT and match the scalp EEG, indicating that the difficulty is inherent to N1 classification\[[32](https://arxiv.org/html/2605.06724#bib.bib32)\]with EEG alone rather than a limitation of denoising\. Overall, the per\-class metrics of iPSD\-denoised in\-ear EEG fall within a few points of scalp EEG across all stages, demonstrating that iPSD recovers clinically meaningful information from the noisy wearable recordings at a level comparable to clinical\-grade recordings\.
### 3\.3Ablation study on intelligent partition
Here, the importance of the proposed learnable partitioningπψ\\pi\_\{\\psi\}is demonstrated through an ablation study\. Specifically, we introduce a baseline with theπψ\\pi\_\{\\psi\}replaced by interleaved downsampling \(ID\), a rule\-based partitioning that assigns alternating samples to\(𝐬l,𝐬r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)and has proven to be highly effective for self\-supervised image denoising\[[15](https://arxiv.org/html/2605.06724#bib.bib15)\]\. Experiments are conducted on the same synthetic EEG signals described in[Section˜3\.1](https://arxiv.org/html/2605.06724#S3.SS1), contaminated by WGN and EMG artifacts at input SNR levels of−5\-5dB and0dB\.[Table˜2](https://arxiv.org/html/2605.06724#S3.T2)summarizes the denoising performance achieved by all three methods across experiments\.
Table 2:Ablation of the iPSD partitioning strategy: learnedπψ\\pi\_\{\\psi\}vs\. interleaved downsampling \(ID\)\.We observe that the intelligent partitioning module \(both in iPSD and iPSD\-Zero\) consistently outperforms the ID baseline under all noise conditions\. Under low\-SNR conditions \(at−5\-5dB\), iPSD achieves1\.391\.39dB higher output SNR under WGN contamination and2\.012\.01dB higher output SNR under EMG contamination\. Despite working in the constrained zero\-shot setting, iPSD\-Zero only slightly underperforms iPSD under EMG noise, which can be attributed to its more constrained partitioning\. Nonetheless, it substantially outperforms the ID baseline, by at least0\.630\.63dB under WGN and1\.001\.00dB under EMG\. The results confirm that intelligent partitioning successfully constructs sub\-signal pairs that are more effective for trainingfθf\_\{\\theta\}, leading to stronger denoising performance\.
## 4Related Work
Conventional scalp EEG processing widely uses band\-pass and notch filters, which have limited use for low\-SNR wearable EEG due to its substantial spectral overlap with prominent artifacts \(EMG and EOG\)\[[33](https://arxiv.org/html/2605.06724#bib.bib33)\]\. To handle the localized and nonstationary nature of these artifacts, wavelet\-based methods decompose the EEG signal to attenuate or discard components that correlate weakly with a predefined set of wavelet bases before reconstructing the denoised signal\[[6](https://arxiv.org/html/2605.06724#bib.bib6)\]\. However, performance of these methods is sensitive to signal\-specific choices of wavelet basis, thresholding strategy, and decomposition depth\[[34](https://arxiv.org/html/2605.06724#bib.bib34)\]\. Mode decomposition methods address the limitation of fixed wavelet bases by decomposing signals into data\-driven mode functions\[[8](https://arxiv.org/html/2605.06724#bib.bib8),[9](https://arxiv.org/html/2605.06724#bib.bib9)\], but they rely either on perturbation\-sensitive local extrema and signal envelopes, or on optimization problems whose solutions depend strongly on hyperparameter choices\. Despite the improvements by hybrid methods such as VMD\-wavelet and WPT\-ICA\[[35](https://arxiv.org/html/2605.06724#bib.bib35),[25](https://arxiv.org/html/2605.06724#bib.bib25)\], these approaches remain heavily reliant on handcrafted bases and heuristic decomposition strategies that do not generalize well\.
Recent advances in DL have opened a promising new avenue for EEG denoising\. Highly expressive neural architectures allow direct mapping of noisy EEG signals to their clean counterparts, eliminating the need for handcrafted decomposition strategies\.Zhang et al\.\[[36](https://arxiv.org/html/2605.06724#bib.bib36)\]show that DL\-based models can learn hierarchical representations, such as features characterizing rhythmic oscillations and waveform morphology, strengthening the denoising performance\. Despite their effectiveness, however, these approaches require a large collection of paired clean\-noisy EEG data for training, which are unobtainable in practice\. Noise2Noise \(N2N\)\[[14](https://arxiv.org/html/2605.06724#bib.bib14),[37](https://arxiv.org/html/2605.06724#bib.bib37)\]offers a compelling self\-supervised image denoising approach based on learning to map independent noisy realizations of a signal to one another\. Yet N2N requires paired noisy measurements of the same signal, which is virtually impossible to obtain for EEG signals, which are unique, unrepeatable observations\. Subsequent works\[[38](https://arxiv.org/html/2605.06724#bib.bib38),[39](https://arxiv.org/html/2605.06724#bib.bib39),[15](https://arxiv.org/html/2605.06724#bib.bib15),[26](https://arxiv.org/html/2605.06724#bib.bib26)\]seek learning from a single noisy observation by relying on rule\-based partitioning or masking strategies, which are not suited to non\-stationary, fast\-oscillating EEG signals\. In contrast, iPSD learns to generate paired noisy realizations sharing the same underlying clean signal via flexible partitioning of a single measurement\.
## 5Conclusion
The proposed iPSD \(and its zero\-shot variant, iPSD\-Zero\) enables training of highly expressive deep learning EEG denoisers without clean ground\-truth signals\. Across extensive experiments on synthetic and real EEG data, iPSD and iPSD\-Zero consistently outperformed state\-of\-the\-art baselines with markedly higher spectral fidelity\. The value of denoising is in empowering tasks that depend on high\-quality signals\. We demonstrated this in a sleep\-stage classification task, where the application of iPSD enhanced classification accuracy up to the standards of scalp EEG\. These results suggest that the methods we developed here can effectively narrow the gap between accessible wearable measurements and clinical\-grade recordings\. Moreover, iPSD\-Zero’s ability to denoise a single signal without prior training makes it deployable in real\-time and on\-device, bringing continuous, accessible neural monitoring and diagnosis closer to reality\. Our method is broadly applicable to denoising temporal signals where clean references are unavailable\. A limitation is the assumption that noise components in the noisy sub\-signals are mutually independent and independent of the signal\. Although such independence assumption is shared by most methods that denoise without a clean reference, our experiments on real\-world data demonstrated that iPSD remains effective under realistic noise conditions where this assumption may not be strictly satisfied\.
## Acknowledgment
HT acknowledges financial support from IBM through an IBM PhD Fellowship\. Authors acknowledge financial support from Imperial College London through an Imperial College Research Fellowship grant awarded to HH\.
## References
- Tian et al\. \[2025\]Haozhe Tian, Qiyu Rao, Nina Moutonnet, Pietro Ferraro, and Danilo Mandic\.Machine intelligence on the edge: Interpretable cardiac pattern localisation using reinforcement learning\.*arXiv preprint arXiv:2508\.21652*, 2025\.
- Zhang et al\. \[2023\]Liping Zhang, Anzi Li, Shukai Chen, Wei Ren, and Kim\-Kwang Raymond Choo\.A secure, flexible, and PPG\-based biometric scheme for healthy IoT using homomorphic random forest\.*IEEE Internet of Things Journal*, 11\(1\):612–622, 2023\.
- Chu et al\. \[2021\]Lei Chu, Ling Pei, and Robert Qiu\.Ahed: A heterogeneous\-domain deep learning model for IoT\-enabled smart health with few\-labeled EEG data\.*IEEE Internet of Things Journal*, 8\(23\):16787–16800, 2021\.
- Huhn et al\. \[2022\]Sophie Huhn et al\.The impact of wearable technologies in health research: Scoping review\.*JMIR mHealth and uHealth*, 10\(1\):e34384, 2022\.
- Rao et al\. \[2025\]Qiyu Rao, Zdenka Babic, Scott C Douglas, and Danilo P Mandic\.Panorama: An enabling technology for Hearables\.In*Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing*, pages 1–5, 2025\.
- Krishnaveni et al\. \[2006\]V Krishnaveni, S Jayaraman, S Aravind, V Hariharasudhan, and K Ramadoss\.Automatic identification and removal of ocular artifacts from EEG using wavelet transform\.*Measurement Science Review*, 6\(4\):45–57, 2006\.
- Alyasseri et al\. \[2020\]Zaid Abdi Alkareem Alyasseri, Ahamad Tajudin Khader, Mohammed Azmi Al\-Betar, Ammar Kamal Abasi, and Sharif Naser Makhadmeh\.EEG signals denoising using optimal wavelet transform hybridized with efficient metaheuristic methods\.*IEEE access*, 8:10584–10605, 2020\.
- Huang et al\. \[1998\]Norden E Huang et al\.The empirical mode decomposition and the Hilbert spectrum for nonlinear and non\-stationary time series analysis\.In*Proceedings of the Royal Society of London\. Series A: Mathematical, Physical and Engineering Sciences*, pages 903–995, 1998\.
- Dragomiretskiy and Zosso \[2013\]Konstantin Dragomiretskiy and Dominique Zosso\.Variational mode decomposition\.*IEEE Transactions on Signal Processing*, 62\(3\):531–544, 2013\.
- Yang et al\. \[2016\]Banghua Yang, Kaiwen Duan, and Tao Zhang\.Removal of EOG artifacts from EEG using a cascade of sparse autoencoder and recursive least squares adaptive filter\.*Neurocomputing*, 214:1053–1060, 2016\.
- Sun et al\. \[2020\]Weitong Sun, Yuping Su, Xia Wu, and Xiaojun Wu\.A novel end\-to\-end 1D\-ResCNN model to remove artifact from EEG signals\.*Neurocomputing*, 404:108–121, 2020\.
- Jurczak et al\. \[2022\]Marcin Jurczak, Marcin Kołodziej, and Andrzej Majkowski\.Implementation of a convolutional neural network for eye blink artifacts removal from the electroencephalography signal\.*Frontiers in Neuroscience*, 16:782367, 2022\.
- Chuang et al\. \[2022\]Chun\-Hsiang Chuang, Kong\-Yi Chang, Chih\-Sheng Huang, and Tzyy\-Ping Jung\.IC\-U\-Net: a U\-Net\-based denoising autoencoder using mixtures of independent components for automatic EEG artifact removal\.*NeuroImage*, 263:119586, 2022\.
- Lehtinen et al\. \[2018\]Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila\.Noise2Noise: Learning image restoration without clean data\.In*Proceedings of the International Conference on Machine Learning*, pages 2965–2974, 2018\.
- Mansour and Heckel \[2023\]Youssef Mansour and Reinhard Heckel\.Zero\-shot Noise2Noise: Efficient image denoising without any data\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pages 14018–14027, 2023\.
- Goverdovsky et al\. \[2017\]Valentin Goverdovsky et al\.Hearables: Multimodal physiological in\-ear sensing\.*Scientific Reports*, 7\(1\):6948, 2017\.
- Nakamura et al\. \[2017\]Takashi Nakamura, Valentin Goverdovsky, and Danilo P Mandic\.In\-ear EEG biometrics for feasible and readily collectable real\-world person authentication\.*IEEE Transactions on Information Forensics and Security*, 13\(3\):648–661, 2017\.
- Schulman et al\. \[2017\]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\.
- Jamieson et al\. \[2014\]Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck\.lil’ UCB: An optimal exploration algorithm for multi\-armed bandits\.In*Proceedings of the 27th Annual Conference on Learning Theory*, volume 35, pages 423–439, 2014\.
- Huang et al\. \[2022\]Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G\.M\. Araújo\.CleanRL: High\-quality single\-file implementations of deep reinforcement learning algorithms\.*Journal of Machine Learning Research*, 23\(274\):1–18, 2022\.URL[http://jmlr\.org/papers/v23/21\-1342\.html](http://jmlr.org/papers/v23/21-1342.html)\.
- Houamed et al\. \[2020\]Ibtissem Houamed, Lamir Saidi, and Fawzi Srairi\.ECG signal denoising by fractional wavelet transform thresholding\.*Research on Biomedical Engineering*, 36\(3\):349–360, 2020\.
- Dora and Biswal \[2020\]Chinmayee Dora and Pradyut Kumar Biswal\.Correlation\-based ECG artifact correction from single channel EEG using modified variational mode decomposition\.*Computer Methods and Programs in Biomedicine*, 183:105092, 2020\.
- Colominas et al\. \[2014\]Marcelo A Colominas, Gastón Schlotthauer, and María E Torres\.Improved complete ensemble EMD: A suitable tool for biomedical signal processing\.*Biomedical Signal Processing and Control*, 14:19–29, 2014\.
- Ranjan et al\. \[2022\]Rakesh Ranjan, Bikash Chandra Sahana, and Ashish Kumar Bhandari\.Motion artifacts suppression from EEG signals using an adaptive signal denoising method\.*IEEE Transactions on Instrumentation and Measurement*, 71:1–10, 2022\.
- Kerechanin and Bobrov \[2022\]Yaroslav Kerechanin and Pavel Bobrov\.EEG denoising using wavelet packet decomposition and independent component analysis\.In*Proceedings of the IEEE Fourth International Conference Neurotechnologies and Neurointerfaces*, pages 68–70, 2022\.
- Chen et al\. \[2025\]Wensheng Chen, Cong Yu, Zhenhua Zhao, Nan Zheng, Han Li, and Yurong Li\.Self\-supervised EEG denoising via dual\-branch consistency learning with masked reconstruction\.*Knowledge\-Based Systems*, page 114703, 2025\.
- Guttag \[2010\]John Guttag\.CHB\-MIT Scalp EEG Database\.*PhysioNet*, June 2010\.doi:10\.13026/C2K01R\.URL[https://doi\.org/10\.13026/C2K01R](https://doi.org/10.13026/C2K01R)\.Version 1\.0\.0\.
- Shoeb \[2009\]Ali Hossam Shoeb\.*Application of machine learning to epileptic seizure onset detection and treatment*\.PhD thesis, Massachusetts Institute of Technology, 2009\.
- Rosiani et al\. \[2021\]Ulla Delfana Rosiani, Pramana Yoga Saputra, and Muhammad Afif Hendrawan\.The impact of segment length on EEG based biometric system\.In*Proceedings of the IEEE 7th Information Technology International Seminar*, pages 1–5, 2021\.
- Rantanen et al\. \[2016\]Ville Rantanen et al\.A survey on the feasibility of surface EMG in facial pacing\.In*Proceedings of the 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society*, pages 1688–1691, 2016\.
- Vallat and Walker \[2021\]Raphael Vallat and Matthew P Walker\.An open\-source, high\-performance tool for automated sleep staging\.*elife*, 10:e70092, 2021\.
- Lee et al\. \[2022\]Yun Ji Lee, Jae Yong Lee, Jae Hoon Cho, and Ji Ho Choi\.Interrater reliability of sleep stage scoring: a meta\-analysis\.*Journal of Clinical Sleep Medicine*, 18\(1\):193–202, 2022\.
- Urigüen and Garcia\-Zapirain \[2015\]Jose Antonio Urigüen and Begoña Garcia\-Zapirain\.EEG artifact removal—state\-of\-the\-art and guidelines\.*Journal of neural engineering*, 12\(3\):031001, 2015\.
- Grobbelaar et al\. \[2022\]Maximilian Grobbelaar et al\.A survey on denoising techniques of electroencephalogram signals using wavelet transform\.*Signals*, 3\(3\):577–586, 2022\.
- Kaur et al\. \[2021\]Chamandeep Kaur, Amandeep Bisht, Preeti Singh, and Garima Joshi\.EEG signal denoising using hybrid approach of variational mode decomposition and wavelets for depression\.*Biomedical Signal Processing and Control*, 65:102337, 2021\.
- Zhang et al\. \[2021\]Haoming Zhang, Chen Wei, Mingqi Zhao, Quanying Liu, and Haiyan Wu\.A novel convolutional neural network model to remove muscle artifacts from EEG\.In*Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing*, pages 1265–1269, 2021\.
- Kashyap et al\. \[2021\]Madhav Mahesh Kashyap, Anuj Tambwekar, Krishnamoorthy Manohara, and S\. Natarajan\.Speech Denoising Without Clean Training Data: A Noise2Noise Approach\.In*Interspeech 2021*, pages 2716–2720, 2021\.doi:10\.21437/Interspeech\.2021\-1130\.
- Krull et al\. \[2019\]Alexander Krull, Tim\-Oliver Buchholz, and Florian Jug\.Noise2Void\-learning denoising from single noisy images\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 2129–2137, 2019\.
- Huang et al\. \[2021\]Tao Huang, Songjiang Li, Xu Jia, Huchuan Lu, and Jianzhuang Liu\.Neighbor2Neighbor: Self\-supervised denoising from single noisy images\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 14781–14790, 2021\.
- Williams \[1992\]Ronald J Williams\.Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.*Machine Learning*, 8\(3\-4\):229–256, 1992\.
### AProof of[Theorem˜1](https://arxiv.org/html/2605.06724#Thmtheorem1)
###### Theorem 1\.
Suppose\(𝐬l,𝐬r\)∼πψ\(⋅∣𝐬\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\sim\\pi\_\{\\psi\}\(\\cdot\\mid\\mathbf\{s\}\)are independent noisy realizations of the same underlying signal𝐱\\mathbf\{x\}\. Then theθ⋆\\theta^\{\\star\}that optimizes the self\-supervised loss in[Eq\.˜1](https://arxiv.org/html/2605.06724#S2.E1)also minimizes the expectedL2L2distance between the network outputs and the ground\-truth𝐱\\mathbf\{x\}, i\.e\.,
θ⋆=argminθ𝔼\[‖fθ\(𝐬l\)−𝐱‖22\+‖fθ\(𝐬r\)−𝐱‖22\]\.\\theta^\{\\star\}=\\operatorname\*\{argmin\}\_\{\\theta\}\\,\\mathbb\{E\}\\Big\[\\big\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{x\}\\big\\\|\_\{2\}^\{2\}\+\\big\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{r\}\)\-\\mathbf\{x\}\\big\\\|\_\{2\}^\{2\}\\Big\]\.
###### Proof\.
The independent noisy realization pairs\(𝐬l,𝐬r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)can be written as𝐬l=𝐱\+𝐧l\\mathbf\{s\}^\{l\}=\\mathbf\{x\}\+\\mathbf\{n\}^\{l\}and𝐬r=𝐱\+𝐧r\\mathbf\{s\}^\{r\}=\\mathbf\{x\}\+\\mathbf\{n\}^\{r\}, where𝐧l\\mathbf\{n\}^\{l\}and𝐧r\\mathbf\{n\}^\{r\}are zero\-mean and mutually independent, and each is independent of𝐱\\mathbf\{x\}\. Expanding the first term of the self\-supervised loss in[Eq\.˜1](https://arxiv.org/html/2605.06724#S2.E1)gives
𝔼\[‖fθ\(𝐬l\)−𝐬r‖22\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{s\}^\{r\}\\\|\_\{2\}^\{2\}\\right\]=𝔼\[‖fθ\(𝐬l\)−𝐱−𝐧r‖22\]\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{x\}\-\\mathbf\{n\}^\{r\}\\\|\_\{2\}^\{2\}\\right\]=𝔼\[‖fθ\(𝐬l\)−𝐱‖22\]−2𝔼\[𝐧r⊤\(fθ\(𝐬l\)−𝐱\)\]\+𝔼\[‖𝐧r‖22\]\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{x\}\\\|\_\{2\}^\{2\}\\right\]\-2\\,\\mathbb\{E\}\\\!\\left\[\{\\mathbf\{n\}^\{r\}\}^\{\\top\}\\\!\\left\(f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{x\}\\right\)\\right\]\+\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{n\}^\{r\}\\\|\_\{2\}^\{2\}\\right\]=𝔼\[‖fθ\(𝐬l\)−𝐱‖22\]\+𝔼\[‖𝐧r‖22\],\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{x\}\\\|\_\{2\}^\{2\}\\right\]\+\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{n\}^\{r\}\\\|\_\{2\}^\{2\}\\right\],where the last equality follows from𝔼\[𝐧r⊤fθ\(𝐬l\)\]=0\\mathbb\{E\}\\\!\\left\[\{\\mathbf\{n\}^\{r\}\}^\{\\top\}f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\\right\]=0and𝔼\[𝐧r⊤𝐱\]=0\\mathbb\{E\}\\\!\\left\[\{\\mathbf\{n\}^\{r\}\}^\{\\top\}\\mathbf\{x\}\\right\]=0, since𝐧r\\mathbf\{n\}^\{r\}is zero\-mean and independent of both𝐱\\mathbf\{x\}and𝐬l\\mathbf\{s\}^\{l\}\. By the same argument with the roles of𝐬l\\mathbf\{s\}^\{l\}and𝐬r\\mathbf\{s\}^\{r\}swapped, the second term of[Eq\.˜1](https://arxiv.org/html/2605.06724#S2.E1)satisfies
𝔼\[‖fθ\(𝐬r\)−𝐬l‖22\]=𝔼\[‖fθ\(𝐬r\)−𝐱‖22\]\+𝔼\[‖𝐧l‖22\]\.\\mathbb\{E\}\\\!\\left\[\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{r\}\)\-\\mathbf\{s\}^\{l\}\\\|\_\{2\}^\{2\}\\right\]=\\mathbb\{E\}\\\!\\left\[\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{r\}\)\-\\mathbf\{x\}\\\|\_\{2\}^\{2\}\\right\]\+\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{n\}^\{l\}\\\|\_\{2\}^\{2\}\\right\]\.Summing the two terms and noting that𝔼\[‖𝐧l‖22\]\+𝔼\[‖𝐧r‖22\]\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{n\}^\{l\}\\\|\_\{2\}^\{2\}\\right\]\+\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{n\}^\{r\}\\\|\_\{2\}^\{2\}\\right\]does not depend onθ\\theta, we obtain
argminθ𝔼\[‖fθ\(𝐬l\)−𝐬r‖22\+‖fθ\(𝐬r\)−𝐬l‖22\]=argminθ𝔼\[‖fθ\(𝐬l\)−𝐱‖22\+‖fθ\(𝐬r\)−𝐱‖22\],\\operatorname\*\{argmin\}\_\{\\theta\}\\mathbb\{E\}\\\!\\left\[\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{s\}^\{r\}\\\|\_\{2\}^\{2\}\+\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{r\}\)\-\\mathbf\{s\}^\{l\}\\\|\_\{2\}^\{2\}\\right\]=\\operatorname\*\{argmin\}\_\{\\theta\}\\mathbb\{E\}\\\!\\left\[\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{x\}\\\|\_\{2\}^\{2\}\+\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{r\}\)\-\\mathbf\{x\}\\\|\_\{2\}^\{2\}\\right\],which proves the theorem\. ∎
### BProof of[Proposition˜1](https://arxiv.org/html/2605.06724#Thmproposition1)
###### Proposition 1\.
For any signal windowI\(k\)I^\{\(k\)\}, the within\-window partitioning strategy used in iPSD can always achieve a partition that is at least as good as the interleaved partition used in\[[15](https://arxiv.org/html/2605.06724#bib.bib15)\], in terms of the mismatch between the underlying clean components of the sub\-signals\(𝐬I\(k\),l,𝐬I\(k\),r\)\(\\mathbf\{s\}\_\{I^\{\(k\),l\}\},\\mathbf\{s\}\_\{I^\{\(k\),r\}\}\)\.
###### Proof\.
The fixed interleaved partition used in\[[15](https://arxiv.org/html/2605.06724#bib.bib15)\]is one particular case in the iPSD search space, which contains all\(WW/2\)/2\\tbinom\{W\}\{W/2\}/2candidate within\-window partitions ofI\(k\)I^\{\(k\)\}\. It follows immediately that iPSD can achieve a partition at least as good as the fixed interleaved partition, since the latter is always available as a candidate\.
Moreover, the iPSD partition can be strictly better than interleaved downsampling\. Consider a clean signal window
𝐱I\(k\)=\(1,−1,1,−1\)\.\\mathbf\{x\}\_\{I^\{\(k\)\}\}=\(1,\-1,1,\-1\)\.Under interleaved partitioning, the resulting clean sub\-signals are
𝐱I\(k\),l=\(1,1\),𝐱I\(k\),r=\(−1,−1\),\\mathbf\{x\}\_\{I^\{\(k\),l\}\}=\(1,1\),\\qquad\\mathbf\{x\}\_\{I^\{\(k\),r\}\}=\(\-1,\-1\),or equivalently in reversed order\. These two sub\-signals clearly do not share the same underlying clean component\. However, iPSD may choose the feasible within\-window partition
I\(k\),l=\{1,2\},I\(k\),r=\{3,4\},I^\{\(k\),l\}=\\\{1,2\\\},\\qquad I^\{\(k\),r\}=\\\{3,4\\\},which yields sub\-signals with the same underlying signal:
𝐱I\(k\),l=\(1,−1\),𝐱I\(k\),r=\(1,−1\)\.\\mathbf\{x\}\_\{I^\{\(k\),l\}\}=\(1,\-1\),\\qquad\\mathbf\{x\}\_\{I^\{\(k\),r\}\}=\(1,\-1\)\.This and many other examples, such as those illustrated in[Fig\.˜1](https://arxiv.org/html/2605.06724#S2.F1), show that for signals with periodic structure, the flexibility of iPSD’s search space yields strictly better partitions than fixed interleaved downsampling\. ∎
### CJustification for the RL reward signal
In this section,we show that under reasonable assumptions, using the converged denoiser loss as RL reward drivesπψ\\pi\_\{\\psi\}toward𝐬l\\mathbf\{s\}^\{l\}and𝐬r\\mathbf\{s\}^\{r\}that are maximally informative of each other’s underlying clean signal\. For conciseness, we present thel→rl\\to rdirection; the reverse direction follows symmetrically\.
We follow the standard N2N assumption that the noise𝐧\\mathbf\{n\}is zero\-mean and independent of the clean signal𝐱\\mathbf\{x\}\. In the iPSD setting, for the partitioned pair\(𝐬l,𝐬r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)induced byπψ\\pi\_\{\\psi\}, we assume that the right\-side noise𝐧r\\mathbf\{n\}^\{r\}remains conditionally zero\-mean given the left\-side observation𝐬l\\mathbf\{s\}^\{l\}and the right\-side clean signal𝐱r\\mathbf\{x\}^\{r\}, i\.e\.,𝔼\[𝐧r∣𝐬l,𝐱r\]=𝟎\\mathbb\{E\}\[\\mathbf\{n\}^\{r\}\\mid\\mathbf\{s\}^\{l\},\\mathbf\{x\}^\{r\}\]=\\mathbf\{0\}\. We also assume that the noise power𝔼\[‖𝐧r‖22\]\\mathbb\{E\}\[\\\|\\mathbf\{n\}^\{r\}\\\|\_\{2\}^\{2\}\]does not depend on the partition policyπψ\\pi\_\{\\psi\}, which is reasonable since the two partitions have fixed cardinality and the noise statistics are approximately homogeneous across samples\. For\(𝐬l,𝐬r\)∼πψ\(⋅∣𝐬\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\sim\\pi\_\{\\psi\}\(\\cdot\\mid\\mathbf\{s\}\), the N2N denoising objective can be rewritten as:
argminθ𝔼\[‖fθ\(𝐬l\)−𝐬r‖22\]=argminθ𝔼\[∥\(fθ\(𝐬l\)−𝔼\[𝐬r∣𝐬l\]\)−\(𝐬r−𝔼\[𝐬r∣𝐬l\]\)∥22\]=argminθ𝔼\[∥fθ\(𝐬l\)−𝔼\[𝐬r∣𝐬l\]∥22\]−2𝔼\[\(fθ\(𝐬l\)−𝔼\[𝐬r∣𝐬l\]\)⊤\(𝐬r−𝔼\[𝐬r∣𝐬l\]\)\]\+𝔼\[∥𝐬r−𝔼\[𝐬r∣𝐬l\]∥22\]=\(i\)argminθ𝔼\[∥fθ\(𝐬l\)−𝔼\[𝐬r∣𝐬l\]∥22\],\\displaystyle\\begin\{split\}\\operatorname\*\{argmin\}\_\{\\theta\}\\mathbb\{E\}\\left\[\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{s\}^\{r\}\\\|^\{2\}\_\{2\}\\right\]&=\\operatorname\*\{argmin\}\_\{\\theta\}\\mathbb\{E\}\\left\[\\left\\\|\\left\(f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbb\{E\}\\left\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\)\-\\left\(\\mathbf\{s\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\)\\right\\\|^\{2\}\_\{2\}\\right\]\\\\ &=\\operatorname\*\{argmin\}\_\{\\theta\}\\mathbb\{E\}\\left\[\\left\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbb\{E\}\\left\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\\\|^\{2\}\_\{2\}\\right\]\\\\ &\\quad\-2\\,\\mathbb\{E\}\\left\[\\left\(f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbb\{E\}\\left\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\)^\{\\top\}\\left\(\\mathbf\{s\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\)\\right\]\\\\ &\\quad\+\\mathbb\{E\}\\left\[\\left\\\|\\mathbf\{s\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\\\|^\{2\}\_\{2\}\\right\]\\\\ &\\stackrel\{\{\\scriptstyle\(i\)\}\}\{\{=\}\}\\operatorname\*\{argmin\}\_\{\\theta\}\\mathbb\{E\}\\left\[\\left\\\|f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbb\{E\}\\left\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\\\|^\{2\}\_\{2\}\\right\],\\end\{split\}\(7\)where\(i\)\(i\)follows because the third term does not depend onθ\\thetaand can be dropped from theargmin\\operatorname\*\{argmin\}, and the cross term vanishes by applying the tower property𝔼\[X\]=𝔼\[𝔼\[X∣𝐬l\]\]\\mathbb\{E\}\[X\]=\\mathbb\{E\}\[\\mathbb\{E\}\[X\\mid\\mathbf\{s\}^\{l\}\]\]:
𝔼\[\(fθ\(𝐬l\)−𝔼\[𝐬r∣𝐬l\]\)⊤\(𝐬r−𝔼\[𝐬r∣𝐬l\]\)\]\\displaystyle\\mathbb\{E\}\\left\[\\left\(f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbb\{E\}\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]\\right\)^\{\\top\}\\left\(\\mathbf\{s\}^\{r\}\-\\mathbb\{E\}\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]\\right\)\\right\]=𝔼\[𝔼\[\(fθ\(𝐬l\)−𝔼\[𝐬r∣𝐬l\]\)⊤\(𝐬r−𝔼\[𝐬r∣𝐬l\]\)\|𝐬l\]\]\\displaystyle\\qquad=\\mathbb\{E\}\\left\[\\mathbb\{E\}\\left\[\\left\(f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbb\{E\}\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]\\right\)^\{\\top\}\\left\(\\mathbf\{s\}^\{r\}\-\\mathbb\{E\}\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]\\right\)\\;\\Big\|\\;\\mathbf\{s\}^\{l\}\\right\]\\right\]=\(ii\)𝔼\[\(fθ\(𝐬l\)−𝔼\[𝐬r∣𝐬l\]\)⊤𝔼\[𝐬r−𝔼\[𝐬r∣𝐬l\]\|𝐬l\]\]\\displaystyle\\qquad\\stackrel\{\{\\scriptstyle\(ii\)\}\}\{\{=\}\}\\mathbb\{E\}\\left\[\\left\(f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbb\{E\}\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]\\right\)^\{\\top\}\\mathbb\{E\}\\left\[\\mathbf\{s\}^\{r\}\-\\mathbb\{E\}\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]\\;\\Big\|\\;\\mathbf\{s\}^\{l\}\\right\]\\right\]=𝔼\[\(fθ\(𝐬l\)−𝔼\[𝐬r∣𝐬l\]\)⊤⋅𝟎\]=0,\\displaystyle\\qquad=\\mathbb\{E\}\\left\[\\left\(f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbb\{E\}\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]\\right\)^\{\\top\}\\cdot\\mathbf\{0\}\\right\]=0,where\(ii\)\(ii\)holds becausefθ\(𝐬l\)−𝔼\[𝐬r∣𝐬l\]f\_\{\\theta\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbb\{E\}\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]is a deterministic function of𝐬l\\mathbf\{s\}^\{l\}that can be pulled out of the conditional expectation given𝐬l\\mathbf\{s\}^\{l\}\. The remaining inner factor then vanishes:
𝔼\[𝐬r−𝔼\[𝐬r∣𝐬l\]\|𝐬l\]=𝔼\[𝐬r∣𝐬l\]−𝔼\[𝐬r∣𝐬l\]=𝟎,\\mathbb\{E\}\\left\[\\mathbf\{s\}^\{r\}\-\\mathbb\{E\}\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]\\;\\Big\|\\;\\mathbf\{s\}^\{l\}\\right\]=\\mathbb\{E\}\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]\-\\mathbb\{E\}\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]=\\mathbf\{0\},where we used linearity of conditional expectation together with the fact that𝔼\[𝐬r∣𝐬l\]\\mathbb\{E\}\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]is itself a deterministic function of𝐬l\\mathbf\{s\}^\{l\}and therefore equals its own conditional expectation given𝐬l\\mathbf\{s\}^\{l\}\. Letfθ⋆f\_\{\\theta^\{\\star\}\}denote the optimal denoiser\.
Assuming sufficient expressiveness of the denoising modulefθf\_\{\\theta\},[Eq\.˜7](https://arxiv.org/html/2605.06724#A0.E7)shows that
fθ⋆\(𝐬l\)=𝔼\[𝐬r∣𝐬l\]\.f\_\{\\theta^\{\\star\}\}\(\\mathbf\{s\}^\{l\}\)=\\mathbb\{E\}\\left\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\.\(8\)
Letℒ⋆\(πψ\)\\mathcal\{L\}^\{\\star\}\(\\pi\_\{\\psi\}\)denote the converged N2N loss in thel→rl\\to rdirection for\(𝐬l,𝐬r\)∼πψ\(⋅∣𝐬\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\sim\\pi\_\{\\psi\}\(\\cdot\\mid\\mathbf\{s\}\)\. Following[Eq\.˜8](https://arxiv.org/html/2605.06724#A0.E8),
ℒ⋆\(πψ\)=𝔼\[‖fθ⋆\(𝐬l\)−𝐬r‖22\]=𝔼\[∥𝐬r−𝔼\[𝐬r∣𝐬l\]∥22\]=𝔼\[∥𝐱r\+𝐧r−𝔼\[𝐱r∣𝐬l\]−𝔼\[𝐧r∣𝐬l\]∥22\]=\(iii\)𝔼\[∥\(𝐱r−𝔼\[𝐱r∣𝐬l\]\)\+𝐧r∥22\]=\(iv\)𝔼\[∥𝐱r−𝔼\[𝐱r∣𝐬l\]∥22\]\+𝔼\[∥𝐧r∥22\],\\displaystyle\\begin\{split\}\\mathcal\{L\}^\{\\star\}\(\\pi\_\{\\psi\}\)&=\\mathbb\{E\}\\left\[\\\|f\_\{\\theta^\{\\star\}\}\(\\mathbf\{s\}^\{l\}\)\-\\mathbf\{s\}^\{r\}\\\|^\{2\}\_\{2\}\\right\]\\\\ &=\\mathbb\{E\}\\left\[\\left\\\|\\mathbf\{s\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\\\|^\{2\}\_\{2\}\\right\]\\\\ &=\\mathbb\{E\}\\left\[\\left\\\|\\mathbf\{x\}^\{r\}\+\\mathbf\{n\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{x\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\-\\mathbb\{E\}\\left\[\\mathbf\{n\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\\\|^\{2\}\_\{2\}\\right\]\\\\ &\\stackrel\{\{\\scriptstyle\(iii\)\}\}\{\{=\}\}\\mathbb\{E\}\\left\[\\left\\\|\\left\(\\mathbf\{x\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{x\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\)\+\\mathbf\{n\}^\{r\}\\right\\\|^\{2\}\_\{2\}\\right\]\\\\ &\\stackrel\{\{\\scriptstyle\(iv\)\}\}\{\{=\}\}\\mathbb\{E\}\\left\[\\left\\\|\\mathbf\{x\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{x\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\\\|^\{2\}\_\{2\}\\right\]\+\\mathbb\{E\}\\left\[\\\|\\mathbf\{n\}^\{r\}\\\|^\{2\}\_\{2\}\\right\],\\end\{split\}\(9\)where\(iii\)\(iii\)uses𝔼\[𝐧r∣𝐬l\]=𝟎\\mathbb\{E\}\[\\mathbf\{n\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]=\\mathbf\{0\}, which follows from the tower property:𝔼\[𝐧r∣𝐬l\]=𝔼\[𝔼\[𝐧r∣𝐬l,𝐱r\]∣𝐬l\]=𝟎\\mathbb\{E\}\[\\mathbf\{n\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\]=\\mathbb\{E\}\\big\[\\mathbb\{E\}\[\\mathbf\{n\}^\{r\}\\mid\\mathbf\{s\}^\{l\},\\mathbf\{x\}^\{r\}\]\\mid\\mathbf\{s\}^\{l\}\\big\]=\\mathbf\{0\}by our assumption\. Step\(iv\)\(iv\)follows because the cross term vanishes\. To see this, apply the tower property with conditioning on\(𝐬l,𝐱r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{x\}^\{r\}\):
𝔼\[\(𝐱r−𝔼\[𝐱r∣𝐬l\]\)⊤𝐧r\]\\displaystyle\\mathbb\{E\}\\left\[\\left\(\\mathbf\{x\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{x\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\)^\{\\top\}\\mathbf\{n\}^\{r\}\\right\]=𝔼\[𝔼\[\(𝐱r−𝔼\[𝐱r∣𝐬l\]\)⊤𝐧r\|𝐬l,𝐱r\]\]\\displaystyle=\\mathbb\{E\}\\left\[\\mathbb\{E\}\\left\[\\left\(\\mathbf\{x\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{x\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\)^\{\\top\}\\mathbf\{n\}^\{r\}\\;\\Big\|\\;\\mathbf\{s\}^\{l\},\\mathbf\{x\}^\{r\}\\right\]\\right\]=𝔼\[\(𝐱r−𝔼\[𝐱r∣𝐬l\]\)⊤𝔼\[𝐧r∣𝐬l,𝐱r\]\]\\displaystyle=\\mathbb\{E\}\\left\[\\left\(\\mathbf\{x\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{x\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\)^\{\\top\}\\mathbb\{E\}\\left\[\\mathbf\{n\}^\{r\}\\mid\\mathbf\{s\}^\{l\},\\mathbf\{x\}^\{r\}\\right\]\\right\]=𝔼\[\(𝐱r−𝔼\[𝐱r∣𝐬l\]\)⊤⋅𝟎\]=0,\\displaystyle=\\mathbb\{E\}\\left\[\\left\(\\mathbf\{x\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{x\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\)^\{\\top\}\\cdot\\mathbf\{0\}\\right\]=0,where the second equality follows because𝐱r−𝔼\[𝐱r∣𝐬l\]\\mathbf\{x\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{x\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]is determined by𝐬l\\mathbf\{s\}^\{l\}and𝐱r\\mathbf\{x\}^\{r\}, and therefore behaves as a constant under the conditional expectation given\(𝐬l,𝐱r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{x\}^\{r\}\); the third uses the assumption that𝔼\[𝐧r∣𝐬l,𝐱r\]=𝟎\\mathbb\{E\}\[\\mathbf\{n\}^\{r\}\\mid\\mathbf\{s\}^\{l\},\\mathbf\{x\}^\{r\}\]=\\mathbf\{0\}\.
Upon combining the two results, we obtain the following regarding the converged loss of the denoiser:
ℒ⋆\(πψ\)=𝔼\[∥𝐱r−𝔼\[𝐱r∣𝐬l\]∥22\]⏟depends onπψ\+𝔼\[‖𝐧r‖22\]⏟constant inπψ\.\\mathcal\{L\}^\{\\star\}\(\\pi\_\{\\psi\}\)=\\underbrace\{\\mathbb\{E\}\\left\[\\left\\\|\\mathbf\{x\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{x\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\\\|^\{2\}\_\{2\}\\right\]\}\_\{\\text\{depends on \}\\pi\_\{\\psi\}\}\+\\underbrace\{\\mathbb\{E\}\\left\[\\\|\\mathbf\{n\}^\{r\}\\\|^\{2\}\_\{2\}\\right\]\}\_\{\\text\{constant in \}\\pi\_\{\\psi\}\}\.\(10\)Note that the first term depends onπψ\\pi\_\{\\psi\}through the joint distribution of\(𝐬l,𝐱r\)\(\\mathbf\{s\}^\{l\},\\mathbf\{x\}^\{r\}\)induced by the partition policy\. By our assumption that𝔼\[‖𝐧r‖22\]\\mathbb\{E\}\[\\\|\\mathbf\{n\}^\{r\}\\\|^\{2\}\_\{2\}\]does not depend onπψ\\pi\_\{\\psi\}, the second term is constant with respect toψ\\psi\. Therefore, maximizing the RL reward−ℒ⋆\(πψ\)\-\\mathcal\{L\}^\{\\star\}\(\\pi\_\{\\psi\}\)is equivalent to minimizing the first term of[Eq\.˜10](https://arxiv.org/html/2605.06724#A0.E10):
argmaxψ\(−ℒ⋆\(πψ\)\)=argminψ𝔼\[∥𝐱r−𝔼\[𝐱r∣𝐬l\]∥22\],\\operatorname\*\{argmax\}\_\{\\psi\}\\left\(\-\\mathcal\{L\}^\{\\star\}\(\\pi\_\{\\psi\}\)\\right\)=\\operatorname\*\{argmin\}\_\{\\psi\}\\mathbb\{E\}\\left\[\\left\\\|\\mathbf\{x\}^\{r\}\-\\mathbb\{E\}\\left\[\\mathbf\{x\}^\{r\}\\mid\\mathbf\{s\}^\{l\}\\right\]\\right\\\|^\{2\}\_\{2\}\\right\],\(11\)which is the minimum mean squared error \(MMSE\) of predicting the right\-side clean signal𝐱r\\mathbf\{x\}^\{r\}from the left\-side observation𝐬l\\mathbf\{s\}^\{l\}\. Hence, using the negative converged N2N loss as the RL reward drivesπψ\\pi\_\{\\psi\}toward partitions under which𝐬l\\mathbf\{s\}^\{l\}is maximally predictive of𝐱r\\mathbf\{x\}^\{r\}in the MMSE sense; by symmetry,𝐬r\\mathbf\{s\}^\{r\}is maximally predictive of𝐱l\\mathbf\{x\}^\{l\}\.
### DDerivation of the policy gradient
We now derive the gradient of the expected denoising reward𝔼\(𝐬l,𝐬r\)∼π\(⋅∣𝐬\)\[R\(𝐬l,𝐬r\)\]\\mathbb\{E\}\_\{\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\sim\\pi\(\\cdot\\mid\\mathbf\{s\}\)\}\\left\[R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\right\]with respect to the parametersψ\\psi\. Applying the log\-derivative identity∇xf\(x\)=f\(x\)∇xlogf\(x\)\\nabla\_\{x\}f\(x\)=f\(x\)\\,\\nabla\_\{x\}\\log f\(x\)yields
∇ψ𝔼\(𝐬l,𝐬r\)∼π\(⋅∣𝐬\)\[R\(𝐬l,𝐬r\)\]\\displaystyle\\nabla\_\{\\psi\}\\mathbb\{E\}\_\{\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\sim\\pi\(\\cdot\\mid\\mathbf\{s\}\)\}\\left\[R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\right\]=∇ψ∑\(𝐬l,𝐬r\)πψ\(𝐬l,𝐬r∣𝐬\)R\(𝐬l,𝐬r\)\\displaystyle=\\nabla\_\{\\psi\}\\sum\_\{\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\}\\pi\_\{\\psi\}\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}\)R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)=∑\(𝐬l,𝐬r\)πψ\(𝐬l,𝐬r∣𝐬\)\[∇ψlogπψ\(𝐬l,𝐬r∣𝐬\)R\(𝐬l,𝐬r\)\]\\displaystyle=\\sum\_\{\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\}\\pi\_\{\\psi\}\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}\)\\left\[\\nabla\_\{\\psi\}\\log\\pi\_\{\\psi\}\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}\)R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\right\]=𝔼\(𝐬l,𝐬r\)\[∇ψlogπψ\(𝐬l,𝐬r∣𝐬\)R\(𝐬l,𝐬r\)\]\.\\displaystyle=\\mathbb\{E\}\_\{\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\}\\left\[\\nabla\_\{\\psi\}\\log\\pi\_\{\\psi\}\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\\mid\\mathbf\{s\}\)R\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\right\]\.\(12\)[Equation˜12](https://arxiv.org/html/2605.06724#A0.E12)is the standard REINFORCE estimator\[[40](https://arxiv.org/html/2605.06724#bib.bib40)\]: the intractable sum over all partitions is replaced by a Monte Carlo expectation, making the gradient estimable from sampled partitions\(𝐬l,𝐬r\)∼πψ\(⋅∣𝐬\)\(\\mathbf\{s\}^\{l\},\\mathbf\{s\}^\{r\}\)\\sim\\pi\_\{\\psi\}\(\\cdot\\mid\\mathbf\{s\}\)\.
### EDetailed implementation of iPSD\-Zero
Algorithm 2iPSD\-Zero1:Noisy signal
𝐬\\mathbf\{s\}; hyper\-parameters
\(α,β,σ,ϵ,δ\)\(\\alpha,\\beta,\\sigma,\\epsilon,\\delta\); learning rate
ηθ\\eta^\{\\theta\}
2:Initialization:pull each arm once to obtain
μ^a\(0\)\\hat\{\\mu\}\_\{a\}\(0\)for all
aa; set
Ta\(0\)←1T\_\{a\}\(0\)\\leftarrow 1for all
aa; let
t←0t\\leftarrow 0
3:while
Ta\(t\)<1\+α∑i≠aTi\(t\)T\_\{a\}\(t\)<1\+\\alpha\\sum\_\{i\\neq a\}T\_\{i\}\(t\)for all arms
aado
4:Compute
la\(t\)l\_\{a\}\(t\)for all
aawith[Eq\.˜13](https://arxiv.org/html/2605.06724#A0.E13)
5:Select arm
at=argmaxala\(t\)a\_\{t\}=\\arg\\max\_\{a\}l\_\{a\}\(t\)and partition
𝐬\\mathbf\{s\}into
\(𝐬atl,𝐬atr\)\(\\mathbf\{s\}\_\{a\_\{t\}\}^\{l\},\\mathbf\{s\}\_\{a\_\{t\}\}^\{r\}\)
6:Initialize
fθf\_\{\\theta\}; set
N=0N=0
7:while
θ\\thetanot convergeddo
8:
Jnd=‖fθ\(𝐬atl\)−𝐬atr‖22\+‖fθ\(𝐬atr\)−𝐬atl‖22J\_\{n\}^\{d\}=\\bigl\\\|f\_\{\\theta\}\(\\mathbf\{s\}\_\{a\_\{t\}\}^\{l\}\)\-\\mathbf\{s\}\_\{a\_\{t\}\}^\{r\}\\bigr\\\|\_\{2\}^\{2\}\+\\bigl\\\|f\_\{\\theta\}\(\\mathbf\{s\}\_\{a\_\{t\}\}^\{r\}\)\-\\mathbf\{s\}\_\{a\_\{t\}\}^\{l\}\\bigr\\\|\_\{2\}^\{2\}
9:
θ←θ−ηθ∇θJnd\\theta\\leftarrow\\theta\-\\eta^\{\\theta\}\\nabla\_\{\\theta\}J\_\{n\}^\{d\};
N=N\+1N=N\+1
10:endwhile
11:Reward
rt=R\(𝐬atl,𝐬atr\)=−110∑n=N−9NJndr\_\{t\}=R\(\\mathbf\{s\}\_\{a\_\{t\}\}^\{l\},\\mathbf\{s\}\_\{a\_\{t\}\}^\{r\}\)=\-\\tfrac\{1\}\{10\}\\sum\_\{n=N\-9\}^\{\\,N\}J\_\{n\}^\{d\}
12:
μ^a\(t\+1\)←\{Ta\(t\)μ^a\(t\)\+rtTa\(t\)\+1,a=atμ^a\(t\),a≠at\\hat\{\\mu\}\_\{a\}\(t\+1\)\\leftarrow\\begin\{cases\}\\dfrac\{T\_\{a\}\(t\)\\,\\hat\{\\mu\}\_\{a\}\(t\)\+r\_\{t\}\}\{T\_\{a\}\(t\)\+1\},&a=a\_\{t\}\\\\\[6\.0pt\] \\hat\{\\mu\}\_\{a\}\(t\),&a\\neq a\_\{t\}\\end\{cases\}
13:
Ta\(t\+1\)←\{Ta\(t\)\+1,a=atTa\(t\),a≠atT\_\{a\}\(t\+1\)\\leftarrow\\begin\{cases\}T\_\{a\}\(t\)\+1,&a=a\_\{t\}\\\\ T\_\{a\}\(t\),&a\\neq a\_\{t\}\\end\{cases\}
14:
t←t\+1t\\leftarrow t\+1
15:endwhile
16:
a⋆←argmaxaTa\(t\)a^\{\\star\}\\leftarrow\\arg\\max\_\{a\}T\_\{a\}\(t\)
17:returndenoised signal
𝐱^=fθ⋆\(𝐬\)\\hat\{\\mathbf\{x\}\}=f\_\{\\theta^\{\\star\}\}\(\\mathbf\{s\}\), where
θ⋆\\theta^\{\\star\}is the network trained on partition
\(𝐬a⋆l,𝐬a⋆r\)\(\\mathbf\{s\}\_\{a^\{\\star\}\}^\{l\},\\mathbf\{s\}\_\{a^\{\\star\}\}^\{r\}\)
In this section, we provide details of iPSD\-Zero\. Let armaadenote the partition that corresponds to\(𝐬al,𝐬ar\)\(\\mathbf\{s\}\_\{a\}^\{l\},\\mathbf\{s\}\_\{a\}^\{r\}\)\. Pullingaaleads to a certain rewardR\(𝐬al,𝐬ar\)R\(\\mathbf\{s\}\_\{a\}^\{l\},\\mathbf\{s\}\_\{a\}^\{r\}\), defined as in[Eq\.˜4](https://arxiv.org/html/2605.06724#S2.E4)as the negated denoiserfθf\_\{\\theta\}loss at convergence\. Due to the stochastic nature of training neural networks,R\(𝐬al,𝐬ar\)R\(\\mathbf\{s\}\_\{a\}^\{l\},\\mathbf\{s\}\_\{a\}^\{r\}\)is also stochastic for each arm\. To balance the exploration of less frequently sampled partitions with the exploitation of those that have already demonstrated strong denoising performance, the lil’ UCB index for armaaat roundttis defined as
la\(t\)=μ^a\(t\)\+\(1\+β\)\(1\+ϵ\)2σ2\(1\+ϵ\)log\(log\(\(1\+ϵ\)Ta\(t\)\)δ\)Ta\(t\),l\_\{a\}\(t\)=\\hat\{\\mu\}\_\{a\}\(t\)\+\(1\+\\beta\)\(1\+\\sqrt\{\\epsilon\}\)\\sqrt\{\\frac\{2\\sigma^\{2\}\(1\+\\epsilon\)\\log\\left\(\\frac\{\\log\\left\(\(1\+\\epsilon\)T\_\{a\}\(t\)\\right\)\}\{\\delta\}\\right\)\}\{T\_\{a\}\(t\)\}\},\(13\)whereμ^a\(t\)\\hat\{\\mu\}\_\{a\}\(t\)denotes the empirical mean reward of armaaup to roundtt,Ta\(t\)T\_\{a\}\(t\)is the number of times armaahas been pulled up to roundtt, andσ2\\sigma^\{2\}is the sub\-Gaussian variance proxy of the reward distribution\. The confidence parameterδ∈\(0,1\)\\delta\\in\(0,1\)and the exploration parametersβ,ϵ\>0\\beta,\\epsilon\>0are user\-specified hyper\-parameters that jointly control the tightness of the confidence interval\. The first term in[Eq\.˜13](https://arxiv.org/html/2605.06724#A0.E13)promotes the exploitation of arms with higher observed rewards, while the second term promotes the exploration of other arms with decreasing intensity asTa\(t\)T\_\{a\}\(t\)increases\.[Algorithm˜2](https://arxiv.org/html/2605.06724#alg2)summarizes our adaptation of lil’ UCB to the denoising process in the zero\-shot setting\.
In iPSD\-Zero, hyper\-parameters are set toα=\[\(2\+β\)/β\]2\\alpha=\[\(2\+\\beta\)/\\beta\]^\{2\},β=1\\beta=1,σ=0\.008\\sigma=0\.008,ϵ=0\.01\\epsilon=0\.01, andδ=0\.55×10−4\\delta=0\.55\\times 10^\{\-4\}\. The iPSD\-Zero method is initialized by pulling each arm once\. We then iteratively apply lil’ UCB to select the next arm\. This results in increasingly confident estimates of partition quality for each arm, reflected by the average reward\. At convergence, the algorithm identifies the optimal partition,\(𝐬a⋆l,𝐬a⋆r\)\(\\mathbf\{s\}\_\{a^\{\\star\}\}^\{l\},\\mathbf\{s\}\_\{a^\{\\star\}\}^\{r\}\), which is then used to train the final denoiserfθ⋆f\_\{\\theta^\{\\star\}\}\. The design of iPSD\-Zero transforms the zero\-shot denoising problem into a structured exploration\-exploitation task, where, by leveraging lil’ UCB, we achieve principled and sample\-efficient identification of the partition that yields the most reliable self\-supervision\.
Table 3:Ablation study on localized window lengths\.
### FNotes on iPSD implementation
The iPSD is implemented in Python 3\.12\.5 and tested on an Ubuntu 22\.04 machine with a 13th Gen Intel Core i7\-13850HX CPU, an Nvidia RTX 3500 Ada GPU \(12 GB VRAM\), and 32 GB of RAM\. First, We perform an ablation on the lengths of the localized window \([Table˜3](https://arxiv.org/html/2605.06724#A0.T3)\)\. Since the window length must divide the signal lengthL=2560L=2560and be even \(so that𝐬l\\mathbf\{s\}^\{l\}and𝐬r\\mathbf\{s\}^\{r\}have equal length\), we ablate over window lengths of 4, 8, and 10 samples\. Longer windows \(e\.g\., the next feasible length, 16 samples\) produce too many candidate partitions, inflating the parameter count ofπψ\\pi\_\{\\psi\}to an intractable size\. Results show that a window length of 8 samples is optimal, providing sufficient flexibility while keeping the search space small enough for efficient policy training\.
Table 4:Denoising module Architectures- •FC = Fully Connected; Conv\-Enc/Dec = 1D Convolutional encoder/decoder block \(2 conv layers each\)\.
- •FC and Conv layers use the ReLU activation; Batch normalization is applied in all Conv layers\.
- •iPSD\-RNN and iPSD\-GRU are bidirectional\.
Second, before settling on the architecture of the partitioning modelπψ\\pi\_\{\\psi\}, we compared four candidates: \(i\) iPSD\-MLP, a window\-independent baseline without temporal context; \(ii\) iPSD\-U\-Net, a multi\-scale fully convolutional architecture; \(iii\) iPSD\-RNN, a recurrent neural network capturing sequential context across windows; and \(iv\) iPSD\-GRU, a gated RNN variant with selective retention of long\-range context\.[Table˜4](https://arxiv.org/html/2605.06724#A0.T4)summarizes the detailed architectures of the four networks\. All four variants are trained using the same PPO hyperparameter setting\.
[Table˜5](https://arxiv.org/html/2605.06724#A0.T5)compares the denoising performance of all the iPSD variants\. Under WGN contamination, the iPSD\-GRU achieved the best overall performance, benefiting from its gating mechanisms, which effectively capture sequential context while mitigating vanishing/exploding gradients\. Notably, the other sequential model, iPSD\-RNN, performed worse than iPSD\-MLP, which does not use any sequential context\. This indicates that while sequential context is indeed important for partitioning𝐬\\mathbf\{s\}, long sequences must be handled appropriately, which GRU achieves successfully\. The iPSD\-U\-Net model performed less favourably, which may be attributed to its relatively limited receptive field when applied to long EEG signals\. Based on these results, we adopt GRU as the architecture for the partitioning model\.
Table 5:Denoising Performance of different denoising module architectures\.Table 6:Baselines Used in Quantitative Comparisons
### GBaselines
[Table˜6](https://arxiv.org/html/2605.06724#A0.T6)provides the details of the baselines used in our quantitative evaluation\. These baselines span wavelet\-based methods \(FrWT\[[21](https://arxiv.org/html/2605.06724#bib.bib21)\], Optimal\-WT\[[7](https://arxiv.org/html/2605.06724#bib.bib7)\]\), mode decomposition methods \(mVMD\[[22](https://arxiv.org/html/2605.06724#bib.bib22)\], CEEMDAN\[[23](https://arxiv.org/html/2605.06724#bib.bib23)\], EMD\-LoG\[[24](https://arxiv.org/html/2605.06724#bib.bib24)\]\), hybrid methods \(WPT\-ICA\[[25](https://arxiv.org/html/2605.06724#bib.bib25)\]\), and self\-supervised DL methods \([Chen et al\.](https://arxiv.org/html/2605.06724#bib.bib26)\[[26](https://arxiv.org/html/2605.06724#bib.bib26)\]\), covering the dominant paradigms in EEG artifact removal\. Hyperparameters follow each method’s original specification unless stated otherwise\. For CEEMDAN, we use the PyEMD implementation with default parameters and discard the first 2 intrinsic mode functions\. For[Chen et al\.](https://arxiv.org/html/2605.06724#bib.bib26), we retrain it on the same training set that iPSD uses with the authors’ default hyperparameters\. All other baselines are implemented using the authors’ released code where available, otherwise re\-implemented from the published descriptions\.
### HEvaluation metrics
##### Signal\-to\-Noise Ratio \(SNR\)
SNR is defined as the ratio of the clean signal power to the residual noise power after denoising\. A higher SNR indicates better noise suppression while retaining the power of the original signal\. The SNR \(in dB\) is computed as
SNR=10log10\(‖𝐱‖22‖𝐱−𝐱^‖22\),\\mathrm\{SNR\}=10\\log\_\{10\}\\left\(\\frac\{\\\|\\mathbf\{x\}\\\|\_\{2\}^\{2\}\}\{\\\|\\mathbf\{x\}\-\\hat\{\\mathbf\{x\}\}\\\|\_\{2\}^\{2\}\}\\right\),\(14\)where𝐱\\mathbf\{x\}denotes the clean reference signal and𝐱^\\hat\{\\mathbf\{x\}\}the denoising output\. This metric directly measures reconstruction accuracy in the time domain\.
##### Peak Signal\-to\-Noise Ratio \(PSNR\)
PSNR is defined as the ratio of the squared peak amplitude of the clean signal to the average power of the residual noise after denoising\. A higher PSNR indicates better noise suppression while retaining extreme values of the original signal, which are often clinically relevant information for EEG signals \(e\.g\., spikes or sharp waves\)\. The PSNR \(in dB\) is defined as
PSNR=10log10\(xmax2‖𝐱−𝐱^‖22/L\),\\displaystyle\\begin\{split\}\\mathrm\{PSNR\}=10\\log\_\{10\}\\left\(\\frac\{x\_\{\\max\}^\{2\}\}\{\\\|\\mathbf\{x\}\-\\hat\{\\mathbf\{x\}\}\\\|\_\{2\}^\{2\}/L\}\\right\),\\end\{split\}\(15\)whereLLis the signal length andxmaxx\_\{\\max\}is the maximum amplitude of the clean signal\. Compared to SNR, which focuses on average power, PSNR is more sensitive to extreme values in the signal\.
##### Mean\-Square Error of the Spectrum \(Spectral MSE\)
Since the clinical interpretation of EEG signals often relies on frequency\-domain information \(e\.g\., power inδ\\delta,θ\\theta,α\\alpha,β\\beta, andγ\\gammabands\)\[[31](https://arxiv.org/html/2605.06724#bib.bib31)\], we also measure reconstruction accuracy in the frequency domain\. Spectral MSE is defined as the mean squared error between the PSDs of the clean and denoised signals\. The PSD of a signal𝐱\\mathbf\{x\}\(in dB\) is defined as
spec𝐱\(f\)=10log10\(∑m=−∞∞R𝐱𝐱\[m\]e−j2πfm\),\\mathrm\{spec\}\_\{\\mathbf\{x\}\}\(f\)=10\\log\_\{10\}\\left\(\\sum\_\{m=\-\\infty\}^\{\\infty\}R\_\{\\mathbf\{x\}\\mathbf\{x\}\}\[m\]e^\{\-j2\\pi fm\}\\right\),\(16\)whereR𝐱𝐱\[m\]=𝔼\[xi⋅xi\+m\]R\_\{\\mathbf\{x\}\\mathbf\{x\}\}\[m\]=\\mathbb\{E\}\\left\[x\_\{i\}\\cdot x\_\{i\+m\}\\right\]denotes the autocorrelation of𝐱\\mathbf\{x\}, andffdenotes the frequency\. In practice, we estimate the PSD of finite\-length signals using Welch’s method, which returns discrete frequency bins\{fk\}k=1Nf\\\{f\_\{k\}\\\}\_\{k=1\}^\{N\_\{f\}\}\. The Spectral MSE is then computed as
SpectralMSE=1Nf∑k=1Nf\(spec𝐱\(fk\)−spec𝐱^\(fk\)\)2\.\\mathrm\{Spectral\\ MSE\}=\\frac\{1\}\{N\_\{f\}\}\\sum\_\{k=1\}^\{N\_\{f\}\}\\bigl\(\\mathrm\{spec\}\_\{\\mathbf\{x\}\}\(f\_\{k\}\)\-\\mathrm\{spec\}\_\{\\hat\{\\mathbf\{x\}\}\}\(f\_\{k\}\)\\bigr\)^\{2\}\.\(17\)This metric assesses how well the denoising algorithm removes noise while preserving the spectral distribution of the EEG signal, which is critical for downstream EEG analysis\.
### IDenoising performance statistics
In[Table˜7](https://arxiv.org/html/2605.06724#A0.T7), we report the standard deviation across all EEG segments for the comparison in[Table˜1](https://arxiv.org/html/2605.06724#S3.T1)of[Section˜3\.1](https://arxiv.org/html/2605.06724#S3.SS1)\. We observe that both iPSD and iPSD\-Zero significantly outperform baselines on average, while remaining highly consistent across segments, demonstrating the robustness of the proposed framework\. Similarly,[Table˜8](https://arxiv.org/html/2605.06724#A0.T8)provides the standard deviation across all EEG segments for the ablation on partitioning strategy in[Table˜2](https://arxiv.org/html/2605.06724#S3.T2)of[Section˜3\.3](https://arxiv.org/html/2605.06724#S3.SS3), which showcases that the intelligent partitioning is consistently effective across EEG segments\.
Table 7:Denoising Performance\. Values are reported as mean±\\pmstandard deviation\.Table 8:Ablation on the iPSD partitioning strategy\. Values are reported as mean±\\pmstandard deviation\.
### JVisual comparisons between iPSD and Optimal\-WT
[Figure˜6](https://arxiv.org/html/2605.06724#A0.F6)uses example synthetic EEG segments to illustrate the denoising performance of both iPSD and iPSD\-Zero\. For reference, we also include the strongest baseline, Optimal\-WT\. The input signals are corrupted with synthetically added WGN or EMG artifacts, with input SNR levels set to either−5\-5dB or0dB\. For each input signal, we show the noisy input EEG \(orange\), the clean ground truth EEG \(green\), the Optimal\-WT output \(gray\), the iPSD output \(pink\), and the iPSD\-Zero output \(blue\)\. The lower right panel for each signal displays the PSDs of the original and denoised signals, allowing visual comparison of all the methods in the frequency domain\.


Figure 6:Visualizing the denoising of example EEG segments contaminated by WGN and EMG artifacts at input SNRs of−5\-5dB and0dB\. For each example, the 6 panels show: the noisy input EEG \(orange\), the clean ground truth EEG \(green\), the Optimal\-WT output \(gray\), the iPSD output \(pink\), the iPSD\-Zero output \(blue\), and the corresponding PSDs \(lower\-right panels\)\.Under WGN contamination, although the Optimal\-WT reconstruction appears visually smooth in the time domain, it incurs substantial spectral distortion, sharply attenuating components above 10 Hz and introducing spurious dips at a few frequencies, resulting in the loss of physiologically meaningful neural activity\. In contrast, both iPSD and iPSD\-Zero faithfully recover the underlying clean EEG, preserving the temporal morphology \(e\.g\., the ridges and spikes\) and the spectral characteristics\. This highlights the advantage of iPSD’s expressive DL denoiser, which captures the underlying signal structure for effective denoising, over Optimal\-WT’s fixed decomposition strategy, which corrupts the original EEG content\.
Under EMG contamination, the advantage of iPSD becomes even more pronounced\. At both input SNR levels, the iPSD and iPSD\-Zero outputs closely track the ground truth in the time domain, with EMG\-induced artifacts substantially suppressed\. The corresponding PSDs further confirm that iPSD selectively attenuates broadband EMG components while recovering the underlying EEG spectrum\. In contrast, Optimal\-WT fails to recover the clean signal, introducing spurious spikes in the time domain that are absent from the ground truth and causing noticeable distortion in the frequency domain\. Overall, iPSD\-Zero matches the performance of iPSD, making it well\-suited for settings that demand immediate deployment without prior training\.Similar Articles
Interpretable EEG Microstate Discovery via Variational Deep Embedding: A Systematic Architecture Search with Multi-Quadrant Evaluation
This paper presents Conv-VaDE, a variational deep embedding model for interpretable EEG microstate discovery that jointly learns topographic reconstruction and probabilistic soft clustering. It includes a systematic architecture search evaluated on resting-state EEG data to determine optimal model configurations for stability and interpretability.
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
This paper applies TopK Sparse Autoencoders to three EEG foundation models (SleepFM, REVE, LaBraM) to extract interpretable feature dictionaries and introduces a framework for concept steering, revealing representational failures and clinical entanglements.
Reducing the Complexity of Deep Learning Models for EEG Analysis on Wearable Devices
This paper investigates reducing the computational complexity of deep neural networks for EEG analysis on wearable devices by applying parameter quantization and electrode reduction techniques, demonstrating significant complexity reduction with minimal accuracy loss for epileptic seizure detection.
SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers
SEGA is a training-free method that improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps.
STDA-Net: Spectrogram-Based Domain Adaptation for cross-dataset Sleep Stage Classification
This paper introduces STDA-Net, a domain adaptation framework for cross-dataset sleep stage classification using 2D spectrograms and adversarial learning. It demonstrates improved accuracy and stability over existing 1D EEG baseline methods on public datasets.