Building The Ph(ysical)AI Layer Of Machine Intelligence

arXiv cs.LG Papers

Summary

Researchers at MIT Lincoln Laboratory propose 'principle-driven foundation models' that encode signal-theoretic physical principles (Fourier decomposition, energy conservation, symmetry) instead of learning statistical correlations from large paired datasets. Trained exclusively on RF data, their 1.99M parameter frozen encoder achieves 77.7% average accuracy across 15 diverse tasks spanning audio, images, text, and video without any fine-tuning on target domains.

arXiv:2606.04106v1 Announce Type: new Abstract: Foundation models achieve generalization through massive-scale training on diverse data, but have limitations with transfer to truly unseen domains without paired training data. We propose principle-driven foundation models that encode signal-theoretic principles (Fourier decomposition, energy conservation, symmetry) rather than learn untethered statistical correlations. We hypothesize that domains differ not in fundamental physics, but in learnable transformations in time, frequency, magnitude, or phase. Training exclusively on radio-frequency (RF) data with co-designed architecture and losses incorporating these principles, we achieve cross-modal transfer to audio, images, text, and video using only frozen representations learned from RF data, requiring no fine-tuning of the encoder on target domains. Our 1.99M parameter frozen encoder achieves 77.7% average accuracy (91.9% top-3) across 15 diverse tasks via linear probing, with systematic variation: 84.5 on physically-grounded tasks (speaker recognition, seismology, RF fingerprinting) versus 70.0% on semantic tasks (music genre, language recognition). This reveals that principle-driven and scale-driven approaches offer complementary paths: physical principles enable efficient cross-modal transfer while naturally establishing the boundary between physical and semantic understanding.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:20 AM

# Building The Ph(ysical)AI Layer Of Machine Intelligence
Source: [https://arxiv.org/html/2606.04106](https://arxiv.org/html/2606.04106)
Ulbert J\. Botero MIT Lincoln Laboratory Joey\.Botero@ll\.mit\.edu &Liam Smith MIT Lincoln Laboratory Liam\.Smih@ll\.mit\.edu Brooks Olney MIT Lincoln Laboratory Brooks\.Olney@ll\.mit\.edu &Pooya Khorrami MIT Lincoln Laboratory Pooya\.Khorrami@ll\.mit\.edu &Steven Kusiak MIT Lincoln Laboratory Steven\.Kusiak@ll\.mit\.edu &Watson Jia MIT Lincoln Laboratory Watson\.Jia@ll\.mit\.edu &Sage Trudeau MIT Lincoln Laboratory Sage\.Trudeau@ll\.mit\.edu &Daniel Capecci MIT Lincoln Laboratory Daniel\.Capecci@ll\.mit\.edu

###### Abstract

Foundation models achieve generalization through massive\-scale training on diverse data, but have limitations with transfer to truly unseen domains without paired training data\. We propose principle\-driven foundation models that encode signal\-theoretic principles \(Fourier decomposition, energy conservation, symmetry\) rather than learn untethered statistical correlations\. We hypothesize that domains differ not in fundamental physics, but in learnable transformations in time, frequency, magnitude, or phase\. Training exclusively on radio\-frequency \(RF\) data with co\-designed architecture and losses incorporating these principles, we achieve cross\-modal transfer to audio, images, text, and video using only frozen representations learned from RF data, requiring no fine\-tuning of the encoder on target domains\. Our 1\.99M parameter frozen encoder achieves77\.7% average accuracy \(91\.9% top\-3\) across 15 diverse tasks via linear probing, with systematic variation:84\.5% on physically\-grounded tasks \(speaker recognition, seismology, RF fingerprinting\) versus70\.0% on semantic tasks \(music genre, language recognition\)\. This reveals that principle\-driven and scale\-driven approaches offer complementary paths: physical principles enable efficient cross\-modal transfer while naturally establishing the boundary between physical and semantic understanding\.

## 1Introduction

Foundation models achieve generalization through massive\-scale self\-supervised learning on diverse multi\-modal data[radford2021learning](https://arxiv.org/html/2606.04106#bib.bib1);[girdhar2023imagebind](https://arxiv.org/html/2606.04106#bib.bib2)\. This scale\-driven paradigm assumes that generalization emerges from learning correlations across domains via large amounts of paired training data\. While effective within their training distribution, these models face challenges with cross\-modal transfer to domains absent from training data[fang2022data](https://arxiv.org/html/2606.04106#bib.bib3);[miller2021accuracy](https://arxiv.org/html/2606.04106#bib.bib4);[taori2020measuring](https://arxiv.org/html/2606.04106#bib.bib5)\.

We propose a complementary approach:principle\-driven foundation modelsthat encode physical laws rather than learn statistical correlations\. All structured data—temporal sequences, spatial images, graph networks—can be viewed as signals defined over some domain\. Fourier’s theorem generalizes beyond classical time\-frequency analysis: graph Fourier transforms decompose signals on graphs[shuman2013emerging](https://arxiv.org/html/2606.04106#bib.bib6), spherical harmonics on spheres, wavelets across scales\. By grounding models in these general signal\-theoretic principles, we hypthesize cross\-modal generalization is possible\. We display a promising step in this direction by achieving cross\-modal transfer from a single training domain to diverse unseen domains without any encoder fine\-tuning\.

Our approach builds on two principles: \(1\)Fourier decomposition—signals decomposed into frequency components regardless of the domain, and \(2\)symmetry learning—representations should transform predictably under operations \(translations, rotations, scalings\)\. We hypothesize thatwhat distinguishes domains is not fundamentally different physics, but learnable transformations in time, frequency, magnitude, or phase\. If true, then learning transformation rules on one signal\-rich domain should enable transfer to any domain respecting the same mathematical structure\. We test this by training exclusively on radio\-frequency \(RF\) data—contextually distant from our evaluation targets \(images, speech, seismology, text\), yet exceptionally rich in signal diversity—and then evaluating cross\-modal transfer via linear probing on frozen representations\.

The frozen encoder probing achieves 84\.5% average accuracy on physically\-grounded tasks \(instrument classification, speaker recognition, RF fingerprinting, modulation classification, seismic event detection\) and 70\.0% on semantic tasks \(music genre, language recognition, clothing classification\) across time series \(seismology, RF, speech\), image \(MNIST, FashionMNIST\), and video\. This systematic pattern supports our hypothesis: signal\-theoretic learning captures physical structure but semantic content requires further abstraction\. Critically, this is not a limitation but a feature—it reveals the boundary of our approach and suggests a hierarchical architecture for AI: physical foundation models provide the base layer upon which semantic reasoning can be built\.

We emphasize that our goal differs from typical supervised learning benchmarks\. We do not ask ’what is the best possible performance on task X?’ but rather ’can training on RF alone enable meaningful transfer to task X?’ Our contribution is demonstrating that physics\-driven transfer is possible and competitive \(within 3\.2% of CLIP on physical tasks with 76× fewer parameters\), offering a resource\-efficient path that complements scale\-driven approaches\.

Our contributions are:

1. 1\.We demonstrate that training on a single signal\-rich domain \(RF\) enables zero\-shot cross\-modal generalization, achieving strong performance \(91\.9% top\-3 average accuracy\) without encoder fine\-tuning, establishing principle\-driven design as a viable path alongside scale\-driven approaches, particularly effective for physically\-grounded tasks\.
2. 2\.We introduce PlanFormer with co\-designed architecture \(Parseval Focus, frequency\-preserving pooling\) and symmetry losses \(IsoFICReg, LED\) that embed signal\-theoretic principles\.
3. 3\.We establish the boundary between physical and semantic understanding through systematic evaluation, showing that signal\-theoretic learning captures physical structure \(84\.5% top\-1 average accuracy\) but not semantic content \(70\.0% top\-1 average accuracy\), revealing what can and cannot be learned from signal processing principles alone\.

## 2Related Work

Multi\-Modal Foundation Models:Multi\-modal foundation models[radford2021learning](https://arxiv.org/html/2606.04106#bib.bib1);[girdhar2023imagebind](https://arxiv.org/html/2606.04106#bib.bib2)achieve generalization through large\-scale training on diverse paired data, learning semantic correlations within their training distribution\. In contrast, we train on a single domain \(RF\) and transfer to completely unseen domains, testing whether physical principles alone suffice for cross\-modal generalization\.

Physics\-Informed ML:Physics\-Informed Neural Networks[raissi2019physicsinformed](https://arxiv.org/html/2606.04106#bib.bib7)and Fourier Neural Operators[li2021fourier](https://arxiv.org/html/2606.04106#bib.bib8)incorporate physical constraints for domain\-specific problems\. We extend this to foundation models: encoding general signal\-theoretic principles \(Fourier decomposition, energy conservation, symmetry\) that enable cross\-modal transfer rather than domain\-specific solutions\.

Symmetry and Equivariance:Group equivariant networks[cohen2016group](https://arxiv.org/html/2606.04106#bib.bib9);[weiler2019general](https://arxiv.org/html/2606.04106#bib.bib10)hard\-code symmetries for specific domains\. Others learn equivariant symmetries through targeted data augmentations and SSL criterions[yu2025selfsupervisedtransformationlearningequivariant](https://arxiv.org/html/2606.04106#bib.bib11);[garrido2023self](https://arxiv.org/html/2606.04106#bib.bib12)\. We build on the latter by learning symmetries through explicit equivariance objectives \(LED\) targeting fundamental symmetries anchored by fourier analysis\.

Self\-Supervised Learning \(SSL\):SSL methods learn representations through masked prediction[he2022masked](https://arxiv.org/html/2606.04106#bib.bib13), contrastive learning[pmlr\-v119\-chen20j](https://arxiv.org/html/2606.04106#bib.bib14), or redundancy reduction[bardes2021vicreg](https://arxiv.org/html/2606.04106#bib.bib15)\. We extend VICReg with focal reweighting and coherent integration[richards2005fundamentals](https://arxiv.org/html/2606.04106#bib.bib16)for noise robustness, and complement invariance with explicit equivariance objectives \(LED\)—learning how representations should change predictably under transformations\.

Signal Processing in Deep Learning:Recent work integrates signal processing: FNet[lee\-thorp\-etal\-2022\-fnet](https://arxiv.org/html/2606.04106#bib.bib17)uses Fourier transforms, time\-frequency joint\-embeddings[zhang2022self](https://arxiv.org/html/2606.04106#bib.bib18)processes dual domains\. We enforce fundamental physical laws through co\-designed architecture and losses: Parseval’s theorem via consistency mechanisms, frequency\-preserving pooling to avoid spectral bias[pmlr\-v97\-rahaman19a](https://arxiv.org/html/2606.04106#bib.bib19), and explicit symmetry learning via equivariance objectives\.

## 3Methodology

Physical phenomena manifest as signals with general mathematical structure\. A seismic wave, RF transmission, and visual scene differ semantically but share fundamental properties: frequency decomposition \(Fourier\), predictable transformations \(symmetries\), and causal structure\. We hypothesize that domain differences arise from transformations in time, frequency, magnitude, or phase—transformations that manifest as learnable symmetries\. A model that learns these transformation rules should generalize across domains sharing the same mathematical structure\.

RF Data as Training Domain:We select RF data for its exceptional signal diversity: frequency content \(kHz to GHz\), temporal dynamics \(modulations, transients, fading\), and transformation types \(Doppler shifts, multipath, channel effects\)\. This rich diversity forces learning of general signal properties rather than domain\-specific shortcuts\. RF naturally requires joint time\-frequency analysis, making it ideal for learning dual\-domain representations\. Critically, RF is distant from our evaluation domains \(images, seismology, speech, histopathology\), providing a stringent test of our hypothesis\. Our RF fingerprinting datasets contain fine\-grained hardware imperfections, forcing highly discriminative feature learning while retaining higher\-order structure\.

### 3\.1Co\-Designed Learning System

Our approach prioritizes principled design, explicitly encoding established physical principles \(Fourier decomposition, energy conservation, symmetry\)—following a path similar to transformers[vaswani2023attentionneed](https://arxiv.org/html/2606.04106#bib.bib20)and CNNs[lecun\-gradientbased\-learning\-applied\-1998](https://arxiv.org/html/2606.04106#bib.bib21), where architectural innovations preceded formal theoretical understanding\.

Standard architectures lack inductive biases for symmetry losses to generate meaningful gradients\. Learning frequency translation equivariance requires preserving high\-frequency information—violated by standard pooling that creates spectral bias[pmlr\-v97\-rahaman19a](https://arxiv.org/html/2606.04106#bib.bib19)\. As a result, the architecture and loss functions must be co\-designed\.

We design PlanFormer’s components to provide computational substrate for our learning objectives: frequency\-preserving pooling enables equivariance losses to receive gradients for high\-frequency transformations; Parseval Focus enforces energy conservation \(a physics\-informed architectural constraint\); the Noise Sink enables invariance learning in negative\-SNR regimes\. We combine complementary objectives—symmetry losses \(IsoFICReg for invariance, LED for equivariance\) that learn transformation rules, and reconstruction losses for instance\-specific details—each enabled by specific architectural mechanisms\.

### 3\.2The Plan\(cherel\)Former Architecture

#### 3\.2\.1Dual\-Domain Architecture:

The encoder embeds Fourier’s theorem through dual\-domain processing operating simultaneously in time and frequency\. Input signals are processed in parallel: the time\-domain operates on the original signal and the frequency\-domain on its Fourier transform\. Unlike Fourier Neural Operators[li2021fourier](https://arxiv.org/html/2606.04106#bib.bib8)which apply asymmetric operations, PlanFormer applies symmetric sliding convolutions in both domains to learn complementary local features, relying on transformer mechanisms for global dependencies\.

We represent signals as pairs of interleaved real\-imaginary in\-phase and quadrature components \(IQ\)\. Real\-valued inputs \(speech, images\) undergo a Hilbert transformation[BendatPiersol2010Hilbert](https://arxiv.org/html/2606.04106#bib.bib22)to generate analytic representations, highlighting instantaneous frequency and phase changes\. Within transformer blocks, we reshape tensors such that each token’s embedding interleaves IQ pairs from consecutive sequence indices, ensuring a sequence length that aligns with physical complex\-valued samples\.

Frequency\-Preserving Pooling:Standard pooling creates aliasing and discards high\-frequency information, contributing to spectral bias[pmlr\-v97\-rahaman19a](https://arxiv.org/html/2606.04106#bib.bib19)\. This is catastrophic for learning frequency translation equivariance—if high\-frequency information is eliminated early, equivariance losses have no gradient signal for high\-frequency transformations\. We perform average pooling directly on complex\-valued spectra, preserving the spectral envelope rather than truncating it\. Both high\- and low\-frequency information is retained through down\-sampling, with the spectrum compressed rather than band\-limited\.

Convolutional Tokenization:The encoder partitions inputs into non\-overlapping windows, processing each in both domains to capture non\-stationary spectral behavior\. Parallel convolutional tokenizers extract local features, with cross\-domain gated fusion at three points: after tokenization, after each transformer block, and after sequence pooling\. Domain\-specific aggregation respects physical meaning: time\-domain outputs are concatenated to preserve temporal ordering; frequency\-domain outputs are averaged to retain global periodic information\.

#### 3\.2\.2Encoder: Energy Conservation using Multi\-Head Parseval Focus

Parseval’s theorem establishes energy conservation between time and frequency domains[parseval1799](https://arxiv.org/html/2606.04106#bib.bib23)\. To learn representations respecting this principle, we introduce Multi\-Head Parseval Focus, which enforces cross\-domain consistency through bidirectional attention and consistency regularization\.

The mechanism operates as follows: time\-domain tokens undergo FFT while frequency\-domain tokens undergo IFFT, leveraging interleaved IQ representation for complex\-valued transformations followed by domain\-specific QKV projections\. Within each domain, we compute the Scaled Covariance Focus:

F​o​c​u​s​\(Q,K,V\)=s​o​f​t​m​a​x​\(C​o​v​\(Q,K\)⋅Ls​e​q​u​e​n​c​edk⋅Kf​o​c​u​s\)​VFocus\(Q,K,V\)=softmax\\left\(\\frac\{Cov\(Q,K\)\\cdot L\_\{sequence\}\}\{\\sqrt\{d\_\{k\}\}\\cdot K\_\{focus\}\}\\right\)V
whereC​o​v​\(Q,K\)=\(Q−μQ\)​\(K−μK\)TCov\(Q,K\)=\(Q\-\\mu\_\{Q\}\)\(K\-\\mu\_\{K\}\)^\{T\}captures functional relationships rather than instantaneous similarity\. The focus factorKf​o​c​u​s∈\[1,Ls​e​q​u​e​n​c​e\]K\_\{focus\}\\in\[1,L\_\{sequence\}\]is predicted from attention score statistics \(mean token distinctiveness, proportion of highly distinctive tokens\) viafθf\_\{\\theta\}, providing adaptive temperatureτ=Kf​o​c​u​s/Ls​e​q​u​e​n​c​e\\tau=K\_\{focus\}/L\_\{sequence\}\. LowKf​o​c​u​sK\_\{focus\}sharpens attention when uncertain; highKf​o​c​u​sK\_\{focus\}maintains broad attention when confident \(details in Appendix[A\.6\.1\.2](https://arxiv.org/html/2606.04106#A1.SS6.SSS1.P2)\)\.

Bidirectional cross\-domain focus \(time\-as\-query/frequency\-as\-key and vice\-versa\) captures complementary features\. For physical consistency, both permutations should produce equivalent focus distributions\. We compute the Jensen\-Shannon Distance \(JSDt​i​m​e\(𝐏t​f\|\|𝐏~f​t\)\\text\{JSD\}\_\{time\}\(\\mathbf\{P\}\_\{tf\}\|\|\\tilde\{\\mathbf\{P\}\}\_\{ft\}\)/JSDf​r​e​q\(𝐏f​t\|\|𝐏~t​f\)\\text\{JSD\}\_\{freq\}\(\\mathbf\{P\}\_\{ft\}\|\|\\tilde\{\\mathbf\{P\}\}\_\{tf\}\), where𝐏~=softmax​\(ScoreT\)\\tilde\{\\mathbf\{P\}\}=\\text\{softmax\}\(\\text\{Score\}^\{\\textbf\{T\}\}\)\) between distributions, serving dual purposes: as a regularization loss \(minimizing JSD encourages consistent cross\-domain relationships\) and as dynamic gating \(down\-weighting inconsistent relationships, amplifying consistent ones\) alternative to softmax\. In\-domain and cross\-domain focus representations are fused via gated mechanisms with diversity regularization ensuring complementary learning\. All multi\-head focus mechanisms employ head orthogonalization and regularization[bardes2021vicreg](https://arxiv.org/html/2606.04106#bib.bib15)to enforce specialization and reduce common\-mode noise[ye2024differential](https://arxiv.org/html/2606.04106#bib.bib24)\.

![Refer to caption](https://arxiv.org/html/2606.04106v1/Figures/FullMultiHeadParsevalFocusBlockDiagramPlusParsevalScaledFocus_Symmetric.png)Figure 1:Multi\-Head Parseval Focus architecture\.\(a\) Multi\-Head Orthogonal Parseval Focus fuses in\-domain \(time\-time, freq\-freq\) and cross\-domain \(time\-freq, freq\-time\) focus via gated linear units for comprehensive signal analysis\. \(b\) Parseval Scaled Covariance Focus: Cross\-domain transformations \(FFT/IFFT\) with covariance\-based attention\. JSD between bidirectional distributions enforces Parseval consistent gating\. AdaptiveKf​o​c​u​sK\_\{focus\}enables dynamic sparsity control\.
#### 3\.2\.3Additional Encoder Mechanisms

Causal Cross\-Window Focuscomputes focus between consecutive windows’ convolutional representations to capture phase relationships spanning longer timescales\.Noise Sinkexplicitly estimates and removes noise during tokenization, regularized via Pearson correlation minimization and power matching constraints, enabling learning in negative\-SNR regimes\.Attentional Pooling[hassani2021escaping](https://arxiv.org/html/2606.04106#bib.bib25)produces fixed\-size latent representations from variable length sequences, with time\- and frequency\-domain latents used independently or concatenated for downstream tasks\.

### 3\.3Decoder Architecture

The decoder employs a dual\-domain UNet architecture[ronneberger2015u](https://arxiv.org/html/2606.04106#bib.bib26), mirroring the encoder’s time\-frequency structure, and adding three key innovations enabling effective gradient generation: \(1\)Instance\-Specific Latent Conditioningvia attention\-based FiLM[perez2017filmvisualreasoninggeneral](https://arxiv.org/html/2606.04106#bib.bib27)provides global context for instance\-specific reconstruction guidance, \(2\)Parseval Focus\-Based Skip Connection Sinksuse Multi\-Head Parseval Focus conditioned on the bottleneck latent to dynamically filter skip connections, creating adaptive filter banks that remove unwanted information \(noise, interfering sources\), and \(3\)Frequency\-Domain Upsamplingperforms upsampling in frequency domain with spectral infilling from skip connections, preserving the spectral envelope and avoiding low\-frequency bias\. These mechanisms enable reconstruction and source separation losses to provide clean gradients, particularly for high\-frequency features and in low\-SNR regimes \(details in Appendix[B](https://arxiv.org/html/2606.04106#A2)\)\.

### 3\.4Training Methodology: Co\-Designed Loss Functions

#### 3\.4\.1Isotropic Focal Invariance Covariance Regularization \(IsoFICReg\)

We extend VICReg[bardes2021vicreg](https://arxiv.org/html/2606.04106#bib.bib15)to enhance invariance learning under extreme noise while maintaining fine\-grained discrimination\. We leverage weak supervision from RF training data structure: multiple samples per emitter with unique hardware fingerprints\.

Latent Coherent Integration:Within each batch, we identify all samples from the same emitter and compute pairwise N\-choose\-2 invariance losses\. This is analogous to coherent integration in classical signal processing[richards2005fundamentals](https://arxiv.org/html/2606.04106#bib.bib16)—enforcing consistency across multiple noisy observations causes noise \(uncorrelated\) to become inconsistent while signal \(correlated\) becomes consistent, effectively improving SNR in latent space\.

Focal Reweighting:We Z\-score standardize projections then apply regression\-focused focal reweighting \(extended from[lin2017focal](https://arxiv.org/html/2606.04106#bib.bib28)\) to emphasize low\-SNR samples over trivial high\-SNR cases\. We complement attraction\-only invariance with repulsion loss for different emitter pairs, encouraging fine\-grained discrimination\. Focal reweighting is also applied to covariance regularization, emphasizing decorrelation along harder dimensions\. The Z\-score standardization paired with covariance based decorrelation encourages learning isotropic representations\. Unlike the original VICReg, we omit the variance term due to Z\-score standardization explicitly normalizing each dimension to unit variance\.

Dual\-Domain Consistency:We apply IsoFICReg to both time and frequency domain latents independently, and compute cross\-domain invariance \(time\-frequency pairs from the same sample\), enforcing Parseval\-like consistency\.

#### 3\.4\.2Latent Equivariant Differences \(LED\)

While IsoFICReg learns invariances \(what should remain the same\), we explicitly learn equivariances \(how representations should change predictably under transformations\)\. This enables cross\-modal generalization: by learning that physical transformations have predictable effects in latent space, the model learns transformation rules that are domain\-invariant\.

We employ a two\-branch scheme\(Figure[2](https://arxiv.org/html/2606.04106#S3.F2)\): the clean branch processes unaugmented signals \(producingzc​l​e​a​nz\_\{clean\}\), while the augmented branch processes signals with domain\-invariant transformations—frequency shifts, time shifts, phase rotations, and AWGN \(producingza​u​gz\_\{aug\}\)\. For each transformationTTapplied to the input \(e\.g\., frequency shift byΔ\\Deltaf\), we:\(1\)\(1\)ApplyTTto input→\\rightarrowencode→\\rightarrowza​u​gz\_\{aug\}\(2\)\(2\)Encode clean input→\\rightarrowapply latent transformerTl​a​t​e​n​tT\_\{latent\}→\\rightarrowzl​a​t​e​n​t​\_​t​r​a​n​s​f​o​r​mz\_\{latent\\\_transform\}and\(3\)\(3\)Enforce thatza​u​gz\_\{aug\}≅\\congzl​a​t​e​n​t​\_​t​r​a​n​s​f​o​r​mz\_\{latent\\\_transform\}\. The latent transformer applies transformations directly to encoded token representations using our Parseval Transformer architecture\. Transformation parameters are linearly projected and prepended as conditioning tokens\.

To verify that latent transformations correspond to input transformations, we compute differences over the pooled encoder outputs prior to IsoFICReg’s loss over projected embeddings:

Δl​a​t​e​n​t=zl​a​t​e​n​t​\_​t​r​a​n​s​f​o​r​m−zc​l​e​a​n\\Delta\_\{latent\}=z\_\{latent\\\_transform\}\-z\_\{clean\}Δi​n​p​u​t=za​u​g−zc​l​e​a​n\\Delta\_\{input\}=z\_\{aug\}\-z\_\{clean\}
These differences capture "how the representation changed" rather than "what the final representation is\." We enforce dual regression:

LE​q​u​i=‖e​pθ​\(Δl​a​t​e​n​t\)−θt​r​u​e‖2\+‖e​pθ​\(Δi​n​p​u​t\)−θt​r​u​e‖2\+‖Δl​a​t​e​n​t−Δi​n​p​u​t‖2L\_\{Equi\}=\|\|ep\_\{\\theta\}\(\\Delta\_\{latent\}\)\-\\theta\_\{true\}\|\|^\{2\}\+\|\|ep\_\{\\theta\}\(\\Delta\_\{input\}\)\-\\theta\_\{true\}\|\|^\{2\}\+\|\|\\Delta\_\{latent\}\-\\Delta\_\{input\}\|\|^\{2\}
wheree​pθep\_\{\\theta\}is an MLP projecting differences to match ground\-truth transformation parametersθt​r​u​e\\theta\_\{true\}\. The first two terms ensure each branch learns the correct transformation; the third ensures they learn the same transformation\. We apply focal reweighting to emphasize difficult instances\.

![Refer to caption](https://arxiv.org/html/2606.04106v1/Figures/TimeFrequencyJointEmbeddingPredictiveArchitecture_updatedIsoFICReg.png)Figure 2:Time\-Frequency Joint Embedding Predictive Architecture for Invariant and Equivariant Symmetry Learning\. The output of the PlanFormer encoder produces time and frequency representations that both go through subsequent domain specific latent transformations and expander networks\. Afterwards, the time\-time, freq\-freq, and time\-freq latent losses are computed\.
#### 3\.4\.3Complementary Objectives

We complement global latent learning with instance\-specific objectives\.Token\-Level Source Separationcreates mixtures of two signals at controlled SINR \(20\-0dB\) and separates them at the token level using a Parseval Transformer and ICA\-inspired processing\. Separated tokens are processed through the remaining encoder and decoder with reconstruction/invariance/equivariance targets as the original unmixed signals\.Dual\-Domain Reconstructioncomputes losses in both time and frequency domains with circular phase variants, using augmented signals \(minus noise\) as targets to provide explicit equivariance supervision\. We add perceptual loss using an EMA\-updated encoder\. All reconstruction losses are focally reweighted to emphasize low\-SNR instances\.

#### 3\.4\.4Overall Training Objectives

Our complete training loss combines complementary objectives:

Lt​o​t​a​l=LI​s​o​F​I​C​R​e​g\+LE​q​u​i\+γs​e​p⋅Ls​e​p\+γr​e​c​o​n⋅Lr​e​c​o​n\+Lr​e​gL\_\{total\}=L\_\{IsoFICReg\}\+L\_\{Equi\}\+\\gamma\_\{sep\}\\cdot L\_\{sep\}\+\\gamma\_\{recon\}\\cdot L\_\{recon\}\+L\_\{reg\}
whereLr​e​gL\_\{reg\}includes Parseval consistency, noise decorrelation, power matching, SNR regression, etc\. \(details in Appendix[C](https://arxiv.org/html/2606.04106#A3)\)\.

## 4Experiments

Training Data:We train exclusively on RF fingerprinting datasets to test whether signal\-theoretic principles learned on RF transfer to diverse domains\. We combine five datasets \(ORACLE[sankhe2019no](https://arxiv.org/html/2606.04106#bib.bib29), POWDER[reus2020trust](https://arxiv.org/html/2606.04106#bib.bib30), and 3 internally collected RF fingerprinting datasets we intend to release upon acceptance\) providing 39 emitter classes with diversity across hardware families, modulation schemes \(11 schemes\), protocols \(Wi\-Fi, 4G, 5G\), and channel conditions \(over\-the\-air, temporal/spatial variation\)\. This forces learning of general signal properties rather than dataset\-specific patterns\. Training details are in Appendix[D](https://arxiv.org/html/2606.04106#A4)and[E](https://arxiv.org/html/2606.04106#A5)\.

Evaluation Protocol:We evaluate cross\-modal transfer via linear probing on frozen encoder representations\. The encoder remains completely frozen while lightweight linear classifiers are trained on labeled target\-domain data via 5\-fold cross\-validation\. This standard protocol[pmlr\-v119\-chen20j](https://arxiv.org/html/2606.04106#bib.bib14);[he2020momentum](https://arxiv.org/html/2606.04106#bib.bib31);[radford2021learning](https://arxiv.org/html/2606.04106#bib.bib1)measures representation quality through linear separability\. We use ’frozen\-encoder transfer’ rather than ’zero\-shot’ to accurately reflect this methodology\. Non\-linear \(RBF kernel\) results are in Appendix[E\.3](https://arxiv.org/html/2606.04106#A5.SS3)\. We evaluate across time series, images, text, and video to test: \(1\) signal\-theoretic principles enable cross\-modal generalization, and \(2\) performance varies systematically with task type\.

Preprocessing:Real\-valued domains undergo Hilbert transformation[BendatPiersol2010Hilbert](https://arxiv.org/html/2606.04106#bib.bib22)to generate complex\-valued \(IQ\) representations\.1D:Resample to 5120 points minimum; text converts to byte streams\.2D:Unwrap images in snake pattern \(vertical/horizontal axes\), process color channels independently, and concatenate\. This tests whether signal\-theoretic principles enable transfer despite discarding 2D spatial structure and no prior exposure to spatial data\.3D:Unwrap frames as in 2D and concatenate successive frames\. Full details in Appendix[E](https://arxiv.org/html/2606.04106#A5)\.

DomainTask \(Dataset\)ClassesTypeTop\-1\(Lin\)Random Init\.Top\-1\(Lin\)Top\-3 \(Lin\)CLIP ViT\-B/32[radford2021learning](https://arxiv.org/html/2606.04106#bib.bib1)Top\-1\(Lin\)DinoV3 ViT\-S[Simeoni2025DINOv3](https://arxiv.org/html/2606.04106#bib.bib32)Top\-1\(Lin\)Baseline1\-DRF\*Modulation Recognition8Physical57\.8±\\pm0\.395\.5±\\pm0\.899\.2±\\pm0\.297\.0±\\pm0\.392\.2±\\pm0\.1N/ARF\*Fingerprinting \(POWDER\)[reus2020trust](https://arxiv.org/html/2606.04106#bib.bib30)4Physical67\.9±\\pm2\.787\.6±\\pm0\.3N/A55±\\pm0\.266\.3±\\pm0\.393\.0SRSpeechBiLingual Speaker Recognition \(TidyVoice\)[farhadipour2026tidyvoice](https://arxiv.org/html/2606.04106#bib.bib33)50Physical60\.5±\\pm5\.490\.1±\\pm2\.697\.7±\\pm1\.097\.8±\\pm0\.496\.8±\\pm0\.893\.8±\\pm1\.9ESpeechLanguage Recognition \(TidyVoice\)[farhadipour2026tidyvoice](https://arxiv.org/html/2606.04106#bib.bib33)29Semantic49\.9±\\pm4\.369\.8±\\pm4\.191\.2±\\pm2\.087\.3±\\pm2\.683\.1±\\pm2\.169\.7±\\pm3\.5EMusicInstrument Family \(TinySOL\)[cella\_2020\_3685331](https://arxiv.org/html/2606.04106#bib.bib34)4Physical91\.6±\\pm1\.091\.7±\\pm1\.1N/A99\.1±\\pm0\.496\.5±\\pm0\.890\.6±\\pm1\.3EMusicIndividual Instrument \(TinySOL\)[cella\_2020\_3685331](https://arxiv.org/html/2606.04106#bib.bib34)14Phys\+Sem74\.8±\\pm1\.580\.5±\\pm1\.595\.5±\\pm0\.997\.5±\\pm0\.589\.9±\\pm1\.190\.0±\\pm1\.3EMusicGenre Recognition \(GTZAN\)[tzanetakis2002musical](https://arxiv.org/html/2606.04106#bib.bib35)10Semantic47\.1±\\pm3\.364\.1±\\pm3\.289\.4±\\pm2\.182\.1±\\pm2\.470\.2±\\pm1\.262\.5±\\pm2\.3ESeismologySeismic Event Classification[SCSN1926](https://arxiv.org/html/2606.04106#bib.bib36);[SCEDC2013](https://arxiv.org/html/2606.04106#bib.bib37)3Physical85\.4±\\pm0\.489\.0±\\pm0\.3N/A93\.0±\\pm0\.494\.2±\\pm0\.298\.0SRaTextArXiv Paper Sub\-Discipline Classification[He2019LongDC](https://arxiv.org/html/2606.04106#bib.bib38)9Semantic24\.2±\\pm0\.736\.9±\\pm0\.471\.8±\\pm0\.3N/AN/A80\.47SRTextArXiv Paper Field Classification[He2019LongDC](https://arxiv.org/html/2606.04106#bib.bib38)2Structural\+Sem77\.0±\\pm0\.182\.7±\\pm0\.2N/AN/AN/AN/A2\-DVisionDigit \(MNIST\)[lecun2010mnist](https://arxiv.org/html/2606.04106#bib.bib39)10Phys\+Sem91\.8±\\pm0\.179\.2±\\pm0\.194\.5±\\pm0\.0598\.6±\\pm0\.193\.4±\\pm0\.199\.2TVisionClothing \(FashionMNIST\)[xiao2017/online](https://arxiv.org/html/2606.04106#bib.bib40)10Phys\+Sem83\.4±\\pm0\.177\.0±\\pm0\.0196\.1±\\pm0\.0490\.4±\\pm0\.183\.3±\\pm0\.194\.9S[fashionmnist\_leaderboard](https://arxiv.org/html/2606.04106#bib.bib41)MedicalPathology \(PathMNIST\)[medmnistv2](https://arxiv.org/html/2606.04106#bib.bib42)9Physical67\.1±\\pm0\.271\.3±\\pm0\.291\.9±\\pm0\.292\.0±\\pm0\.275\.9±\\pm0\.191\.1SRVisionFake Detection \(CIFAKE\)[bird2023cifakeimageclassificationexplainable](https://arxiv.org/html/2606.04106#bib.bib43)2Physical78\.1±\\pm0\.282\.0±\\pm0\.1N/A94\.73±\\pm0\.177\.1±\\pm0\.392\.98SR3\-DVideoNormal vs Abnormal Mitosis \(Full Video\)[delgado2024automatic](https://arxiv.org/html/2606.04106#bib.bib44);[Delgado\_Mitosis\_Classification\_2023](https://arxiv.org/html/2606.04106#bib.bib45)2Physical65\.7±\\pm4\.968\.7±\\pm2\.8N/A72\.6±\\pm3\.369\.3±\\pm2\.194SRVideoNormal vs Abnormal Mitosis \(1st Frame\)[delgado2024automatic](https://arxiv.org/html/2606.04106#bib.bib44);[Delgado\_Mitosis\_Classification\_2023](https://arxiv.org/html/2606.04106#bib.bib45)2Physical67\.4±\\pm1\.764\.1±\\pm1\.1N/A65\.8±\\pm1\.763\.5±\\pm2\.1N/AAverage \(All\)68\.2%77\.7%91\.9%83\.8%83\.7%88\.4%Average \(1\-D Only\)63\.6%78\.8%90\.8%88\.6%86\.2%84\.8%Average \(Physical Only\)71\.8%84\.5%96\.3%87\.7%83\.6%93\.4%Average \(Semantic/Mixed\)64\.0%70\.0%89\.8%91\.2%84%82\.8%Number of Parameters\(M = Million\)\-1\.99M\-151\.28M21MN/AFloating Point Operations \(FLOPs\)\-93\.6MFLOPs\-14,780MFLOPs[ilharco\_gabriel\_2021\_5143773](https://arxiv.org/html/2606.04106#bib.bib46)12,000MFLOPsEF[Simeoni2025DINOv3](https://arxiv.org/html/2606.04106#bib.bib32)N/A

Table 1:Values given as mean±\\pmstd unless noted\. Frozen\-encoder transfer via linear probing \(5\-fold CV\) on frozen representations\.Type: Physical \(structure\-based\), Semantic \(meaning\-based\), Phys\+Sem \(mixed\)\. \*Within\-domain generalization\.Baselines:EExpert features \(MFCC\) \+ linear SVM;SSupervised neural network;SRSource paper results;TTypical literature performance\.aMeier\+ 2019 \(avg\. 99\.95% local, 95\.36% teleseismic\)\. Signal\-theoretic learning captures physical structure effectively \(84\.5%\), with graceful degradation on semantic tasks \(70\.0%\)\.CLIP/DinoV3 comparison:PlanFormer achieves competitive performance on physical tasks \(84\.5% vs 87\.7%/83\.6%\) with 76×\\times/11×\\timesfewer parameters and 158×\\times/128×\\timeslower FLOPs; semantic gaps \(70\.0% vs 91\.2%/84\.0%\) validate physical/semantic boundary\.EFFLOPs for 256×\\times256, inputs used for evaluation 224×\\times224\.### 4\.1Main Results: Frozen\-Encoder Cross\-Modal Transfer

Table[1](https://arxiv.org/html/2606.04106#S4.T1)presents results across 15 diverse tasks spanning time series, text, images, and video\. Our learned representations, trained exclusively on RF data, are performant on physical tasks \(84\.5% top\-1, 96\.3% top\-3 average\) and match or exceed expert\-crafted features \(MFCC\) on semantic tasks: 69\.8%\(ours\) vs\. 69\.7%\(MFCC\) for language recognition and 64\.1%\(ours\) vs\. 62\.5%\(MFCC\) for music genre classification\. This demonstrates that signal\-theoretic learning captures features comparable to decades of domain\-specific signal processing research, while maintaining generality across diverse domains \(images, text, video\) where MFCC features do not apply\.

Efficiency vs Scale\-Driven Models\.We compare to CLIP ViT\-B/32[radford2021learning](https://arxiv.org/html/2606.04106#bib.bib1)\(151M params\) and DinoV3 ViT\-S[Simeoni2025DINOv3](https://arxiv.org/html/2606.04106#bib.bib32)\(21M params\)\. PlanFormer achieves competitive performance on physical tasks \(84\.5% vs 87\.7%/83\.6%\) with 76×\\times/11×\\timesfewer parameters and 158×\\times/128×\\timeslower FLOPs\. Semantic gaps \(70\.0% vs 91\.2%/84\.0%\) validate our physical/semantic boundary hypothesis\. PlanFormer outperforms both on RF fingerprinting \(87\.6% vs 55\.0%/66\.3%\), demonstrating domain\-aligned pretraining advantages\.

Representation Quality\.High top\-3 accuracy \(91\.9% average\) reveals top\-1 errors are within\-category confusions rather than fundamental misunderstanding\. Zero\-shot image reconstructions \(Figure[3](https://arxiv.org/html/2606.04106#S4.F3)\) provides visual confirmation: despite training only on RF with 1D unwrapping, reconstructions preserve structural coherence and spatial relationships\. These complementary perspectives demonstrate learned representations capture transferable physical features rather than domain\-specific patterns\.

### 4\.2Probing the Physical\-Semantic Boundary

We design controlled experiments testing our hypothesis, each demonstrating strong performance on physical tasks and graceful degradation on semantic tasks with interpretable failure modes\. This occurs despite zero prior exposure to these domains—only physically\-anchored representations governing signal dynamics\. Our strict frozen\-encoder protocol—training exclusively on RF with zero target\-domain exposure—provides unambiguous evidence of cross\-modal transfer through signal\-theoretic principles, contrasting with approaches requiring large\-scale diverse datasets\.

Instrument Classification: Human Semantics on Physical Mechanisms\.Instruments are physically manufactured to produce intentional frequency responses, but humans introduce nonlinear effects through excitation—notes, tempo, pitch, playing style\. Using TinySOL[cella\_2020\_3685331](https://arxiv.org/html/2606.04106#bib.bib34)\(14 instruments, 4 families\), we achieve 80\.5% instrument\-level and 91\.7% family\-level accuracy\. Top\-3 accuracy \(95\.5%\) compensates for human\-level semantics\. Confusion occurs within families or across instruments with overlapping physical features \(timbre, pitch\), demonstrating learned physical structure with predictable semantic degradation—without prior musical data exposure\.

Bilingual Speaker Recognition: Human as Physical Mechanism\.We view the human as the physical mechanism and language as the semantic layer\. Using TidyVoice[farhadipour2026tidyvoice](https://arxiv.org/html/2606.04106#bib.bib33);[farhadipour2026tidyvoice2026challengeevaluation](https://arxiv.org/html/2606.04106#bib.bib47)\(bilingual speakers across 29 languages\), we achieve 90\.1% speaker recognition \(physical: voice characteristics\) but 69\.8% language classification \(semantic: linguistic content\)\. Critically, speaker recognition accuracy is consistent across languages—the model recognizes the same speaker whether speaking English or Mandarin—demonstrating that it learned voice characteristics independent of linguistic content\. Top\-3 accuracy \(97\.7% for speaker, 91\.2% for language\) shows the underlying physical encoding captures fundamental information meaningfully\.

Semantic Stress Tests: Music Genre & Academic Discipline Text Classification\.As the most stringent tests of our hypothesis, we evaluate two tasks driven primarily by semantics: music genre classification \(GTZAN[tzanetakis2002musical](https://arxiv.org/html/2606.04106#bib.bib35);[gtzan\_kaggle](https://arxiv.org/html/2606.04106#bib.bib48)\) and academic discipline classification from arXiv papers[He2019LongDC](https://arxiv.org/html/2606.04106#bib.bib38)\. These tasks probe different aspects of the physical\-semantic boundary:

Music Genre:Genres share substantial physical structure—instruments, tempo, harmonic patterns—but differ in cultural context, production style, and artistic intent\. We achieve 64\.1% top\-1 accuracy \(vs\. 10% random chance\) and 89\.4% top\-3 accuracy\.

![Refer to caption](https://arxiv.org/html/2606.04106v1/Figures/DiscriminativePlusReconstructionQualityViz_FrozenEncoder.jpg)Figure 3:Learned representation quality\.RF modulation t\-SNE and confusion matrix show tight clustering for physical task\. FashionMNIST shows structural clustering with within\-category confusion\. Zero\-shot image reconstruction: reconstructions \(middle\) preserve structural coherence and spatial relationships from originals \(top\) \.Academic Discipline:Text classification provides the most extreme test: \(1\) drastic signal origin shift \(text byte streams vs\. RF\), \(2\) discrimination driven by semantic content rather than physical structure\. We achieve 36\.9% top\-1 accuracy \(vs\. 11\.1% random chance\) and 71\.8% top\-3 accuracy\. Binary classification between structurally distinct fields \(math vs\. computer science\) achieves 82\.7%, demonstrating that even raw text contains structural cues\. Together, these validate graceful degradation as tasks become more semantic\.

Image Classification: Physical Structure vs\. Semantic Context\.We evaluate on MNIST[lecun2010mnist](https://arxiv.org/html/2606.04106#bib.bib39)\(digits, structural with human handwriting variation\) and FashionMNIST[xiao2017/online](https://arxiv.org/html/2606.04106#bib.bib40)\(clothing, physical structure with semantic labels\)\. Despite disadvantages—\(1\) unwrapping images loses spatial correlations, \(2\) no prior exposure to spatial data—we achieve 79\.2% on MNIST and 77\.0% on FashionMNIST\. Top\-3 accuracy \(94\.5% MNIST, 96\.1% FashionMNIST\) shows structural features capture foundational information\.

Additional Image Tasks\.PathMNIST \(9\-class histopathology\): 71\.3% accuracy demonstrates that signal\-theoretic principles provide useful features for medical imaging despite our 1D unwrapping approach being suboptimal\. CIFAKE \(binary real vs\. synthetic\): 82\.0% accuracy suggests our model captures physical signatures of natural images that synthetic images approximate imperfectly\.

Video Dynamics Classification: Spatio\-Temporal Extension\.We evaluate 3D extensibility using Mitosis Classification[delgado2024automatic](https://arxiv.org/html/2606.04106#bib.bib44);[Delgado\_Mitosis\_Classification\_2023](https://arxiv.org/html/2606.04106#bib.bib45), differentiating normal vs\. abnormal cell division\. This dataset was chosen to mitigate frame\-level shortcuts—the task label is a temporal process, not a static property\. To validate spatio\-temporal learning, we compare full video vs\. first\-frame\-only processing\. The first frame alone should not discriminate between processes, requiring full video for accurate classification\.

We achieve 68\.7% \(full video\) vs\. 64\.1% \(first frame, not used in average metrics\), a 4\.6% difference validating that the model leverages spatio\-temporal information for encoding physical dynamics\.

Cross\-Protocol RF Fingerprinting: Within\-Domain Generalization

To validate that our model learned strong RF representations, we evaluate on POWDER transmissions[reus2020trust](https://arxiv.org/html/2606.04106#bib.bib30)with an unseen protocol \(5G\) and different channel conditions \(different collection day\)\. The 5G protocol differs from training protocols \(Wi\-Fi, 4G\) in modulation schemes, frame structures, and frequency bands\. Despite compounded distribution shifts \(protocols & channels\), linear probing achieves 87\.6% accuracy\. This complements cross\-modal results: transfer from RF to images/text/video \(cross\-modal\) and from Wi\-Fi/4G to 5G \(within\-domain\)—both require learning robust physical structure\.

## 5Conclusion

Across this series of diverse experiments, we provide strong evidence for our hypothesis: signal\-theoretic principles enable robust cross\-modal transfer\. We achieve competitive performance on physical tasks \(84\.5%, within 3\.2% of CLIP’s 87\.7%\) despite 76× fewer parameters and RF\-only pretraining, while graceful degradation on semantic tasks \(70\.0% vs CLIP’s 91\.2%\) empirically validates the physical\-semantic boundary\. This establishes a clear division of labor: physics\-driven foundation models capture causal structure and domain\-invariant properties upon which semantic reasoning can be built\.

## Acknowledgment

DISTRIBUTION STATEMENT A\. Approved for public release\. Distribution is unlimited\.

This material is based upon work supported by the Under Secretary of War for Research and Engineering under Air Force Contract No\. FA8702\-15\-D\-0001 or FA8702\-25\-D\-B002\. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author\(s\) and do not necessarily reflect the views of the Under Secretary of War for Research and Engineering\.

© 2026 Massachusetts Institute of Technology\.

Delivered to the U\.S\. Government with Unlimited Rights, as defined in DFARS Part 252\.227\-7013 or 7014 \(Feb 2014\)\. Notwithstanding any copyright notice, U\.S\. Government rights in this work are defined by DFARS 252\.227\-7013 or DFARS 252\.227\-7014 as detailed above\. Use of this work other than as specifically authorized by the U\.S\. Government may violate any copyrights that exist in this work\.

## 6References

## References

- \[1\]Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al\.Learning transferable visual models from natural language supervision\.InInternational conference on machine learning, pages 8748–8763\. PmLR, 2021\.
- \[2\]Rohit Girdhar, Alaaeldin El\-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra\.Imagebind: One embedding space to bind them all\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180–15190, 2023\.
- \[3\]Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, and Ludwig Schmidt\.Data determines distributional robustness in contrastive language image pre\-training \(clip\)\.InInternational conference on machine learning, pages 6216–6234\. PMLR, 2022\.
- \[4\]John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt\.Accuracy on the line: on the strong correlation between out\-of\-distribution and in\-distribution generalization\.InInternational conference on machine learning, pages 7721–7735\. PMLR, 2021\.
- \[5\]Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt\.Measuring robustness to natural distribution shifts in image classification\.Advances in Neural Information Processing Systems, 33:18583–18599, 2020\.
- \[6\]David I\. Shuman, Sunil K\. Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst\.The emerging field of signal processing on graphs: Extending high\-dimensional data analysis to networks and other irregular domains\.IEEE Signal Processing Magazine, 30\(3\):83–98, 2013\.
- \[7\]Maziar Raissi, Paris Perdikaris, and George E\. Karniadakis\.Physics\-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations\.Journal of Computational Physics, 378:686–707, 2019\.
- \[8\]Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar\.Fourier neural operator for parametric partial differential equations\.InInternational Conference on Learning Representations, 2021\.
- \[9\]Taco Cohen and Max Welling\.Group equivariant convolutional networks\.InProceedings of the 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 2990–2999\. PMLR, 2016\.
- \[10\]Maurice Weiler and Gabriele Cesa\.General e\(2\)\-equivariant steerable cnns\.InAdvances in Neural Information Processing Systems 32 \(NeurIPS 2019\), pages 14357–14368, 2019\.
- \[11\]Jaemyung Yu, Jaehyun Choi, Dong\-Jae Lee, HyeongGwon Hong, and Junmo Kim\.Self\-supervised transformation learning for equivariant representations, 2025\.
- \[12\]Quentin Garrido, Laurent Najman, and Yann Lecun\.Self\-supervised learning of split invariant equivariant representations\.arXiv preprint arXiv:2302\.10283, 2023\.
- \[13\]Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick\.Masked autoencoders are scalable vision learners\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022\.
- \[14\]Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton\.A simple framework for contrastive learning of visual representations\.InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1597–1607\. PMLR, 2020\.
- \[15\]Adrien Bardes, Jean Ponce, and Yann LeCun\.Vicreg: Variance\-invariance\-covariance regularization for self\-supervised learning\.arXiv preprint arXiv:2105\.04906, 2021\.
- \[16\]Mark A Richards\.Fundamentals of Radar Signal Processing\.McGraw\-Hill, New York, 2005\.
- \[17\]James Lee\-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon\.FNet: Mixing tokens with Fourier transforms\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4296–4313, Seattle, United States, July 2022\. Association for Computational Linguistics\.
- \[18\]Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik\.Self\-supervised contrastive pre\-training for time series via time\-frequency consistency\.Advances in neural information processing systems, 35:3988–4003, 2022\.
- \[19\]Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville\.On the spectral bias of neural networks\.In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5301–5310\. PMLR, 09–15 Jun 2019\.
- \[20\]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Lukasz Kaiser, and Illia Polosukhin\.Attention is all you need, 2023\.
- \[21\]Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner\.Gradient\-based learning applied to document recognition\.Proceedings of the IEEE, 86\(11\):2278–2324, 1998\.
- \[22\]Julius S\. Bendat and Allan G\. Piersol\.The Hilbert Transform, chapter 13, pages 473–503\.Wiley, Hoboken, NJ, USA, 2010\.
- \[23\]Marc\-Antoine Parseval des Chênes\.Mémoire sur les séries et sur l’intégration complète d’une équation aux différences partielles linéaire du second ordre, à coefficients constants\.Paris, 1799\.
- \[24\]Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei\.Differential transformer\.arXiv preprint arXiv:2410\.05258, 2024\.
- \[25\]Ali Hassani, Steven Walton, Nikhil Shah, Abulikemu Abuduweili, Jiachen Li, and Humphrey Shi\.Escaping the big data paradigm with compact transformers\.arXiv preprint arXiv:2104\.05704, 2021\.
- \[26\]Olaf Ronneberger, Philipp Fischer, and Thomas Brox\.U\-net: Convolutional networks for biomedical image segmentation\.InInternational Conference on Medical image computing and computer\-assisted intervention, pages 234–241\. Springer, 2015\.
- \[27\]Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville\.Film: Visual reasoning with a general conditioning layer, 2017\.
- \[28\]Tsung\-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár\.Focal loss for dense object detection\.InProceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017\.
- \[29\]Kunal Sankhe, Mauro Belgiovine, Fan Zhou, Luca Angioloni, Frank Restuccia, Salvatore D’Oro, Tommaso Melodia, Stratis Ioannidis, and Kaushik Chowdhury\.No radio left behind: Radio fingerprinting through deep learning of physical\-layer hardware impairments\.IEEE Transactions on Cognitive Communications and Networking, 6\(1\):165–178, 2019\.
- \[30\]Guillem Reus\-Muns, Dheryta Jaisinghani, Kunal Sankhe, and Kaushik R Chowdhury\.Trust in 5g open rans through machine learning: Rf fingerprinting on the powder pawr platform\.InGLOBECOM 2020\-2020 IEEE Global Communications Conference, pages 1–6\. IEEE, 2020\.
- \[31\]Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick\.Momentum contrast for unsupervised visual representation learning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\), pages 9729–9738, 2020\.
- \[32\]Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al\.Dinov3\.arXiv preprint arXiv:2508\.10104, 2025\.
- \[33\]Aref Farhadipour, Jan Marquenie, Srikanth Madikeri, and Eleanor Chodroff\.Tidyvoice: A curated multilingual dataset for speaker verification derived from common voice\.arXiv preprint arXiv:2601\.16358, 2026\.
- \[34\]Carmine\-Emanuele Cella, Daniele Ghisi, Vincent Lostanlen, Fabien Lévy, Joshua Fineberg, and Yan Maresz\.Tinysol: an audio dataset of isolated musical notes, January 2020\.
- \[35\]George Tzanetakis and Perry Cook\.Musical genre classification of audio signals\.IEEE Transactions on Audio and Speech Processing, 10\(5\):293–302, 2002\.
- \[36\]California Institute of Technology \(Caltech\) and United States Geological Survey \(USGS\)\.Southern california seismic network, 1926\.
- \[37\]Southern California Earthquake Data Center\.Southern california earthquake data center, 2013\.
- \[38\]Jun He, Liqun Wang, Liu Liu, Jiao Feng, and Hao Wu\.Long document classification from local word glimpses via recurrent attention learning\.IEEE Access, 7:40707–40718, 2019\.
- \[39\]Yann LeCun, Corinna Cortes, Chris Burges, et al\.Mnist handwritten digit database, 2010\.
- \[40\]Han Xiao, Kashif Rasul, and Roland Vollgraf\.Fashion\-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017\.
- \[41\]Zalando Research\.Fashion\-mnist github repository and benchmark leaderboard\.[https://github\.com/zalandoresearch/fashion\-mnist](https://github.com/zalandoresearch/fashion-mnist), 2017\.Accessed: 2024\-05\-22\.
- \[42\]Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni\.Medmnist v2\-a large\-scale lightweight benchmark for 2d and 3d biomedical image classification\.Scientific Data, 10\(1\):41, 2023\.
- \[43\]Jordan J\. Bird and Ahmad Lotfi\.Cifake: Image classification and explainable identification of ai\-generated synthetic images, 2023\.
- \[44\]Pablo Delgado\-Rodriguez, Rodrigo Morales Sánchez, Elouan Rouméas\-Noël, François Paris, and Arrate Munoz\-Barrutia\.Automatic classification of normal and abnormal cell division using deep learning\.Scientific Reports, 14\(1\):14241, 2024\.
- \[45\]Pablo Delgado, Nicolas Gaggion, Lucas Mansilla, Diego H\. Milone, and Enzo Ferrante\.Mitosis Classification, 6 2023\.
- \[46\]Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt\.Openclip, July 2021\.
- \[47\]Aref Farhadipour, Jan Marquenie, Srikanth Madikeri, Teodora Vukovic, Volker Dellwo, Kathy Reid, Francis M\. Tyers, Ingo Siegert, and Eleanor Chodroff\.Tidyvoice 2026 challenge evaluation plan, 2026\.
- \[48\]Andrada Olteanu\.Gtzan dataset \- music genre classification, 2020\.
- \[49\]Alexander Krull, Tim\-Oliver Buchholz, and Florian Jug\.Noise2void\-learning denoising from single noisy images\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2129–2137, 2019\.
- \[50\]Claude E\. Shannon\.Communication in the presence of noise\.Proceedings of the IRE, 37\(1\):10–21, 1949\.
- \[51\]Richard Zhang\.Making convolutional networks shift\-invariant again\.InInternational conference on machine learning, pages 7324–7334\. PMLR, 2019\.
- \[52\]Hanqing Gu, Lisheng Su, Yuxia Wang, Weifeng Zhang, and Chuan Ran\.Efficient channel\-temporal attention for boosting rf fingerprinting\.IEEE Open Journal of Signal Processing, 5:478–492, 2024\.
- \[53\]Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu\.Transformers without normalization\.InProceedings of the computer vision and pattern recognition conference, pages 14901–14911, 2025\.
- \[54\]Diederik P Kingma and Jimmy Ba\.Adam: A method for stochastic optimization\.InInternational Conference on Learning Representations \(ICLR\), 2015\.
- \[55\]Tri Dao, Daniel Y\. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré\.FlashAttention: Fast and Memory\-Efficient Exact Attention with IO\-Awareness\.arXiv preprint arXiv:2205\.14135, 2022\.
- \[56\]F\. Pedregosa and et al\.Scikit\-learn: Machine learning in Python\.Journal of Machine Learning Research, 12:2825–2830, 2011\.
- \[57\]John Platt\.Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods\.InAdvances in large margin classifiers, volume 10, pages 61–74\. Cambridge, MA, 1999\.
- \[58\]James Lyons et al\.jameslyons/python\_speech\_features: release v0\.6\.1, January 2020\.
- \[59\]Linus Ericsson, Henry Gouk, and Timothy M Hospedales\.How well do self\-supervised models transfer?InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5414–5423, 2021\.

## Appendix AAppendix A: Encoder Architecture

### A\.1Planformer Encoder Architecture

Overview: The PlanFormer encoder is designed to embed signal\-theoretic principles through dual\-domain processing that operates simultaneously in time and frequency domains\. This appendix provides complete architectural specifications for reproducibility\.

### A\.2High\-Level Architecture

The encoder consists of the following stages:

1. 1\.Complex\-valued preprocessing\- Convert input to IQ representation
2. 2\.Dual\-domain tokenization\- Parallel processing in time and frequency
3. 3\.Parseval Transformer blocks\- Physics\-informed attention mechanisms that we term "Focus" based on our dynamic attention sharpening machinery
4. 4\.Attentional pooling\- Fixed\-size latent representations per domain via pooling over variable token sequence lengths

Key architectural parameters:

- •Input length:5120 total samples \(minimum\), 1024 samples per window, 5 windows total
- •Embedding dimension:128 after IQ interleaving for complex tokenization
- •Number of transformer blocks:1
- •Number of attention heads:8
- •Time\-domain latent size:128
- •Frequency\-domain latent size:128

### A\.3Complex\-Valued Preprocessing

#### A\.3\.1Real\-Valued Signal Conversion

For real\-valued input signals \(speech, images, seismology\), we generate analytic representations via Hilbert transform:

xa​n​a​l​y​t​i​c​\(t\)=x​\(t\)\+j​·​H​\[x​\(t\)\]x\_\{analytic\}\(t\)=x\(t\)\+j·H\[x\(t\)\]
where H\[·\] denotes the Hilbert transform\. This enables seamless extension of our architecture originally developed for complex valued domains to real valued domains\. Further exposing the real valued signals’ instantaneous frequency and phase information\.

#### A\.3\.2IQ Representation

Signals are represented as interleaved real\-imaginary \(IQ\) pairs:

xI​Q=\[R​e​\(x​\[0\]\),I​m​\(x​\[0\]\),R​e​\(x​\[1\]\),I​m​\(x​\[1\]\),…,R​e​\(x​\[N−1\]\),I​m​\(x​\[N−1\]\)\]x\_\{IQ\}=\[Re\(x\[0\]\),Im\(x\[0\]\),Re\(x\[1\]\),Im\(x\[1\]\),\.\.\.,Re\(x\[N\-1\]\),Im\(x\[N\-1\]\)\]
This preserves phase\-magnitude relationships while leveraging optimized real\-valued computation\.

Rationale: Complex\-valued operations are essential for frequency\-domain processing and phase\-aware learning, but most deep learning frameworks optimize for real\-valued tensors\. The standard approach treats real and imaginary components as separate channels in the input space for convolutional processing\. However, this construction overlooks a critical issue: while I and Q components retain their physical relationship in the input space, this relationship breaks in deeper layers where the channel dimension represents learned feature maps rather than physical components\.

Consider the standard approach: in deeper layers, tensors have shape \[C, N\] \(channels × sequence length\) or \[C, H, W\] for images, where C represents learned features, not physical I/Q pairs\. Operations along the channel dimension \(e\.g\., batch normalization, channel\-wise attention\) treat all channels equivalently, destroying the I\-Q coupling that is fundamental to complex\-valued signals\.

We instead propose to interleave I and Q components along the sequence dimension:\[I0,Q0,I1,Q1,…,IN−1,QN−1\]\[I\_\{0\},Q\_\{0\},I\_\{1\},Q\_\{1\},\.\.\.,I\_\{N\-1\},Q\_\{N\-1\}\]\. This ensures that as we progress deeper into the network, the physical relationship between components is always present and maintained—adjacent elements in the sequence dimension are always an I\-Q pair, regardless of how many feature channels exist\. Operations that respect local structure \(e\.g\., convolutions with small kernels, local attention\) naturally preserve I\-Q coupling\.

Furthermore, this construction enables seamless transitions between real and complex\-valued representations via de\-interleaving and re\-interleaving at strategic points in the processing chain\. For example, during frequency\-domain pooling, we de\-interleave to obtain explicit complex values, apply FFT, perform pooling on the complex\-valued spectrum \(respecting its Hermitian symmetry for real signals\), and re\-interleave for subsequent processing\. This is essential for learning equivariant symmetries to transformations like frequency translation, where respecting the asymmetric structure of complex\-valued spectra is critical\.

Computational Benefits:This approach also reduces sequence length in token space by half \(N/2 complex samples instead of N real samples\), reducing attention’s quadratic cost from O\(N²\) to O\(\(N/2\)²\) = O\(N²/4\), while doubling embedding dimension\. For typical sequence lengths \(N \> 1000\) and embedding dimensions \(d < 512\), this trade\-off is favorable\.

### A\.4Convolutional Tokenization

Our encoder architecture takes large inspiration from\[[25](https://arxiv.org/html/2606.04106#bib.bib25)\]with the utilization of a convolutional tokenizer to learn tokens that are then utilized within a transformer block at a compressed sequence length, followed by a linear attention pooling over the sequence for the final latent\. We build on top of this starting point and combine it with additional inspiration from\[[18](https://arxiv.org/html/2606.04106#bib.bib18)\]to unify time\-frequency symmetry to produce our final architecture\. While these works serve as inspirations we have advanced them considerably in the following ways\.

#### A\.4\.1Windowing and Dual\-Domain Processing

Input signals are partitioned into non\-overlapping windows of size W:

Windows:\[x\[0:W\],x\[W:2W\],…,x\[\(Nw−1\)W:Nw·W\]\]\[x\[0:W\],x\[W:2W\],\.\.\.,x\[\(N\_\{w\}\-1\)W:N\_\{w\}·W\]\]whereNw=\[N/W\]N\_\{w\}=\[N/W\]is the number of windows\.

Each window is processed in both time and frequency domains:

Time domain: Direct processing of windowed signal Frequency domain: FFT of windowed signal Window size: \[1024 complex samples \(2048 interleaved real valued samples\)\]

Rationale:Time\-varying phenomena are ubiquitous in time series analysis—whether intentional \(frequency hopping, chirps, transient events\) or unintentional \(time\-varying noise, non\-stationary channels\)\. Effective representation learning requires balancing local and global processing to capture both instantaneous spectral characteristics and long\-range dependencies\.

Why Windowed Processing?Computing the FFT over the entire sequence produces a global frequency representation that averages spectral content across time, obscuring time\-varying behavior\. For example, a signal that shifts from 1 kHz to 2 kHz midway through would show energy at both frequencies in the global FFT, but thewhenof this transition is lost\. Windowed processing— computing local spectra over short time segments—preserves temporal localization of spectral features, enabling the network to learn representations that respect time\-varying dynamics\.

Computational and Architectural Benefits:Windowed processing provides several advantages:

1. 1\.Translational equivariance:Convolution’s translational equivariance property minimizes the need for overlapping windows\. A feature learned at positionttin one window transfers to positionttin other windows, reducing redundancy and parameter requirements\.
2. 2\.Computational efficiency:Attention mechanisms scale quadratically with sequence length \(O​\(N2\)O\(N^\{2\}\)\)\. By processing windows of lengthWW, attention cost per window isO​\(W2\)O\(W^\{2\}\), and total cost acrossNw=N/WN\_\{w\}=N/Wwindows isO​\(Nw⋅W2\)=O​\(N⋅W\)O\(N\_\{w\}\\cdot W^\{2\}\)=O\(N\\cdot W\)—linear in sequence length\. ForW≪NW\\ll N, this provides substantial savings\.
3. 3\.Reconstruction efficiency:In the decoder, operating on windowed representations enables localized reconstruction with manageable memory footprints, particularly important for long sequences \(N\>10,000N\>10,000\)\.

Addressing Long\-Range Dependencies:The primary drawback of windowed processing is that convolutions operate in isolation within each window, potentially missing long\-range dependencies such as phase coherence across extended temporal spans \(e\.g\., carrier phase in RF signals, pitch contours in speech\)\. We address this through two complementary mechanisms:

1. 1\.Causal Cross\-Window Focus\(Section[A\.4\.2\.3](https://arxiv.org/html/2606.04106#A1.SS4.SSS2.P3)\): Explicit attention between consecutive windows’ tokenized representations models inter\-window dependencies\. Domain\-specific positional encodings enable the network to learn causal phase relationships \(time domain\) and time\-varying spectral evolution \(frequency domain\)\.
2. 2\.Parseval Transformer\(Section[A\.6](https://arxiv.org/html/2606.04106#A1.SS6)\): After tokenization, all window tokens are processed jointly through transformer blocks, enabling global attention across the entire sequence\. This captures long\-range dependencies that span multiple windows while maintaining the benefits of localized spectral analysis\.

In summary, windowed processing provides the best of both worlds: local spectral analysis for time\-varying phenomena and global dependency modeling through cross\-window attention and transformers\. This design is essential for learning representations that respect both instantaneous signal characteristics and extended temporal structure\.

Relation to Short\-Time Fourier Transform \(STFT\):Our windowed processing is conceptually similar to STFT, which computes local spectra over sliding windows to create time\-frequency representations \(spectrograms\)\. However, unlike STFT which uses fixed, hand\-crafted windows \(e\.g\., Hamming, Hann\), our approach learns optimal window\-level representations through convolutional tokenization and transformer processing\. This enables the network to adapt to signal\-specific time\-frequency characteristics rather than relying on predetermined window shapes\.

#### A\.4\.2Tokenization Components

![Refer to caption](https://arxiv.org/html/2606.04106v1/x1.png)Figure 4:Full Convolutional Tokenizer Block DiagramThe full convolutional tokenizer and its constituent blocks are represented in Figure[4](https://arxiv.org/html/2606.04106#A1.F4)\. We will specifically describe the unique contributions of our work below:

##### A\.4\.2\.1Blindspot Convolution

The first convolutional layer uses a blindspot kernel\[[49](https://arxiv.org/html/2606.04106#bib.bib49)\]\(center element zeroed\) for processing additive noise in the input space:

Kb​l​i​n​d​s​p​o​t​\[i\]=\{K​\[i\]if​i≠center0if​i=centerK\_\{blindspot\}\[i\]=\\begin\{cases\}K\[i\]&\\text\{if \}i\\neq\\text\{center\}\\\\ 0&\\text\{if \}i=\\text\{center\}\\end\{cases\}\(1\)
Input Layer Blindspot Convolution Parameters:

- •Kernel size:5
- •Stride:1
- •Output channels:16

Rationale:Noise is ubiquitous across physical signals, but additive noise sources are typically locally uncorrelated relative to the structured signal of interest\. We leverage this inductive bias through a blindspot kernel: the zeroed center element forces the network to infer signal structure from context rather than learning identity mappings, creating an implicit denoising prior\. This approach, originally developed for self\-supervised denoising in images\[[49](https://arxiv.org/html/2606.04106#bib.bib49)\], is extensible to all signal types\.

Addressing the Capacity Trade\-off:A key drawback of blindspot kernels is reduced representational capacity compared to full kernels\. We address this through our dual\-branch training architecture \(Section[3\.4\.1](https://arxiv.org/html/2606.04106#S3.SS4.SSS1)\): the noise\-augmented branch uses blindspot convolutions, the clean branch uses full\-capacity kernels, and invariance losses encourage the blindspot branch to approximate full\-kernel representations\. This provides both denoising priors and full representational capacity\.

![Refer to caption](https://arxiv.org/html/2606.04106v1/Figures/NoiseSinkConstruction.png)Figure 5:Construction of the Noise Sink: \(1\)Noise Estimation \(2\)Noise Subtraction \(3\)Regularization via Decorrelation \(4\) Dynamic Representation Power Compensation
##### A\.4\.2\.2Noise Sink Module

To complement the implicit denoising priors of Blindspot Convolution and Latent Coherent Integration \(Section[3\.4\.1](https://arxiv.org/html/2606.04106#S3.SS4.SSS1)\), we develop the Noise Sink Module to explicitly estimate and remove uncorrelated noise from the representation\. Figure[5](https://arxiv.org/html/2606.04106#A1.F5)illustrates the four\-stage construction: \(1\) noise estimation, \(2\) noise subtraction, \(3\) regularization via decorrelation, and \(4\) dynamic power compensation\.

###### Architecture Overview:

The Noise Sink operates on intermediate representationsx∈ℝB×C×Lx\\in\\mathbb\{R\}^\{B\\times C\\times L\}\(batch×\\timeschannels×\\timeslength\) and produces cleaned representationsxc​l​e​a​n∈ℝB×C×Lx\_\{clean\}\\in\\mathbb\{R\}^\{B\\times C\\times L\}:

Algorithm 1Noise Sink Module1:Input:

x∈ℝB×C×Lx\\in\\mathbb\{R\}^\{B\\times C\\times L\}\(potentially noisy representation\)

2:Noise Estimation:

ne​s​t=ResidualConv​\(x\)n\_\{est\}=\\text\{ResidualConv\}\(x\)
3:Noise Subtraction:

xs​u​b=x−ne​s​tx\_\{sub\}=x\-n\_\{est\}
4:Decorrelation Loss:

Ld​e​c​o​r​r=PearsonCorr​\(ne​s​t,xs​u​b\)L\_\{decorr\}=\\text\{PearsonCorr\}\(n\_\{est\},x\_\{sub\}\)
5:Token Reshaping:

xt​o​k​e​n​s=ReshapeToToken​\(xs​u​b\)x\_\{tokens\}=\\text\{ReshapeToToken\}\(x\_\{sub\}\)
6:Power Estimation:

Pn​o​i​s​e,Ps​i​g​n​a​l=ConvPowerPooler​\(ne​s​t,x\)P\_\{noise\},P\_\{signal\}=\\text\{ConvPowerPooler\}\(n\_\{est\},x\)
7:Power Ratio:

r=Pn​o​i​s​e/Ps​i​g​n​a​lr=P\_\{noise\}/P\_\{signal\}
8:Affine Parameters:

γ,β=AffineMLP​\(r\)\\gamma,\\beta=\\text\{AffineMLP\}\(r\)
9:Compensation:

xt​o​k​e​n​s,c​o​m​p=γ⋅xt​o​k​e​n​s\+βx\_\{tokens,comp\}=\\gamma\\cdot x\_\{tokens\}\+\\beta
10:Token Normalization:

xt​o​k​e​n​s,n​o​r​m=RMSNorm​\(xt​o​k​e​n​s,c​o​m​p\)x\_\{tokens,norm\}=\\text\{RMSNorm\}\(x\_\{tokens,comp\}\)
11:Reshape Back:

xc​l​e​a​n=ReshapeToChannels​\(xt​o​k​e​n​s,n​o​r​m\)x\_\{clean\}=\\text\{ReshapeToChannels\}\(x\_\{tokens,norm\}\)
12:Output:

xc​l​e​a​n∈ℝB×C×Lx\_\{clean\}\\in\\mathbb\{R\}^\{B\\times C\\times L\},

ne​s​tn\_\{est\},

Ld​e​c​o​r​rL\_\{decorr\},

Pn​o​i​s​eP\_\{noise\}

###### Stage 1: Noise Estimation

We leverage two inductive biases: \(1\) additive noise is locally uncorrelated with structured signals, and \(2\) multi\-layer networks with non\-linear activations can approximate arbitrary functions\. The noise estimator consists of two convolutional layers with a GELU activation:

ne​s​t=Conv2​\(GELU​\(Conv1​\(x\)\)\)n\_\{est\}=\\text\{Conv\}\_\{2\}\(\\text\{GELU\}\(\\text\{Conv\}\_\{1\}\(x\)\)\)\(2\)
LayerIn ChannelsOut ChannelsKernelStridePaddingConv1CCC/4C/451sameActivationGELUConv2C/4C/4CC51sameTable 2:Noise Estimator Architecture\. Bias is disabled for all layers\.Rationale:The bottleneck structure \(C→C/4→CC\\rightarrow C/4\\rightarrow C\) forces the network to learn a compressed representation of noise patterns, preventing it from learning identity mappings that would simply copy the input\.

###### Stage 2: Noise Subtraction

The estimated noise is subtracted from the input:

xs​u​b=x−ne​s​tx\_\{sub\}=x\-n\_\{est\}\(3\)

###### Stage 3: Regularization via Pearson Decorrelation

To ensure the Noise Sink removes noise rather than signal, we compute the Pearson correlation coefficient between the estimated noise and cleaned signal:

sunkf​l​a​t\\displaystyle\\text\{sunk\}\_\{flat\}=ne​s​t\.f​l​a​t​t​e​n​\(1\)∈ℝB×\(C⋅L\)\\displaystyle=n\_\{est\}\.flatten\(1\)\\in\\mathbb\{R\}^\{B\\times\(C\\cdot L\)\}\(4\)postf​l​a​t\\displaystyle\\text\{post\}\_\{flat\}=xs​u​b\.f​l​a​t​t​e​n​\(1\)∈ℝB×\(C⋅L\)\\displaystyle=x\_\{sub\}\.flatten\(1\)\\in\\mathbb\{R\}^\{B\\times\(C\\cdot L\)\}\(5\)sunkc​e​n​t​e​r​e​d\\displaystyle\\text\{sunk\}\_\{centered\}=sunkf​l​a​t−μs​u​n​k\\displaystyle=\\text\{sunk\}\_\{flat\}\-\\mu\_\{sunk\}\(6\)postc​e​n​t​e​r​e​d\\displaystyle\\text\{post\}\_\{centered\}=postf​l​a​t−μp​o​s​t\\displaystyle=\\text\{post\}\_\{flat\}\-\\mu\_\{post\}\(7\)sunkn​o​r​m\\displaystyle\\text\{sunk\}\_\{norm\}=sunkc​e​n​t​e​r​e​d/σs​u​n​k\\displaystyle=\\text\{sunk\}\_\{centered\}/\\sigma\_\{sunk\}\(8\)postn​o​r​m\\displaystyle\\text\{post\}\_\{norm\}=postc​e​n​t​e​r​e​d/σp​o​s​t\\displaystyle=\\text\{post\}\_\{centered\}/\\sigma\_\{post\}\(9\)ρ\\displaystyle\\rho=1N​∑i=1Nsunknorm​\[i\]⋅postnorm​\[i\]\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\text\{sunk\}\_\{\\text\{norm\}\}\[i\]\\cdot\\text\{post\}\_\{\\text\{norm\}\}\[i\]\(10\)Ld​e​c​o​r​r\\displaystyle L\_\{decorr\}=\|ρ\|\\displaystyle=\|\\rho\|\(11\)
whereμ\\muandσ\\sigmaare computed per sample over the flattened dimension\.N=C⋅LN=C\\cdot Lis the total number of elements in the flattened representation\. This normalization ensuresρ∈\[−1,1\]\\rho\\in\[\-1,1\], representing a valid Pearson correlation coefficient\.

Rationale:Ifne​s​tn\_\{est\}is truly noise, it should be uncorrelated with the cleaned signalxs​u​bx\_\{sub\}\(Pearsonρ≈0\\rho\\approx 0\)\. High correlation indicates the sink is removing signal content\.

###### Stage 4: Dynamic Power Compensation

A critical challenge arises in low or negative\-SNR regimes: after noise subtraction, signal power may be severely attenuated\. We introduce dynamic power compensation based on the ratio of sunk noise power to original signal power\.

Token Reshaping:

We convert the convolutional representation to token representation with a fixed token width of 2\. WhereL/2L/2is the number of tokens andC⋅2C\\cdot 2is the token dimension\. This groups consecutive IQ pairs per channel into tokens, preserving the complex\-valued structure \(each token contains one complete complex sample across all channels\)\.

Convolutional Power Pooling:

Rather than computing power directly, we use an attention\-based convolutional pooler that learns to weight different time steps:

Attention Logits:

a=Conv1​x​1​\(x\)∈ℝB×1×L/2\\text\{a\}=\\text\{Conv\}\_\{1x1\}\(x\)\\in\\mathbb\{R\}^\{B\\times 1\\times L/2\}Attention Weights \(over sequence dimension\):

α=softmax​\(a,d​i​m=−1\)∈ℝ\(B×1×L/2\)\\alpha=\\text\{softmax\}\(a,dim=\-1\)\\in\\mathbb\{R\}^\{\(B\\times 1\\times L/2\)\}Weighted Pooling:

pooled=bmm​\(α,xT\)∈ℝ\(B×1×2​c\)→ℝ\(B×2​C\)\\text\{pooled\}=\\text\{bmm\}\(\\alpha,x^\{T\}\)\\in\\mathbb\{R\}^\{\(B\\times 1\\times 2c\)\}\\rightarrow\\mathbb\{R\}^\{\(B\\times 2C\)\}
WherexTx^\{T\}denotes x\.transpose\(1,2\)∈ℝ\(B×L/2×2​C\)\\in\\mathbb\{R\}^\{\(B\\times L/2\\times 2C\)\}, and bmm performs batch matrix multiplication:\[B×1×L/2\]​@​\[B×L/2×2​C\]→\[B×1×2​C\]\[B\\times 1\\times L/2\]@\[B\\times L/2\\times 2C\]\\rightarrow\[B\\times 1\\times 2C\]

ConvPowerPooler Architecture:

- •Attention Conv:1×1 convolution:C→1C\\rightarrow 1
- •Softmax:Over time dimension
- •Weighted pooling:Batch matrix multiplication
- •Power:Sum of squares over channels

Rationale:The learned attention weights allow the network to focus on informative time steps when estimating power, rather than uniform averaging\. This is particularly useful for signals with time\-varying SNR or transient events\.

Power Ratio Computation:

We compute power for both the estimated noise and original signal:

Pn​o​i​s​e\\displaystyle P\_\{noise\}=ConvPowerPooler​\(ne​s​t\)∈ℝB×L/2\\displaystyle=\\text\{ConvPowerPooler\}\(n\_\{est\}\)\\in\\mathbb\{R\}^\{B\\times L/2\}\(12\)Ps​i​g​n​a​l\\displaystyle P\_\{signal\}=ConvPowerPooler​\(x\)∈ℝB×L/2\\displaystyle=\\text\{ConvPowerPooler\}\(x\)\\in\\mathbb\{R\}^\{B\\times L/2\}\(13\)r\\displaystyle r=Pn​o​i​s​ePs​i​g​n​a​l∈ℝB×L/2\\displaystyle=\\frac\{P\_\{noise\}\}\{P\_\{signal\}\}\\in\\mathbb\{R\}^\{B\\times L/2\}\(14\)
The ratio is clamped to\[0,2\]\[0,2\]to handle extreme cases:

r=clamp​\(r,0\.0,2\.0\)r=\\text\{clamp\}\(r,0\.0,2\.0\)\(15\)
Affine MLP \(Hypernetwork\):

The power ratio \(a scalar per token\) is fed to an MLP that predicts affine parameters for all tokens:

\[γ;β\]=AffineMLP​\(r\)∈ℝB×2⋅\(L/2\)\[\\gamma;\\beta\]=\\text\{AffineMLP\}\(r\)\\in\\mathbb\{R\}^\{B\\times 2\\cdot\(L/2\)\}\(16\)
LayerInput DimOutput DimLinear11C⋅2⋅2C\\cdot 2\\cdot 2ReLU\-Linear2C⋅2⋅2C\\cdot 2\\cdot 22⋅\(C⋅2⋅2\)2\\cdot\(C\\cdot 2\\cdot 2\)Table 3:Affine MLP Architecture\. Input is a scalar power ratio per token; output is concatenated\[γ;β\]\[\\gamma;\\beta\]whereγ,β∈ℝL/2\\gamma,\\beta\\in\\mathbb\{R\}^\{L/2\}is broadcasted over the token dimension to apply the same scale and shift per element/token across all channels\.The affine parameters are applied to each token individually and broadcasted across all channels per token\. This is with the assumption that noise is localized at the same elements/tokens in a sequence across all feature channels\. For example, noise in feature map 1 element 0 will also be noise at feature map 2 element 0\.

Token Normalization:

After compensation, tokens are normalized using RMS normalization:

xt​o​k​e​n​s,n​o​r​m=RMSNorm​\(xt​o​k​e​n​s,c​o​m​p\)x\_\{tokens,norm\}=\\text\{RMSNorm\}\(x\_\{tokens,comp\}\)\(17\)
Reshape Back to Convolutional:

Finally, we reverse the tokenization to convert the signals fromℝB×L/2×2​C\\mathbb\{R\}^\{B\\times L/2\\times 2C\}back toℝB×C×L\\mathbb\{R\}^\{B\\times C\\times L\}for subsequent convolutional processing\.

Rationale:The token\-level conditioning provides a localized compensation strategy based on the local SNR of the token\. In low\-SNR tokens/samples \(highrr\), the network learns to amplify specific tokens rather than all of them \(useful for time\-varying noise\); in high\-SNR samples/tokens \(lowrr\), it preserves the representation withγ≈1\\gamma\\approx 1\. The RMS normalization after compensation ensures stable magnitudes regardless of the affine transformation\.

###### Complete Noise Sink Summary:

The Noise Sink provides explicit, regularized denoising with token\-adaptive compensation\. Key design principles:

- •Explicit estimation:Convolutional bottleneck approximates noise functions
- •Pearson decorrelation:Ensures noise \(not signal\) is removed
- •Learned power pooling:Attention\-weighted power estimation
- •Token\-level compensation:Per\-Token affine transformation based on SNR ratio
- •Token normalization:RMS normalization stabilizes representations
- •Hierarchical:Multiple sinks throughout encoder provide progressive denoising

###### Additional Regularization Losses:

Beyond the decorrelation loss computed within the module, the Noise Sink is further regularized through hierarchical losses computed across all sinks in the encoder:

\(a\) Hierarchical Power Matching:Total estimated noise power across all sinks matches injected noise power:

Lp​o​w​e​r=\|∑i=1Ns​i​n​k​sPn​o​i​s​e\(i\)−Pi​n​j​e​c​t​e​d\|L\_\{power\}=\\left\|\\sum\_\{i=1\}^\{N\_\{sinks\}\}P\_\{noise\}^\{\(i\)\}\-P\_\{injected\}\\right\|\(18\)
\(b\) SNR Regression:Reconstructed signal’s SNR matches target SNR:

SNRe​s​t=log⁡\(‖xr​e​c​o​n‖2\)−log⁡\(∑i=1Ns​i​n​k​sPn​o​i​s​e\(i\)\)\\text\{SNR\}\_\{est\}=\\log\(\|\|x\_\{recon\}\|\|^\{2\}\)\-\\log\\left\(\\sum\_\{i=1\}^\{N\_\{sinks\}\}P\_\{noise\}^\{\(i\)\}\\right\)\(19\)LS​N​R=MSEf​o​c​a​l​\(SNRe​s​t,SNRG​T\)L\_\{SNR\}=\\text\{MSE\}\_\{focal\}\(\\text\{SNR\}\_\{est\},\\text\{SNR\}\_\{GT\}\)\(20\)
![Refer to caption](https://arxiv.org/html/2606.04106v1/Figures/CausalCrossWindowFocusSaveWindow.png)Figure 6:Causal Cross Window Focus Block

##### A\.4\.2\.3Causal Cross\-Window Focus:

The Causal Cross\-Window Focus mechanism addresses a fundamental challenge in windowed sequence processing: modeling long\-range dependencies across temporal boundaries while maintaining causal constraints\. Our approach leverages domain\-specific positional encodings and cross\-attention between consecutive windows to capture phase relationships that would otherwise be lost in naive windowed processing\.

Figure[6](https://arxiv.org/html/2606.04106#A1.F6)illustrates the complete data flow through the mechanism\. A key contribution lies in our adaptive treatment of time\-domain versus frequency\-domain representations, where positional encoding strategies are tailored to preserve the physical meaning of tokens in each domain\.

###### Main Algorithm:

Algorithm[2](https://arxiv.org/html/2606.04106#alg2)presents the complete Causal Cross\-Window Focus procedure\. The mechanism operates on consecutive windows of a signal, using the previous window to provide contextual information for processing the current window through a cross\-attention mechanism\.

Algorithm 2Causal Cross\-Window Focus \(AR\_attention\)0:Current window

𝐱∈ℝB×C×L\\mathbf\{x\}\\in\\mathbb\{R\}^\{B\\times C\\times L\}, Previous window

𝐱prev∈ℝB×C×L\\mathbf\{x\}\_\{\\text\{prev\}\}\\in\\mathbb\{R\}^\{B\\times C\\times L\}, Block index

b∈\{0,1,…,Nblocks−1\}b\\in\\\{0,1,\\ldots,N\_\{\\text\{blocks\}\}\-1\\\}, Domain flag

is\_freq∈\{True,False\}\\text\{is\\\_freq\}\\in\\\{\\text\{True\},\\text\{False\}\\\}
0:Transformed representation

𝐱out∈ℝB×C×L\\mathbf\{x\}\_\{\\text\{out\}\}\\in\\mathbb\{R\}^\{B\\times C\\times L\}, Orthogonality regularization loss

ℒorth∈ℝ\\mathcal\{L\}\_\{\\text\{orth\}\}\\in\\mathbb\{R\}
1:

𝐱res←𝐱\\mathbf\{x\}\_\{\\text\{res\}\}\\leftarrow\\mathbf\{x\}\{Store residual connection\}

2:

3:\{Domain Transformation\}

4:if

is\_freq=False\\text\{is\\\_freq\}=\\text\{False\}then

5:

𝐱←FFT​\(𝐱\)\\mathbf\{x\}\\leftarrow\\text\{FFT\}\(\\mathbf\{x\}\)\{Transform to frequency domain\}

6:

𝐱prev←FFT​\(𝐱prev\)\\mathbf\{x\}\_\{\\text\{prev\}\}\\leftarrow\\text\{FFT\}\(\\mathbf\{x\}\_\{\\text\{prev\}\}\)
7:endif

8:

9:\{Step 1: Tokenization & Learned Spectral Compression\}

10:

𝐓,B,C,L←ReshapeToToken​\(𝐱\)\\mathbf\{T\},B,C,L\\leftarrow\\textsc\{ReshapeToToken\}\(\\mathbf\{x\}\)\{

𝐓∈ℝB×L/2×2​C\\mathbf\{T\}\\in\\mathbb\{R\}^\{B\\times L/2\\times 2C\}\}

11:

𝐓prev,\_,\_,\_←ReshapeToToken​\(𝐱prev\)\\mathbf\{T\}\_\{\\text\{prev\}\},\\\_,\\\_,\\\_\\leftarrow\\textsc\{ReshapeToToken\}\(\\mathbf\{x\}\_\{\\text\{prev\}\}\)
12:

13:

s←\{16if​b=04otherwises\\leftarrow\\begin\{cases\}16&\\text\{if \}b=0\\\\ 4&\\text\{otherwise\}\\end\{cases\}\{Adaptive downsampling stride\}

14:

15:

𝐓down←Conv1Ddown\(b\)​\(𝐓\)\\mathbf\{T\}\_\{\\text\{down\}\}\\leftarrow\\text\{Conv1D\}^\{\(b\)\}\_\{\\text\{down\}\}\(\\mathbf\{T\}\)\{

∈ℝB×\(L/2\)/s×2​C\\in\\mathbb\{R\}^\{B\\times\(L/2\)/s\\times 2C\}\}

16:

𝐓prev,down←Conv1Ddown\(b\)​\(𝐓prev\)\\mathbf\{T\}\_\{\\text\{prev,down\}\}\\leftarrow\\text\{Conv1D\}^\{\(b\)\}\_\{\\text\{down\}\}\(\\mathbf\{T\}\_\{\\text\{prev\}\}\)
17:

18:\{Step 2: Domain\-Specific Positional Encoding\}

19:if

is\_freq=False\\text\{is\\\_freq\}=\\text\{False\}then

20:

𝐓down,𝐓prev,down←TimeDomainPosEnc​\(𝐓down,𝐓prev,down,B,C,L,b\)\\mathbf\{T\}\_\{\\text\{down\}\},\\mathbf\{T\}\_\{\\text\{prev,down\}\}\\leftarrow\\textsc\{TimeDomainPosEnc\}\(\\mathbf\{T\}\_\{\\text\{down\}\},\\mathbf\{T\}\_\{\\text\{prev,down\}\},B,C,L,b\)
21:\{Algorithm[3](https://arxiv.org/html/2606.04106#alg3)\}

22:else

23:

𝐓down,𝐓prev,down←FreqDomainPosEnc​\(𝐓down,𝐓prev,down,b\)\\mathbf\{T\}\_\{\\text\{down\}\},\\mathbf\{T\}\_\{\\text\{prev,down\}\}\\leftarrow\\textsc\{FreqDomainPosEnc\}\(\\mathbf\{T\}\_\{\\text\{down\}\},\\mathbf\{T\}\_\{\\text\{prev,down\}\},b\)
24:\{Algorithm[4](https://arxiv.org/html/2606.04106#alg4)\}

25:endif

26:

27:\{Step 3: Multi\-Head Orthogonal Cross\-Focus Attention\}

28:

𝐓out,ℒorth←MultiHeadFocus\(b\)​\(𝐐=𝐓down,𝐊=𝐓prev,down,𝐕=𝐓prev,down\)\\mathbf\{T\}\_\{\\text\{out\}\},\\mathcal\{L\}\_\{\\text\{orth\}\}\\leftarrow\\text\{MultiHeadFocus\}^\{\(b\)\}\(\\mathbf\{Q\}=\\mathbf\{T\}\_\{\\text\{down\}\},\\mathbf\{K\}=\\mathbf\{T\}\_\{\\text\{prev,down\}\},\\mathbf\{V\}=\\mathbf\{T\}\_\{\\text\{prev,down\}\}\)
29:

30:\{Step 4: Spectral Upsampling & Reconstruction\}

31:if

is\_freq=False\\text\{is\\\_freq\}=\\text\{False\}then

32:

𝐓out←FFT​\(𝐓out\)\\mathbf\{T\}\_\{\\text\{out\}\}\\leftarrow\\textsc\{FFT\}\(\\mathbf\{T\}\_\{\\text\{out\}\}\)
33:endif

34:

35:

𝐓up←Conv1Dup\(b\)​\(𝐓out\)\\mathbf\{T\}\_\{\\text\{up\}\}\\leftarrow\\text\{Conv1D\}^\{\(b\)\}\_\{\\text\{up\}\}\(\\mathbf\{T\}\_\{\\text\{out\}\}\)\{Upsample to original token count\}

36:

𝐱out←ReshapeToChannels​\(𝐓up,B,C,L\)\\mathbf\{x\}\_\{\\text\{out\}\}\\leftarrow\\textsc\{ReshapeToChannels\}\(\\mathbf\{T\}\_\{\\text\{up\}\},B,C,L\)
37:

38:if

is\_freq=False\\text\{is\\\_freq\}=\\text\{False\}then

39:

𝐱out←IFFT​\(𝐱out\)\\mathbf\{x\}\_\{\\text\{out\}\}\\leftarrow\\text\{IFFT\}\(\\mathbf\{x\}\_\{\\text\{out\}\}\)\{Return to time domain\}

40:endif

41:

42:

𝐱out←𝐱out\+𝐱res\\mathbf\{x\}\_\{\\text\{out\}\}\\leftarrow\\mathbf\{x\}\_\{\\text\{out\}\}\+\\mathbf\{x\}\_\{\\text\{res\}\}\{Residual connection\}

43:

44:return

𝐱out,ℒorth\\mathbf\{x\}\_\{\\text\{out\}\},\\mathcal\{L\}\_\{\\text\{orth\}\}

###### Domain\-Specific Positional Encoding

A critical contribution in our approach is the use of domain\-specific positional encoding strategies\. The physical interpretation of sequence position differs fundamentally between time and frequency domains, necessitating distinct encoding approaches\.

###### Time Domain Positional Encoding

For time\-domain processing, temporal continuity across window boundaries is paramount\. Algorithm[3](https://arxiv.org/html/2606.04106#alg3)describes our causal concatenation strategy, which preserves phase relationships by encoding windows in their natural temporal order\.

Algorithm 3Time Domain Positional Encoding0:Downsampled tokens

𝐓curr∈ℝB×N×D\\mathbf\{T\}\_\{\\text\{curr\}\}\\in\\mathbb\{R\}^\{B\\times N\\times D\},

𝐓prev∈ℝB×N×D\\mathbf\{T\}\_\{\\text\{prev\}\}\\in\\mathbb\{R\}^\{B\\times N\\times D\}
0:Block index

bb
0:Position\-encoded tokens

𝐓curr′,𝐓prev′\\mathbf\{T\}^\{\\prime\}\_\{\\text\{curr\}\},\\mathbf\{T\}^\{\\prime\}\_\{\\text\{prev\}\}
1:\{Convert to time domain for causal concatenation\}

2:

𝐓curr←IFFT​\(𝐓curr\)\\mathbf\{T\}\_\{\\text\{curr\}\}\\leftarrow\\text\{IFFT\}\(\\mathbf\{T\}\_\{\\text\{curr\}\}\)\{Transform to time domain\}

3:

𝐓prev←IFFT​\(𝐓prev\)\\mathbf\{T\}\_\{\\text\{prev\}\}\\leftarrow\\text\{IFFT\}\(\\mathbf\{T\}\_\{\\text\{prev\}\}\)
4:

5:\{Causal concatenation preserves temporal ordering\}

6:

𝐓concat←Concat​\(\[𝐓prev,𝐓curr\],dim=1\)\\mathbf\{T\}\_\{\\text\{concat\}\}\\leftarrow\\text\{Concat\}\(\[\\mathbf\{T\}\_\{\\text\{prev\}\},\\mathbf\{T\}\_\{\\text\{curr\}\}\],\\text\{dim\}=1\)\{

∈ℝB×2​N×D\\in\\mathbb\{R\}^\{B\\times 2N\\times D\}\}

7:

8:

𝐏𝐄←SinusoidalPosEnc​\(2​N,D\)\\mathbf\{PE\}\\leftarrow\\textsc\{SinusoidalPosEnc\}\(2N,D\)\{Standard sinusoidal encoding\}

9:

𝐓concat←𝐓concat\+𝐏𝐄\\mathbf\{T\}\_\{\\text\{concat\}\}\\leftarrow\\mathbf\{T\}\_\{\\text\{concat\}\}\+\\mathbf\{PE\}
10:

11:\{Split and normalize\}

12:

𝐓curr′←RMSNorm\(b\)\(𝐓concat\[:,N:,:\]\)\\mathbf\{T\}^\{\\prime\}\_\{\\text\{curr\}\}\\leftarrow\\text\{RMSNorm\}^\{\(b\)\}\(\\mathbf\{T\}\_\{\\text\{concat\}\}\[:,N:,:\]\)\{Current window tokens\}

13:

𝐓prev′←RMSNorm\(b\)\(𝐓concat\[:,:N,:\]\)\\mathbf\{T\}^\{\\prime\}\_\{\\text\{prev\}\}\\leftarrow\\text\{RMSNorm\}^\{\(b\)\}\(\\mathbf\{T\}\_\{\\text\{concat\}\}\[:,:N,:\]\)\{Previous window tokens\}

14:

15:return

𝐓curr′,𝐓prev′\\mathbf\{T\}^\{\\prime\}\_\{\\text\{curr\}\},\\mathbf\{T\}^\{\\prime\}\_\{\\text\{prev\}\}

Rationale:In the time domain, the relative temporal position between tokens in consecutive windows carries critical phase information\. By transforming frequency\-domain representations back to time, concatenating causally \(previous before current\), and then applying positional encoding to the unified sequence, we ensure that the model can learn continuous phase relationships across window boundaries\. This is essential for signals where phase coherence matters, such as in communications or audio processing\.

###### Frequency Domain Positional Encoding

For frequency\-domain processing, spectral bin relationships within each window are more important than temporal ordering across windows\. Algorithm[4](https://arxiv.org/html/2606.04106#alg4)presents our independent encoding strategy\.

Algorithm 4Frequency Domain Positional Encoding0:Downsampled tokens

𝐓curr∈ℝB×N×D\\mathbf\{T\}\_\{\\text\{curr\}\}\\in\\mathbb\{R\}^\{B\\times N\\times D\},

𝐓prev∈ℝB×N×D\\mathbf\{T\}\_\{\\text\{prev\}\}\\in\\mathbb\{R\}^\{B\\times N\\times D\}
0:Block index

bb
0:Position\-encoded tokens

𝐓curr′,𝐓prev′\\mathbf\{T\}^\{\\prime\}\_\{\\text\{curr\}\},\\mathbf\{T\}^\{\\prime\}\_\{\\text\{prev\}\}
1:

2:\{Independent encoding for spectral bin relationships\}

3:

𝐏𝐄←SinusoidalPosEnc​\(N,D\)\\mathbf\{PE\}\\leftarrow\\textsc\{SinusoidalPosEnc\}\(N,D\)
4:

5:

𝐓curr←𝐓curr\+𝐏𝐄\\mathbf\{T\}\_\{\\text\{curr\}\}\\leftarrow\\mathbf\{T\}\_\{\\text\{curr\}\}\+\\mathbf\{PE\}
6:

𝐓prev←𝐓prev\+𝐏𝐄\\mathbf\{T\}\_\{\\text\{prev\}\}\\leftarrow\\mathbf\{T\}\_\{\\text\{prev\}\}\+\\mathbf\{PE\}
7:

8:

𝐓curr′←RMSNorm\(b\)​\(𝐓curr\)\\mathbf\{T\}^\{\\prime\}\_\{\\text\{curr\}\}\\leftarrow\\text\{RMSNorm\}^\{\(b\)\}\(\\mathbf\{T\}\_\{\\text\{curr\}\}\)
9:

𝐓prev′←RMSNorm\(b\)​\(𝐓prev\)\\mathbf\{T\}^\{\\prime\}\_\{\\text\{prev\}\}\\leftarrow\\text\{RMSNorm\}^\{\(b\)\}\(\\mathbf\{T\}\_\{\\text\{prev\}\}\)
10:

11:return

𝐓curr′,𝐓prev′\\mathbf\{T\}^\{\\prime\}\_\{\\text\{curr\}\},\\mathbf\{T\}^\{\\prime\}\_\{\\text\{prev\}\}

Rationale:In the frequency domain, each position corresponds to a spectral bin rather than a temporal instant\. The relationship between corresponding bins across windows \(e\.g\., how the 100 Hz component evolves\) is more meaningful than the temporal ordering of windows\. Independent positional encoding allows the cross\-attention mechanism to learn spectral covariance patterns—how energy in specific frequency bands correlates across time—without imposing artificial temporal structure on what is fundamentally a spectral representation\.

###### Mathematical Formulation

We now provide explicit mathematical definitions for each transformation in the Causal Cross\-Window Focus mechanism\.

Tokenization\.Given a convolutional feature map𝐗∈ℝB×C×L\\mathbf\{X\}\\in\\mathbb\{R\}^\{B\\times C\\times L\}where adjacent channel pairs represent complex\-valued features \(real and imaginary components\), we reshape to token representation:

𝐓=Reshape​\(𝐗\)∈ℝB×\(L/2\)×\(2​C\)\\mathbf\{T\}=\\text\{Reshape\}\(\\mathbf\{X\}\)\\in\\mathbb\{R\}^\{B\\times\(L/2\)\\times\(2C\)\}\(21\)This operation groups complex pairs along the sequence dimension, creating tokens that span all channels\.

Learned Spectral Compression\.

To avoid aliasing artifacts common in naive downsampling, we perform learned spectral compression in the frequency domain before temporal decimation\. Specifically:

1\. We reshape the frequency\-domain representation\[B,F,T\]\[B,F,T\]\(whereFFrepresents frequency bins\) into token format\[B,T/2,2​F\]\[B,T/2,2F\]where each token encodes a complex\-valued frequency representation\.

2\. We apply a learned1​×​11×1convolution with stride s in token space, which effectively learns to compress the frequency spectrum by selecting which frequency bins to preserve before downsampling\.

3\. For time\-domain processing, we convert back via IFFT, ensuring the temporal signal has reduced bandwidth appropriate for the lower sampling rate\.

This approach implements a learned, adaptive low\-pass filter in the frequency domain, preventing aliasing while preserving task\-relevant spectral components\.

𝐓↓=Conv1×1,s\(b\)​\(𝐓\)∈ℝB×Ndown×\(2​C\)\\mathbf\{T\}\_\{\\downarrow\}=\\text\{Conv\}^\{\(b\)\}\_\{1\\times 1,s\}\(\\mathbf\{T\}\)\\in\\mathbb\{R\}^\{B\\times N\_\{\\text\{down\}\}\\times\(2C\)\}\(22\)whereNdown=\(L/2\)/sN\_\{\\text\{down\}\}=\(L/2\)/swith adaptive stride:

s=\{16if​b=0​\(early layers, aggressive compression\)4if​b\>0​\(later layers, preserve detail\)s=\\begin\{cases\}16&\\text\{if \}b=0\\text\{ \(early layers, aggressive compression\)\}\\\\ 4&\\text\{if \}b\>0\\text\{ \(later layers, preserve detail\)\}\\end\{cases\}\(23\)
Cross\-Window Focus\.Multi\-head orthogonal cross\-focus withH=4H=4heads:

headh\\displaystyle\\text\{head\}\_\{h\}=Focus​\(𝐐h,𝐊h,𝐕h\)\\displaystyle=\\text\{Focus\}\(\\mathbf\{Q\}\_\{h\},\\mathbf\{K\}\_\{h\},\\mathbf\{V\}\_\{h\}\)\(24\)=softmax​\(C​o​v​\(𝐐h,𝐊h\)⋅Ls​e​q​u​e​n​c​edk⋅Kf​o​c​u​s\)​𝐕h\\displaystyle=\\text\{softmax\}\\left\(\\frac\{Cov\(\\mathbf\{Q\}\_\{h\},\\mathbf\{K\}\_\{h\}\)\\cdot L\_\{sequence\}\}\{\\sqrt\{d\_\{k\}\}\\cdot K\_\{focus\}\}\\right\)\\mathbf\{V\}\_\{h\}\(25\)whereC​o​v​\(Q,K\)=\(Q−μQ\)​\(K−μK\)TCov\(Q,K\)=\(Q\-\\mu\_\{Q\}\)\(K\-\\mu\_\{K\}\)^\{T\}\.

The focus factorKf​o​c​u​s∈\[1,Ls​e​q​u​e​n​c​e\]K\_\{focus\}\\in\[1,L\_\{sequence\}\]is computed from attention score statistics\. First, we compute per\-token variance:

σi2=V​a​rj​\(C​o​v​\(Q,K\):,i\),i∈\[1,Ls​e​q​u​e​n​c​e\]\\sigma^\{2\}\_\{i\}=Var\_\{j\}\(Cov\(Q,K\)\_\{:,i\}\),\\quad i\\in\[1,L\_\{sequence\}\]
whereσi2\\sigma^\{2\}\_\{i\}is the variance of attention scores for tokeniiacross all queries\. We then compute per\-token SNR estimates relative to cumulative variance:

SNRi=σi2∑j≠iσj2\+ϵ\\text\{SNR\}\_\{i\}=\\frac\{\\sigma^\{2\}\_\{i\}\}\{\\sum\_\{j\\neq i\}\\sigma^\{2\}\_\{j\}\+\\epsilon\}
From the SNR distribution, we extract summary statistics:

mean\_snr=1Ls​e​q​u​e​n​c​e​∑i=1Ls​e​q​u​e​n​c​eSNRi\\displaystyle=\\frac\{1\}\{L\_\{sequence\}\}\\sum\_\{i=1\}^\{L\_\{sequence\}\}\\text\{SNR\}\_\{i\}\(26\)high\_snr\_prop=1Ls​e​q​u​e​n​c​e​∑i=1Ls​e​q​u​e​n​c​e𝟙​\[SNRi\>median​\(\{SNRj\}j=1Ls​e​q​u​e​n​c​e\)\]\\displaystyle=\\frac\{1\}\{L\_\{sequence\}\}\\sum\_\{i=1\}^\{L\_\{sequence\}\}\\mathbb\{1\}\[\\text\{SNR\}\_\{i\}\>\\text\{median\}\(\\\{\\text\{SNR\}\_\{j\}\\\}\_\{j=1\}^\{L\_\{sequence\}\}\)\]\(27\)
These statistics are concatenated and fed to a learned network:

𝐤input=\[mean\_snr,high\_snr\_prop\]\\mathbf\{k\}\_\{\\text\{input\}\}=\[\\text\{mean\\\_snr\},\\text\{high\\\_snr\\\_prop\}\]Kf​o​c​u​s=1\+\(Ls​e​q​u​e​n​c​e−1\)⋅σ​\(fθ​\(𝐤input\)\)K\_\{focus\}=1\+\(L\_\{sequence\}\-1\)\\cdot\\sigma\(f\_\{\\theta\}\(\\mathbf\{k\}\_\{\\text\{input\}\}\)\)
wherefθf\_\{\\theta\}is a lightweight MLP andσ​\(⋅\)\\sigma\(\\cdot\)is the sigmoid function\.

𝐐h\\displaystyle\\mathbf\{Q\}\_\{h\}=𝐓curr​𝐖hQ,𝐐h∈ℝB×N×dk\\displaystyle=\\mathbf\{T\}\_\{\\text\{curr\}\}\\mathbf\{W\}^\{Q\}\_\{h\},\\quad\\mathbf\{Q\}\_\{h\}\\in\\mathbb\{R\}^\{B\\times N\\times d\_\{k\}\}\(28\)𝐊h\\displaystyle\\mathbf\{K\}\_\{h\}=𝐓prev​𝐖hK,𝐊h∈ℝB×N×dk\\displaystyle=\\mathbf\{T\}\_\{\\text\{prev\}\}\\mathbf\{W\}^\{K\}\_\{h\},\\quad\\mathbf\{K\}\_\{h\}\\in\\mathbb\{R\}^\{B\\times N\\times d\_\{k\}\}\(29\)𝐕h\\displaystyle\\mathbf\{V\}\_\{h\}=𝐓prev​𝐖hV,𝐕h∈ℝB×N×dv\\displaystyle=\\mathbf\{T\}\_\{\\text\{prev\}\}\\mathbf\{W\}^\{V\}\_\{h\},\\quad\\mathbf\{V\}\_\{h\}\\in\\mathbb\{R\}^\{B\\times N\\times d\_\{v\}\}\(30\)withdk=dv=\(2​C\)/Hd\_\{k\}=d\_\{v\}=\(2C\)/H\. The multi\-head outputs are concatenated and projected:

𝐓out=Concat​\(head1,…,headH\)​𝐖O\\mathbf\{T\}\_\{\\text\{out\}\}=\\text\{Concat\}\(\\text\{head\}\_\{1\},\\ldots,\\text\{head\}\_\{H\}\)\\mathbf\{W\}^\{O\}\(31\)
Orthogonality Regularization\.To encourage diverse attention patterns across heads, we apply orthogonality regularization as discussed in Section[A\.6\.2](https://arxiv.org/html/2606.04106#A1.SS6.SSS2)\.

Spectral Upsampling and Reconstruction\.The attended representation is upsampled back to the original token count:

𝐓up=Conv1×1,1/s\(b\)​\(𝐓out\)∈ℝB×\(L/2\)×\(2​C\)\\mathbf\{T\}\_\{\\text\{up\}\}=\\text\{Conv\}^\{\(b\)\}\_\{1\\times 1,1/s\}\(\\mathbf\{T\}\_\{\\text\{out\}\}\)\\in\\mathbb\{R\}^\{B\\times\(L/2\)\\times\(2C\)\}\(32\)followed by reshaping and optional inverse FFT:

𝐗out=\{IFFT​\(ReshapeToChannels​\(𝐓up\)\)ifis\_freq=FalseReshapeToChannels​\(𝐓up\)ifis\_freq=True\\mathbf\{X\}\_\{\\text\{out\}\}=\\begin\{cases\}\\text\{IFFT\}\(\\textsc\{ReshapeToChannels\}\(\\mathbf\{T\}\_\{\\text\{up\}\}\)\)&\\text\{if \}\\text\{is\\\_freq\}=\\text\{False\}\\\\ \\textsc\{ReshapeToChannels\}\(\\mathbf\{T\}\_\{\\text\{up\}\}\)&\\text\{if \}\\text\{is\\\_freq\}=\\text\{True\}\\end\{cases\}\(33\)
Residual Connection\.Finally, we add the input as a residual connection:

𝐗final=𝐗out\+𝐗input\\mathbf\{X\}\_\{\\text\{final\}\}=\\mathbf\{X\}\_\{\\text\{out\}\}\+\\mathbf\{X\}\_\{\\text\{input\}\}\(34\)
Tensor Shape Transformations

Table[4](https://arxiv.org/html/2606.04106#A1.T4)tracks tensor dimensions through the complete forward pass, aiding implementation and debugging\.

Table 4:Tensor Shape Transformations in Causal Cross\-Window FocusOperationShapeDescriptionInput𝐱\\mathbf\{x\}ℝB×C×L\\mathbb\{R\}^\{B\\times C\\times L\}Convolutional representationAfter FFT \(if time\-domain\)ℝB×C×L\\mathbb\{R\}^\{B\\times C\\times L\}Frequency domain \(complex as real pairs\)AfterReshapeToTokenℝB×L/2×2​C\\mathbb\{R\}^\{B\\times L/2\\times 2C\}Token representationAfter Downsample ConvℝB×\(L/2\)/s×2​C\\mathbb\{R\}^\{B\\times\(L/2\)/s\\times 2C\}Compressed tokens \(s∈\{4,16\}s\\in\\\{4,16\\\}\)After Pos\. Encoding \(time\)ℝB×\(L/2\)/s×2​C\\mathbb\{R\}^\{B\\times\(L/2\)/s\\times 2C\}Causally concatenated, then splitAfter Pos\. Encoding \(freq\)ℝB×\(L/2\)/s×2​C\\mathbb\{R\}^\{B\\times\(L/2\)/s\\times 2C\}Independently encodedAfter Cross\-AttentionℝB×\(L/2\)/s×2​C\\mathbb\{R\}^\{B\\times\(L/2\)/s\\times 2C\}Attended representationAfter Freq\. Upsample \(time\)ℝB×\(L/2\)/s×2​C\\mathbb\{R\}^\{B\\times\(L/2\)/s\\times 2C\}Back to frequency domainAfter Upsample ConvℝB×L/2×2​C\\mathbb\{R\}^\{B\\times L/2\\times 2C\}Restored token countAfterReshapeToChannelsℝB×C×L\\mathbb\{R\}^\{B\\times C\\times L\}Back to convolutional shapeAfter IFFT \(if time\-domain\)ℝB×C×L\\mathbb\{R\}^\{B\\times C\\times L\}Time domain outputAfter ResidualℝB×C×L\\mathbb\{R\}^\{B\\times C\\times L\}Final output
###### Design Rationale

We now explain the key design decisions that distinguish our approach from standard attention mechanisms\.

Why domain\-specific positional encoding?The fundamental insight is that position has different physical meanings in time versus frequency domains\. In time\-domain processing, causal concatenation of windows preserves temporal continuity, allowing the model to learn phase relationships across window boundaries\. This is critical for signals where phase coherence matters—for example, in communications systems where carrier phase must be tracked across packet boundaries, or in audio where transient events may span multiple windows\.

In contrast, frequency\-domain processing benefits from independent encoding per window\. Here, each token position corresponds to a spectral bin \(e\.g\., the 100 Hz component\), and the cross\-focus mechanism learns how energy in specific frequency bands evolves across time\. Concatenating windows would impose artificial temporal ordering on what is fundamentally a spectral covariance problem\. Independent encoding allows the model to learn patterns like "when the 100 Hz component is strong in the previous window, the 200 Hz component tends to be strong in the current window," without conflating this with temporal position\.

Why learned spectral compression?Traditional downsampling via strided convolution or pooling can introduce aliasing artifacts, especially when applied in the time or spatial domain\. Our learned1×11\\times 1convolution \(with stride\) performs adaptive compression across all channels simultaneously, learning which spectral components to preserve at each layer depth due to its explicit application in the frequency domain\. Early layers \(block 0\) use aggressive compression \(s=16s=16\) to mitigate quadratic attention costs, while later layers \(s=4s=4\) preserve more detail for fine\-grained pattern recognition at a reduced computational burden for attention\.

Why cross\-focus instead of self\-focus?Using the previous window as Keys and Values while the current window provides Queries enforces causal information flow, essential for autoregressive processing\. This design has several advantages:

1. 1\.Causality:Information flows strictly from past to present, enabling online/streaming processing\.
2. 2\.Efficiency:We avoid the𝒪​\(L2\)\\mathcal\{O\}\(L^\{2\}\)complexity of full\-sequence self\-focus, instead computing𝒪​\(\(L/s\)2\)\\mathcal\{O\}\(\(L/s\)^\{2\}\)focus within compressed windows\.
3. 3\.Interpretability:focus weights reveal how the model uses past context to inform current predictions, facilitating analysis of learned temporal dependencies\.

##### A\.4\.2\.4Spectral Compression Via Frequency Domain Pooling

Pooling operations are ubiquitous in modern deep learning architectures, serving to reduce computational costs while learning abstract feature representations\. The most common approaches—max pooling and average pooling—are typically applied in the time or spatial domain\. However, these conventional methods fundamentally violate the Nyquist\-Shannon sampling theorem\[[50](https://arxiv.org/html/2606.04106#bib.bib50)\], leading to aliasing artifacts where high\-frequency components fold back into lower frequency bands during downsampling\. While recent work has introduced anti\-aliasing filters prior to pooling\[[51](https://arxiv.org/html/2606.04106#bib.bib51)\], these approaches either dilute high\-frequency content through low\-pass filtering or explicitly discard it, fundamentally limiting the representational capacity of the network\.

###### Motivation: The High\-Frequency Information Problem

The discarding or corruption of high\-frequency information during pooling has two critical consequences:

\(1\) Low\-Frequency Learning Bias\.It is well\-documented that deep neural networks exhibit a spectral bias toward learning low\-frequency functions faster than high\-frequency counterparts\[[19](https://arxiv.org/html/2606.04106#bib.bib19)\]\. We argue this is not merely an empirical observation but a deterministic consequence of progressive downsampling\. As one descends deeper into a convolutional architecture with repeated pooling operations, more high\-frequency detail is either aliased or removed via anti\-aliasing filters\. This progressively leaves only low\-frequency information in the learned representations\. Notably, early layers in deep convolutional networks—which undergo minimal downsampling—are well\-known to capture high\-frequency features such as edges, corners, and fine textures\. This is not coincidental: these layers retain the spectral bandwidth necessary to represent such features\.

\(2\) Violation of Frequency Translation Equivariance\.For architectures designed to learn equivariant representations with respect to frequency translations \(e\.g\., shift\-invariance in the spectral domain\), it is essential to retain complete spectral fidelity across both low and high frequencies, including both magnitude and phase information\. Traditional pooling methods fundamentally break this symmetry by selectively removing or corrupting portions of the spectrum\. This is particularly problematic for complex\-valued signals, where the asymmetric Hermitian symmetry of the spectrum must be preserved to maintain physical interpretability\.

###### Proposed Method: Frequency Domain Average Pooling

We propose a simple yet effective solution based on the principle thatspectral structureis more important for signal representation thanabsolute frequency location\. Specifically, when downsampling in the PlanFormer architecture, we apply average pooling directly in the frequency domain on complex\-valued spectra\.

The procedure is as follows:

1. 1\.De\-interleave:Convert the real\-valued interleaved representation \(where adjacent elements per channel represent real and imaginary components\) to explicit complex format\.
2. 2\.FFT \(if time\-domain\):If the input is in the time domain, transform to frequency domain via FFT\.
3. 3\.Average Pool:Apply average pooling along the frequency axis, reducing the sequence length by a factor ofrr\(typicallyr∈\{2,4\}r\\in\\\{2,4\\\}\)\.
4. 4\.IFFT \(if time\-domain\):If the original input was time\-domain, transform back via IFFT\.
5. 5\.Re\-Interleave:Convert the complex\-valued sequence back to a real\-valued interleaved sequence for subsequent real\-valued processing blocks\.

This produces a representation that retains thespectral envelopeof the higher\-resolution signal but at reduced sequence length\. Crucially, this operation preserves both magnitude and phase information in a coarsened form, effectively creating a coarser binning of the original spectrum\. The network’s non\-linearities and representational capacity can then learn to compensate for relative spectral envelope structure rather than requiring absolute frequency component locations\. This approach:

- •Explicitly retains both high and low\-frequency components \(in compressed form\)
- •Respects the asymmetric Hermitian symmetry of complex\-valued signals
- •Avoids aliasing by operating directly in the frequency domain
- •Facilitates learning of frequency translation equivariance

Implementation Note:Frequency Preserving Pooling contrasts with the earlier mentioned Learned Spectral Compression by not using any learned parameters or layers during this pooling stage\. This only utilizes a parametric complex valued FFT\(IFFT to transform back to time domain if originating from time domain\) and non\-trainable average pooling kernel\. This is done in an effort to minimize learnable parameters to only those that are absolutely necessary\.

###### Implications and Benefits

U\-Net Style Architectures\.Frequency\-domain pooling is particularly powerful when combined with decoder pipelines in a U\-Net fashion\. Skip connections can be fused in the frequency domain, where higher\-resolution spectra from earlier layers infill and sharpen the reconstruction during upsampling\. This facilitates explicit high\-frequency detail retention and reconstruction, enhancing the fidelity of generated outputs\. For fine\-grained tasks where high\-frequency details provide discriminative signals \(e\.g\., texture classification, signal modulation recognition\), this retention is essential\.

Extreme Compression Ratios\.By leveraging frequency\-domain pooling for spectral compression, we achieve downsampling from input space to token space by a factor of64×64\\timesthrough three sequential pooling stages, each with strides=4s=4\(i\.e\.,43=644^\{3\}=64\)\. This extreme compression is essential for enabling our cross\-window focus mechanism and Parseval Focus Mechanisms within the Parseval Transformer[3\.2\.2](https://arxiv.org/html/2606.04106#S3.SS2.SSS2)over very long sequences, circumventing the quadratic complexity of attention mechanisms while preserving spectral fidelity\. Critically, each pooling stage operates in the frequency domain, ensuring that the spectral envelope—including high\-frequency components—is retained throughout the compression pipeline\.

Sample Rate Equivariance\.When combined with our Multi\-Head Orthongal Parseval Focus mechanism \(Section[3\.2\.2](https://arxiv.org/html/2606.04106#S3.SS2.SSS2)\) for time\-frequency energy conservation and function learning, frequency\-domain pooling enables learning a functional representation of time versus frequency across a structurally representative spectrum\. This facilitates learning sample rate equivariance, such that the trained architecture gracefully extends to arbitrary sample rates without requiring sample\-rate\-specific retraining or fine\-tuning during deployment\. We demonstrate this capability through zero\-shot deployment of the architecture across various domains with the encoder frozen, despite training only on a single domain at a single sample rate \(RF signals at 7\.69 megasamples per second\)\.

Frequency Translation Equivariance\.Most importantly, spectral envelope retention is essential for learning full frequency translation equivariance, which allows natural scaling across domains and frequencies\. Traditional pooling methods directly handicap this capability by corrupting the spectral structure, leading to poor gradient flow and degraded performance\. By preserving the complete spectral envelope—including high\-frequency components—our approach enables the network to learn truly equivariant representations with respect to frequency shifts\.

###### Theoretical Justification

From a signal processing perspective, frequency\-domain average pooling can be understood as a form ofspectral decimationthat preserves the overall energy distribution while reducing resolution\. Unlike time\-domain downsampling, which must obey the Nyquist criterion to avoid aliasing, frequency\-domain pooling operates on the already\-decomposed spectral representation\. The averaging operation acts as a form of spectral smoothing that preserves the envelope while discarding fine\-grained frequency resolution—a fundamentally different trade\-off than discarding high\-frequency content entirely\.

###### Comparison to Existing Methods

Table[5](https://arxiv.org/html/2606.04106#A1.T5)contrasts our frequency\-domain pooling with conventional approaches\.

Table 5:Comparison of Pooling MethodsMethodDomainAliasingHigh\-Freq RetentionHermitian SymmetryMax PoolingTime/SpatialYesNoNoAverage PoolingTime/SpatialYesNoNoAnti\-Aliased Pooling\[[51](https://arxiv.org/html/2606.04106#bib.bib51)\]Time/SpatialReducedPartialNoFreq\-Domain Pooling \(Ours\)FrequencyNoYes \(compressed\)Yes

##### A\.4\.2\.5Convolutional Attention

The convolution block concludes with an Efficient Channel\-Spatial\(Temporal for 1d\) Attention \(ECSA\) block\[[52](https://arxiv.org/html/2606.04106#bib.bib52)\]to provide a final attention in convolutional space\.

##### A\.4\.2\.6Domain\-Specific Token Aggregation

A critical design choice in our tokenization strategy is the domain\-specific aggregation of these windowed representations\. Rather than applying a uniform aggregation strategy across domains, we respect the physical structure and interpretation of each domain\.

###### Time Domain Aggregation: Causal Concatenation

For the time domain, temporal ordering is paramount\. We aggregate tokenized windows by concatenating them along the sequence dimension to preserve the causal temporal structure:

𝐱time=Concat​\(\[𝐱win1,𝐱win2,…,𝐱winNw\],dim=2\)∈ℝB×C×\(Nw⋅L\)\\mathbf\{x\}\_\{\\text\{time\}\}=\\text\{Concat\}\(\[\\mathbf\{x\}\_\{\\text\{win\}\_\{1\}\},\\mathbf\{x\}\_\{\\text\{win\}\_\{2\}\},\\ldots,\\mathbf\{x\}\_\{\\text\{win\}\_\{N\_\{w\}\}\}\],\\text\{dim\}=2\)\\in\\mathbb\{R\}^\{B\\times C\\times\(N\_\{w\}\\cdot L\)\}\(35\)whereNwN\_\{w\}is the number of windows\. This produces a unified temporal representation that maintains the sequential ordering of events across the entire signal\. Each window’s features are placed in their correct temporal position, enabling downstream modules to learn long\-range temporal dependencies while respecting causality\.

###### Frequency Domain Aggregation: Latent Spectral Averaging

For the frequency domain, the physical interpretation is fundamentally different\. Each window’s frequency representation captures the spectral content of that temporal segment\. To obtain a global spectral representation, we average the tokenized frequency representations across all windows:

𝐱freq=1Nw​∑i=1Nw𝐱wini∈ℝB×C×L\\mathbf\{x\}\_\{\\text\{freq\}\}=\\frac\{1\}\{N\_\{w\}\}\\sum\_\{i=1\}^\{N\_\{w\}\}\\mathbf\{x\}\_\{\\text\{win\}\_\{i\}\}\\in\\mathbb\{R\}^\{B\\times C\\times L\}\(36\)
Critically, this averaging occurs in thelearned latent spaceafter each window has undergone local representation learning through the convolutional tokenization pipeline \(convolution, cross\-window focus, frequency\-domain pooling, and ECSA blocks\)\. This is fundamentally different from naive spectral averaging in the input space\.

###### Rationale: Physics\-Informed Design

These domain\-specific aggregation strategies are anchored in the physical meaning of each representation rather than heuristic choices:

Time Domain\.In the temporal domain, position has absolute meaning—an event at timet1t\_\{1\}is fundamentally different from the same event at timet2t\_\{2\}\. Concatenation preserves this temporal ordering, enabling the model to learn causal relationships, temporal dynamics, and sequential patterns\. Averaging would destroy this critical temporal structure\.

Frequency Domain\.In the spectral domain, each frequency bin represents a specific oscillatory component \(e\.g\., the 100 Hz component\)\. However, unlike the time domain where position is absolute, thepresence and strengthof spectral components may vary across time\.

Latent Space vs\. Input Space Averaging\.This distinction is critical: averaging in latent space is fundamentally different from naive spectral averaging in the input space\. Averaging raw frequency\-domain inputs \(e\.g\., windowed FFTs\) produces a time\-averaged spectrum that obscures time\-varying spectral phenomena\. In contrast, our latent space averaging operates on learned feature representations where each window’s tokenizer has already encoded local spectral patterns, transient events, and time\-varying structure into the feature channelsCC\. The averaging operation then aggregates these learned representations, preserving information about time\-varying phenomena in the channel dimension while producing a compact global spectral summary in the sequence dimension\.

For example, consider a frequency\-hopping signal where the carrier frequency shifts across windows\. Input\-space spectral averaging would produce a blurred spectrum showing energy across all hopped frequencies with no temporal structure\. In contrast, our approach allows the per\-window tokenizers to learn representations that encode "frequency hop at 100 Hz in this window" and "frequency hop at 200 Hz in that window" in distinct feature channels\. The subsequent averaging produces a latent representation where different channels capture different hopping patterns, preserving the time\-varying spectral structure that would otherwise be lost\.

This design enables the frequency\-domain encoder to learn both stationary spectral structure \(consistent patterns across windows\) and time\-varying spectral dynamics \(patterns that evolve across windows\), without requiring an artificially long sequence that would conflate temporal evolution with spectral content\. For signals with time\-varying spectral characteristics—such as frequency hopping in communications, chirps in radar, or vibrato in audio—this latent space averaging is essential for preserving discriminative information while maintaining computational efficiency\.

###### Output Shapes and Downstream Processing

The domain\-specific aggregation produces tokenized representations with distinct shapes:

Time Domain:𝐱time∈ℝB×C×\(Nw⋅L\)\\displaystyle\\mathbf\{x\}\_\{\\text\{time\}\}\\in\\mathbb\{R\}^\{B\\times C\\times\(N\_\{w\}\\cdot L\)\}\(37\)Frequency Domain:𝐱freq∈ℝB×C×L\\displaystyle\\mathbf\{x\}\_\{\\text\{freq\}\}\\in\\mathbb\{R\}^\{B\\times C\\times L\}\(38\)

### A\.5Learned Domain Transformation & Cross\-Domain Fusion

Upon completion of the convolutional tokenization pipeline, we prepare the domain\-specific representations for processing by the Parseval Transformer blocks\. This preparation involves two critical steps: \(1\) transforming convolutional feature maps into token representations suitable for transformer processing\[[25](https://arxiv.org/html/2606.04106#bib.bib25)\], and \(2\) enabling cross\-domain information flow through learned transformations and fusion mechanisms\. The latter builds on the fundamental signal processing principle that comprehensive signal analysis requires joint time\-frequency analysis rather than isolated domain\-specific processing\.

#### A\.5\.1Complex\-Aware Tokenization

We transform convolutional feature maps into token representations in a manner that respects the complex\-valued structure of the input space\. Recall that the convolutional tokenizer outputs feature maps𝐱∈ℝB×C×L\\mathbf\{x\}\\in\\mathbb\{R\}^\{B\\times C\\times L\}, whereLLrepresents an interleaved real\-valued sequence of in\-phase \(I\) and quadrature \(Q\) components\.

To preserve the complex\-valued structure during tokenization, we reshape such that the sequence lengthN=L/2N=L/2corresponds to the number of complex samples, while the embedding dimension doubles to2​C2Cto accommodate both I and Q components per channel:

𝐱∈ℝB×C×L→𝐓∈ℝB×N×2​C,where​N=L/2\\mathbf\{x\}\\in\\mathbb\{R\}^\{B\\times C\\times L\}\\rightarrow\\mathbf\{T\}\\in\\mathbb\{R\}^\{B\\times N\\times 2C\},\\quad\\text\{where \}N=L/2\(39\)
This tokenization strategy provides dual benefits:

1. 1\.Complex\-valued structure preservation:Each token position corresponds to a complex sample \(I/Q pair\), enabling the transformer to model relationships between complex samples rather than treating I and Q components as independent entities\.
2. 2\.Computational efficiency:Halving the sequence length reduces the quadratic complexity of self\-attention from𝒪​\(L2\)\\mathcal\{O\}\(L^\{2\}\)to𝒪​\(\(L/2\)2\)=𝒪​\(L2/4\)\\mathcal\{O\}\(\(L/2\)^\{2\}\)=\\mathcal\{O\}\(L^\{2\}/4\), providing a4×4\\timesreduction in attention computation cost\.

#### A\.5\.2Cross\-Domain Information Fusion

Once tokenized, we leverage the complementary nature of time and frequency domain representations\. Comprehensive signal analysis requires joint time\-frequency processing—a principle long established in signal processing through techniques such as the Short\-Time Fourier Transform \(STFT\) and wavelet analysis\. These classical methods apply fixed, parametric transformations to obtain time\-frequency representations\.

We propose a learned alternative: our architecture functions as anend\-to\-end learnable and differentiable time\-frequency transform in latent feature space\. Rather than fixed transformations, we employ learned domain transformations and gated fusion layers to facilitate adaptive cross\-domain information flow\. This enables the network to learn domain\-specific features while ensuring that information learned in one domain informs processing in the other\.

#### A\.5\.3Learned Domain Transformation

At the conclusion of tokenization, we have two sets of domain\-specific tokens with different shapes due to the aggregation strategies described in Section[A\.4\.2\.6](https://arxiv.org/html/2606.04106#A1.SS4.SSS2.P6):

Time Domain:𝐓t∈ℝB×Nt×2​C\\displaystyle\\mathbf\{T\}\_\{t\}\\in\\mathbb\{R\}^\{B\\times N\_\{t\}\\times 2C\}\(40\)Frequency Domain:𝐓f∈ℝB×Nf×2​C\\displaystyle\\mathbf\{T\}\_\{f\}\\in\\mathbb\{R\}^\{B\\times N\_\{f\}\\times 2C\}\(41\)whereNt=Nw⋅\(L/2\)N\_\{t\}=N\_\{w\}\\cdot\(L/2\)due to concatenation across windows, andNf=L/2N\_\{f\}=L/2due to averaging across windows\.

Furthermore, domain\-specific processing results in latent features that occupy different subspaces of the representation space\. To enable effective cross\-domain fusion, we must first align representations to compatible shapes and map them to a common latent subspace\. We achieve this through learned1×11\\times 1convolutions that simultaneously transform the latent subspace and adjust sequence lengths\.

##### A\.5\.3\.1Time\-to\-Frequency Transformation\.

To inject time\-domain context into frequency processing:

𝐓t→f=Conv1×1​\(𝐓t\)∈ℝB×Nf×2​C\\mathbf\{T\}\_\{t\\rightarrow f\}=\\text\{Conv\}\_\{1\\times 1\}\(\\mathbf\{T\}\_\{t\}\)\\in\\mathbb\{R\}^\{B\\times N\_\{f\}\\times 2C\}\(42\)The1×11\\times 1convolution operates along the sequence dimension in token space \(regular channel dimension for convolution\), learning to compress the longer time\-domain sequence \(NtN\_\{t\}tokens\) to match the frequency\-domain length \(NfN\_\{f\}tokens\) while transforming the latent representation to the frequency\-domain subspace\.

##### A\.5\.3\.2Frequency\-to\-Time Transformation\.

Conversely, to inject frequency\-domain context into time processing:

𝐓f→t=Conv1×1​\(𝐓f\)∈ℝB×Nt×2​C\\mathbf\{T\}\_\{f\\rightarrow t\}=\\text\{Conv\}\_\{1\\times 1\}\(\\mathbf\{T\}\_\{f\}\)\\in\\mathbb\{R\}^\{B\\times N\_\{t\}\\times 2C\}\(43\)Here, the convolution expands the compact frequency representation to match the time\-domain sequence length while transforming to the time\-domain latent subspace\.

These learned transformations are fundamentally different from fixed domain transforms \(e\.g\., FFT/IFFT\)\. Rather than converting between time and frequency representations of the signal itself, they learn to map between latent feature subspaces, identifying which frequency\-domain features are most relevant for time\-domain processing and vice versa\.

#### A\.5\.4Gated Linear Unit Fusion

Once domain transformations align the shapes, we employ Gated Linear Units \(GLUs\) to perform adaptive feature fusion\. The gating mechanism enables learned, dynamic fusion during inference, allowing the network to selectively emphasize or suppress cross\-domain information based on the input signal characteristics\.

The fusion operates at the token level across the embedding dimension, providing fine\-grained control over which features from each domain are combined:

##### A\.5\.4\.1Time\-Domain Fusion\.

𝐓fused\-t=GLU​\(Concat​\(\[𝐓t,𝐓f→t\],dim=2\)\)∈ℝB×Nt×2​C\\mathbf\{T\}\_\{\\text\{fused\-t\}\}=\\text\{GLU\}\(\\text\{Concat\}\(\[\\mathbf\{T\}\_\{t\},\\mathbf\{T\}\_\{f\\rightarrow t\}\],\\text\{dim\}=2\)\)\\in\\mathbb\{R\}^\{B\\times N\_\{t\}\\times 2C\}\(44\)where the concatenation producesℝB×Nt×4​C\\mathbb\{R\}^\{B\\times N\_\{t\}\\times 4C\}and the GLU projects back toℝB×Nt×2​C\\mathbb\{R\}^\{B\\times N\_\{t\}\\times 2C\}\.

##### A\.5\.4\.2Frequency\-Domain Fusion\.

𝐓fused\-f=GLU​\(Concat​\(\[𝐓f,𝐓t→f\],dim=2\)\)∈ℝB×Nf×2​C\\mathbf\{T\}\_\{\\text\{fused\-f\}\}=\\text\{GLU\}\(\\text\{Concat\}\(\[\\mathbf\{T\}\_\{f\},\\mathbf\{T\}\_\{t\\rightarrow f\}\],\\text\{dim\}=2\)\)\\in\\mathbb\{R\}^\{B\\times N\_\{f\}\\times 2C\}\(45\)
Algorithm[5](https://arxiv.org/html/2606.04106#alg5)details the GLU fusion procedure\.

Algorithm 5Gated Linear Unit Fusion0:Concatenated features

𝐗∈ℝB×N×Din\\mathbf\{X\}\\in\\mathbb\{R\}^\{B\\times N\\times D\_\{\\text\{in\}\}\}where

Din=4​CD\_\{\\text\{in\}\}=4C
0:Target dimension

Dout=2​CD\_\{\\text\{out\}\}=2C
0:Fused representation

𝐗fused∈ℝB×N×Dout\\mathbf\{X\}\_\{\\text\{fused\}\}\\in\\mathbb\{R\}^\{B\\times N\\times D\_\{\\text\{out\}\}\}
1:

2:\{Normalize input features\}

3:

𝐗norm←RMSNorm​\(𝐗\)\\mathbf\{X\}\_\{\\text\{norm\}\}\\leftarrow\\text\{RMSNorm\}\(\\mathbf\{X\}\)
4:

5:\{Compute value and gate projections\}

6:

𝐕←𝐗norm​𝐖v\+𝐛v\\mathbf\{V\}\\leftarrow\\mathbf\{X\}\_\{\\text\{norm\}\}\\mathbf\{W\}\_\{v\}\+\\mathbf\{b\}\_\{v\}\{Value:

ℝB×N×Dout\\mathbb\{R\}^\{B\\times N\\times D\_\{\\text\{out\}\}\}\}

7:

𝐆←𝐗norm​𝐖g\+𝐛g\\mathbf\{G\}\\leftarrow\\mathbf\{X\}\_\{\\text\{norm\}\}\\mathbf\{W\}\_\{g\}\+\\mathbf\{b\}\_\{g\}\{Gate:

ℝB×N×Dout\\mathbb\{R\}^\{B\\times N\\times D\_\{\\text\{out\}\}\}\}

8:

9:\{Apply sigmoid activation to gate\}

10:

𝐆act←σ​\(𝐆\)\\mathbf\{G\}\_\{\\text\{act\}\}\\leftarrow\\sigma\(\\mathbf\{G\}\)\{Element\-wise sigmoid\}

11:

12:\{Gated fusion\}

13:

𝐗fused←𝐕⊙𝐆act\\mathbf\{X\}\_\{\\text\{fused\}\}\\leftarrow\\mathbf\{V\}\\odot\\mathbf\{G\}\_\{\\text\{act\}\}\{Element\-wise multiplication\}

14:

15:return

𝐗fused\\mathbf\{X\}\_\{\\text\{fused\}\}

The GLU mechanism can be expressed mathematically as:

GLU​\(𝐗\)=\(𝐗𝐖v\+𝐛v\)⊙σ​\(𝐗𝐖g\+𝐛g\)\\text\{GLU\}\(\\mathbf\{X\}\)=\(\\mathbf\{X\}\\mathbf\{W\}\_\{v\}\+\\mathbf\{b\}\_\{v\}\)\\odot\\sigma\(\\mathbf\{X\}\\mathbf\{W\}\_\{g\}\+\\mathbf\{b\}\_\{g\}\)\(46\)where𝐖v,𝐖g∈ℝDin×Dout\\mathbf\{W\}\_\{v\},\\mathbf\{W\}\_\{g\}\\in\\mathbb\{R\}^\{D\_\{\\text\{in\}\}\\times D\_\{\\text\{out\}\}\}are learnable projection matrices,𝐛v,𝐛g∈ℝDout\\mathbf\{b\}\_\{v\},\\mathbf\{b\}\_\{g\}\\in\\mathbb\{R\}^\{D\_\{\\text\{out\}\}\}are bias terms,σ​\(⋅\)\\sigma\(\\cdot\)is the sigmoid activation, and⊙\\odotdenotes element\-wise multiplication\.

The gating mechanism provides several advantages:

- •Adaptive fusion:The sigmoid gate learns to dynamically weight the contribution of each feature dimension, enabling signal\-dependent fusion strategies\.
- •Gradient flow:The multiplicative gating provides direct gradient paths, facilitating stable training\.
- •Interpretability:Gate activations can be analyzed to understand which cross\-domain features are emphasized for different signal types\.

#### A\.5\.5Multi\-Stage Fusion Architecture

We apply this learned transformation and fusion process strategically at three key points in the architecture:

1. 1\.Post\-tokenization:After convolutional tokenization, before transformer processing
2. 2\.Inter\-block:Between each Parseval Transformer block \(currently one block in our architecture\)
3. 3\.Post\-pooling:After attentional sequence pooling, before final domain\-specific embeddings

This multi\-stage fusion ensures continuous cross\-domain information flow throughout the network hierarchy, enabling the model to refine its joint time\-frequency representation at multiple levels of abstraction\.

#### A\.5\.6Interpretation as Learned Time\-Frequency Analysis

At its conclusion, this process can be viewed as alearned latent spectrogram or wavelet transform\. Unlike classical STFT or wavelet analysis with fixed basis functions, our approach learns:

- •Adaptive basis functions:The convolutional tokenizers learn domain\-specific feature extractors tailored to the data distribution\.
- •Cross\-domain mappings:The1×11\\times 1convolutions learn which time\-domain features correspond to which frequency\-domain features\.
- •Fusion strategies:The GLU gates learn how to combine time and frequency information for optimal task performance\.

This end\-to\-end differentiable time\-frequency analysis adapts to the specific characteristics of the signal domain and task, providing a more flexible and powerful alternative to fixed parametric transforms\.

### A\.6The Parseval Transformer

Once the tokens have been properly prepared we next utilize them within the our custom transformer architecture that we term the Parseval Transformer\. Standard transformer block construction is fairly well\-refined throughout the literature, and we leave it largely unmodified\. Our main contributions are to the attention mechanism within the standard transformer block\. First, we discuss our proposed evolution of Scaled Dot\-Product Attention to the new Scaled Covariance Focus\. Then we highlight how this is used both across and within time/frequency domains for a truly comprehensive learned focus mechanism that we call Parseval Focus\. With these contributions,motivated by Parseval’s theorem, we sought to anchor Transformer design with known physical symmetries and a more functional understanding of the time and frequency domains\.

Implementation Note:Each domain specific branch of the PlanFormer has an independent Parseval Transformer\. When computing the Multi\-Head Parseval Focus mechanism it is applied within the context of that domain specific Parseval Transformer’s domain specific tokens which have a complementary parametric transformation applied \(e\.g\. Time\-Domain Parseval Transformer→\\rightarrowFFT\(time\-domain tokens\) and Frequency\-Domain Parseval Transformer→\\rightarrowIFFT\(frequency\-domain tokens\)\)\. This produces the cross\-domain series of tokens that match exactly in length/dimensionality which are needed for valid downstream cross\-domain probability distribution comparisons\. It is afterwards where each domain\-specific Parseval Transformer output is fused with the other through the earlier mentioned learned domain transforms and cross domain gated fusions\(Appendix[A\.5\.3](https://arxiv.org/html/2606.04106#A1.SS5.SSS3)\)\.

#### A\.6\.1Scaled Covariance Focus:

##### A\.6\.1\.1From Dot\-Product to Covariance\.

The standard attention mechanism is implemented via a learned dot product across the projections of the sequence tokens\. This simple but effective computation allows a model to learn the similarity of one token to all other tokens in a sequence and weight them appropriately as a probability distribution via the softmax operation\. While historically very effective, this computation is very discrete in nature\. We argue this discrete computation and analysis is limiting, especially in relation to time\-series information which naturally has higher\-order information encoded within it\. An appropriate example is how trends within time series are a function of past behavior as opposed to only how similar a present segment of time is relative to a past segment\.

For time\-series data, we argue that functional relationships—how tokens co\-vary—are more informative than discrete similarity\. We replace dot\-product attention with covariance\-based attention:

A​t​t​e​n​t​i​o​n​\(Q,K,V\)=s​o​f​t​m​a​x​\(C​o​v​\(Q,K\)dk\)​VAttention\(Q,K,V\)=softmax\\left\(\\frac\{Cov\(Q,K\)\}\{\\sqrt\{d\_\{k\}\}\}\\right\)V
Where

C​o​v​\(Q,K\)=\(Q−μQ\)​\(K−μK\)TCov\(Q,K\)=\(Q\-\\mu\_\{Q\}\)\(K\-\\mu\_\{K\}\)^\{T\}
This only requires centering the query and key matrices prior to computing their inner product, but now the resultant computation represents the functional relationship of a token relative to the positive, negative, or neutral change of another\. This captures whether tokens exhibit positive, negative, or neutral co\-variation, encoding higher\-order temporal dynamics \(trends, correlations\) rather than instantaneous similarity\. This higher\-order information is extremely useful for generalizing time\-series analysis and learning fundamental symmetries of the data\.

##### A\.6\.1\.2Dynamic Focus Mechanism\.

We introduce a Focus mechanism inspired by neurocognitive load in low\-SNR environments\. In low\-SNR scenarios, multiple stimuli may compete for attention, but many are distractions rather than the signal of interest\. Think of a static television or a needle in a haystack\. In these situations, effective attention requires sharpening the region one attends to dynamically\.

The focus factorKf​o​c​u​sK\_\{focus\}is computed from attention score statistics through the following process:

Step 1: Per\-Token Variance\.Compute variance of attention scores for each token across all queries:

σi2=V​a​rj​\(C​o​v​\(Q,K\):,i\),i∈\[1,Ls​e​q​u​e​n​c​e\]\\sigma^\{2\}\_\{i\}=Var\_\{j\}\(Cov\(Q,K\)\_\{:,i\}\),\\quad i\\in\[1,L\_\{sequence\}\]
Step 2: Cumulative Variance\.Sum variances across all tokens:

σcum2=∑i=1Ls​e​q​u​e​n​c​eσi2\\sigma^\{2\}\_\{\\text\{cum\}\}=\\sum\_\{i=1\}^\{L\_\{sequence\}\}\\sigma^\{2\}\_\{i\}
Step 3: Per\-Token SNR Estimation\.Compute each token’s variance relative to the total:

SNRi=σi2σcum2−σi2\+ϵ\\text\{SNR\}\_\{i\}=\\frac\{\\sigma^\{2\}\_\{i\}\}\{\\sigma^\{2\}\_\{\\text\{cum\}\}\-\\sigma^\{2\}\_\{i\}\+\\epsilon\}
whereϵ=10−8\\epsilon=10^\{\-8\}prevents division by zero\. HighSNRi\\text\{SNR\}\_\{i\}indicates tokeniihas high variance relative to others \(distinctive\), while lowSNRi\\text\{SNR\}\_\{i\}indicates low relative variance \(not distinctive\)\.

Step 4: Summary Statistics\.Extract two features from the SNR distribution:

mean\_snr=1Ls​e​q​u​e​n​c​e​∑i=1Ls​e​q​u​e​n​c​eSNRi\\displaystyle=\\frac\{1\}\{L\_\{sequence\}\}\\sum\_\{i=1\}^\{L\_\{sequence\}\}\\text\{SNR\}\_\{i\}\(47\)high\_snr\_prop=1Ls​e​q​u​e​n​c​e​∑i=1Ls​e​q​u​e​n​c​e𝟙​\[SNRi\>median​\(\{SNRj\}j=1Ls​e​q​u​e​n​c​e\)\]\\displaystyle=\\frac\{1\}\{L\_\{sequence\}\}\\sum\_\{i=1\}^\{L\_\{sequence\}\}\\mathbb\{1\}\[\\text\{SNR\}\_\{i\}\>\\text\{median\}\(\\\{\\text\{SNR\}\_\{j\}\\\}\_\{j=1\}^\{L\_\{sequence\}\}\)\]\(48\)
wheremean\_snrcaptures the average distinctiveness across tokens, andhigh\_snr\_propcaptures the proportion of tokens with above\-median distinctiveness\.

Step 5: Learned Focus Prediction\.Concatenate statistics and predict focus factor:

𝐤input=\[mean\_snr,high\_snr\_prop\]∈ℝ2\\mathbf\{k\}\_\{\\text\{input\}\}=\[\\text\{mean\\\_snr\},\\text\{high\\\_snr\\\_prop\}\]\\in\\mathbb\{R\}^\{2\}Kf​o​c​u​s=1\+\(Ls​e​q​u​e​n​c​e−1\)⋅σ​\(fθ​\(𝐤input\)\)K\_\{focus\}=1\+\(L\_\{sequence\}\-1\)\\cdot\\sigma\(f\_\{\\theta\}\(\\mathbf\{k\}\_\{\\text\{input\}\}\)\)
wherefθ:ℝ2→ℝf\_\{\\theta\}:\\mathbb\{R\}^\{2\}\\rightarrow\\mathbb\{R\}is a lightweight MLP \(2 hidden layers, 64 units each, ReLU in between\) andσ​\(⋅\)\\sigma\(\\cdot\)is the sigmoid function\.

The effective temperature isτ=Kf​o​c​u​s/Ls​e​q​u​e​n​c​e∈\[1/Ls​e​q​u​e​n​c​e,1\]\\tau=K\_\{focus\}/L\_\{sequence\}\\in\[1/L\_\{sequence\},1\]\. The network learns to map attention score distributions to appropriate temperature scaling: when attention is uncertain \(high mean\_snr, many tokens competing\),Kf​o​c​u​sK\_\{focus\}is low \(τ\\tausmall\) to sharpen the distribution; when attention is confident \(low mean\_snr, few clear winners\),Kf​o​c​u​sK\_\{focus\}is high \(τ≈1\\tau\\approx 1\) to maintain standard attention\. This adaptive mechanism enables dynamic sparsity control based on the confidence of the attention distribution\.

The complete Focus mechanism is:

F​o​c​u​s​\(Q,K,V\)=s​o​f​t​m​a​x​\(C​o​v​\(Q,K\)⋅Ls​e​q​u​e​n​c​edk⋅Kf​o​c​u​s\)​VFocus\(Q,K,V\)=softmax\\left\(\\frac\{Cov\(Q,K\)\\cdot L\_\{sequence\}\}\{\\sqrt\{d\_\{k\}\}\\cdot K\_\{focus\}\}\\right\)V

#### A\.6\.2Head Orthogonalization Regularization

##### A\.6\.2\.1Motivation and Approach

The multi\-head attention mechanism is designed to allow different heads to attend to distinct aspects of the input representations\[[20](https://arxiv.org/html/2606.04106#bib.bib20)\]\. However, without explicit enforcement, heads often learn redundant attention patterns\. Recent work on Differential Transformers\[[24](https://arxiv.org/html/2606.04106#bib.bib24)\]has demonstrated that common\-mode noise is prevalent across attention heads, reducing the effective capacity of the multi\-head mechanism\.

To address this, we introduce a head orthogonalization regularization loss applied during training of every Multi\-Head Scaled Covariance Self\-Focus layer\. Our loss explicitly encourages head specialization by minimizing the covariance between attention distributions of different heads while maintaining sufficient variance within each head\.

##### A\.6\.2\.2Mathematical Formulation

Given attention weights𝐀∈ℝB×H×L×L\\mathbf\{A\}\\in\\mathbb\{R\}^\{B\\times H\\times L\\times L\}whereBBis batch size,HHis the number of heads, andLLis sequence length, we compute the head orthogonalization loss as follows\.

First, we reshape and transpose the attention weights:

A~=reshape​\(permute​\(A,\[0,2,1,3\]\),\[B​L,H,L\]\)\\tilde\{A\}=\\text\{reshape\}\(\\text\{permute\}\(A,\[0,2,1,3\]\),\[BL,H,L\]\)\(49\)wherepermute​\(A,\[0,2,1,3\]\)\\text\{permute\}\(A,\[0,2,1,3\]\)reorders dimensions from\(B,H,Lq,Lk\)\(B,H,L\_\{q\},L\_\{k\}\)to\(B,Lq,H,Lk\)\(B,L\_\{q\},H,L\_\{k\}\), grouping batch and query\-sequence dimensions before reshaping toℝB​L×H×L\\mathbb\{R\}^\{BL\\times H\\times L\}\. We then center each attention head by subtracting its mean across the sequence dimension:

𝐀¯=𝐀~−𝔼l​\[𝐀~\]\\bar\{\\mathbf\{A\}\}=\\tilde\{\\mathbf\{A\}\}\-\\mathbb\{E\}\_\{l\}\[\\tilde\{\\mathbf\{A\}\}\]\(50\)where𝔼l​\[⋅\]\\mathbb\{E\}\_\{l\}\[\\cdot\]denotes the mean over the sequence length dimension\. This centering operation transforms the subsequent inner product into a covariance computation, allowing us to measure statistical dependence between attention heads\.

The similarity matrix between heads is computed as:

𝐒=\|𝐀¯​𝐀¯T\|∈ℝB​L×H×H\\mathbf\{S\}=\|\\bar\{\\mathbf\{A\}\}\\bar\{\\mathbf\{A\}\}^\{T\}\|\\in\\mathbb\{R\}^\{BL\\times H\\times H\}\(51\)
We decompose𝐒\\mathbf\{S\}into diagonal and off\-diagonal components:

𝐃\\displaystyle\\mathbf\{D\}=diag​\(𝐒\)\\displaystyle=\\text\{diag\}\(\\mathbf\{S\}\)\(52\)𝐒off\\displaystyle\\mathbf\{S\}\_\{\\text\{off\}\}=𝐒−𝐃\\displaystyle=\\mathbf\{S\}\-\\mathbf\{D\}\(53\)
The head orthogonalization loss consists of two terms:

ℒorth=𝔼​\[\|𝐒off\|\]\+𝔼​\[ReLU​\(1−𝐃\+ϵ\)\]\\mathcal\{L\}\_\{\\text\{orth\}\}=\\mathbb\{E\}\[\|\\mathbf\{S\}\_\{\\text\{off\}\}\|\]\+\\mathbb\{E\}\[\\text\{ReLU\}\(1\-\\sqrt\{\\mathbf\{D\}\+\\epsilon\}\)\]\(54\)whereϵ=10−4\\epsilon=10^\{\-4\}is a small constant for numerical stability\. The first term minimizes covariance between different heads, encouraging orthogonalization\. The second term penalizes heads with low variance \(standard deviation below 1\) to prevent degeneracy, ensuring each head maintains meaningful attention patterns rather than collapsing to uniform distributions\.

##### A\.6\.2\.3Preventing Projection Degeneration with SoftAbsFloor

A critical challenge with orthogonalization\-based regularization is the risk of over\-regularization, which can cause query and key projections to collapse toward zero, resulting in uniform attention distributions\. To mitigate this, we apply aSoftAbsFlooractivation function to the outputs of the𝐐\\mathbf\{Q\}and𝐊\\mathbf\{K\}projection layers:

SoftAbsFloor​\(x;ϵ\)=x\+sign​\(x\)⋅ϵ⋅σ​\(−\|x\|ϵ\)\\text\{SoftAbsFloor\}\(x;\\epsilon\)=x\+\\text\{sign\}\(x\)\\cdot\\epsilon\\cdot\\sigma\\left\(\-\\frac\{\|x\|\}\{\\epsilon\}\\right\)\(55\)
whereσ​\(⋅\)\\sigma\(\\cdot\)is the sigmoid function andϵ=10−4\\epsilon=10^\{\-4\}controls the floor magnitude\. This function behaves approximately as the identity for\|x\|≫ϵ\|x\|\\gg\\epsilonand smoothly approaches±ϵ\\pm\\epsilonfor\|x\|≈0\|x\|\\approx 0\. The smooth transition preserves gradient flow while preventing complete collapse of the projections, maintaining the benefits of orthogonalization without the pathological degeneration that can occur with hard constraints\.

Behavior Analysis\.For large magnitudes,σ​\(−\|x\|/ϵ\)≈0\\sigma\(\-\|x\|/\\epsilon\)\\approx 0, soSoftAbsFloor​\(x\)≈x\\text\{SoftAbsFloor\}\(x\)\\approx x\. For small magnitudes,σ​\(−\|x\|/ϵ\)≈1\\sigma\(\-\|x\|/\\epsilon\)\\approx 1, providing an effective floor at±ϵ\\pm\\epsilon\. The transition sharpness is controlled by1/ϵ1/\\epsilon, creating a nearly hard floor while maintaining full differentiability\.

##### A\.6\.2\.4Implementation Details

The head orthogonalization loss \(Eq\.[54](https://arxiv.org/html/2606.04106#A1.E54)\) is computed at each attention layer and aggregated across layers\. The SoftAbsFloor activation \(Eq\.[55](https://arxiv.org/html/2606.04106#A1.E55)\) is applied element\-wise to query and key projections before the attention computation:

𝐐′\\displaystyle\\mathbf\{Q\}^\{\\prime\}=SoftAbsFloor​\(𝐖Q​𝐗\)\\displaystyle=\\text\{SoftAbsFloor\}\(\\mathbf\{W\}\_\{Q\}\\mathbf\{X\}\)\(56\)𝐊′\\displaystyle\\mathbf\{K\}^\{\\prime\}=SoftAbsFloor​\(𝐖K​𝐗\)\\displaystyle=\\text\{SoftAbsFloor\}\(\\mathbf\{W\}\_\{K\}\\mathbf\{X\}\)\(57\)
where𝐖Q,𝐖K\\mathbf\{W\}\_\{Q\},\\mathbf\{W\}\_\{K\}are the learned projection matrices and𝐗\\mathbf\{X\}is the input\. The value projection𝐕\\mathbf\{V\}does not require this activation, as it is not involved in the attention score computation and thus not subject to the same collapse dynamics\.

### A\.7Multi\-Head Parseval Focus Mechanism

#### A\.7\.1Theoretical Foundation

We now extend the Scaled Covariance Focus mechanism to leverage a fundamental principle from signal processing: Parseval’s theorem\[[23](https://arxiv.org/html/2606.04106#bib.bib23)\]\. Parseval’s theorem establishes that energy is conserved between time and frequency domains:

∑n\|x​\[n\]\|2=1N​∑k\|X​\[k\]\|2\\sum\_\{n\}\|x\[n\]\|^\{2\}=\\frac\{1\}\{N\}\\sum\_\{k\}\|X\[k\]\|^\{2\}\(58\)
This energy conservation represents a fundamental invariance symmetry—the total information content of a signal is preserved across domain transformations\. We extend this principle to learned representations:physically consistent features should exhibit predictable, symmetric relationships across time and frequency domains\.

The Multi\-Head Parseval Focus mechanism enforces this consistency through a novel cross\-domain attention architecture with Parseval\-based regularization\. Rather than treating time and frequency representations independently, we explicitly model their bidirectional relationships and penalize physically inconsistent attention patterns\.

#### A\.7\.2Architecture Overview

The Multi\-Head Parseval Focus mechanism integrates three complementary attention operations:

1. 1\.In\-Domain Self\-Focus:Captures domain\-specific patterns \(temporal ordering in time, spectral periodicity in frequency\)
2. 2\.Cross\-Domain Parseval Focus:Models bidirectional time\-frequency relationships with energy conservation constraints
3. 3\.Strategic Fusion:Combines in\-domain and cross\-domain representations via gated mechanisms

##### A\.7\.2\.1Domain Preparation

Given time\-domain tokens𝐓t∈ℝB×Nt×d\\mathbf\{T\}\_\{t\}\\in\\mathbb\{R\}^\{B\\times N\_\{t\}\\times d\}and frequency\-domain tokens𝐓f∈ℝB×Nf×d\\mathbf\{T\}\_\{f\}\\in\\mathbb\{R\}^\{B\\times N\_\{f\}\\times d\}from the dual\-domain tokenization pipeline, we first prepare both representations in both domains:

𝐓t\(f​r​e​q\)\\displaystyle\\mathbf\{T\}\_\{t\}^\{\(freq\)\}=FFT​\(𝐓t\)\(time tokens in frequency domain\)\\displaystyle=\\text\{FFT\}\(\\mathbf\{T\}\_\{t\}\)\\quad\\text\{\(time tokens in frequency domain\)\}\(59\)𝐓f\(t​i​m​e\)\\displaystyle\\mathbf\{T\}\_\{f\}^\{\(time\)\}=IFFT​\(𝐓f\)\(frequency tokens in time domain\)\\displaystyle=\\text\{IFFT\}\(\\mathbf\{T\}\_\{f\}\)\\quad\\text\{\(frequency tokens in time domain\)\}\(60\)
These transformations leverage our interleaved IQ representation: we de\-interleave the embedding dimension to obtain complex\-valued tokens, apply FFT/IFFT, and re\-interleave for subsequent real\-valued processing\.

#### A\.7\.3In\-Domain Self\-Focus

Before computing cross\-domain relationships, we capture domain\-specific patterns through Scaled Covariance Self\-Focus \(Section[A\.6\.1](https://arxiv.org/html/2606.04106#A1.SS6.SSS1)\) within each domain:

Time\-Domain Self\-Focus:

𝐎self\(t\)=Focus​\(𝐐t\(s​e​l​f\),𝐊t\(s​e​l​f\),𝐕t\(s​e​l​f\)\)\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(t\)\}=\\text\{Focus\}\(\\mathbf\{Q\}\_\{t\}^\{\(self\)\},\\mathbf\{K\}\_\{t\}^\{\(self\)\},\\mathbf\{V\}\_\{t\}^\{\(self\)\}\)\(61\)where𝐐t\(s​e​l​f\),𝐊t\(s​e​l​f\),𝐕t\(s​e​l​f\)\\mathbf\{Q\}\_\{t\}^\{\(self\)\},\\mathbf\{K\}\_\{t\}^\{\(self\)\},\\mathbf\{V\}\_\{t\}^\{\(self\)\}are projections of𝐓f\(t​i​m​e\)\\mathbf\{T\}\_\{f\}^\{\(time\)\}\(frequency tokens transformed to time domain\)\.

Frequency\-Domain Self\-Focus:

𝐎self\(f\)=Focus​\(𝐐f\(s​e​l​f\),𝐊f\(s​e​l​f\),𝐕f\(s​e​l​f\)\)\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(f\)\}=\\text\{Focus\}\(\\mathbf\{Q\}\_\{f\}^\{\(self\)\},\\mathbf\{K\}\_\{f\}^\{\(self\)\},\\mathbf\{V\}\_\{f\}^\{\(self\)\}\)\(62\)where𝐐f\(s​e​l​f\),𝐊f\(s​e​l​f\),𝐕f\(s​e​l​f\)\\mathbf\{Q\}\_\{f\}^\{\(self\)\},\\mathbf\{K\}\_\{f\}^\{\(self\)\},\\mathbf\{V\}\_\{f\}^\{\(self\)\}are projections of𝐓t\(f​r​e​q\)\\mathbf\{T\}\_\{t\}^\{\(freq\)\}\(time tokens transformed to frequency domain\)\.

Rationale:In\-domain self\-focus captures patterns that are naturally expressed within a single domain: temporal ordering, causal dependencies, and sequential structure in time; spectral periodicity, harmonic relationships, and frequency correlations in frequency\. These domain\-specific patterns are complementary to cross\-domain relationships and essential for comprehensive signal analysis\.

#### A\.7\.4Cross\-Domain Parseval Focus

The core innovation of Multi\-Head Parseval Focus is the bidirectional cross\-domain attention mechanism with energy conservation constraints\.

##### A\.7\.4\.1Bidirectional Cross\-Domain Attention

We compute attention in both directions:

Time\-Query, Frequency\-Key \(T→\\rightarrowF\):

𝐐t​f\\displaystyle\\mathbf\{Q\}\_\{tf\}=SoftAbsFloor​\(𝐖Q\(t​f\)​𝐓t\)\\displaystyle=\\text\{SoftAbsFloor\}\(\\mathbf\{W\}\_\{Q\}^\{\(tf\)\}\\mathbf\{T\}\_\{t\}\)\(63\)𝐊t​f\\displaystyle\\mathbf\{K\}\_\{tf\}=SoftAbsFloor​\(𝐖K\(t​f\)​𝐓f\)\\displaystyle=\\text\{SoftAbsFloor\}\(\\mathbf\{W\}\_\{K\}^\{\(tf\)\}\\mathbf\{T\}\_\{f\}\)\(64\)𝐕t​f\\displaystyle\\mathbf\{V\}\_\{tf\}=𝐖V\(t​f\)​𝐓f\\displaystyle=\\mathbf\{W\}\_\{V\}^\{\(tf\)\}\\mathbf\{T\}\_\{f\}\(65\)𝐒t​f\\displaystyle\\mathbf\{S\}\_\{tf\}=Cov​\(𝐐t​f,𝐊t​f\)dk⋅Kfocus\(t​f\)\\displaystyle=\\frac\{\\text\{Cov\}\(\\mathbf\{Q\}\_\{tf\},\\mathbf\{K\}\_\{tf\}\)\}\{\\sqrt\{d\_\{k\}\}\\cdot K\_\{\\text\{focus\}\}^\{\(tf\)\}\}\(66\)
Frequency\-Query, Time\-Key \(F→\\rightarrowT\):

𝐐f​t\\displaystyle\\mathbf\{Q\}\_\{ft\}=SoftAbsFloor​\(𝐖Q\(f​t\)​𝐓f\)\\displaystyle=\\text\{SoftAbsFloor\}\(\\mathbf\{W\}\_\{Q\}^\{\(ft\)\}\\mathbf\{T\}\_\{f\}\)\(67\)𝐊f​t\\displaystyle\\mathbf\{K\}\_\{ft\}=SoftAbsFloor​\(𝐖K\(f​t\)​𝐓t\)\\displaystyle=\\text\{SoftAbsFloor\}\(\\mathbf\{W\}\_\{K\}^\{\(ft\)\}\\mathbf\{T\}\_\{t\}\)\(68\)𝐕f​t\\displaystyle\\mathbf\{V\}\_\{ft\}=𝐖V\(f​t\)​𝐓t\\displaystyle=\\mathbf\{W\}\_\{V\}^\{\(ft\)\}\\mathbf\{T\}\_\{t\}\(69\)𝐒f​t\\displaystyle\\mathbf\{S\}\_\{ft\}=Cov​\(𝐐f​t,𝐊f​t\)dk⋅Kfocus\(f​t\)\\displaystyle=\\frac\{\\text\{Cov\}\(\\mathbf\{Q\}\_\{ft\},\\mathbf\{K\}\_\{ft\}\)\}\{\\sqrt\{d\_\{k\}\}\\cdot K\_\{\\text\{focus\}\}^\{\(ft\)\}\}\(70\)
whereKfocusK\_\{\\text\{focus\}\}is computed per Section[A\.6\.1\.2](https://arxiv.org/html/2606.04106#A1.SS6.SSS1.P2)\.

##### A\.7\.4\.2Parseval Consistency via Jensen\-Shannon Distance

For physically consistent representations, both cross\-domain permutations should produce equivalent attention distributions\. We enforce this through Jensen\-Shannon Distance \(JSD\), a symmetric, bounded divergence measure\.

We compute attention distributions with consistent normalization:

𝐏t​f=softmaxa​x​i​s=−1​\(St​f\)∈ℝB×H×Nt×Nf\\displaystyle\\mathbf\{P\}\_\{tf\}=\\text\{softmax\}\_\{axis=\-1\}\(S\_\{tf\}\)\\in\\mathbb\{R\}^\{B\\times H\\times N\_\{t\}\\times N\_\{f\}\}\(71\)𝐏f​t=softmaxa​x​i​s=−1​\(Sf​t\)∈ℝB×H×Nf×Nt\\displaystyle\\mathbf\{P\}\_\{ft\}=\\text\{softmax\}\_\{axis=\-1\}\(S\_\{ft\}\)\\in\\mathbb\{R\}^\{B\\times H\\times N\_\{f\}\\times N\_\{t\}\}\(72\)
To align dimensions for comparison, we transpose the score matrices*before*applying softmax:

P~f​t=softmaxa​x​i​s=−1​\(Sf​tT\)∈ℝB×H×Nt×Nf\\displaystyle\\tilde\{P\}\_\{ft\}=\\text\{softmax\}\_\{axis=\-1\}\(S\_\{ft\}^\{T\}\)\\in\\mathbb\{R\}^\{B\\times H\\times N\_\{t\}\\times N\_\{f\}\}\(74\)P~t​f=softmaxa​x​i​s=−1​\(St​fT\)∈ℝB×H×Nf×Nt\\displaystyle\\tilde\{P\}\_\{tf\}=\\text\{softmax\}\_\{axis=\-1\}\(S\_\{tf\}^\{T\}\)\\in\\mathbb\{R\}^\{B\\times H\\times N\_\{f\}\\times N\_\{t\}\}\(75\)
Now bothPt​fP\_\{tf\}andP~f​t\\tilde\{P\}\_\{ft\}are valid probability distributions over the same event space \(time tokens attending to frequency tokens\), and similarly for the reverse direction\. We compute bidirectional Jensen\-Shannon Distance:

Time\-to\-Frequency Direction:

𝐌t​f=12​\(Pt​f\+P~f​t\)\\displaystyle\\mathbf\{M\}\_\{tf\}=\\frac\{1\}\{2\}\(P\_\{tf\}\+\\tilde\{P\}\_\{ft\}\)\(77\)𝐊𝐋t​f→M=∑i,jPt​f​\[i,j\]​log⁡Pt​f​\[i,j\]\+ϵMt​f​\[i,j\]\+ϵ\\displaystyle\\mathbf\{KL\}\_\{tf\\to M\}=\\sum\_\{i,j\}P\_\{tf\}\[i,j\]\\log\\frac\{P\_\{tf\}\[i,j\]\+\\epsilon\}\{M\_\{tf\}\[i,j\]\+\\epsilon\}\(78\)𝐊𝐋f​t~→M=∑i,jP~f​t​\[i,j\]​log⁡P~f​t​\[i,j\]\+ϵMt​f​\[i,j\]\+ϵ\\displaystyle\\mathbf\{KL\}\_\{\\tilde\{ft\}\\to M\}=\\sum\_\{i,j\}\\tilde\{P\}\_\{ft\}\[i,j\]\\log\\frac\{\\tilde\{P\}\_\{ft\}\[i,j\]\+\\epsilon\}\{M\_\{tf\}\[i,j\]\+\\epsilon\}\(79\)𝐉𝐒𝐃t​i​m​e=12​\(K​Lt​f→M\+K​Lf​t~→M\)\\displaystyle\\mathbf\{JSD\}\_\{time\}=\\frac\{1\}\{2\}\(KL\_\{tf\\to M\}\+KL\_\{\\tilde\{ft\}\\to M\}\)\(80\)
Frequency\-to\-Time Direction:

𝐌f​t=12​\(Pf​t\+P~t​f\)\\displaystyle\\mathbf\{M\}\_\{ft\}=\\frac\{1\}\{2\}\(P\_\{ft\}\+\\tilde\{P\}\_\{tf\}\)\(82\)𝐉𝐒𝐃f​r​e​q=12​\(K​Lf​t→M\+K​Lt​f~→M\)\\displaystyle\\mathbf\{JSD\}\_\{freq\}=\\frac\{1\}\{2\}\(KL\_\{ft\\to M\}\+KL\_\{\\tilde\{tf\}\\to M\}\)\(83\)
The JSD reweighting factors are computed as:

𝐰t​i​m​e=\(1−J​S​Dt​i​m​e\)\\displaystyle\\mathbf\{w\}\_\{time\}=\(1\-JSD\_\{time\}\)\(85\)𝐰f​r​e​q=\(1−J​S​Df​r​e​q\)\\displaystyle\\mathbf\{w\}\_\{freq\}=\(1\-JSD\_\{freq\}\)\(86\)
Rationale:By transposing before softmax normalization, we ensure both distributions represent the same conditional probability structure \(e\.g\.,P​\(f​r​e​q\|t​i​m​e\)P\(freq\|time\)\), making the JSD mathematically well\-defined\. High JSD indicates inconsistent cross\-domain relationships, which we down\-weight; low JSD indicates consistent relationships, which we amplify\.

Dual Role of JSD:The JSD serves two complementary purposes:

\(1\) Regularization Loss:

ℒParseval=𝔼\[JSDt​i​m​e\(𝐏t​f\|\|𝐏~f​t\)\]\+𝔼\[JSDf​r​e​q\(𝐏f​t\|\|𝐏~t​f\)\]\\mathcal\{L\}\_\{\\text\{Parseval\}\}=\\mathbb\{E\}\[\\text\{JSD\}\_\{time\}\(\\mathbf\{P\}\_\{tf\}\|\|\\tilde\{\\mathbf\{P\}\}\_\{ft\}\)\]\+\\mathbb\{E\}\[\\text\{JSD\}\_\{freq\}\(\\mathbf\{P\}\_\{ft\}\|\|\\tilde\{\\mathbf\{P\}\}\_\{tf\}\)\]\(88\)Minimizing this bidirectional loss encourages the model to learn cross\-domain relationships that are consistent regardless of which domain serves as query or key—a direct analog of Parseval’s energy conservation\. The tilde notation \(𝐏~\\tilde\{\\mathbf\{P\}\}\) indicates distributions computed by transposing scores before softmax normalization, ensuring both distributions share the same probability space for valid divergence computation\.

\(2\) Dynamic Reweighting:

wParsevalt​i​m​e\\displaystyle w\_\{\\text\{Parseval\}\}^\{time\}=1−JSDt​i​m​e∈\[0,1\]\\displaystyle=1\-\\text\{JSD\}\_\{time\}\\in\[0,1\]\(89\)wParsevalf​r​e​q\\displaystyle w\_\{\\text\{Parseval\}\}^\{freq\}=1−JSDf​r​e​q∈\[0,1\]\\displaystyle=1\-\\text\{JSD\}\_\{freq\}\\in\[0,1\]\(90\)𝐒t​freweighted\\displaystyle\\mathbf\{S\}\_\{tf\}^\{\\text\{reweighted\}\}=𝐒t​f⋅wParsevalt​i​m​e\\displaystyle=\\mathbf\{S\}\_\{tf\}\\cdot w\_\{\\text\{Parseval\}\}^\{time\}\(91\)𝐒f​treweighted\\displaystyle\\mathbf\{S\}\_\{ft\}^\{\\text\{reweighted\}\}=𝐒f​t⋅wParsevalf​r​e​q\\displaystyle=\\mathbf\{S\}\_\{ft\}\\cdot w\_\{\\text\{Parseval\}\}^\{freq\}\(92\)
This reweighting amplifies attention patterns that exhibit Parseval consistency \(low JSD\) and suppresses physically inconsistent patterns \(high JSD\)\. Each direction \(time\-to\-frequency and frequency\-to\-time\) is weighted independently based on its respective consistency\. The mechanism is fully differentiable and learned end\-to\-end\.

Physical Interpretation:From a signal processing perspective, Parseval consistency ensures that the attention mechanism respects the fundamental duality between time and frequency domains\. If a time\-domain feature strongly attends to a frequency\-domain feature, the reverse relationship should hold with equivalent strength—just as energy in a time\-domain signal component corresponds to energy in its frequency\-domain counterpart\.

##### A\.7\.4\.3Cross\-Domain Focus Output

After JSD reweighting, we compute the final cross\-domain attended representations:

𝐎cross\(t\)\\displaystyle\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(t\)\}=𝐒t​freweighted⋅𝐕t​f∈ℝB×H×Nt×dk\\displaystyle=\\mathbf\{S\}\_\{tf\}^\{\\text\{reweighted\}\}\\cdot\\mathbf\{V\}\_\{tf\}\\in\\mathbb\{R\}^\{B\\times H\\times N\_\{t\}\\times d\_\{k\}\}\(93\)𝐎cross\(f\)\\displaystyle\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(f\)\}=𝐒f​treweighted⋅𝐕f​t∈ℝB×H×Nf×dk\\displaystyle=\\mathbf\{S\}\_\{ft\}^\{\\text\{reweighted\}\}\\cdot\\mathbf\{V\}\_\{ft\}\\in\\mathbb\{R\}^\{B\\times H\\times N\_\{f\}\\times d\_\{k\}\}\(94\)
Notably, we utilize the domain specific parseval reweighted score matrices as is rather than wrapping them in a softmax normalization\. By only utilizing parseval reweighting this provides a path to mitigate the over\-smoothing potential of the softmax normalization procedure and instead provides a physically meaningful gating to the attention score matrix that is then multiplied against V\.

#### A\.7\.5Strategic Fusion and Diversity Regularization

To produce comprehensive signal representations, we fuse in\-domain self\-focus with cross\-domain Parseval focus\.

##### A\.7\.5\.1Domain\-Specific Fusion

For each domain, we concatenate self\-focus and cross\-focus outputs and apply gated fusion:

Time Domain:

𝐎concat\(t\)\\displaystyle\\mathbf\{O\}\_\{\\text\{concat\}\}^\{\(t\)\}=Concat​\(\[𝐎self\(t\),𝐎cross\(t\)\],dim=−1\)∈ℝB×Nt×2​d\\displaystyle=\\text\{Concat\}\(\[\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(t\)\},\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(t\)\}\],\\text\{dim\}=\-1\)\\in\\mathbb\{R\}^\{B\\times N\_\{t\}\\times 2d\}\(95\)𝐎fused\(t\)\\displaystyle\\mathbf\{O\}\_\{\\text\{fused\}\}^\{\(t\)\}=GLU​\(𝐎concat\(t\)\)∈ℝB×Nt×d\\displaystyle=\\text\{GLU\}\(\\mathbf\{O\}\_\{\\text\{concat\}\}^\{\(t\)\}\)\\in\\mathbb\{R\}^\{B\\times N\_\{t\}\\times d\}\(96\)
Frequency Domain:

𝐎concat\(f\)\\displaystyle\\mathbf\{O\}\_\{\\text\{concat\}\}^\{\(f\)\}=Concat​\(\[𝐎self\(f\),𝐎cross\(f\)\],dim=−1\)∈ℝB×Nf×2​d\\displaystyle=\\text\{Concat\}\(\[\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(f\)\},\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(f\)\}\],\\text\{dim\}=\-1\)\\in\\mathbb\{R\}^\{B\\times N\_\{f\}\\times 2d\}\(97\)𝐎fused\(f\)\\displaystyle\\mathbf\{O\}\_\{\\text\{fused\}\}^\{\(f\)\}=GLU​\(𝐎concat\(f\)\)∈ℝB×Nf×d\\displaystyle=\\text\{GLU\}\(\\mathbf\{O\}\_\{\\text\{concat\}\}^\{\(f\)\}\)\\in\\mathbb\{R\}^\{B\\times N\_\{f\}\\times d\}\(98\)

##### A\.7\.5\.2Diversity Regularization

A critical risk in this architecture is that self\-focus and cross\-focus mechanisms could learn redundant representations, wasting capacity\. We enforce diversity through an orthogonality regularization loss:

𝐎¯self\(t\)\\displaystyle\\bar\{\\mathbf\{O\}\}\_\{\\text\{self\}\}^\{\(t\)\}=𝐎self\(t\)‖𝐎self\(t\)‖2\+ϵ\(normalize\)\\displaystyle=\\frac\{\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(t\)\}\}\{\\\|\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(t\)\}\\\|\_\{2\}\+\\epsilon\}\\quad\\text\{\(normalize\)\}\(99\)𝐎¯cross\(t\)\\displaystyle\\bar\{\\mathbf\{O\}\}\_\{\\text\{cross\}\}^\{\(t\)\}=𝐎cross\(t\)‖𝐎cross\(t\)‖2\+ϵ\\displaystyle=\\frac\{\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(t\)\}\}\{\\\|\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(t\)\}\\\|\_\{2\}\+\\epsilon\}\(100\)ℒdiversity\(t\)\\displaystyle\\mathcal\{L\}\_\{\\text\{diversity\}\}^\{\(t\)\}=\(∑i𝐎¯self\(t\)​\[i\]⋅𝐎¯cross\(t\)​\[i\]\)2\\displaystyle=\\left\(\\sum\_\{i\}\\bar\{\\mathbf\{O\}\}\_\{\\text\{self\}\}^\{\(t\)\}\[i\]\\cdot\\bar\{\\mathbf\{O\}\}\_\{\\text\{cross\}\}^\{\(t\)\}\[i\]\\right\)^\{2\}\(101\)
Similarly for frequency domain\. The total diversity loss is:

ℒdiversity=12​\(ℒdiversity\(t\)\+ℒdiversity\(f\)\)\+λnorm​ℒnorm\\mathcal\{L\}\_\{\\text\{diversity\}\}=\\frac\{1\}\{2\}\\left\(\\mathcal\{L\}\_\{\\text\{diversity\}\}^\{\(t\)\}\+\\mathcal\{L\}\_\{\\text\{diversity\}\}^\{\(f\)\}\\right\)\+\\lambda\_\{\\text\{norm\}\}\\mathcal\{L\}\_\{\\text\{norm\}\}\(102\)
whereℒnorm\\mathcal\{L\}\_\{\\text\{norm\}\}penalizes representations with insufficient magnitude:

ℒnorm=∑X∈\{𝐎self\(t\),𝐎cross\(t\),𝐎self\(f\),𝐎cross\(f\)\}ReLU​\(d−‖X‖2\)\\mathcal\{L\}\_\{\\text\{norm\}\}=\\sum\_\{X\\in\\\{\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(t\)\},\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(t\)\},\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(f\)\},\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(f\)\}\\\}\}\\text\{ReLU\}\(\\sqrt\{d\}\-\\\|X\\\|\_\{2\}\)\(103\)
This prevents the trivial solution where representations collapse to zero to satisfy orthogonality\.

##### A\.7\.5\.3Final Cross\-Domain Fusion

Finally, we fuse the time and frequency domain representations:

𝐎final\\displaystyle\\mathbf\{O\}\_\{\\text\{final\}\}=GLU​\(Concat​\(\[𝐎fused\(t\),𝐎fused\(f\)\],dim=−1\)\)\\displaystyle=\\text\{GLU\}\(\\text\{Concat\}\(\[\\mathbf\{O\}\_\{\\text\{fused\}\}^\{\(t\)\},\\mathbf\{O\}\_\{\\text\{fused\}\}^\{\(f\)\}\],\\text\{dim\}=\-1\)\)\(104\)

#### A\.7\.6Complete Multi\-Head Parseval Focus Algorithm

Algorithm[6](https://arxiv.org/html/2606.04106#alg6)summarizes the complete forward pass\.

Algorithm 6Multi\-Head Parseval Focus0:Time tokens

𝐓t\\mathbf\{T\}\_\{t\}, Frequency tokens

𝐓f\\mathbf\{T\}\_\{f\}
0:Fused output

𝐎final\\mathbf\{O\}\_\{\\text\{final\}\}, Losses

\{ℒorth,ℒParseval,ℒdiversity\}\\\{\\mathcal\{L\}\_\{\\text\{orth\}\},\\mathcal\{L\}\_\{\\text\{Parseval\}\},\\mathcal\{L\}\_\{\\text\{diversity\}\}\\\}
1:

2:\{Domain Preparation\}

3:

𝐓t\(f​r​e​q\)←FFT​\(𝐓t\)\\mathbf\{T\}\_\{t\}^\{\(freq\)\}\\leftarrow\\text\{FFT\}\(\\mathbf\{T\}\_\{t\}\)
4:

𝐓f\(t​i​m​e\)←IFFT​\(𝐓f\)\\mathbf\{T\}\_\{f\}^\{\(time\)\}\\leftarrow\\text\{IFFT\}\(\\mathbf\{T\}\_\{f\}\)
5:

6:\{In\-Domain Self\-Focus\}

7:

𝐎self\(t\),ℒorth\(t\)←ScaledCovarianceSelfFocus​\(𝐓f\(t​i​m​e\)\)\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(t\)\},\\mathcal\{L\}\_\{\\text\{orth\}\}^\{\(t\)\}\\leftarrow\\text\{ScaledCovarianceSelfFocus\}\(\\mathbf\{T\}\_\{f\}^\{\(time\)\}\)
8:

𝐎self\(f\),ℒorth\(f\)←ScaledCovarianceSelfFocus​\(𝐓t\(f​r​e​q\)\)\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(f\)\},\\mathcal\{L\}\_\{\\text\{orth\}\}^\{\(f\)\}\\leftarrow\\text\{ScaledCovarianceSelfFocus\}\(\\mathbf\{T\}\_\{t\}^\{\(freq\)\}\)
9:

10:\{Cross\-Domain Parseval Focus\}

11:

𝐎cross\(t\),𝐎cross\(f\),ℒorth\(c​r​o​s​s\),ℒParseval←\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(t\)\},\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(f\)\},\\mathcal\{L\}\_\{\\text\{orth\}\}^\{\(cross\)\},\\mathcal\{L\}\_\{\\text\{Parseval\}\}\\leftarrow
12:

ScaledCovarianceParsevalAttention​\(𝐓t,𝐓f\)\\text\{ScaledCovarianceParsevalAttention\}\(\\mathbf\{T\}\_\{t\},\\mathbf\{T\}\_\{f\}\)
13:

14:\{Domain\-Specific Fusion\}

15:

𝐎fused\(t\)←GLU​\(Concat​\(\[𝐎self\(t\),𝐎cross\(t\)\]\)\)\\mathbf\{O\}\_\{\\text\{fused\}\}^\{\(t\)\}\\leftarrow\\text\{GLU\}\(\\text\{Concat\}\(\[\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(t\)\},\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(t\)\}\]\)\)
16:

𝐎fused\(f\)←GLU​\(Concat​\(\[𝐎self\(f\),𝐎cross\(f\)\]\)\)\\mathbf\{O\}\_\{\\text\{fused\}\}^\{\(f\)\}\\leftarrow\\text\{GLU\}\(\\text\{Concat\}\(\[\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(f\)\},\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(f\)\}\]\)\)
17:

18:\{Diversity Regularization\}

19:

ℒdiversity←ComputeDiversityLoss​\(𝐎self\(t\),𝐎cross\(t\),𝐎self\(f\),𝐎cross\(f\)\)\\mathcal\{L\}\_\{\\text\{diversity\}\}\\leftarrow\\text\{ComputeDiversityLoss\}\(\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(t\)\},\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(t\)\},\\mathbf\{O\}\_\{\\text\{self\}\}^\{\(f\)\},\\mathbf\{O\}\_\{\\text\{cross\}\}^\{\(f\)\}\)
20:

21:\{Final Cross\-Domain Fusion\}

22:

𝐎final←GLU​\(Concat​\(\[𝐎fused\(t\),𝐎fused\(f\)\]\)\)\\mathbf\{O\}\_\{\\text\{final\}\}\\leftarrow\\text\{GLU\}\(\\text\{Concat\}\(\[\\mathbf\{O\}\_\{\\text\{fused\}\}^\{\(t\)\},\\mathbf\{O\}\_\{\\text\{fused\}\}^\{\(f\)\}\]\)\)
23:

24:

ℒorth←ℒorth\(t\)\+ℒorth\(f\)\+ℒorth\(c​r​o​s​s\)\\mathcal\{L\}\_\{\\text\{orth\}\}\\leftarrow\\mathcal\{L\}\_\{\\text\{orth\}\}^\{\(t\)\}\+\\mathcal\{L\}\_\{\\text\{orth\}\}^\{\(f\)\}\+\\mathcal\{L\}\_\{\\text\{orth\}\}^\{\(cross\)\}
25:

26:return

𝐎final,\{ℒorth,ℒParseval,ℒdiversity\}\\mathbf\{O\}\_\{\\text\{final\}\},\\\{\\mathcal\{L\}\_\{\\text\{orth\}\},\\mathcal\{L\}\_\{\\text\{Parseval\}\},\\mathcal\{L\}\_\{\\text\{diversity\}\}\\\}

#### A\.7\.7Summary of Regularization Losses

Table[6](https://arxiv.org/html/2606.04106#A1.T6)summarizes all regularization losses in the Multi\-Head Parseval Focus mechanism\.

Table 6:Multi\-Head Parseval Focus Regularization LossesLossPurposeWeightℒorth\\mathcal\{L\}\_\{\\text\{orth\}\}Encourage diverse attention heads \(Section[A\.6\.2](https://arxiv.org/html/2606.04106#A1.SS6.SSS2)\)λorth\\lambda\_\{\\text\{orth\}\}ℒParseval\\mathcal\{L\}\_\{\\text\{Parseval\}\}Enforce cross\-domain consistency via JSDλParseval\\lambda\_\{\\text\{Parseval\}\}ℒdiversity\\mathcal\{L\}\_\{\\text\{diversity\}\}Prevent redundancy between self\-focus and cross\-focusλdiversity\\lambda\_\{\\text\{diversity\}\}
#### A\.7\.8Design Rationale and Physical Grounding

##### A\.7\.8\.1Why Bidirectional Cross\-Domain Attention?

Parseval’s theorem establishes a symmetric relationship between time and frequency domains\. Our bidirectional attention architecture directly models this symmetry: if time\-domain featureiirelates to frequency\-domain featurejj, the reverse relationship should hold with equivalent strength\. The JSD regularization enforces this consistency\.

##### A\.7\.8\.2Why JSD Instead of KL Divergence?

KL divergence is asymmetric:KL\(P\|\|Q\)≠KL\(Q\|\|P\)\\text\{KL\}\(P\|\|Q\)\\neq\\text\{KL\}\(Q\|\|P\)\. For Parseval consistency, we require a symmetric measure\. JSD is the symmetrized, bounded variant of KL divergence:

JSD\(P\|\|Q\)=12KL\(P\|\|M\)\+12KL\(Q\|\|M\),M=12\(P\+Q\)\\text\{JSD\}\(P\|\|Q\)=\\frac\{1\}\{2\}\\text\{KL\}\(P\|\|M\)\+\\frac\{1\}\{2\}\\text\{KL\}\(Q\|\|M\),\\quad M=\\frac\{1\}\{2\}\(P\+Q\)\(105\)satisfyingJSD\(P\|\|Q\)=JSD\(Q\|\|P\)\\text\{JSD\}\(P\|\|Q\)=\\text\{JSD\}\(Q\|\|P\)and0≤JSD≤10\\leq\\text\{JSD\}\\leq 1\.

##### A\.7\.8\.3Why In\-Domain Self\-Focus \+ Cross\-Domain Focus?

Comprehensive signal analysis requires both:

- •Domain\-specific patterns:Temporal ordering, causality \(time\); spectral periodicity, harmonics \(frequency\)
- •Cross\-domain relationships:How time\-domain events manifest in frequency, and vice versa

Self\-focus captures the former, cross\-focus the latter\. The diversity regularization ensures they learn complementary \(not redundant\) representations\.

##### A\.7\.8\.4Connection to Classical Signal Processing

Multi\-Head Parseval Focus can be viewed as a learned, adaptive generalization of classical time\-frequency analysis:

- •STFT/Spectrogram:Fixed windows, fixed basis functions
- •Wavelet Transform:Fixed mother wavelet, fixed time\-frequency resolution trade\-off
- •Parseval Focus:Learned windows \(tokenization\), learned basis functions \(projections\), learned time\-frequency relationships \(cross\-domain attention\), adaptive resolution \(dynamic focus\)

The Parseval consistency constraint ensures the learned transform respects fundamental physical principles while adapting to task\-specific requirements\.

![Refer to caption](https://arxiv.org/html/2606.04106v1/x2.png)Figure 7:High Level Block Diagram of PlanFormer Encoder: Starting with domain specific transformations that are then tokenized via learned convolutions→\\rightarrowturned into tokens→\\rightarrowdomain specific tokens receive cross\-domain context injection and then are processed via the Parseval Transformer→\\rightarrowpost Parseval Transformer tokens recieve cross domain context injection prior to sequence pooling→\\rightarrowone last round of cross\-domain information injection before final encodings are produced per domain\.

### A\.8Attentional Sequence Pooling

Following the Parseval Transformer blocks and final cross\-domain fusion, we must aggregate the variable\-length token sequences into fixed\-size latent representations suitable for downstream tasks\. We employ attentional sequence pooling\[[25](https://arxiv.org/html/2606.04106#bib.bib25)\], which learns to weight tokens by their global relevance rather than applying uniform pooling\.

#### A\.8\.1Linear Attention Pooling Mechanism

Given token sequences from each domain after the final cross\-domain fusion:

𝐓tfinal\\displaystyle\\mathbf\{T\}\_\{t\}^\{\\text\{final\}\}∈ℝB×Nt×d\(time domain\)\\displaystyle\\in\\mathbb\{R\}^\{B\\times N\_\{t\}\\times d\}\\quad\\text\{\(time domain\)\}\(106\)𝐓ffinal\\displaystyle\\mathbf\{T\}\_\{f\}^\{\\text\{final\}\}∈ℝB×Nf×d\(frequency domain\)\\displaystyle\\in\\mathbb\{R\}^\{B\\times N\_\{f\}\\times d\}\\quad\\text\{\(frequency domain\)\}\(107\)
we compute domain\-specific pooled representations through learned attention weights\.

##### A\.8\.1\.1Pooling Procedure \(per domain\):

Algorithm 7Attentional Sequence Pooling0:Token sequence

𝐓∈ℝB×N×d\\mathbf\{T\}\\in\\mathbb\{R\}^\{B\\times N\\times d\}
0:Pooled representation

𝐳∈ℝB×d\\mathbf\{z\}\\in\\mathbb\{R\}^\{B\\times d\}
1:

2:\{Step 1: Normalize\}

3:

𝐓norm←RMSNorm​\(𝐓\)\\mathbf\{T\}\_\{\\text\{norm\}\}\\leftarrow\\text\{RMSNorm\}\(\\mathbf\{T\}\)
4:

5:\{Step 2: Compute attention logits\}

6:

𝐚logits←𝐖a​𝐓norm\+𝐛a\\mathbf\{a\}\_\{\\text\{logits\}\}\\leftarrow\\mathbf\{W\}\_\{a\}\\mathbf\{T\}\_\{\\text\{norm\}\}\+\\mathbf\{b\}\_\{a\}\{

∈ℝB×N×1\\in\\mathbb\{R\}^\{B\\times N\\times 1\}\}

7:

8:\{Step 3: Softmax over sequence dimension\}

9:

𝐚←softmax​\(𝐚logits,dim=1\)\\mathbf\{a\}\\leftarrow\\text\{softmax\}\(\\mathbf\{a\}\_\{\\text\{logits\}\},\\text\{dim\}=1\)\{

∈ℝB×N×1\\in\\mathbb\{R\}^\{B\\times N\\times 1\}\}

10:

11:\{Step 4: Weighted aggregation\}

12:

𝐳←∑i=1N𝐚​\[i\]⋅𝐓​\[i\]\\mathbf\{z\}\\leftarrow\\sum\_\{i=1\}^\{N\}\\mathbf\{a\}\[i\]\\cdot\\mathbf\{T\}\[i\]\{

=𝐚T​𝐓∈ℝB×d=\\mathbf\{a\}^\{T\}\\mathbf\{T\}\\in\\mathbb\{R\}^\{B\\times d\}\}

13:

14:return

𝐳\\mathbf\{z\}

Mathematically, for each domain:

𝐓norm\\displaystyle\\mathbf\{T\}\_\{\\text\{norm\}\}=RMSNorm​\(𝐓\)\\displaystyle=\\text\{RMSNorm\}\(\\mathbf\{T\}\)\(108\)𝐚logits\\displaystyle\\mathbf\{a\}\_\{\\text\{logits\}\}=𝐖a​𝐓norm\+𝐛a∈ℝB×N×1\\displaystyle=\\mathbf\{W\}\_\{a\}\\mathbf\{T\}\_\{\\text\{norm\}\}\+\\mathbf\{b\}\_\{a\}\\in\\mathbb\{R\}^\{B\\times N\\times 1\}\(109\)𝐚\\displaystyle\\mathbf\{a\}=softmax​\(𝐚logits,dim=1\)\\displaystyle=\\text\{softmax\}\(\\mathbf\{a\}\_\{\\text\{logits\}\},\\text\{dim\}=1\)\(110\)𝐳\\displaystyle\\mathbf\{z\}=∑i=1Nai​𝐓i=𝐚T​𝐓∈ℝB×d\\displaystyle=\\sum\_\{i=1\}^\{N\}a\_\{i\}\\mathbf\{T\}\_\{i\}=\\mathbf\{a\}^\{T\}\\mathbf\{T\}\\in\\mathbb\{R\}^\{B\\times d\}\(111\)
where𝐖a∈ℝ1×d\\mathbf\{W\}\_\{a\}\\in\\mathbb\{R\}^\{1\\times d\}and𝐛a∈ℝ\\mathbf\{b\}\_\{a\}\\in\\mathbb\{R\}are learned parameters\.

Computational Efficiency:This linear attention mechanism scales asO​\(N​d\)O\(Nd\)compared toO​\(N2​d\)O\(N^\{2\}d\)for standard self\-attention, making it suitable for long sequences\. The pooler learns aglobal token importance distributionover theNNtokens, which is applied uniformly across allddfeatures\. Specifically,𝐖a∈ℝ1×d\\mathbf\{W\}\_\{a\}\\in\\mathbb\{R\}^\{1\\times d\}projects each token to a scalar attention logit, producing a single attention distribution𝐚∈ℝN×1\\mathbf\{a\}\\in\\mathbb\{R\}^\{N\\times 1\}that determines which tokens are most relevant for the overall representation\. The resulting latent𝐳∈ℝd\\mathbf\{z\}\\in\\mathbb\{R\}^\{d\}is computed as𝐳=∑i=1Nai​𝐓i\\mathbf\{z\}=\\sum\_\{i=1\}^\{N\}a\_\{i\}\\mathbf\{T\}\_\{i\}, where each dimension of𝐳\\mathbf\{z\}is a weighted combination of the corresponding feature across all tokens\.

#### A\.8\.2Domain\-Specific Pooled Representations

Applying this procedure to both domains yields:

𝐳t\\displaystyle\\mathbf\{z\}\_\{t\}=AttentionalPool​\(𝐓tfinal\)∈ℝB×d\\displaystyle=\\text\{AttentionalPool\}\(\\mathbf\{T\}\_\{t\}^\{\\text\{final\}\}\)\\in\\mathbb\{R\}^\{B\\times d\}\(112\)𝐳f\\displaystyle\\mathbf\{z\}\_\{f\}=AttentionalPool​\(𝐓ffinal\)∈ℝB×d\\displaystyle=\\text\{AttentionalPool\}\(\\mathbf\{T\}\_\{f\}^\{\\text\{final\}\}\)\\in\\mathbb\{R\}^\{B\\times d\}\(113\)
These fixed\-size representations encode:

- •𝐳t\\mathbf\{z\}\_\{t\}: Global temporal structure, causal dependencies, time\-domain events
- •𝐳f\\mathbf\{z\}\_\{f\}: Global spectral structure, frequency content, harmonic relationships

#### A\.8\.3Semantic Interpretation: Relative Globality

A key insight is that "global" is relative to the input sequence length:

Short Sequences \(N∼100N\\sim 100\):The pooled representation captures local, fine\-grained descriptions\. For example, in speech processing, this might encode phoneme\-level information\.

Long Sequences \(N∼1000N\\sim 1000\):The pooled representation summarizes global structure\. In the same speech example, this would encode sentence\-level semantics or speaker characteristics\.

This scale\-adaptive behavior emerges naturally from the windowed tokenization and attentional pooling: the network learns to aggregate information at the appropriate level of abstraction for the given sequence length\.

### A\.9Dual\-Domain Latent Representation

Figure[7](https://arxiv.org/html/2606.04106#A1.F7)illustrates the complete encoder pipeline, from dual\-domain preprocessing through tokenization, Parseval Transformer processing, and final sequence pooling\.

At the conclusion of encoder processing, we obtain a pair of complementary fixed\-size latent representations:

𝐳=\{𝐳t,𝐳f\}where​𝐳t,𝐳f∈ℝd\\mathbf\{z\}=\\\{\\mathbf\{z\}\_\{t\},\\mathbf\{z\}\_\{f\}\\\}\\quad\\text\{where \}\\mathbf\{z\}\_\{t\},\\mathbf\{z\}\_\{f\}\\in\\mathbb\{R\}^\{d\}\(114\)
#### A\.9\.1Flexible Downstream Usage

These dual\-domain representations provide a general signal embedding with flexible usage patterns:

\(1\) Independent Usage:

- •Use𝐳t\\mathbf\{z\}\_\{t\}alone for time\-domain\-specific tasks \(e\.g\., temporal event detection\)
- •Use𝐳f\\mathbf\{z\}\_\{f\}alone for frequency\-domain\-specific tasks \(e\.g\., spectral classification\)

\(2\) Concatenated Usage:

𝐳concat=\[𝐳t;𝐳f\]∈ℝ2​d\\mathbf\{z\}\_\{\\text\{concat\}\}=\[\\mathbf\{z\}\_\{t\};\\mathbf\{z\}\_\{f\}\]\\in\\mathbb\{R\}^\{2d\}\(115\)Provides comprehensive time\-frequency representation for tasks requiring both perspectives\.

\(3\) Task\-Specific Fusion:Downstream tasks can apply learned fusion \(e\.g\., attention\-weighted combination, gated fusion\) tailored to task requirements\.

#### A\.9\.2General Embedding Properties

The dual\-domain latent representation exhibits several desirable properties:

Task Agnostic:No task\-specific architectural modifications required\. The same encoder produces embeddings suitable for:

- •Classification \(e\.g\., modulation recognition, speaker identification\)
- •Regression \(e\.g\., SNR estimation, parameter prediction\)
- •Retrieval \(e\.g\., signal similarity search\)
- •Generation \(e\.g\., signal reconstruction, denoising\)

Domain Agnostic:Through frequency\-domain pooling \(Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\) and Parseval Focus \(Section[A\.7](https://arxiv.org/html/2606.04106#A1.SS7)\), the encoder learns representations that generalize across signal types \(RF, audio, seismic\) without domain\-specific retraining\.

Scale Adaptive:The relative globality property enables the same architecture to process signals at multiple scales, from fine\-grained local analysis to coarse\-grained global summarization\.

#### A\.9\.3Extension to Variable\-Length Sequences

While the encoder architecture assumes fixed\-size input windows \(5120 samples = 5 windows × 1024 samples/window\) during training, deployment often requires processing signals of arbitrary length\. We extend to variable\-length processing through a simple but effective strategy that leverages the encoder’s natural structure\.

##### A\.9\.3\.1Variable\-Length Processing Strategy

For an input signal of arbitrary lengthLinputL\_\{\\text\{input\}\}:

Step 1: Segment into Fixed\-Size Windows\.Partition the input into non\-overlapping segments of the training window size \(5120 samples\):

𝐱input→\{𝐱1,𝐱2,…,𝐱M\}where​M=⌈Linput/5120⌉\\mathbf\{x\}\_\{\\text\{input\}\}\\rightarrow\\\{\\mathbf\{x\}\_\{1\},\\mathbf\{x\}\_\{2\},\\ldots,\\mathbf\{x\}\_\{M\}\\\}\\quad\\text\{where \}M=\\lceil L\_\{\\text\{input\}\}/5120\\rceil\(116\)
If the final segment is shorter than 5120 samples, it can be resampled to 5120 depending on the application\.

Step 2: Process Segments Through Encoder\.Each segment is processed independently through the complete encoder pipeline \(tokenization, Parseval Transformer, cross\-domain fusion\), producing token\-level representations:

𝐱i\\displaystyle\\mathbf\{x\}\_\{i\}→\{𝐓t\(i\),𝐓f\(i\)\}for​i=1,…,M\\displaystyle\\rightarrow\\\{\\mathbf\{T\}\_\{t\}^\{\(i\)\},\\mathbf\{T\}\_\{f\}^\{\(i\)\}\\\}\\quad\\text\{for \}i=1,\\ldots,M\(117\)
where𝐓t\(i\)∈ℝNt×d\\mathbf\{T\}\_\{t\}^\{\(i\)\}\\in\\mathbb\{R\}^\{N\_\{t\}\\times d\}and𝐓f\(i\)∈ℝNf×d\\mathbf\{T\}\_\{f\}^\{\(i\)\}\\in\\mathbb\{R\}^\{N\_\{f\}\\times d\}\.

Step 3: Domain\-Specific Aggregation\.Aggregate tokens across segments using domain\-appropriate strategies:

Time Domain \(Concatenation\):

𝐓tconcat=Concat​\(\[𝐓t\(1\),𝐓t\(2\),…,𝐓t\(M\)\],dim=0\)∈ℝ\(M⋅Nt\)×d\\mathbf\{T\}\_\{t\}^\{\\text\{concat\}\}=\\text\{Concat\}\(\[\\mathbf\{T\}\_\{t\}^\{\(1\)\},\\mathbf\{T\}\_\{t\}^\{\(2\)\},\\ldots,\\mathbf\{T\}\_\{t\}^\{\(M\)\}\],\\text\{dim\}=0\)\\in\\mathbb\{R\}^\{\(M\\cdot N\_\{t\}\)\\times d\}\(118\)
This preserves temporal ordering across the entire input signal, creating a unified sequence representation\.

Frequency Domain \(Averaging\):

𝐓favg=1M​∑i=1M𝐓f\(i\)∈ℝNf×d\\mathbf\{T\}\_\{f\}^\{\\text\{avg\}\}=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\mathbf\{T\}\_\{f\}^\{\(i\)\}\\in\\mathbb\{R\}^\{N\_\{f\}\\times d\}\(119\)
This produces a time\-averaged spectral representation, consistent with the latent space averaging strategy described in Section[A\.4\.2\.6](https://arxiv.org/html/2606.04106#A1.SS4.SSS2.P6)\.

Step 4: Attentional Pooling\.Apply the sequence pooler to the aggregated token representations:

𝐳t\\displaystyle\\mathbf\{z\}\_\{t\}=AttentionalPool​\(𝐓tconcat\)∈ℝd\\displaystyle=\\text\{AttentionalPool\}\(\\mathbf\{T\}\_\{t\}^\{\\text\{concat\}\}\)\\in\\mathbb\{R\}^\{d\}\(120\)𝐳f\\displaystyle\\mathbf\{z\}\_\{f\}=AttentionalPool​\(𝐓favg\)∈ℝd\\displaystyle=\\text\{AttentionalPool\}\(\\mathbf\{T\}\_\{f\}^\{\\text\{avg\}\}\)\\in\\mathbb\{R\}^\{d\}\(121\)
Step 5: Final Cross\-Domain Fusion\.Apply the post\-pooling cross\-domain fusion \(Section[A\.9](https://arxiv.org/html/2606.04106#A1.SS9)\) to produce the final dual\-domain representation\.

##### A\.9\.3\.2Rationale and Benefits

This strategy provides several advantages:

\(1\) Consistency with Training:Each segment is processed through the encoder with the same input size used during training, ensuring consistent feature extraction without distribution shift\.

\(2\) Leveraging Translational Equivariance:The use of non\-overlapping windows exploits the translational equivariance properties learned during training\. Our architecture is explicitly designed to learn shift\-invariant representations through:

- •Convolutional tokenization:Convolution’s inherent translational equivariance ensures features detected at positionttin one window transfer to positionttin other windows
- •Co\-designed training losses:Our training objectives \(Section[3\.4\.2](https://arxiv.org/html/2606.04106#S3.SS4.SSS2)\) explicitly encourage translational equivariance in time, ensuring features shift predictably regardless of their position in the sequence

This means a feature learned at the beginning of segmentiishould produce equivalent representations when it appears at the end of segmenti−1i\-1or the beginning of segmenti\+1i\+1\. Non\-overlapping windows leverage this learned symmetry: if the model has successfully learned translational equivariance, segment boundaries become arbitrary reference points rather than discontinuities in the representation space\.

\(3\) Domain\-Appropriate Aggregation:Time\-domain concatenation preserves causal structure across the full signal, while frequency\-domain averaging produces a global spectral summary—consistent with the physical interpretation of each domain\.

\(4\) Scalability:The approach scales to arbitrarily long signals without architectural modifications or retraining\. The attentional pooler naturally adapts to the increased sequence length in the time domain \(M⋅NtM\\cdot N\_\{t\}tokens\) while the frequency domain remains fixed \(NfN\_\{f\}tokens\)\.

\(5\) Memory Efficiency:By processing segments independently through the encoder, peak memory usage is bounded by the single\-segment processing cost, avoiding the quadratic memory growth of processing the entire signal as one sequence\.

##### A\.9\.3\.3Implementation Note

Algorithm[8](https://arxiv.org/html/2606.04106#alg8)summarizes the complete procedure\.

Algorithm 8Variable\-Length Signal Processing0:Input signal

𝐱input\\mathbf\{x\}\_\{\\text\{input\}\}of arbitrary length

LinputL\_\{\\text\{input\}\}
0:Segment size

Lseg=5120L\_\{\\text\{seg\}\}=5120\(training window size\)

0:Dual\-domain latent representation

\{𝐳t,𝐳f\}\\\{\\mathbf\{z\}\_\{t\},\\mathbf\{z\}\_\{f\}\\\}
1:

2:\{Segment input signal\}

3:

M←⌈Linput/Lseg⌉M\\leftarrow\\lceil L\_\{\\text\{input\}\}/L\_\{\\text\{seg\}\}\\rceil
4:

\{𝐱1,…,𝐱M\}←Segment​\(𝐱input,Lseg\)\\\{\\mathbf\{x\}\_\{1\},\\ldots,\\mathbf\{x\}\_\{M\}\\\}\\leftarrow\\text\{Segment\}\(\\mathbf\{x\}\_\{\\text\{input\}\},L\_\{\\text\{seg\}\}\)
5:

6:\{Process each segment through encoder\}

7:for

i=1i=1to

MMdo

8:

𝐓t\(i\),𝐓f\(i\)←Encoder​\(𝐱i\)\\mathbf\{T\}\_\{t\}^\{\(i\)\},\\mathbf\{T\}\_\{f\}^\{\(i\)\}\\leftarrow\\text\{Encoder\}\(\\mathbf\{x\}\_\{i\}\)\{Token\-level representations\}

9:endfor

10:

11:\{Domain\-specific aggregation\}

12:

𝐓tconcat←Concat​\(\[𝐓t\(1\),…,𝐓t\(M\)\],dim=0\)\\mathbf\{T\}\_\{t\}^\{\\text\{concat\}\}\\leftarrow\\text\{Concat\}\(\[\\mathbf\{T\}\_\{t\}^\{\(1\)\},\\ldots,\\mathbf\{T\}\_\{t\}^\{\(M\)\}\],\\text\{dim\}=0\)\{Time: concatenate\}

13:

𝐓favg←1M​∑i=1M𝐓f\(i\)\\mathbf\{T\}\_\{f\}^\{\\text\{avg\}\}\\leftarrow\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\mathbf\{T\}\_\{f\}^\{\(i\)\}\{Frequency: average\}

14:

15:\{Attentional pooling\}

16:

𝐳tpre←AttentionalPool​\(𝐓tconcat\)\\mathbf\{z\}\_\{t\}^\{\\text\{pre\}\}\\leftarrow\\text\{AttentionalPool\}\(\\mathbf\{T\}\_\{t\}^\{\\text\{concat\}\}\)
17:

𝐳fpre←AttentionalPool​\(𝐓favg\)\\mathbf\{z\}\_\{f\}^\{\\text\{pre\}\}\\leftarrow\\text\{AttentionalPool\}\(\\mathbf\{T\}\_\{f\}^\{\\text\{avg\}\}\)
18:

19:\{Post\-pooling cross\-domain fusion\}

20:

𝐳t←GLU​\(\[𝐳tpre;Transformf→t​\(𝐳fpre\)\]\)\\mathbf\{z\}\_\{t\}\\leftarrow\\text\{GLU\}\(\[\\mathbf\{z\}\_\{t\}^\{\\text\{pre\}\};\\text\{Transform\}\_\{f\\rightarrow t\}\(\\mathbf\{z\}\_\{f\}^\{\\text\{pre\}\}\)\]\)
21:

𝐳f←GLU​\(\[𝐳fpre;Transformt→f​\(𝐳tpre\)\]\)\\mathbf\{z\}\_\{f\}\\leftarrow\\text\{GLU\}\(\[\\mathbf\{z\}\_\{f\}^\{\\text\{pre\}\};\\text\{Transform\}\_\{t\\rightarrow f\}\(\\mathbf\{z\}\_\{t\}^\{\\text\{pre\}\}\)\]\)
22:

23:return

\{𝐳t,𝐳f\}\\\{\\mathbf\{z\}\_\{t\},\\mathbf\{z\}\_\{f\}\\\}

#### A\.9\.4Encoder Output Summary

Table[7](https://arxiv.org/html/2606.04106#A1.T7)summarizes the encoder’s output representations and their typical usage\.

Table 7:PlanFormer Encoder Output RepresentationsRepresentationShapeTypical UsageTime\-domain tokens𝐓tfinal\\mathbf\{T\}\_\{t\}^\{\\text\{final\}\}ℝB×Nt×d\\mathbb\{R\}^\{B\\times N\_\{t\}\\times d\}Sequence\-to\-sequence tasks \(e\.g\., denoising, forecasting\)Frequency\-domain tokens𝐓ffinal\\mathbf\{T\}\_\{f\}^\{\\text\{final\}\}ℝB×Nf×d\\mathbb\{R\}^\{B\\times N\_\{f\}\\times d\}Spectral analysis, frequency\-domain generationTime\-domain latent𝐳t\\mathbf\{z\}\_\{t\}ℝB×d\\mathbb\{R\}^\{B\\times d\}Time\-domain classification, temporal retrievalFrequency\-domain latent𝐳f\\mathbf\{z\}\_\{f\}ℝB×d\\mathbb\{R\}^\{B\\times d\}Spectral classification, frequency\-based retrievalDual\-domain latent\[𝐳t;𝐳f\]\[\\mathbf\{z\}\_\{t\};\\mathbf\{z\}\_\{f\}\]ℝB×2​d\\mathbb\{R\}^\{B\\times 2d\}Comprehensive signal understanding, multi\-task learningDesign Philosophy:By providing both token\-level and pooled representations in both domains, the encoder maximizes flexibility for downstream applications\. Token\-level representations preserve spatial/temporal structure for generation tasks, while pooled representations provide compact embeddings for discriminative tasks\. The dual\-domain structure ensures that both time\-domain and frequency\-domain perspectives are available, enabling the downstream task to leverage whichever view \(or combination\) is most appropriate\.

## Appendix BAppendix B: Decoder Architecture

### B\.1Theoretical Foundation and Design Philosophy

Design Philosophy: Architecture as Gradient Substrate\.The decoder architecture serves a specific purpose: providing a computational substrate that enables symmetry losses to generate meaningful gradients throughout the encoder\. Unlike the encoder—whose design directly embodies signal\-theoretic principles—the decoder’s design is pragmatic: each component addresses specific gradient flow challenges that arise when training with our co\-designed loss functions \(IsoFICReg, LED, reconstruction, source separation\)\.

We do not claim these decoder components are optimal or represent the best possible design\. Rather, they are sufficient solutions that fill critical gaps: enabling learning in negative\-SNR regimes \(skip sinks\), preserving high\-frequency gradients \(frequency\-domain upsampling\), and providing instance\-specific guidance \(dynamic FiLM\)\. The decoder’s value lies in its functional role—supporting the encoder’s learning of physical principles—rather than as an architectural contribution in its own right\.

Encoder\-Decoder Role Distinction\.The encoder must remain task\-agnostic, avoiding excessive a priori guidance that could cause information loss\. It encodes the complete physical structure—"what it is and how it is"—producing information\-maximal bottleneck representations without knowledge of downstream tasks\.

The decoder operates with explicit targets derived from the tokens and bottleneck latent\. It interprets the encoder’s information\-rich representation to reconstruct specific targets, using the latent as guidance for "what to pull out and reconstruct\." This asymmetry is critical: if the encoder produces well\-structured representations capturing physical structure, the decoder can reconstruct any target sharing that structure—enabling zero\-shot generalization to unseen domains\.

This role distinction justifies design choices that would be inappropriate during encoding\. For example, Skip Connection Sinks \(Section[B\.4\.3](https://arxiv.org/html/2606.04106#A2.SS4.SSS3)\) use three\-way constraint regularization against explicit targets—acceptable during decoding where targets are known, but inappropriate during encoding where task\-agnostic information preservation is paramount\.

Key Capability: Zero\-Shot Reconstruction Generalization\.This encoder\-decoder separation enables a critical capability: zero\-shot reconstruction of signals from previously unseen domains\. Because the encoder captures domain\-invariant physical structure rather than domain\-specific semantics, and the decoder interprets this structure rather than memorizing reconstruction patterns, the system generalizes to new domains without retraining\.

This is particularly valuable for inverse problems \(denoising, source separation, inpainting\) where mixed information is encoded but specific targets must be reconstructed\. The bottleneck latent encodeswhat physical structure exists, not semantic categories—the decoder then extracts the relevant structure for reconstruction\. This explains how RF\-trained models can denoise audio or reconstruct image sources: the physical transformations \(noise, mixing, occlusion\) are domain\-invariant, even when the signal content differs\.

Decoder Components Overview\.The decoder introduces three primary innovations beyond the encoder’s mechanisms:

1. 1\.Attention\-Based Dynamic FiLM\(Section[B\.4\.2](https://arxiv.org/html/2606.04106#A2.SS4.SSS2)\): Bidirectional latent conditioning via cross\-focus for instance\-specific guidance
2. 2\.Parseval Focus\-Based Skip Sinks\(Section[B\.4\.3](https://arxiv.org/html/2606.04106#A2.SS4.SSS3)\): Target\-guided dynamic filtering of skip connections
3. 3\.Sliding Window Self\-Focus Activation\(Section[B\.5\.3](https://arxiv.org/html/2606.04106#A2.SS5.SSS3)\): Complex\-valued adaptive modulation for final reconstruction

The remaining components \(frequency\-domain upsampling, causal cross\-window focus, transformer refinement\) adapt encoder mechanisms for the decoder’s reconstruction objectives\.

### B\.2Design Rationale: Balancing Compression, Attention Cost, and Reconstruction

The decoder architecture employs a U\-Net structure with carefully designed skip connections that balance three competing constraints: \(1\)attention computational cost, which drives aggressive sequence compression, \(2\)reconstruction fidelity, which requires high\-resolution information, and \(3\)preventing trivial information leakage, which ensures the bottleneck learns meaningful representations\.

#### B\.2\.1Compression Driven by Attention Cost

The encoder achieves64x sequence compressionfrom input to bottleneck:

- •Input: 5120 samples \(complex\-valued IQ\)
- •After 3 pooling stages within convolutinoal tokenization \(4x, 4x, 4x\): 80 tokens
- •Final latent: 128\-dimensional vector after attentional pooling

This aggressive compression is driven by thequadratic cost of self\-attention:𝒪​\(N2⋅d\)\\mathcal\{O\}\(N^\{2\}\\cdot d\)whereNNis sequence length\. Processing 5120 tokens with multiple attention/focus computations per Parseval Transformer block would be computationally prohibitive\. By compressing to 80 tokens, we achieve\(80/5120\)2=0\.02%\(80/5120\)^\{2\}=0\.02\\%of the attention cost, enabling more efficient and real\-time processing with Multi\-Head Parseval Focus \(Section[A\.7](https://arxiv.org/html/2606.04106#A1.SS7)\)\.

The compressed bottleneck latent dimensionality \(128\) matches the embedding dimension per token due to attentional pooling across all 80 tokens—each dimension aggregates information from the full sequence via learned attention weights \(Section[A\.8](https://arxiv.org/html/2606.04106#A1.SS8)\)\.

#### B\.2\.2Skip Connection Design: Preventing Trivial Leakage

Reconstructing from 64x compression without skip connections is architecturally infeasible—the bottleneck would need to encode every high\-resolution detail, or we would need to process much longer sequences through the transformer \(defeating the compression goal\)\. However, naively including all skip connections would enable trivial information leakage, allowing the decoder to bypass the bottleneck entirely\.

Our solution:selective skip connections from compressed intermediate features only\.

##### B\.2\.2\.1Skip Connection Hierarchy

We include only two skip connections, both from already\-compressed encoder features:

Table 8:Skip Connection DesignEncoder StageCompressionSkip ConnectionRationaleInput1xOmittedRaw samples \(trivial leakage\)Conv Tokenization1xOmittedLinear features \(trivial leakage\)After 1st Pooling4xIncludedCompressed local patternsAfter 2nd Pooling16xIncludedCompressed mid\-level featuresAfter 3rd Pooling64xOmittedRedundant with token representationBottleneck Tokens64xN/AGlobal guidance \(via FiLM conditioning\)Key Design Choices:

1. 1\.Omit top\-level skips: No connection from input or first convolutional layer, preventing the decoder from simply copying input details
2. 2\.First skip at 4x compression: Already learned features from frequency\-preserving pooling \(Section[A\.4\.2\.4](https://arxiv.org/html/2606.04106#A1.SS4.SSS2.P4)\), not raw samples
3. 3\.Omit pre\-transformer skip: The 64x compressed tokens undergo Parseval Transformer processing and their pooled representation are used for bottleneck latent guidance via FiLM conditioning \(Section[B\.4\.2](https://arxiv.org/html/2606.04106#A2.SS4.SSS2)\)\. A skip connection here would be redundant with the token\-based upsampling path and would bypass the learned transformer representations\.

#### B\.2\.3Role Division: Bottleneck Latent vs\. Skip Connections

The decoder architecture creates a deliberate division of labor:

Bottleneck Latent \(via Dynamic FiLM\):

- •Global structural guidance: Targeted content, object identity, overall frequency distribution
- •Task\-specific modulation: Conditions all decoder layers via attention\-based FiLM \(Section[B\.4\.2](https://arxiv.org/html/2606.04106#A2.SS4.SSS2)\)
- •Cross\-modal transfer: Learned from RF, transfers to images/audio/text \(validated by linear probing using bottleneck only\)
- •Transformer\-processed representation: Captures long\-range dependencies and cross\-domain relationships via Parseval Focus

Skip Connections \(from 4x and 16x compressed features\):

- •High\-resolution details: Local textures, sharp edges, fine\-grained frequency components
- •Gradient flow: Provides dense gradients to early encoder layers during training
- •Frequency infilling: Parseval Focus\-based skip sinks \(Section[B\.4\.3](https://arxiv.org/html/2606.04106#A2.SS4.SSS3)\) dynamically filter skip connections, removing noise and interference while preserving signal structure
- •Compressed representations: Already 4x and 16x downsampled, forcing encoder to learn meaningful features rather than pass through raw inputs

#### B\.2\.4Empirical Validation: Random Weight Baseline

To validate that learned representations \(not architectural inductive biases\) drive reconstruction quality, we evaluate the same architecture with random weight initialization\. Figure[8](https://arxiv.org/html/2606.04106#A2.F8)shows that random weights produce near\-complete reconstruction failure—uniform gray fields with no discernible structure\. This demonstrates:

1. 1\.Learned weights are essential: The architecture alone \(skip connections \+ U\-Net structure\) cannot reconstruct without learned representations
2. 2\.Bottleneck encodes meaningful structure: Random bottleneck latents provide no guidance, resulting in collapsed reconstructions
3. 3\.Skip connections are not shortcuts: Even with skip connections, random weights fail to reconstruct, proving skips don’t bypass learning

![Refer to caption](https://arxiv.org/html/2606.04106v1/Figures/CheetahReconstructionRandomWeights.jpg)Figure 8:Random weight baseline for reconstruction\.Same architecture as trained model, but with random weight initialization\.Left:Original image\.Right:Reconstruction with random weights—complete failure producing uniform gray field\. This validates that learned representations \(not architectural biases or skip connections\) drive reconstruction quality\. Compare to Figure[11](https://arxiv.org/html/2606.04106#A5.F11)showing coherent reconstructions with trained weights\.
#### B\.2\.5Why This Design Validates Transfer Learning

Critically, our transfer learning evaluation \(Table[1](https://arxiv.org/html/2606.04106#S4.T1)\) usesonly the bottleneck latentfor linear probing—skip connections are not involved\. The strong classification performance \(77\.7% average, 91\.9% top\-3\) validates that the bottleneck latent alone captures transferable physical structure\. Skip connections contribute to reconstruction quality during training but do not confound our transfer learning claims\.

Reconstruction as Training Signal:The reconstruction objective \(with skip connections\) serves as a dense training signal that encourages the encoder to learn physically meaningful features\. However, the learned representations must also satisfy discriminative objectives \(IsoFICReg, LED\) that operate solely on the bottleneck latent\. The combination ensures the bottleneck learns compressed, transferable representations rather than relying on skip connection shortcuts\. The random weight baseline \(Figure[8](https://arxiv.org/html/2606.04106#A2.F8)\) confirms that skip connections alone cannot reconstruct—learned bottleneck representations are essential\.

#### B\.2\.6Interpretation of Reconstruction Results

The zero\-shot reconstruction quality \(Section E\.5, Figure[11](https://arxiv.org/html/2606.04106#A5.F11)\) should be interpreted as evidence that:

1. 1\.The encoder\-decoder system\(bottleneck \+ skip connections \+ learned weights\) captures transferable physical structure across modalities
2. 2\.The bottleneck latentprovides global structural guidance \(validated by linear probing on bottleneck alone, and by random weight failure\)
3. 3\.Skip connectionspreserve high\-resolution details from compressed features \(4x, 16x\) necessary for pixel\-level reconstruction fidelity
4. 4\.Learned weights are essential: Random initialization fails completely \(Figure[8](https://arxiv.org/html/2606.04106#A2.F8)\), proving architectural biases alone are insufficient

The reconstruction quality is not solely attributable to the bottleneck latent, nor solely to skip connections, nor to architectural inductive biases—all three components \(learned bottleneck, compressed skip connections, learned decoder weights\) are necessary and complementary\. However, the transfer learning results \(linear probing\) validate that the bottleneck latent alone captures sufficient physical structure for cross\-modal classification tasks\.

### B\.3Decoder Components

### B\.4Core Decoder Components

Having established the architectural rationale, we now detail the specific decoder components\. The decoder maintains the encoder’s dual\-domain architecture with parallel time and frequency branches\. Like the encoder, the decoder processes reconstruction in non\-overlapping windows to address time\-varying phenomena \(noise fluctuations, local frequency trends\)\. Since frequency\-domain tokens represent global spectral aggregation, the same frequency tokens serve as context for each time\-domain window, with temporally local frequency content injected through skip connections\.

Each decoder layer consists of the following components, applied symmetrically to both branches:

#### B\.4\.1Convolutional Processing and Frequency\-Domain Upsampling

Convolutional Processing:1D convolutions with progressively decreasing channel dimensions \(reversing the encoder’s expansion from Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\) refine local features\. As the decoder progresses from abstract latent representations toward concrete signal reconstructions, channel dimensionality decreases while sequence length increases\.

Frequency\-Domain Upsampling:Transposed convolution applied in the frequency domain progressively increases sequence length while preserving spectral structure, consistent with the encoder’s frequency\-preserving pooling strategy \(Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\)\.

Synergy with Skip Connections:Skip connection spectrums infill high\-fidelity frequency components during upsampling, facilitating finer\-grained reconstruction fidelity than discrete time\-domain sample interpolation\.

#### B\.4\.2Attention\-Based Dynamic FiLM Conditioning

Motivation:Traditional Feature\-wise Linear Modulation \(FiLM\)\[[27](https://arxiv.org/html/2606.04106#bib.bib27)\]applies channel\-wise affine transformations conditioned on an external signal\. However, standard FiLM has critical limitations:

1. 1\.One\-way conditioning:The conditioning signal generates modulation parameters without considering the features being modulated
2. 2\.Global rigidity:Channel\-wise modulation is spatially/temporally global, failing to respect equivariant symmetries learned during training

Innovation:We extend FiLM by combining it with Dynamic TanH activation\[[53](https://arxiv.org/html/2606.04106#bib.bib53)\]construction, where modulation parameter generation occurs via Scaled Covariance Cross\-Focus\. \(Section[A\.6\.1](https://arxiv.org/html/2606.04106#A1.SS6.SSS1)\) between the bottleneck latent and the sequence of features being modulated\. This creates a bidirectional relationship: the latent guides modulation based on the reconstruction target, but the features’ characteristics influence how that guidance is applied\.

##### B\.4\.2\.1Modulation Parameter Generation

The cross\-focus mechanism computes attention between the bottleneck latent \(as query\) and the decoder features \(as key/value\), producing modulation parameters that are adaptive to both the reconstruction target and the current feature state\.

Gamma \(γ\\gamma\):Generated per element in the sequence, applied inside the TanH nonlinearity for maximal expressivity\. This pointwise modulation respects local equivariant symmetries learned during training\.

Alpha \(α\\alpha\) and Beta \(β\\beta\):Generated channel\-wise for parameter efficiency, applied as standard affine parameters outside the nonlinearity for stability\.

##### B\.4\.2\.2Residual Formulation

To ensure stability and preserve information flow, modulation is applied as a series of residual connections:

𝐱modulated=α⋅tanh⁡\(γ⋅𝐱\+𝐱\)\+β\+𝐱\\mathbf\{x\}\_\{\\text\{modulated\}\}=\\alpha\\cdot\\tanh\(\\gamma\\cdot\\mathbf\{x\}\+\\mathbf\{x\}\)\+\\beta\+\\mathbf\{x\}\(122\)
This nested residual structure ensures that even if modulation parameters collapse, the identity pathway preserves gradient flow\.

Rationale:By conditioning modulation on the bottleneck latent via cross\-focus, the decoder can adaptively emphasize or suppress features based on the reconstruction objective\. For denoising, the modulation might amplify signal\-like features and suppress noise\-like features; for source separation, it might emphasize features corresponding to the target source\. Critically, this adaptation occurs dynamically based on the latent encoding of the desired output, enabling the same decoder architecture to handle diverse reconstruction tasks\.

#### B\.4\.3Parseval Focus\-Based Skip Connection Sinks

Motivation:In UNet architectures, skip connections facilitate high\-resolution details but present a vulnerability: they hold intermediately processed representations before full encoding is complete\. This is problematic for tasks where multiple signal components are present during encoding but only specific components are desired at reconstruction output \(e\.g\., denoising: signal \+ noise→\\rightarrowsignal only; source separation: speaker A \+ speaker B→\\rightarrowspeaker A only\)\.

Challenge:A naive approach would learn fixed transformations after skip fusion to attenuate unwanted components\. However, this rigid prior does not scale to arbitrary additive noise sources or mixture scenarios where the unwanted component varies dynamically\.

Solution:We introduce a unified mechanism combining: \(1\) Noise Sink mechanism \(Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\) to clean skip connections, extended with conditional guidance toward the reconstruction target, \(2\) Parseval\-consistent cross\-domain focus \(Section[A\.7](https://arxiv.org/html/2606.04106#A1.SS7)\) to ensure physical consistency, and \(3\) Dynamic power compensation for SNR\-adaptive fusion\.

Key Insight:The bottleneck latent encodeswhich physical structure to reconstruct\. By conditioning the skip sink on this latent, we enable dynamic filtering: the same skip connection can be processed differently depending on the reconstruction objective\. For denoising, the sink removes noise; for source separation, it removes interfering sources; for signal restoration, it removes artifacts—all using the same architecture, guided by the latent encoding of the desired output structure\.

##### B\.4\.3\.1Theoretical Foundation: Dynamic Filter Banks

For dynamic noise sources that vary in time and frequency, it is essential to learn dynamic filter behaviors \(low\-pass, high\-pass, bandpass, notch\)\. To do so effectively, we need access to the full context of signals in both time and frequency domains\.

##### B\.4\.3\.2Implementation: Parseval Focus with Dynamic Filtering

The skip sink implementation varies by domain but follows a unified procedure:

Domain\-Specific Preprocessing:

- •Time\-domain skips:Processed via non\-overlapping sliding windows, leveraging the inductive bias that additive noise is locally uncorrelated with signal
- •Frequency\-domain skips:Processed with full spectral context \(no windowing\), using learned convolutional downsampling to produce manageable spectral bins

Parseval Focus for Dynamic Noise Removal:

Both time and frequency representations undergo complementary parametric domain transforms \(FFT/IFFT\) to produce dual\-domain token sequences\. The bottleneck latent is prepended to each domain’s sequence as a conditioning token, and Multi\-Head Parseval Focus \(Section[A\.7](https://arxiv.org/html/2606.04106#A1.SS7)\) is computed over these augmented sequences\.

This joint time\-frequency processing enables learning dynamic filters \(low\-pass, high\-pass, bandpass, notch\) conditioned on three critical factors:

1. 1\.What to remove:Encoded in bottleneck latent \(e\.g\., "remove noise, keep signal"\)
2. 2\.Where to remove:Time localization via windowed processing
3. 3\.At what frequencies:Spectral localization via frequency\-domain processing

Parseval Focus is ideal because it jointly considers time and frequency representations while enforcing physical consistency \(via JSD regularization\)\. Cross\-domain attention allows the model to learn that noise removal in time corresponds to predictable changes in frequency, and vice versa\.

Sink Computation and Subtraction:

The output of Parseval Focus \(with bottleneck conditioning token removed\) seeds a hypernetwork generatingγ\\gamma,α\\alpha, andβ\\betapointwise modulation parameters\. These are applied in the frequency domain, functioning as spectral convolutions whose cumulative output serves as the sink \(noise estimate\):

𝐱skipclean=𝐱skip−𝐱sink\\mathbf\{x\}\_\{\\text\{skip\}\}^\{\\text\{clean\}\}=\\mathbf\{x\}\_\{\\text\{skip\}\}\-\\mathbf\{x\}\_\{\\text\{sink\}\}\(123\)
This subtractive residual formulation enables learning both direct filters \(what to keep\) and complementary filters \(what to remove\), providing maximal expressivity for all types of skip connection pollution\.

##### B\.4\.3\.3Regularization for Target\-Guided Conditioning

To ensure the conditioning works as expected, the layer is heavily regularized:

\(1\) Decorrelation Loss:Sink is decorrelated from post\-sink output \(similar to encoder noise sinks\):

ℒdecorrskip=\|PearsonCorr​\(𝐱sink,𝐱skipclean\)\|\\mathcal\{L\}\_\{\\text\{decorr\}\}^\{\\text\{skip\}\}=\|\\text\{PearsonCorr\}\(\\mathbf\{x\}\_\{\\text\{sink\}\},\\mathbf\{x\}\_\{\\text\{skip\}\}^\{\\text\{clean\}\}\)\|\(124\)
This ensures the sink removes uncorrelated components rather than signal content\.

\(2\) Sink\-Latent Decorrelation:Sink is decorrelated from bottleneck latent guidance:

ℒsink\-latent=\|PearsonCorr​\(𝐱sink,𝐳\)\|\\mathcal\{L\}\_\{\\text\{sink\-latent\}\}=\|\\text\{PearsonCorr\}\(\\mathbf\{x\}\_\{\\text\{sink\}\},\\mathbf\{z\}\)\|\(125\)
This ensures the removed content is not aligned with the reconstruction target encoded in the latent\.

\(3\) Post\-Sink Target Alignment Loss:Post\-sink output is encouraged to correlate with bottleneck latent guidance:

ℒtarget\-align=1−PearsonCorr​\(𝐱skipclean,𝐳\)\\mathcal\{L\}\_\{\\text\{target\-align\}\}=1\-\\text\{PearsonCorr\}\(\\mathbf\{x\}\_\{\\text\{skip\}\}^\{\\text\{clean\}\},\\mathbf\{z\}\)\(126\)
This ensures the cleaned signal aligns with the reconstruction target\.

\(4\) Dynamic Power Scaling:Handles cases where the sink’s information content is more dominant than the signal of interest, using the same power compensation mechanism as the encoder’s noise sink \(Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\)\.

Rationale:These regularization losses create a three\-way constraint: \(1\) sink should be uncorrelated with cleaned output \(removing noise, not signal\), \(2\) sink should be uncorrelated with reconstruction target \(not removing what we want\), and \(3\) cleaned output should be correlated with reconstruction target \(keeping what we want\)\. Together, these ensure the skip sink performs target\-guided filtering rather than arbitrary transformation\.

#### B\.4\.4Causal Cross\-Window Focus with Latent Conditioning

The Causal Cross\-Window Focus mechanism \(Section[A\.4\.1](https://arxiv.org/html/2606.04106#A1.SS4.SSS1)\) is employed at each decoder layer to maintain phase coherence across window boundaries\.

##### B\.4\.4\.1Per\-Layer Previous Window Storage

At each decoder layerii, we maintain a buffer storing the previous window’s output:

𝐱prev\(i\)∈ℝB×Ci×W\\mathbf\{x\}\_\{\\text\{prev\}\}^\{\(i\)\}\\in\\mathbb\{R\}^\{B\\times C\_\{i\}\\times W\}\(127\)
For the first window \(window\_idx=0\\text\{window\\\_idx\}=0\), no previous context exists\. For subsequent windows \(window\_idx\>0\\text\{window\\\_idx\}\>0\), we compute cross\-window attention\.

##### B\.4\.4\.2Two\-Stage Focus Process

The decoder applies focus hierarchically at each layer, with domain\-specific preprocessing to ensure physically meaningful positional encodings:

Stage 1: Cross\-Window Focus \(windows 2\+\)

For all windows except the first \(window\_idx\>0\\text\{window\\\_idx\}\>0\), we compute causal cross\-window focus to maintain phase coherence\. The preprocessing differs by domain to respect physical interpretation:

Time\-Domain Processing:

1. 1\.Transform current and previous windows to frequency domain \(mitigates aliasing during learned downsampling\)
2. 2\.Reshape to token representation and downsample \(4×\) via learned 1×1 convolution
3. 3\.Transform back to time domainbefore concatenation
4. 4\.Concatenate previous and current windows in time domain:𝐓concat=\[𝐓prev\(t​i​m​e\);𝐓curr\(t​i​m​e\)\]\\mathbf\{T\}\_\{\\text\{concat\}\}=\[\\mathbf\{T\}\_\{\\text\{prev\}\}^\{\(time\)\};\\mathbf\{T\}\_\{\\text\{curr\}\}^\{\(time\)\}\]
5. 5\.Apply sinusoidal positional encoding to concatenated time\-domain sequence
6. 6\.Apply RMS normalization to split sequences independently
7. 7\.Compute cross\-focus: current window \(query\) attends to previous window \(key/value\)
8. 8\.Upsample in frequency domain, reshape, transform to time domain, and add as residual

Frequency\-Domain Processing:

1. 1\.Reshape to token representation and downsample \(4×\) via learned 1×1 convolution
2. 2\.Remain in frequency domain\(no time\-domain transformation\)
3. 3\.Apply sinusoidal positional encoding to previous and current windowsindependently\(no concatenation\)
4. 4\.Apply RMS normalization
5. 5\.Compute cross\-focus: current window \(query\) attends to previous window \(key/value\)
6. 6\.Upsample, reshape, and add as residual

Rationale for Domain\-Specific Treatment:

Time Domain:Causal concatenation in the time domain preserves temporal ordering and phase relationships across window boundaries\. The frequency\-domain transformation is used only for anti\-aliasing during downsampling—the actual positional encoding must occur in the time domain where position has absolute temporal meaning\. This ensures the model learns continuous phase relationships: a feature at the end of windowi−1i\-1should smoothly connect to the beginning of windowii\.

Frequency Domain:Each position corresponds to a spectral bin \(e\.g\., the 100 Hz component\) rather than a temporal instant\. Concatenation would impose artificial temporal ordering on what is fundamentally a spectral covariance problem\. Independent positional encoding allows the cross\-attention mechanism to learn how corresponding frequency bins evolve across windows \(e\.g\., "when the 100 Hz component is strong in the previous window, how does it affect the current window?"\) without conflating this with temporal position\.

Stage 2: Latent\-Conditioned Self\-Focus \(all windows\)

After cross\-window focus \(or for the first window where Stage 1 is skipped\), we apply self\-focus conditioned on the bottleneck latent\. Again, domain\-specific preprocessing ensures physical consistency:

Time\-Domain Processing:

1. 1\.Downsample tokens \(4×\) via learned 1×1 convolution
2. 2\.Transform to time domainbefore latent conditioning
3. 3\.Project bottleneck latent to token dimension and prepend:𝐓aug=\[𝐖z​𝐳t;𝐓curr\(t​i​m​e\)\]\\mathbf\{T\}\_\{\\text\{aug\}\}=\[\\mathbf\{W\}\_\{z\}\\mathbf\{z\}\_\{t\};\\mathbf\{T\}\_\{\\text\{curr\}\}^\{\(time\)\}\]
4. 4\.Apply sinusoidal positional encoding and RMS normalization
5. 5\.Compute self\-focus over augmented sequence
6. 6\.Remove latent token from output
7. 7\.Transform to frequency domainbefore upsampling
8. 8\.Upsample in frequency domain, reshape, transform to time domain, and add as residual

Frequency\-Domain Processing:

1. 1\.Downsample tokens \(4×\) via learned 1×1 convolution
2. 2\.Remain in frequency domain
3. 3\.Project bottleneck latent to token dimension and prepend:𝐓aug=\[𝐖z​𝐳f;𝐓curr\(f​r​e​q\)\]\\mathbf\{T\}\_\{\\text\{aug\}\}=\[\\mathbf\{W\}\_\{z\}\\mathbf\{z\}\_\{f\};\\mathbf\{T\}\_\{\\text\{curr\}\}^\{\(freq\)\}\]
4. 4\.Apply sinusoidal positional encoding and RMS normalization
5. 5\.Compute self\-focus over augmented sequence
6. 6\.Remove latent token from output
7. 7\.Upsample, reshape, and add as residual

##### B\.4\.4\.3Bidirectional Conditioning

This two\-stage process provides:

- •Temporal coherence:Cross\-window focus in time domain ensures phase continuity across boundaries
- •Spectral coherence:Cross\-window focus in frequency domain captures time\-varying spectral evolution
- •Task\-specific guidance:Latent conditioning guides refinement toward reconstruction target in both domains
- •Bidirectional influence:Latent influences sequence \(forward\), sequence influences latent via gradients \(backward\)

Anti\-Aliasing Strategy:The time\-domain branch performs frequency\-domain transformations \(FFT/IFFT\) around the downsampling/upsampling operations to leverage insights motivated by frequency\-preserving pooling \(Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\)\. This prevents aliasing artifacts that would corrupt high\-frequency information essential for learning frequency translation equivariance\. The frequency\-domain branch operates natively in frequency, so no additional transformations are needed\.

### B\.5Global Sequence Refinement

After window\-based processing, the decoder applies global refinement mechanisms to address sequence\-level phenomena and ensure high\-fidelity reconstruction\.

#### B\.5\.1Window Aggregation and Smoothing

Time Domain:Window outputs are concatenated along the sequence dimension:

𝐱concat=Concat​\(\[𝐱win1,…,𝐱winNw\],dim=−1\)∈ℝB×C×Ltotal\\mathbf\{x\}\_\{\\text\{concat\}\}=\\text\{Concat\}\(\[\\mathbf\{x\}\_\{\\text\{win\}\_\{1\}\},\\ldots,\\mathbf\{x\}\_\{\\text\{win\}\_\{N\_\{w\}\}\}\],\\text\{dim\}=\-1\)\\in\\mathbb\{R\}^\{B\\times C\\times L\_\{\\text\{total\}\}\}\(128\)
Domain\-Specific Final Convolutions and Fusion:

𝐱tfinal\\displaystyle\\mathbf\{x\}\_\{t\}^\{\\text\{final\}\}=Convfinal\(t\)​\(𝐱concat\)\\displaystyle=\\text\{Conv\}\_\{\\text\{final\}\}^\{\(t\)\}\(\\mathbf\{x\}\_\{\\text\{concat\}\}\)\(129\)𝐱ffinal\\displaystyle\\mathbf\{x\}\_\{f\}^\{\\text\{final\}\}=Convfinal\(f\)​\(𝐱freq\)\\displaystyle=\\text\{Conv\}\_\{\\text\{final\}\}^\{\(f\)\}\(\\mathbf\{x\}\_\{\\text\{freq\}\}\)\(130\)𝐱f\(t\)\\displaystyle\\mathbf\{x\}\_\{f\}^\{\(t\)\}=IFFT​\(𝐱ffinal\)\(frequency to time\)\\displaystyle=\\text\{IFFT\}\(\\mathbf\{x\}\_\{f\}^\{\\text\{final\}\}\)\\quad\\text\{\(frequency to time\)\}\(131\)𝐱fused\\displaystyle\\mathbf\{x\}\_\{\\text\{fused\}\}=GLU​\(\[𝐱tfinal;𝐱f\(t\)\]\)\\displaystyle=\\text\{GLU\}\(\[\\mathbf\{x\}\_\{t\}^\{\\text\{final\}\};\\mathbf\{x\}\_\{f\}^\{\(t\)\}\]\)\(132\)
Smoothing Convolution:A convolution with kernel size 7 is applied across the full sequence, blending features across window boundaries while preserving overall signal structure:

𝐱smooth=Convsmooth​\(𝐱fused\)\\mathbf\{x\}\_\{\\text\{smooth\}\}=\\text\{Conv\}\_\{\\text\{smooth\}\}\(\\mathbf\{x\}\_\{\\text\{fused\}\}\)\(133\)

#### B\.5\.2Transformer\-Based Frequency Refinement

After window\-based processing and concatenation, local convolutional operations have captured fine\-grained details but may miss long\-range spectral dependencies\. We apply transformer\-based attention in the frequency domain to capture global spectral relationships\.

Procedure:

1. 1\.Transform smoothed time\-domain representation to frequency via FFT
2. 2\.Bin spectrum into manageable tokens:51205120frequency bins→\\rightarrow8080tokens \(64 bins/token\)
3. 3\.Apply single\-layer transformer \(8 heads, Scaled Covariance Focus\) to refine frequency components
4. 4\.Transform back to time domain via IFFT and remove DC bias

Rationale:Spectral relationships \(harmonics, spectral envelope\) manifest as local patterns in frequency but long\-range dependencies in time\. For example, a fundamental frequency and its harmonics \(separated by thousands of samples in time\) appear as a local pattern in frequency\. The transformer enforces consistency across harmonically\-related components, improving reconstruction fidelity for signals with rich harmonic structure\.

#### B\.5\.3Sliding Window Self\-Focus Activation

Purpose:Final layer of adaptive, equivariant refinement\.

Motivation:The final activation function must address several dynamic symmetries essential for effective signal reconstruction: \(1\) translational equivariance, \(2\) variable output dynamic ranges, \(3\) relative long\-range phase relationships, and \(4\) final refinement that earlier layers could not address\.

Challenge:Signal reconstructions are anchored to unit power during training, but this does not produce a fixed dynamic range\. Output values can vary widely depending on signal characteristics\. It is unreasonable to learn fixed activation parameters that generalize to any reconstruction range—the dynamic range itself must be learnable and adaptive\.

##### B\.5\.3\.1Solution: Attention\-Based Hypernetwork for Dynamic Modulation

We extend the Dynamic FiLM mechanism \(Section[B\.4\.2](https://arxiv.org/html/2606.04106#A2.SS4.SSS2)\) to the final activation layer, introducing an attention\-based hypernetwork to dynamically modulate both inside and outside the nonlinearity\. This creates a bidirectional relationship between the conditioning signal \(bottleneck latent\) and the features being modulated \(reconstruction sequence\)\.

Computational Challenge:Since this operates over the final reconstruction sequence, each element serves as a "token\." Computing full inner\-product attention across all tokens becomes computationally prohibitive for long sequences \(e\.g\., 5120 samples = 5120 tokens\)\.

Solution:We utilize our sliding window focus formulation \(Section[A\.4\.1](https://arxiv.org/html/2606.04106#A1.SS4.SSS1)\) across the sequence with 20 windows\. To ensure proper continuity for phase relationship learning, we leverage the causal window focus formulation\. The key difference: we do not upsample back to the original signal resolution after computing causal cross\-window focus\. Instead, the output of cross\-window focus \(at reduced resolution\) serves as conditioning for hyperparameter generation, which is then broadcast to the full\-resolution reconstruction\.

##### B\.5\.3\.2Modulation Parameter Generation

Two\-Stage Process:

1. 1\.Cross\-Window Focus:Contextualizes phase compensation by attending to the previous window, ensuring continuity across window boundaries
2. 2\.Bottleneck Latent\-Conditioned Self\-Focus:Refines via bidirectional influence between the bottleneck latent and all tokens in the current window, providing task\-specific guidance

This ensures both local coherence \(within windows\) and global coherence \(across windows\) while maintaining computational efficiency\.

##### B\.5\.3\.3Complex\-Valued Modulation

To avoid introducing DC bias, we only compute alpha \(α\\alpha\) and gamma \(γ\\gamma\) for scaling modulations inside and outside the TanH nonlinearity \(no additive beta\)\. Since elements represent real and imaginary components of a complex\-valued reconstruction \(in interleaved IQ format\), gamma and alpha are generated as pairs of element\-wise magnitude and phase compensations\.

For a complex\-valued samplez=I\+j​Qz=I\+jQ\(represented as consecutive elements\[I,Q\]\[I,Q\]in the interleaved format\):

\|z\|\\displaystyle\|z\|=I2\+Q2\(magnitude\)\\displaystyle=\\sqrt\{I^\{2\}\+Q^\{2\}\}\\quad\\text\{\(magnitude\)\}\(134\)ϕ\\displaystyle\\phi=atan2​\(Q,I\)\(phase\)\\displaystyle=\\text\{atan2\}\(Q,I\)\\quad\\text\{\(phase\)\}\(135\)zmod\\displaystyle z\_\{\\text\{mod\}\}=αmag⋅tanh⁡\(γmag⋅\|z\|\+\|z\|\)⋅ej​\(ϕ\+γphase\+αphase\)\\displaystyle=\\alpha\_\{\\text\{mag\}\}\\cdot\\tanh\(\\gamma\_\{\\text\{mag\}\}\\cdot\|z\|\+\|z\|\)\\cdot e^\{j\(\\phi\+\\gamma\_\{\\text\{phase\}\}\+\\alpha\_\{\\text\{phase\}\}\)\}\(136\)
This is then converted back to interleaved IQ format:

Imod\\displaystyle I\_\{\\text\{mod\}\}=\|zmod\|​cos⁡\(ϕmod\)\\displaystyle=\|z\_\{\\text\{mod\}\}\|\\cos\(\\phi\_\{\\text\{mod\}\}\)\(137\)Qmod\\displaystyle Q\_\{\\text\{mod\}\}=\|zmod\|​sin⁡\(ϕmod\)\\displaystyle=\|z\_\{\\text\{mod\}\}\|\\sin\(\\phi\_\{\\text\{mod\}\}\)\(138\)
Rationale:Complex\-valued modulation provides fine\-grained control over both amplitude and phase, essential for high\-fidelity signal reconstruction\. The magnitude modulation can adaptively adjust dynamic range, while phase modulation can correct phase discontinuities at window boundaries or compensate for phase distortions introduced by earlier processing stages\.

### B\.6Complete Decoder Pipeline

Algorithm[9](https://arxiv.org/html/2606.04106#alg9)summarizes the complete decoder forward pass\.

Algorithm 9Complete Decoder Forward Pass0:Time tokens

𝐓t\\mathbf\{T\}\_\{t\}, Frequency tokens

𝐓f\\mathbf\{T\}\_\{f\}
0:Skip connections

\{𝐒t\(i\),𝐒f\(i\)\}\\\{\\mathbf\{S\}\_\{t\}^\{\(i\)\},\\mathbf\{S\}\_\{f\}^\{\(i\)\}\\\}, Bottleneck latents

𝐳t,𝐳f\\mathbf\{z\}\_\{t\},\\mathbf\{z\}\_\{f\}
0:Final reconstruction

𝐱out\\mathbf\{x\}\_\{\\text\{out\}\}
1:

2:\{Reshape tokens to convolutional format\}

3:

𝐱t,𝐱f←ReshapeToChannels​\(𝐓t,𝐓f\)\\mathbf\{x\}\_\{t\},\\mathbf\{x\}\_\{f\}\\leftarrow\\text\{ReshapeToChannels\}\(\\mathbf\{T\}\_\{t\},\\mathbf\{T\}\_\{f\}\)
4:

5:for

window\_idx=1\\text\{window\\\_idx\}=1to

NwN\_\{w\}do

6:\{Extract time window, use full frequency\}

7:

𝐱twin←𝐱t\[:,:,\(window\_idx−1\)⋅W:window\_idx⋅W\]\\mathbf\{x\}\_\{t\}^\{\\text\{win\}\}\\leftarrow\\mathbf\{x\}\_\{t\}\[:,:,\(\\text\{window\\\_idx\}\-1\)\\cdot W:\\text\{window\\\_idx\}\\cdot W\]
8:

9:for

i=1i=1to

NlayersN\_\{\\text\{layers\}\}do

10:\{Convolutional processing\}

11:

𝐱twin←Conv\(i\)​\(𝐱twin\)\\mathbf\{x\}\_\{t\}^\{\\text\{win\}\}\\leftarrow\\text\{Conv\}^\{\(i\)\}\(\\mathbf\{x\}\_\{t\}^\{\\text\{win\}\}\)
12:

𝐱f←Conv\(i\)​\(𝐱f\)\\mathbf\{x\}\_\{f\}\\leftarrow\\text\{Conv\}^\{\(i\)\}\(\\mathbf\{x\}\_\{f\}\)
13:

14:\{Dynamic FiLM conditioning\}

15:

𝐱twin←DynamicFiLM\(i\)​\(𝐱twin,𝐳t\)\\mathbf\{x\}\_\{t\}^\{\\text\{win\}\}\\leftarrow\\text\{DynamicFiLM\}^\{\(i\)\}\(\\mathbf\{x\}\_\{t\}^\{\\text\{win\}\},\\mathbf\{z\}\_\{t\}\)
16:

𝐱f←DynamicFiLM\(i\)​\(𝐱f,𝐳f\)\\mathbf\{x\}\_\{f\}\\leftarrow\\text\{DynamicFiLM\}^\{\(i\)\}\(\\mathbf\{x\}\_\{f\},\\mathbf\{z\}\_\{f\}\)
17:

18:\{Frequency\-domain upsampling\}

19:

𝐱twin←IFFT​\(Upsample​\(FFT​\(𝐱twin\)\)\)\\mathbf\{x\}\_\{t\}^\{\\text\{win\}\}\\leftarrow\\text\{IFFT\}\(\\text\{Upsample\}\(\\text\{FFT\}\(\\mathbf\{x\}\_\{t\}^\{\\text\{win\}\}\)\)\)
20:

𝐱f←Upsample​\(𝐱f\)\\mathbf\{x\}\_\{f\}\\leftarrow\\text\{Upsample\}\(\\mathbf\{x\}\_\{f\}\)
21:

22:if

i≠Nlayersi\\neq N\_\{\\text\{layers\}\}then

23:\{Skip connection fusion\}

24:

𝐒tclean←SkipSink\(i\)​\(𝐒t\(i\),𝐳t\)\\mathbf\{S\}\_\{t\}^\{\\text\{clean\}\}\\leftarrow\\text\{SkipSink\}^\{\(i\)\}\(\\mathbf\{S\}\_\{t\}^\{\(i\)\},\\mathbf\{z\}\_\{t\}\)
25:

𝐒fclean←SkipSink\(i\)​\(𝐒f\(i\),𝐳f\)\\mathbf\{S\}\_\{f\}^\{\\text\{clean\}\}\\leftarrow\\text\{SkipSink\}^\{\(i\)\}\(\\mathbf\{S\}\_\{f\}^\{\(i\)\},\\mathbf\{z\}\_\{f\}\)
26:

𝐱twin←GLU​\(\[𝐱twin;FFT​\(𝐒tclean\)\]\)\\mathbf\{x\}\_\{t\}^\{\\text\{win\}\}\\leftarrow\\text\{GLU\}\(\[\\mathbf\{x\}\_\{t\}^\{\\text\{win\}\};\\text\{FFT\}\(\\mathbf\{S\}\_\{t\}^\{\\text\{clean\}\}\)\]\)
27:

𝐱f←GLU​\(\[𝐱f;𝐒fclean\]\)\\mathbf\{x\}\_\{f\}\\leftarrow\\text\{GLU\}\(\[\\mathbf\{x\}\_\{f\};\\mathbf\{S\}\_\{f\}^\{\\text\{clean\}\}\]\)
28:

29:\{Causal cross\-window focus\}

30:

𝐱twin←CausalCrossWindowFocus​\(𝐱twin,𝐱tprev,𝐳t\)\\mathbf\{x\}\_\{t\}^\{\\text\{win\}\}\\leftarrow\\text\{CausalCrossWindowFocus\}\(\\mathbf\{x\}\_\{t\}^\{\\text\{win\}\},\\mathbf\{x\}\_\{t\}^\{\\text\{prev\}\},\\mathbf\{z\}\_\{t\}\)
31:

𝐱f←CausalCrossWindowFocus​\(𝐱f,𝐱fprev,𝐳f\)\\mathbf\{x\}\_\{f\}\\leftarrow\\text\{CausalCrossWindowFocus\}\(\\mathbf\{x\}\_\{f\},\\mathbf\{x\}\_\{f\}^\{\\text\{prev\}\},\\mathbf\{z\}\_\{f\}\)
32:endif

33:endfor

34:

35:\{Store for next window\}

36:

time\_outputs\.append​\(𝐱twin\)\\text\{time\\\_outputs\}\.\\text\{append\}\(\\mathbf\{x\}\_\{t\}^\{\\text\{win\}\}\)
37:endfor

38:

39:\{Global refinement\}

40:

𝐱concat←Concat​\(time\_outputs\)\\mathbf\{x\}\_\{\\text\{concat\}\}\\leftarrow\\text\{Concat\}\(\\text\{time\\\_outputs\}\)
41:

𝐱tfinal←Convfinal\(t\)​\(𝐱concat\)\\mathbf\{x\}\_\{t\}^\{\\text\{final\}\}\\leftarrow\\text\{Conv\}\_\{\\text\{final\}\}^\{\(t\)\}\(\\mathbf\{x\}\_\{\\text\{concat\}\}\)
42:

𝐱ffinal←Convfinal\(f\)​\(𝐱f\)\\mathbf\{x\}\_\{f\}^\{\\text\{final\}\}\\leftarrow\\text\{Conv\}\_\{\\text\{final\}\}^\{\(f\)\}\(\\mathbf\{x\}\_\{f\}\)
43:

𝐱fused←GLU​\(\[𝐱tfinal;IFFT​\(𝐱ffinal\)\]\)\\mathbf\{x\}\_\{\\text\{fused\}\}\\leftarrow\\text\{GLU\}\(\[\\mathbf\{x\}\_\{t\}^\{\\text\{final\}\};\\text\{IFFT\}\(\\mathbf\{x\}\_\{f\}^\{\\text\{final\}\}\)\]\)
44:

45:

𝐱smooth←Convsmooth​\(𝐱fused\)\\mathbf\{x\}\_\{\\text\{smooth\}\}\\leftarrow\\text\{Conv\}\_\{\\text\{smooth\}\}\(\\mathbf\{x\}\_\{\\text\{fused\}\}\)
46:

𝐱refined←IFFT​\(TransformerRefinement​\(FFT​\(𝐱smooth\)\)\)\\mathbf\{x\}\_\{\\text\{refined\}\}\\leftarrow\\text\{IFFT\}\(\\text\{TransformerRefinement\}\(\\text\{FFT\}\(\\mathbf\{x\}\_\{\\text\{smooth\}\}\)\)\)
47:

𝐱refined←𝐱refined−𝔼​\[𝐱refined\]\\mathbf\{x\}\_\{\\text\{refined\}\}\\leftarrow\\mathbf\{x\}\_\{\\text\{refined\}\}\-\\mathbb\{E\}\[\\mathbf\{x\}\_\{\\text\{refined\}\}\]\{DC removal\}

48:

49:\{Final activation\}

50:

𝐱out←SlidingSelfFocusActivation​\(𝐱refined,\[𝐳t;𝐳f\]\)\\mathbf\{x\}\_\{\\text\{out\}\}\\leftarrow\\text\{SlidingSelfFocusActivation\}\(\\mathbf\{x\}\_\{\\text\{refined\}\},\[\\mathbf\{z\}\_\{t\};\\mathbf\{z\}\_\{f\}\]\)
51:

52:return

𝐱out\\mathbf\{x\}\_\{\\text\{out\}\}

### B\.7Decoder Summary

Table[9](https://arxiv.org/html/2606.04106#A2.T9)summarizes the key innovations in the PlanFormer Decoder\.

Table 9:PlanFormer Decoder Key InnovationsComponentInnovationLatent\-Conditioned ReconstructionBottleneck latent injected at multiple decoder stages for instance\-wise, task\-specific reconstruction guidance based on physical structureAttention\-Based Dynamic FiLMScaled Covariance Cross\-Focus for bidirectional, adaptive feature modulation with pointwise inner modulation and hierarchical residual connectionsFrequency\-Domain UpsamplingSkip connection spectrums infill high\-fidelity frequency components rather than discrete time\-domain samples, preserving spectral envelopeParseval Focus\-Based Skip SinksUnified mechanism combining noise removal, cross\-domain physical consistency, and conditional dynamic filter banks guided by reconstruction targetCausal Cross\-Window FocusBidirectional conditioning between bottleneck latent and window sequences for phase\-coherent reconstructionFrequency\-Domain TransformerGlobal spectral refinement with adaptive frequency binning for long\-range harmonic relationshipsSliding Window Self\-Focus ActivationAttention\-based hypernetwork generating per\-element complex\-valued modulation parameters with causal coherence and task\-specific guidance#### B\.7\.1Synergy with Encoder

The decoder’s design is deeply integrated with the encoder architecture:

\(1\) Symmetric Dual\-Domain Processing:Both maintain parallel time/frequency branches with strategic fusion, ensuring consistent treatment of time\-frequency duality throughout the entire architecture\.

\(2\) Consistent Tokenization:Decoder respects encoder’s IQ\-grouped tokenization \(Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\), preserving complex\-valued relationships and enabling seamless FFT/IFFT operations\.

\(3\) Latent Conditioning:Encoder’s dual\-domain latent representations \(Section[A\.9](https://arxiv.org/html/2606.04106#A1.SS9)\) condition every decoder layer, providing continuous task\-specific guidance throughout reconstruction\.

\(4\) Skip Connection Compatibility:Decoder’s skip fusion mechanisms handle encoder’s noise\-sink\-processed skip connections, with additional target\-guided filtering to remove unwanted components based on the reconstruction objective\.

\(5\) Frequency Pooling/Upsampling Consistency:Both use frequency\-domain pooling \(encoder\) and upsampling \(decoder\) to preserve spectral envelopes, avoiding the spectral bias that would prevent learning high\-frequency equivariances\.

\(6\) Shared Focus Mechanisms:Both employ Scaled Covariance Focus and Multi\-Head Parseval Focus, ensuring consistent treatment of functional relationships and cross\-domain consistency throughout encoding and decoding\.

#### B\.7\.2Design Philosophy: Physical Structure, Not Semantic Content

Critically, the decoder operates entirely within the physical domain\. The "target" encoded in the bottleneck latent is a specification ofwhich physical structure to reconstruct, not a semantic category:

- •Denoising:Latent encodes "reconstruct the signal component, not the noise component"
- •Source Separation:Latent encodes "reconstruct speaker A’s voice characteristics, not speaker B’s"
- •Signal Restoration:Latent encodes "reconstruct the clean signal structure, not the artifacts"

In all cases, the decoder is selecting and reconstructing physical features—spectral content, temporal dynamics, phase relationships—not learning semantic categories like "jazz" or "classical\."

#### B\.7\.3Enabling Zero\-Shot Transfer

The decoder’s design directly enables the zero\-shot cross\-modal transfer demonstrated in the main paper \(Table[1](https://arxiv.org/html/2606.04106#S4.T1)\)\. By conditioning all reconstruction operations on the encoder’s domain\-invariant physical representations rather than domain\-specific patterns, the decoder can reconstruct signals from unseen domains:

- •RF→\\rightarrowAudio:Encoder captures temporal dynamics and spectral structure; decoder reconstructs audio waveforms respecting these physical constraints
- •RF→\\rightarrowImages:Encoder captures spatial frequency content \(via 1D unwrapping\); decoder reconstructs 2D structure from 1D representations
- •Denoising/Separation:Skip sinks dynamically filter based on latent\-encoded targets, removing noise or interfering sources without domain\-specific training

This zero\-shot capability supports our central hypothesis: by learning physical structure rather than semantic content, both encoder and decoder generalize to unseen domains sharing similar physical transformations\.

#### B\.7\.4Training\-Specific Outputs

During training, the decoder returns additional outputs for regularization:

Head Orthogonalization Loss:Aggregated across all multi\-head focus mechanisms:

ℒorthdecoder=∑iℒorth\(FiLM,i\)\+∑iℒorth\(skip,i\)\+∑iℒorth\(transformer,i\)\+ℒorth\(refinement\)\+ℒorth\(activation\)\\mathcal\{L\}\_\{\\text\{orth\}\}^\{\\text\{decoder\}\}=\\sum\_\{i\}\\mathcal\{L\}\_\{\\text\{orth\}\}^\{\(\\text\{FiLM\},i\)\}\+\\sum\_\{i\}\\mathcal\{L\}\_\{\\text\{orth\}\}^\{\(\\text\{skip\},i\)\}\+\\sum\_\{i\}\\mathcal\{L\}\_\{\\text\{orth\}\}^\{\(\\text\{transformer\},i\)\}\+\\mathcal\{L\}\_\{\\text\{orth\}\}^\{\(\\text\{refinement\}\)\}\+\\mathcal\{L\}\_\{\\text\{orth\}\}^\{\(\\text\{activation\}\)\}\(139\)
Skip Sink Outputs:For each decoder layer with skip connections, we return𝐱sink\(i\)\\mathbf\{x\}\_\{\\text\{sink\}\}^\{\(i\)\}\(estimated noise/interference\) and𝐱post\-sink\(i\)\\mathbf\{x\}\_\{\\text\{post\-sink\}\}^\{\(i\)\}\(cleaned skip connection\)\. These enable the skip sink regularization losses described in Section[B\.4\.3](https://arxiv.org/html/2606.04106#A2.SS4.SSS3)\.

## Appendix CAppendix C: Auxiliary Networks and Training Objectives

The encoder and decoder architectures \(Appendices[A](https://arxiv.org/html/2606.04106#A1)and[B](https://arxiv.org/html/2606.04106#A2)\) provide the computational substrate for learning signal\-theoretic principles\. However, these architectures alone are insufficient—they must be trained with co\-designed loss functions and auxiliary networks that generate meaningful gradients for physical representation learning\. This appendix describes the auxiliary networks and training objectives that enable the encoder to learn domain\-invariant physical structure\.

### C\.1Design Philosophy: Co\-Designed Auxiliary Networks

Our training methodology combines three complementary objectives, each requiring dedicated auxiliary networks:

1. 1\.Global Invariance Learning \(IsoFICReg\):Learn what should remain the same across transformations \(e\.g\., emitter identity despite noise\)
2. 2\.Equivariance Learning \(LED\):Learn how representations should change predictably under transformations \(e\.g\., frequency shifts\)
3. 3\.Instance\-Specific Learning \(Reconstruction\):Learn fine\-grained physical details through reconstruction fidelity

Each objective addresses different aspects of physical representation learning:

- •IsoFICRegcaptures global semantic structure \(what the signal is\)
- •LEDcaptures transformation rules \(how the signal changes\)
- •Reconstructioncaptures instance\-specific details \(how to reconstruct this particular signal\)

Together, these objectives ensure the encoder learns representations that are both globally informative and locally discriminative, enabling frozen encoder transfer to unseen domains\.

### C\.2Augmentations and Transformation Encoding

We apply domain\-invariant transformations during training to learn equivariant and invariant symmetries\. Each transformation is encoded as a bounded code in\[0,1\]\[0,1\]to facilitate sigmoid\-activated regression heads\. These codes serve dual purposes: \(1\) conditioning tokens in the Latent Equivariant Transformer \(Section[C\.3\.2](https://arxiv.org/html/2606.04106#A3.SS3.SSS2)\), and \(2\) regression targets for equivariance losses \(Section[C\.3\.4](https://arxiv.org/html/2606.04106#A3.SS3.SSS4)\)\.

#### C\.2\.1Augmentation Pipeline

Augmentations are applied in the following sequence:

1. 1\.Frequency Shift:Randomfo∈\[−0\.33⋅fs/2,0\.33⋅fs/2\]f\_\{o\}\\in\[\-0\.33\\cdot f\_\{s\}/2,0\.33\\cdot f\_\{s\}/2\]
2. 2\.Phase Rotation:Randomϕ∈\[0,2​π\)\\phi\\in\[0,2\\pi\)
3. 3\.IQ Flip:Random I/Q component flipping
4. 4\.Time Shift:Randomτ∈\[−0\.25​L,0\.25​L\]\\tau\\in\[\-0\.25L,0\.25L\]samples
5. 5\.AWGN:Random SNR∈\[−10,100\]\\in\[\-10,100\]dB
6. 6\.Unit Power Normalization:Scale to unit power
7. 7\.IQ Interleaving:Convert complex to real\-valued interleaved sequence

The transformation codes𝜽=\[θfreq,θphase,θh\-flip,θv\-flip,θtime\]\\boldsymbol\{\\theta\}=\[\\theta\_\{\\text\{freq\}\},\\theta\_\{\\text\{phase\}\},\\theta\_\{\\text\{h\-flip\}\},\\theta\_\{\\text\{v\-flip\}\},\\theta\_\{\\text\{time\}\}\]are concatenated and used for latent conditioning and regression targets\. AWGN parameters \(noise power, signal power, SNR\) are returned for regularization losses but not used for equivariance learning\.

#### C\.2\.2Frequency Shifting

Transformation:Apply frequency offsetfo∈\[−fmax,fmax\]f\_\{o\}\\in\[\-f\_\{\\text\{max\}\},f\_\{\\text\{max\}\}\]wherefmax=0\.33⋅fs/2f\_\{\\text\{max\}\}=0\.33\\cdot f\_\{s\}/2, via complex exponential multiplication followed by asymmetric anti\-aliasing filtering:

𝐱shifted​\[n\]=AAF​\(𝐱​\[n\]⋅ej​2​π​fo​n/fs,fo,fs\)\\mathbf\{x\}\_\{\\text\{shifted\}\}\[n\]=\\text\{AAF\}\\left\(\\mathbf\{x\}\[n\]\\cdot e^\{j2\\pi f\_\{o\}n/f\_\{s\}\},f\_\{o\},f\_\{s\}\\right\)\(140\)
where the anti\-aliasing filter \(AAF\) applies asymmetric frequency\-domain masking based on the shift direction\. For complex baseband signals with asymmetric spectra:

- •Positive shifts\(fo\>0f\_\{o\}\>0\): Suppress high positive frequencies beyond\(fs/2−\|fo\|\)\(f\_\{s\}/2\-\|f\_\{o\}\|\)to prevent aliasing above Nyquist
- •Negative shifts\(fo<0f\_\{o\}<0\): Suppress low negative frequencies beyond−\(fs/2−\|fo\|\)\-\(f\_\{s\}/2\-\|f\_\{o\}\|\)to prevent aliasing below−fs/2\-f\_\{s\}/2

This asymmetric filtering preserves the maximum signal bandwidth while preventing spectral wraparound after the frequency shift operation\. The implementation uses FFT\-based filtering with per\-sample adaptive cutoff frequencies computed as:

fcutoff\+=\{fs/2−\|fo\|if​fo≥0fs/2if​fo<0,fcutoff−=\{fs/2if​fo≥0fs/2−\|fo\|if​fo<0f\_\{\\text\{cutoff\}\}^\{\+\}=\\begin\{cases\}f\_\{s\}/2\-\|f\_\{o\}\|&\\text\{if \}f\_\{o\}\\geq 0\\\\ f\_\{s\}/2&\\text\{if \}f\_\{o\}<0\\end\{cases\},\\quad f\_\{\\text\{cutoff\}\}^\{\-\}=\\begin\{cases\}f\_\{s\}/2&\\text\{if \}f\_\{o\}\\geq 0\\\\ f\_\{s\}/2\-\|f\_\{o\}\|&\\text\{if \}f\_\{o\}<0\\end\{cases\}\(141\)
wherefcutoff\+f\_\{\\text\{cutoff\}\}^\{\+\}andfcutoff−f\_\{\\text\{cutoff\}\}^\{\-\}are the cutoff frequencies for positive and negative frequency components, respectively\.

Purpose:Learn frequency translation equivariance—representations should transform predictably when signal frequency content shifts\.

Transformation Code:Shift amount normalized to\[0,1\]\[0,1\]:

θfreq=fo\+fs/2fs∈\[0,1\]\\theta\_\{\\text\{freq\}\}=\\frac\{f\_\{o\}\+f\_\{s\}/2\}\{f\_\{s\}\}\\in\[0,1\]\(142\)
The offset byfs/2f\_\{s\}/2ensures positive values for easier sigmoid regression\.

Learned Symmetry:Frequency translation equivariance enables the model to recognize that a signal shifted byΔ​f\\Delta fin input space should produce a predictable transformation in latent space, independent of the signal’s original frequency content\. This generalizes to arbitrary sample rates and frequency ranges\.

#### C\.2\.3Phase Rotation

Transformation:Apply random phase rotationϕ∈\[0,2​π\)\\phi\\in\[0,2\\pi\):

xrotated​\[n\]=x​\[n\]⋅ej​ϕx\_\{\\text\{rotated\}\}\[n\]=x\[n\]\\cdot e^\{j\\phi\}\(143\)
Purpose:Learn phase rotation equivariance—representations should transform predictably under global phase rotations\. Unlike phase invariance, we learn how representations change with phase, enabling the model to track phase relationships when needed while remaining robust to arbitrary phase offsets\.

Transformation Code:Phase normalized to\[0,1\]\[0,1\]:

θphase=ϕ2​π∈\[0,1\]\\theta\_\{\\text\{phase\}\}=\\frac\{\\phi\}\{2\\pi\}\\in\[0,1\]\(144\)
For regression, we use the circular variants of the regression losses[C\.3\.4\.2](https://arxiv.org/html/2606.04106#A3.SS3.SSS4.P2)\.

Cross\-Domain Generalization:Phase rotation equivariance enables the model to handle signals with arbitrary time alignment \(phase offset in time domain≈\\approxphase rotation in frequency domain\), generalizing to speech processing where utterances have arbitrary start times\.

#### C\.2\.4IQ Flipping \(Spectral Inversion\)

Transformation:Randomly flip real \(I\) and/or imaginary \(Q\) components:

xflipped​\[n\]=\{x​\[n\]no flip \(25% probability\)−Re​\(x​\[n\]\)\+j​Im​\(x​\[n\]\)horizontal flip \(25%\)Re​\(x​\[n\]\)−j​Im​\(x​\[n\]\)vertical flip \(25%\)−Re​\(x​\[n\]\)−j​Im​\(x​\[n\]\)both flips \(25%\)\\displaystyle x\_\{\\text\{flipped\}\}\[n\]=\\begin\{cases\}x\[n\]&\\text\{no flip \(25\\% probability\)\}\\\\ \-\\text\{Re\}\(x\[n\]\)\+j\\text\{Im\}\(x\[n\]\)&\\text\{horizontal flip \(25\\%\)\}\\\\ \\text\{Re\}\(x\[n\]\)\-j\\text\{Im\}\(x\[n\]\)&\\text\{vertical flip \(25\\%\)\}\\\\ \-\\text\{Re\}\(x\[n\]\)\-j\\text\{Im\}\(x\[n\]\)&\\text\{both flips \(25\\%\)\}\\end\{cases\}\(145\)
Purpose:Learn equivariance to spectral inversion and conjugation\. In RF systems, I/Q swapping can occur due to hardware configurations or sideband selection\. This augmentation ensures representations are robust to such inversions\.

Transformation Codes:Two binary indicators:

θh\-flip\\displaystyle\\theta\_\{\\text\{h\-flip\}\}∈\{0,1\}\(horizontal flip applied\)\\displaystyle\\in\\\{0,1\\\}\\quad\\text\{\(horizontal flip applied\)\}\(146\)θv\-flip\\displaystyle\\theta\_\{\\text\{v\-flip\}\}∈\{0,1\}\(vertical flip applied\)\\displaystyle\\in\\\{0,1\\\}\\quad\\text\{\(vertical flip applied\)\}\(147\)
Cross\-Domain Generalization:Spectral inversion equivariance generalizes to domains where sign conventions or orientation may vary \(e\.g\., image flipping, audio polarity inversion\)\.

#### C\.2\.5Time Shifting

Transformation:Apply random time shiftτ∈\[−τmax,τmax\]\\tau\\in\[\-\\tau\_\{\\max\},\\tau\_\{\\max\}\]whereτmax=0\.25⋅Lseq\\tau\_\{\\max\}=0\.25\\cdot L\_\{\\text\{seq\}\}\(25% of sequence length\)\. Shifted regions are padded with complex Gaussian noise at SNR = 70 dB:

xshifted​\[n\]=\{x​\[n−τ\]if​0≤n−τ<Lseq𝒞​𝒩​\(0,σnoise2\)otherwise \(padding\)x\_\{\\text\{shifted\}\}\[n\]=\\begin\{cases\}x\[n\-\\tau\]&\\text\{if \}0\\leq n\-\\tau<L\_\{\\text\{seq\}\}\\\\ \\mathcal\{CN\}\(0,\\sigma\_\{\\text\{noise\}\}^\{2\}\)&\\text\{otherwise \(padding\)\}\\end\{cases\}\(148\)
Purpose:Learn time translation equivariance—representations should transform predictably when signals are temporally shifted\. This is fundamental for processing signals with arbitrary time alignment and for learning causal structure independent of absolute time reference\.

Transformation Code:Shift amount normalized to\[0,1\]\[0,1\]:

θtime=τ/τmax\+12∈\[0,1\]\\theta\_\{\\text\{time\}\}=\\frac\{\\tau/\\tau\_\{\\max\}\+1\}\{2\}\\in\[0,1\]\(149\)
This maps\[−τmax,τmax\]\[\-\\tau\_\{\\max\},\\tau\_\{\\max\}\]to\[0,1\]\[0,1\]for sigmoid regression\.

Learned Symmetry:Time translation equivariance enables the model to recognize temporal patterns regardless of their position in the sequence\. This generalizes to variable\-length signals and enables processing signals with arbitrary time alignment without retraining\.

#### C\.2\.6Additive White Gaussian Noise \(AWGN\)

Transformation:Add complex Gaussian noise at target SNR∈\[−10,100\]\\in\[\-10,100\]dB:

Psignal\\displaystyle P\_\{\\text\{signal\}\}=𝔼​\[\|x​\[n\]\|2\]\\displaystyle=\\mathbb\{E\}\[\|x\[n\]\|^\{2\}\]\(150\)Pnoise\\displaystyle P\_\{\\text\{noise\}\}=Psignal/10SNR/10\\displaystyle=P\_\{\\text\{signal\}\}/10^\{\\text\{SNR\}/10\}\(151\)xnoisy​\[n\]\\displaystyle x\_\{\\text\{noisy\}\}\[n\]=x​\[n\]\+Pnoise⋅𝒞​𝒩​\(0,1\)\\displaystyle=x\[n\]\+\\sqrt\{P\_\{\\text\{noise\}\}\}\\cdot\\mathcal\{CN\}\(0,1\)\(152\)
where𝒞​𝒩​\(0,1\)\\mathcal\{CN\}\(0,1\)is complex Gaussian noise with unit variance\.

Purpose:Learn noise invariance—representations should remain unchanged despite additive noise\. Unlike the equivariant transformations, AWGN is a destructive transformation: noise does not preserve signal structure and should be removed rather than tracked\.

No Transformation Code:AWGN parameters \(noise power, signal power, SNR\) are returned for use in regularization losses \(e\.g\., Noise Sink power matching, Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\) but arenotused for equivariance learning or latent conditioning\. The model learns noise invariance through IsoFICReg \(Section[C\.5](https://arxiv.org/html/2606.04106#A3.SS5)\): noisy versions of the same signal should produce identical representations\.

Negative\-SNR Regime:The SNR range extends to \-10 dB, enabling learning in regimes where noise power exceeds signal power\. Combined with the Noise Sink mechanism \(Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\) and Latent Coherent Integration \(Section[C\.5](https://arxiv.org/html/2606.04106#A3.SS5)\), this enables robust representation learning even when signal is deeply buried in noise\.

Cross\-Domain Generalization:Noise invariance learned on RF generalizes to any domain with additive noise: sensor noise in images, background noise in audio, measurement noise in seismology\. The model learns to extract signal structure independent of noise level\.

#### C\.2\.7Unit Power Normalization

After all augmentations, signals are normalized to unit power:

xnorm​\[n\]=x​\[n\]𝔼​\[\|x​\[n\]\|2\]x\_\{\\text\{norm\}\}\[n\]=\\frac\{x\[n\]\}\{\\sqrt\{\\mathbb\{E\}\[\|x\[n\]\|^\{2\}\]\}\}\(153\)
Purpose:Ensure consistent signal power across batches, preventing the model from learning spurious correlations with absolute power levels\. This normalization is applied after AWGN injection, so the model learns representations at a fixed signal power but varying noise levels \(controlled by SNR\)\.

Not an Augmentation:This is a preprocessing step rather than an augmentation—it does not introduce variability but standardizes the input distribution\.

#### C\.2\.8IQ Interleaving \(Sample Format\)

Complex\-valued signals are converted to real\-valued interleaved sequences:

xinterleaved=\[Re​\(x​\[0\]\),Im​\(x​\[0\]\),Re​\(x​\[1\]\),Im​\(x​\[1\]\),…\]x\_\{\\text\{interleaved\}\}=\[\\text\{Re\}\(x\[0\]\),\\text\{Im\}\(x\[0\]\),\\text\{Re\}\(x\[1\]\),\\text\{Im\}\(x\[1\]\),\\ldots\]\(154\)
Purpose:Enable processing with real\-valued neural network operations while preserving complex\-valued structure\. The interleaving ensures that I and Q components remain paired throughout processing, facilitating seamless FFT/IFFT operations \(Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\)\.

Not an Augmentation:This is a format conversion rather than an augmentation—it preserves all information while changing representation\.

#### C\.2\.9Symmetry Learning: Equivariance vs\. Invariance

The augmentation pipeline encodes two types of symmetries:

Equivariant Symmetries \(Structure\-Preserving Transformations\):

- •Frequency translation:Shift byΔ​f\\Delta f→\\rightarrowpredictable latent transformation
- •Phase rotation:Rotate byϕ\\phi→\\rightarrowpredictable latent transformation
- •Spectral inversion:Flip I/Q→\\rightarrowpredictable latent transformation
- •Time translation:Shift byτ\\tau→\\rightarrowpredictable latent transformation

These are learned through LED \(Section[C\.3](https://arxiv.org/html/2606.04106#A3.SS3)\): the model learns transformation rules by predicting how representations change under these operations\.

Invariant Symmetry \(Destructive Transformation\):

- •Noise:Add AWGN→\\rightarrowrepresentation unchanged

This is learned through IsoFICReg \(Section[C\.5](https://arxiv.org/html/2606.04106#A3.SS5)\): noisy versions of the same signal should produce identical representations\.

Cross\-Domain Transfer Mechanism:By learning these domain\-invariant symmetries on RF data, the model acquires transformation rules that generalize to any domain respecting the same mathematical structure:

- •Frequency shift in RF≈\\approxfrequency shift in audio
- •Time shift in RF≈\\approxtime shift in seismology
- •Phase rotation in RF≈\\approxtime alignment in speech
- •Noise in RF≈\\approxsensor noise in images

This explains the cross\-modal transfer demonstrated in Table[1](https://arxiv.org/html/2606.04106#S4.T1): the model learned physical transformation rules on RF that apply generally to structured signals, enabling strong frozen\-encoder transfer on unseen domains without fine\-tuning the encoder on target domains\.

### C\.3Latent Equivariant Differences \(LED\)

#### C\.3\.1Theoretical Foundation

Equivariant symmetry learning aims to ensure representations transform predictably under domain\-invariant operations\. Formally, for a transformationggand representation functionff, equivariance requires:

g​\(f​\(x\)\)=f​\(g​\(x\)\)g\(f\(x\)\)=f\(g\(x\)\)\(155\)
In contrast, invariant symmetry \(for destructive distortions like noise\) requires:

g​\(f​\(x\)\)=f​\(x\)g\(f\(x\)\)=f\(x\)\(156\)
We learn both symmetries through a dual\-branch Joint Embedding Predictive Architecture \(JEPA\):

Clean Branch:Processes raw signals with no transformations, producing𝐳clean\\mathbf\{z\}\_\{\\text\{clean\}\}

Augmented Branch:Processes signals with applied transformations \(frequency shifts, time shifts, phase rotations, AWGN\), producing𝐳aug\\mathbf\{z\}\_\{\\text\{aug\}\}

The key insight: transformations for which we want equivariance are explicitly modeled in latent space via a Latent Equivariant Transformer, while transformations for which we want invariance \(noise\) are handled implicitly through the invariance criterion \(Section[C\.5](https://arxiv.org/html/2606.04106#A3.SS5)\)\.

#### C\.3\.2Latent Equivariant Transformer

The Latent Equivariant Transformer applies transformations directly to encoded token representations, enabling the model to learn how latent features should change under input\-space transformations\.

##### C\.3\.2\.1Architecture

Given clean branch tokens𝐓clean∈ℝB×N×d\\mathbf\{T\}\_\{\\text\{clean\}\}\\in\\mathbb\{R\}^\{B\\times N\\times d\}and transformation parameters𝜽∈ℝB×dθ\\boldsymbol\{\\theta\}\\in\\mathbb\{R\}^\{B\\times d\_\{\\theta\}\}\(e\.g\., frequency shift amount, time shift amount, phase rotation angle\), the Latent Equivariant Transformer produces transformed tokens𝐓latent\_transform∈ℝB×N×d\\mathbf\{T\}\_\{\\text\{latent\\\_transform\}\}\\in\\mathbb\{R\}^\{B\\times N\\times d\}\.

Step 1: Parameter Projection and Prepending

Transformation parameters are linearly projected to token dimension and prepended as conditioning tokens:

𝜽token\\displaystyle\\boldsymbol\{\\theta\}\_\{\\text\{token\}\}=𝐖θ​𝜽\+𝐛θ∈ℝB×d\\displaystyle=\\mathbf\{W\}\_\{\\theta\}\\boldsymbol\{\\theta\}\+\\mathbf\{b\}\_\{\\theta\}\\in\\mathbb\{R\}^\{B\\times d\}\(157\)𝐓aug\\displaystyle\\mathbf\{T\}\_\{\\text\{aug\}\}=\[𝜽token;𝐓clean\]∈ℝB×\(N\+1\)×d\\displaystyle=\[\\boldsymbol\{\\theta\}\_\{\\text\{token\}\};\\mathbf\{T\}\_\{\\text\{clean\}\}\]\\in\\mathbb\{R\}^\{B\\times\(N\+1\)\\times d\}\(158\)
Rationale:Prepending the transformation token \(rather than cross\-attention\) enables bidirectional communication between the transformation and all latent tokens\. Each token’s transformation is informed by neighboring context: the self\-attention mechanism allows tokens to understand how their transformation affects and is affected by other tokens in the sequence\. This is critical for transformations like frequency shifts where the entire spectrum shifts coherently—each frequency bin’s transformation must be coordinated with all other bins\. Cross\-attention would create one\-way communication \(transformation→\\rightarrowtokens\) without allowing tokens to coordinate their transformations with each other\.

Step 2: Parseval Transformer Processing

The augmented sequence is processed through a Parseval Transformer block \(Section[A\.7](https://arxiv.org/html/2606.04106#A1.SS7)\) to apply the latent transformation:

𝐓transformed,ℒorth,ℒParseval,ℒdiversity=ParsevalTransformer​\(𝐓aug\)\\mathbf\{T\}\_\{\\text\{transformed\}\},\\mathcal\{L\}\_\{\\text\{orth\}\},\\mathcal\{L\}\_\{\\text\{Parseval\}\},\\mathcal\{L\}\_\{\\text\{diversity\}\}=\\text\{ParsevalTransformer\}\(\\mathbf\{T\}\_\{\\text\{aug\}\}\)\(159\)
The conditioning token is removed after transformation:

𝐓latent\_transform=𝐓transformed\[:,1:,:\]∈ℝB×N×d\\mathbf\{T\}\_\{\\text\{latent\\\_transform\}\}=\\mathbf\{T\}\_\{\\text\{transformed\}\}\[:,1:,:\]\\in\\mathbb\{R\}^\{B\\times N\\times d\}\(160\)
Step 3: Sequence Pooling and Cross\-Domain Fusion

Transformed tokens undergo the same processing as encoder tokens \(Section[A\.8](https://arxiv.org/html/2606.04106#A1.SS8)\):

𝐳latent\_transform\(t\)\\displaystyle\\mathbf\{z\}\_\{\\text\{latent\\\_transform\}\}^\{\(t\)\}=AttentionalPool​\(𝐓latent\_transform\(t\)\)\\displaystyle=\\text\{AttentionalPool\}\(\\mathbf\{T\}\_\{\\text\{latent\\\_transform\}\}^\{\(t\)\}\)\(161\)𝐳latent\_transform\(f\)\\displaystyle\\mathbf\{z\}\_\{\\text\{latent\\\_transform\}\}^\{\(f\)\}=AttentionalPool​\(𝐓latent\_transform\(f\)\)\\displaystyle=\\text\{AttentionalPool\}\(\\mathbf\{T\}\_\{\\text\{latent\\\_transform\}\}^\{\(f\)\}\)\(162\)𝐳latent\_transform\\displaystyle\\mathbf\{z\}\_\{\\text\{latent\\\_transform\}\}=CrossDomainFusion​\(𝐳latent\_transform\(t\),𝐳latent\_transform\(f\)\)\\displaystyle=\\text\{CrossDomainFusion\}\(\\mathbf\{z\}\_\{\\text\{latent\\\_transform\}\}^\{\(t\)\},\\mathbf\{z\}\_\{\\text\{latent\\\_transform\}\}^\{\(f\)\}\)\(163\)
Domain\-Specific Transformers:Separate Latent Equivariant Transformers are used for time and frequency domains to enable domain\-specific transformation learning if necessary\.

#### C\.3\.3Equivariance Learning through Differences

After processing through the encoder and Latent Equivariant Transformer, we have three representations:

𝐳clean\\displaystyle\\mathbf\{z\}\_\{\\text\{clean\}\}∈ℝB×d\(clean branch, no transformations\)\\displaystyle\\in\\mathbb\{R\}^\{B\\times d\}\\quad\\text\{\(clean branch, no transformations\)\}\(164\)𝐳latent\_transform\\displaystyle\\mathbf\{z\}\_\{\\text\{latent\\\_transform\}\}∈ℝB×d\(clean branch, latent\-space transformation\)\\displaystyle\\in\\mathbb\{R\}^\{B\\times d\}\\quad\\text\{\(clean branch, latent\-space transformation\)\}\(165\)𝐳input\_transform\\displaystyle\\mathbf\{z\}\_\{\\text\{input\\\_transform\}\}∈ℝB×d\(augmented branch, input\-space transformation\)\\displaystyle\\in\\mathbb\{R\}^\{B\\times d\}\\quad\\text\{\(augmented branch, input\-space transformation\)\}\(166\)
To learn equivariance, we measure the transformation process rather than just the output\. We compute differences after the sequence pooling but before expander/projection networks are used for IsoFICReg\.

The differences capture "how the representation changed":

Δlatent\\displaystyle\\Delta\_\{\\text\{latent\}\}=𝐳latent−𝐳clean∈ℝB×d\\displaystyle=\\mathbf\{z\}\_\{\\text\{latent\}\}\-\\mathbf\{z\}\_\{\\text\{clean\}\}\\in\\mathbb\{R\}^\{B\\times d\}\(167\)Δinput\\displaystyle\\Delta\_\{\\text\{input\}\}=𝐳input−𝐳clean∈ℝB×d\\displaystyle=\\mathbf\{z\}\_\{\\text\{input\}\}\-\\mathbf\{z\}\_\{\\text\{clean\}\}\\in\\mathbb\{R\}^\{B\\times d\}\(168\)
Comprehensive Time\-Frequency Equivariance:To learn equivariances across both domains, we concatenate time and frequency differences:

Δlatentfull\\displaystyle\\Delta\_\{\\text\{latent\}\}^\{\\text\{full\}\}=\[Δlatent\(t\);Δlatent\(f\)\]∈ℝB×2​d\\displaystyle=\[\\Delta\_\{\\text\{latent\}\}^\{\(t\)\};\\Delta\_\{\\text\{latent\}\}^\{\(f\)\}\]\\in\\mathbb\{R\}^\{B\\times 2d\}\(169\)Δinputfull\\displaystyle\\Delta\_\{\\text\{input\}\}^\{\\text\{full\}\}=\[Δinput\(t\);Δinput\(f\)\]∈ℝB×2​d\\displaystyle=\[\\Delta\_\{\\text\{input\}\}^\{\(t\)\};\\Delta\_\{\\text\{input\}\}^\{\(f\)\}\]\\in\\mathbb\{R\}^\{B\\times 2d\}\(170\)

#### C\.3\.4Equivariance Loss: Focal Huber Regression

The equivariance loss enforces three constraints using focal\-weighted Huber regression:

\(1\) Latent transformation matches ground truth:

ℒlatent=FocalHuber​\(fe​p​\(Δlatentfull\),𝜽true\)\\mathcal\{L\}\_\{\\text\{latent\}\}=\\text\{FocalHuber\}\(f\_\{ep\}\(\\Delta\_\{\\text\{latent\}\}^\{\\text\{full\}\}\),\\boldsymbol\{\\theta\}\_\{\\text\{true\}\}\)\(171\)
\(2\) Input transformation matches ground truth:

ℒinput=FocalHuber​\(fe​p​\(Δinputfull\),𝜽true\)\\mathcal\{L\}\_\{\\text\{input\}\}=\\text\{FocalHuber\}\(f\_\{ep\}\(\\Delta\_\{\\text\{input\}\}^\{\\text\{full\}\}\),\\boldsymbol\{\\theta\}\_\{\\text\{true\}\}\)\(172\)
\(3\) Both transformations are consistent:

ℒconsistency=FocalHuber​\(Δlatentfull,Δinputfull\)\\mathcal\{L\}\_\{\\text\{consistency\}\}=\\text\{FocalHuber\}\(\\Delta\_\{\\text\{latent\}\}^\{\\text\{full\}\},\\Delta\_\{\\text\{input\}\}^\{\\text\{full\}\}\)\(173\)
The total equivariance loss is:

ℒEqui=ℒlatent\+ℒinput\+ℒconsistency\\mathcal\{L\}\_\{\\text\{Equi\}\}=\\mathcal\{L\}\_\{\\text\{latent\}\}\+\\mathcal\{L\}\_\{\\text\{input\}\}\+\\mathcal\{L\}\_\{\\text\{consistency\}\}\(174\)
##### C\.3\.4\.1Focal\-Weighted Huber Loss

To emphasize difficult instances \(small shifts, low SNR\) while maintaining robustness to outliers, we combine Huber loss with focal reweighting:

Huberδ​\(y,y^\)\\displaystyle\\text\{Huber\}\_\{\\delta\}\(y,\\hat\{y\}\)=\{12​\(y−y^\)2if​\|y−y^\|≤δδ⋅\(\|y−y^\|−12​δ\)otherwise\\displaystyle=\\begin\{cases\}\\frac\{1\}\{2\}\(y\-\\hat\{y\}\)^\{2\}&\\text\{if \}\|y\-\\hat\{y\}\|\\leq\\delta\\\\ \\delta\\cdot\(\|y\-\\hat\{y\}\|\-\\frac\{1\}\{2\}\\delta\)&\\text\{otherwise\}\\end\{cases\}\(175\)dlocal\\displaystyle d\_\{\\text\{local\}\}=mean​\(Huberδ​\(y,y^\),dim=−1\)\\displaystyle=\\text\{mean\}\(\\text\{Huber\}\_\{\\delta\}\(y,\\hat\{y\}\),\\text\{dim\}=\-1\)\(176\)dnorm\\displaystyle d\_\{\\text\{norm\}\}=dlocalmean​\(dlocal\)\+ϵ\\displaystyle=\\frac\{d\_\{\\text\{local\}\}\}\{\\text\{mean\}\(d\_\{\\text\{local\}\}\)\+\\epsilon\}\(177\)σglobal\\displaystyle\\sigma\_\{\\text\{global\}\}=std​\(dlocal\)\\displaystyle=\\text\{std\}\(d\_\{\\text\{local\}\}\)\(178\)γadaptive\\displaystyle\\gamma\_\{\\text\{adaptive\}\}=γ⋅\(1\+log⁡\(1\+σglobal\)\)\\displaystyle=\\gamma\\cdot\(1\+\\log\(1\+\\sigma\_\{\\text\{global\}\}\)\)\(179\)wfocal\\displaystyle w\_\{\\text\{focal\}\}=max⁡\(0\.1,dnormγadaptive\)\\displaystyle=\\max\(0\.1,d\_\{\\text\{norm\}\}^\{\\gamma\_\{\\text\{adaptive\}\}\}\)\(180\)FocalHuber​\(y,y^\)\\displaystyle\\text\{FocalHuber\}\(y,\\hat\{y\}\)=mean​\(wfocal⋅dlocal\)\\displaystyle=\\text\{mean\}\(w\_\{\\text\{focal\}\}\\cdot d\_\{\\text\{local\}\}\)\(181\)
whereδ=0\.05\\delta=0\.05\(Huber threshold\),γ=2\.0\\gamma=2\.0\(focal exponent\), andϵ=10−8\\epsilon=10^\{\-8\}\(numerical stability\)\. The focal weight is clamped to\[0\.1,∞\)\[0\.1,\\infty\)to prevent complete down\-weighting of easy examples while emphasizing hard cases\.

Rationale:Huber loss provides robustness to outliers \(e\.g\., occasional large prediction errors during early training\), while focal reweighting prevents overfitting to trivial transformations \(e\.g\., 0 Hz shift, high SNR\)\. The adaptiveγ\\gammaadjusts emphasis based on batch difficulty\.

Note:Empirically we noted this variation of a batch normalized Focal reweighting worked best for regression tasks relative to the Focal reweighting used for IsoFICReg\. They are similar in nature but have minute differences in implementation that are mainly motivated to the scale differences of the respective errors\.

##### C\.3\.4\.2Circular Phase Regression

For phase rotation parametersϕ∈\[0,2​π\)\\phi\\in\[0,2\\pi\), we account for circular periodicity using modulo wrapping rather than sin/cos decomposition\. The prediction head outputsϕ^∈\[0,1\]\\hat\{\\phi\}\\in\[0,1\]\(via sigmoid activation\), which is scaled back to\[0,2​π\)\[0,2\\pi\):

ϕ^scaled=2​π⋅ϕ^\\hat\{\\phi\}\_\{\\text\{scaled\}\}=2\\pi\\cdot\\hat\{\\phi\}\(182\)
The circular\-aware loss wraps the error to\[−π,π\]\[\-\\pi,\\pi\]:

Δϕ\\displaystyle\\Delta\_\{\\phi\}=\(ϕtrue−ϕ^scaled\+π\)mod2​π−π\\displaystyle=\(\\phi\_\{\\text\{true\}\}\-\\hat\{\\phi\}\_\{\\text\{scaled\}\}\+\\pi\)\\mod 2\\pi\-\\pi\(183\)ℒphase\\displaystyle\\mathcal\{L\}\_\{\\text\{phase\}\}=FocalHuber​\(Δϕ,0\)\\displaystyle=\\text\{FocalHuber\}\(\\Delta\_\{\\phi\},0\)\(184\)
This ensuresϕ=0\\phi=0andϕ=2​π\\phi=2\\piare treated as equivalent, avoiding discontinuities at the circular boundary\.

#### C\.3\.5Summary: LED Pipeline

Algorithm[10](https://arxiv.org/html/2606.04106#alg10)summarizes the complete LED forward pass\.

Algorithm 10Latent Equivariant Differences \(LED\)0:Clean signal

𝐱clean\\mathbf\{x\}\_\{\\text\{clean\}\}, Augmented signal

𝐱aug\\mathbf\{x\}\_\{\\text\{aug\}\}
0:Transformation parameters

𝜽true\\boldsymbol\{\\theta\}\_\{\\text\{true\}\}
0:Equivariance loss

ℒEqui\\mathcal\{L\}\_\{\\text\{Equi\}\}
1:

2:\{Encode clean and augmented branches\}

3:

𝐳clean\(t\),𝐳clean\(f\)←Encoder​\(𝐱clean\)\\mathbf\{z\}\_\{\\text\{clean\}\}^\{\(t\)\},\\mathbf\{z\}\_\{\\text\{clean\}\}^\{\(f\)\}\\leftarrow\\text\{Encoder\}\(\\mathbf\{x\}\_\{\\text\{clean\}\}\)
4:

𝐳input\(t\),𝐳input\(f\)←Encoder​\(𝐱aug\)\\mathbf\{z\}\_\{\\text\{input\}\}^\{\(t\)\},\\mathbf\{z\}\_\{\\text\{input\}\}^\{\(f\)\}\\leftarrow\\text\{Encoder\}\(\\mathbf\{x\}\_\{\\text\{aug\}\}\)
5:

6:\{Apply latent transformation\}

7:

𝐳latent\(t\)←LatentEquivariantTransformer\(t\)​\(𝐳clean\(t\),𝜽true\)\\mathbf\{z\}\_\{\\text\{latent\}\}^\{\(t\)\}\\leftarrow\\text\{LatentEquivariantTransformer\}^\{\(t\)\}\(\\mathbf\{z\}\_\{\\text\{clean\}\}^\{\(t\)\},\\boldsymbol\{\\theta\}\_\{\\text\{true\}\}\)
8:

𝐳latent\(f\)←LatentEquivariantTransformer\(f\)​\(𝐳clean\(f\),𝜽true\)\\mathbf\{z\}\_\{\\text\{latent\}\}^\{\(f\)\}\\leftarrow\\text\{LatentEquivariantTransformer\}^\{\(f\)\}\(\\mathbf\{z\}\_\{\\text\{clean\}\}^\{\(f\)\},\\boldsymbol\{\\theta\}\_\{\\text\{true\}\}\)
9:

10:\{Compute differences\}

11:

Δlatentfull←\[\(𝐳latent\(t\)−𝐳clean\(t\)\);\(𝐳latent\(f\)−𝐳clean\(f\)\)\]\\Delta\_\{\\text\{latent\}\}^\{\\text\{full\}\}\\leftarrow\[\(\\mathbf\{z\}\_\{\\text\{latent\}\}^\{\(t\)\}\-\\mathbf\{z\}\_\{\\text\{clean\}\}^\{\(t\)\}\);\(\\mathbf\{z\}\_\{\\text\{latent\}\}^\{\(f\)\}\-\\mathbf\{z\}\_\{\\text\{clean\}\}^\{\(f\)\}\)\]
12:

Δinputfull←\[\(𝐳input\(t\)−𝐳clean\(t\)\);\(𝐳input\(f\)−𝐳clean\(f\)\)\]\\Delta\_\{\\text\{input\}\}^\{\\text\{full\}\}\\leftarrow\[\(\\mathbf\{z\}\_\{\\text\{input\}\}^\{\(t\)\}\-\\mathbf\{z\}\_\{\\text\{clean\}\}^\{\(t\)\}\);\(\\mathbf\{z\}\_\{\\text\{input\}\}^\{\(f\)\}\-\\mathbf\{z\}\_\{\\text\{clean\}\}^\{\(f\)\}\)\]
13:

14:\{Predict transformation parameters\}

15:

𝜽^latent←fe​p​\(Δlatentfull\)\\hat\{\\boldsymbol\{\\theta\}\}\_\{\\text\{latent\}\}\\leftarrow f\_\{ep\}\(\\Delta\_\{\\text\{latent\}\}^\{\\text\{full\}\}\)
16:

𝜽^input←fe​p​\(Δinputfull\)\\hat\{\\boldsymbol\{\\theta\}\}\_\{\\text\{input\}\}\\leftarrow f\_\{ep\}\(\\Delta\_\{\\text\{input\}\}^\{\\text\{full\}\}\)
17:

18:\{Compute equivariance loss\}

19:

ℒlatent←FocalHuber​\(𝜽^latent,𝜽true\)\\mathcal\{L\}\_\{\\text\{latent\}\}\\leftarrow\\text\{FocalHuber\}\(\\hat\{\\boldsymbol\{\\theta\}\}\_\{\\text\{latent\}\},\\boldsymbol\{\\theta\}\_\{\\text\{true\}\}\)
20:

ℒinput←FocalHuber​\(𝜽^input,𝜽true\)\\mathcal\{L\}\_\{\\text\{input\}\}\\leftarrow\\text\{FocalHuber\}\(\\hat\{\\boldsymbol\{\\theta\}\}\_\{\\text\{input\}\},\\boldsymbol\{\\theta\}\_\{\\text\{true\}\}\)
21:

ℒconsistency←FocalHuber​\(Δlatentfull,Δinputfull\)\\mathcal\{L\}\_\{\\text\{consistency\}\}\\leftarrow\\text\{FocalHuber\}\(\\Delta\_\{\\text\{latent\}\}^\{\\text\{full\}\},\\Delta\_\{\\text\{input\}\}^\{\\text\{full\}\}\)
22:

ℒEqui←ℒlatent\+ℒinput\+ℒconsistency\\mathcal\{L\}\_\{\\text\{Equi\}\}\\leftarrow\\mathcal\{L\}\_\{\\text\{latent\}\}\+\\mathcal\{L\}\_\{\\text\{input\}\}\+\\mathcal\{L\}\_\{\\text\{consistency\}\}
23:

24:return

ℒEqui\\mathcal\{L\}\_\{\\text\{Equi\}\}

### C\.4Token\-Level Source Separation

#### C\.4\.1Motivation and Placement

Token\-level source separation provides local discriminative learning to complement global objectives \(IsoFICReg, LED\)\. The Latent Source Separation Transformer operatesafter the main encoder’s Parseval Transformer blocks, processing tokens that have already undergone dual\-domain focus and cross\-domain fusion\.

Processing Flow:

1. 1\.Encoder: Convolutional tokenization→\\rightarrowParseval Transformer→\\rightarrowtokens𝐓mixture\\mathbf\{T\}\_\{\\text\{mixture\}\}
2. 2\.Source Separation:𝐓mixture→\\mathbf\{T\}\_\{\\text\{mixture\}\}\\rightarrowICA Transformer→\\rightarrow𝐓A,𝐓B\\mathbf\{T\}\_\{A\},\\mathbf\{T\}\_\{B\}
3. 3\.Remaining Encoder: Cross\-domain fusion→\\rightarrowsequence pooling→\\rightarrowfinal cross\-domain fusion→\\rightarrow𝐳A,𝐳B\\mathbf\{z\}\_\{A\},\\mathbf\{z\}\_\{B\}
4. 4\.Decoder: Reconstruct𝐱A,𝐱B\\mathbf\{x\}\_\{A\},\\mathbf\{x\}\_\{B\}from separated representations

This placement ensures separation operates on physically\-rich tokens rather than raw convolutional features, improving separation quality\.

#### C\.4\.2Mixture Creation and SINR Control

To avoid trivializing source separation to denoising, we create controlled mixtures with Signal\-to\-Interference\-plus\-Noise Ratio \(SINR\) ranging from 20 dB to 0 dB:

𝐱mixture=𝐱A\+α⋅𝐱B\\mathbf\{x\}\_\{\\text\{mixture\}\}=\\mathbf\{x\}\_\{A\}\+\\alpha\\cdot\\mathbf\{x\}\_\{B\}\(185\)
whereα\\alphais computed to achieve target SINR:

α=PAPB⋅10SINR/10\\alpha=\\sqrt\{\\frac\{P\_\{A\}\}\{P\_\{B\}\\cdot 10^\{\\text\{SINR\}/10\}\}\}\(186\)
Source Ambiguity Resolution:Source A is always the dominant source \(higher power\), while Source B is the interfering source\. This removes the permutation ambiguity inherent in blind source separation\. We explicitly separate and compute losses on both sources to ensure the task requires true separation rather than simple attenuation\.

#### C\.4\.3Latent Source Separation Transformer

The Latent Source Separation Transformer combines Parseval Focus with ICA\-inspired iterative refinement for source separation\.

##### C\.4\.3\.1Architecture

Step 1: High\-Dimensional Projection

𝐓proj\\displaystyle\\mathbf\{T\}\_\{\\text\{proj\}\}=GELU​\(BatchNorm​\(Linear1​\(𝐓mixture\)\)\)∈ℝB×N×dproj\\displaystyle=\\text\{GELU\}\(\\text\{BatchNorm\}\(\\text\{Linear\}\_\{1\}\(\\mathbf\{T\}\_\{\\text\{mixture\}\}\)\)\)\\in\\mathbb\{R\}^\{B\\times N\\times d\_\{\\text\{proj\}\}\}\(187\)𝐓proj\\displaystyle\\mathbf\{T\}\_\{\\text\{proj\}\}=Dropout0\.25​\(𝐓proj\)\\displaystyle=\\text\{Dropout\}\_\{0\.25\}\(\\mathbf\{T\}\_\{\\text\{proj\}\}\)\(188\)𝐓proj\\displaystyle\\mathbf\{T\}\_\{\\text\{proj\}\}=GELU​\(BatchNorm​\(Linear2​\(𝐓proj\)\)\)∈ℝB×N×dproj\\displaystyle=\\text\{GELU\}\(\\text\{BatchNorm\}\(\\text\{Linear\}\_\{2\}\(\\mathbf\{T\}\_\{\\text\{proj\}\}\)\)\)\\in\\mathbb\{R\}^\{B\\times N\\times d\_\{\\text\{proj\}\}\}\(189\)𝐓proj\\displaystyle\\mathbf\{T\}\_\{\\text\{proj\}\}=Dropout0\.25​\(𝐓proj\)\\displaystyle=\\text\{Dropout\}\_\{0\.25\}\(\\mathbf\{T\}\_\{\\text\{proj\}\}\)\(190\)
wheredproj=4⋅dinputd\_\{\\text\{proj\}\}=4\\cdot d\_\{\\text\{input\}\}\.

Step 2: Parseval Transformer for Joint Time\-Frequency Analysis

𝐓focused,ℒorth,ℒParseval,ℒdiversity=ParsevalTransformer​\(𝐓proj\)\\mathbf\{T\}\_\{\\text\{focused\}\},\\mathcal\{L\}\_\{\\text\{orth\}\},\\mathcal\{L\}\_\{\\text\{Parseval\}\},\\mathcal\{L\}\_\{\\text\{diversity\}\}=\\text\{ParsevalTransformer\}\(\\mathbf\{T\}\_\{\\text\{proj\}\}\)\(191\)
Step 3: ICA\-Inspired Iterative Refinement

We applyK=3K=3iterations of ICA\-inspired processing with orthogonal linear layers:

𝐓\(k\)\\displaystyle\\mathbf\{T\}^\{\(k\)\}=𝐖ICA\(k\)​𝐓\(k−1\)where​𝐖ICA\(k\)​is orthogonal\\displaystyle=\\mathbf\{W\}\_\{\\text\{ICA\}\}^\{\(k\)\}\\mathbf\{T\}^\{\(k\-1\)\}\\quad\\text\{where \}\\mathbf\{W\}\_\{\\text\{ICA\}\}^\{\(k\)\}\\text\{ is orthogonal\}\(192\)𝐓\(k\)\\displaystyle\\mathbf\{T\}^\{\(k\)\}=𝐓\(k\)−𝔼​\[𝐓\(k\)\]\(centering\)\\displaystyle=\\mathbf\{T\}^\{\(k\)\}\-\\mathbb\{E\}\[\\mathbf\{T\}^\{\(k\)\}\]\\quad\\text\{\(centering\)\}\(193\)𝐓\(k\)\\displaystyle\\mathbf\{T\}^\{\(k\)\}=log⁡\(cosh⁡\(𝐓\(k\)\)\)\(log\-cosh nonlinearity\)\\displaystyle=\\log\(\\cosh\(\\mathbf\{T\}^\{\(k\)\}\)\)\\quad\\text\{\(log\-cosh nonlinearity\)\}\(194\)𝐓\(k\)\\displaystyle\\mathbf\{T\}^\{\(k\)\}=𝐓\(k\)‖𝐓\(k\)‖2\+ϵ\(L2 normalization\)\\displaystyle=\\frac\{\\mathbf\{T\}^\{\(k\)\}\}\{\\\|\\mathbf\{T\}^\{\(k\)\}\\\|\_\{2\}\+\\epsilon\}\\quad\\text\{\(L2 normalization\)\}\(195\)
where𝐓\(0\)=𝐓focused\\mathbf\{T\}^\{\(0\)\}=\\mathbf\{T\}\_\{\\text\{focused\}\}andk∈\{1,2,3\}k\\in\\\{1,2,3\\\}\.

The log\-cosh nonlinearity approximates negentropy \(non\-Gaussianity\), encouraging statistical independence between separated sources\. Interestingly, the log\-cosh nonlinearity should remove sign information due to its evenness but in practice we observed better training dynamics with its inclusion rather than signed alternatives like tanh or asinh\.

We attribute this to its nice differentiability properties such as a stable bounded derivative between \[\-1,1\] from its tanh derivative\. Additionally, its inclusion within the joint\-embedding framework and follow\-on high dimensional projections provide a rich surface to leverage its rich differentiability properties\.

Step 4: Final Unmixing

𝐓unmixed=𝐖unmix​𝐓\(3\)∈ℝB×N×2​dinput\\mathbf\{T\}\_\{\\text\{unmixed\}\}=\\mathbf\{W\}\_\{\\text\{unmix\}\}\\mathbf\{T\}^\{\(3\)\}\\in\\mathbb\{R\}^\{B\\times N\\times 2d\_\{\\text\{input\}\}\}\(196\)
where𝐖unmix\\mathbf\{W\}\_\{\\text\{unmix\}\}is initialized as orthogonal\. The output is split into two sources:

𝐓A\\displaystyle\\mathbf\{T\}\_\{A\}=𝐓unmixed\[:,:,:dinput\]\\displaystyle=\\mathbf\{T\}\_\{\\text\{unmixed\}\}\[:,:,:d\_\{\\text\{input\}\}\]\(197\)𝐓B\\displaystyle\\mathbf\{T\}\_\{B\}=𝐓unmixed\[:,:,dinput:\]\\displaystyle=\\mathbf\{T\}\_\{\\text\{unmixed\}\}\[:,:,d\_\{\\text\{input\}\}:\]\(198\)
Rationale:The iterative ICA refinement with orthogonal constraints encourages learning a rotation in latent space that maximizes statistical independence between sources\. The Parseval Transformer provides joint time\-frequency context essential for effective separation\.

##### C\.4\.3\.2Downstream Processing

After separation, each source is processed independently through the remaining encoder and decoder:

- •Encoder:Parseval Transformer, sequence pooling, cross\-domain fusion
- •Decoder:Reconstruction targeting the original unmixed sources
- •Losses:IsoFICReg, LED, and reconstruction losses computed against clean single\-source counterparts

This ensures separated tokens produce representations and reconstructions consistent with the original unmixed signals\.

#### C\.4\.4Source Separation Losses

\(1\) Reconstruction Loss:Each separated source is reconstructed and compared to its original unmixed target:

ℒrecon\(A\)\\displaystyle\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(A\)\}=FocalHuber​\(𝐱recon\(A\),𝐱A−N​o​N​o​i​s​e\)\\displaystyle=\\text\{FocalHuber\}\(\\mathbf\{x\}\_\{\\text\{recon\}\}^\{\(A\)\},\\mathbf\{x\}\_\{A\-NoNoise\}\)\(199\)ℒrecon\(B\)\\displaystyle\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(B\)\}=FocalHuber​\(𝐱recon\(B\),𝐱B−N​o​N​o​i​s​e\)\\displaystyle=\\text\{FocalHuber\}\(\\mathbf\{x\}\_\{\\text\{recon\}\}^\{\(B\)\},\\mathbf\{x\}\_\{B\-NoNoise\}\)\(200\)
\(2\) Skip Sink Complementarity Loss:The decoder’s skip sinks \(Section[B\.4\.3](https://arxiv.org/html/2606.04106#A2.SS4.SSS3)\) should produce complementary estimates for mixed sources: what is sunk from source A should equal what is kept for source B, and vice versa\.

For each decoder layeriiwith skip connections:

ℒsink\_comp\(i\)\\displaystyle\\mathcal\{L\}\_\{\\text\{sink\\\_comp\}\}^\{\(i\)\}=FocalHuber​\(𝐱sink\(A,i\),𝐱post\-sink\(B,i\)\)\\displaystyle=\\text\{FocalHuber\}\(\\mathbf\{x\}\_\{\\text\{sink\}\}^\{\(A,i\)\},\\mathbf\{x\}\_\{\\text\{post\-sink\}\}^\{\(B,i\)\}\)\(201\)\+FocalHuber​\(𝐱sink\(B,i\),𝐱post\-sink\(A,i\)\)\\displaystyle\\quad\+\\text\{FocalHuber\}\(\\mathbf\{x\}\_\{\\text\{sink\}\}^\{\(B,i\)\},\\mathbf\{x\}\_\{\\text\{post\-sink\}\}^\{\(A,i\)\}\)\(202\)
Applied to both time and frequency domains:

ℒsink\_comp=∑i\(ℒsink\_comp\(t,i\)\+ℒsink\_comp\(f,i\)\)\\mathcal\{L\}\_\{\\text\{sink\\\_comp\}\}=\\sum\_\{i\}\\left\(\\mathcal\{L\}\_\{\\text\{sink\\\_comp\}\}^\{\(t,i\)\}\+\\mathcal\{L\}\_\{\\text\{sink\\\_comp\}\}^\{\(f,i\)\}\\right\)\(203\)
Rationale:This complementarity constraint ensures the skip sinks learn to separate sources rather than arbitrarily attenuate\. If the sink correctly removes source B from the mixture \(leaving source A\), then by definition it has extracted source B\. This provides a strong regularization signal for learning effective source\-specific filtering\.

Note on Scale Consistency:The complementarity constraint is scale\-consistent because both decoders process the same mixture𝐱mixture=𝐱A\+α⋅𝐱B\\mathbf\{x\}\_\{\\text\{mixture\}\}=\\mathbf\{x\}\_\{A\}\+\\alpha\\cdot\\mathbf\{x\}\_\{B\}\. When decoderAAremoves the interfering source \(𝐱sink\(A\)≈α⋅𝐱B\\mathbf\{x\}\_\{\\text\{sink\}\}^\{\(A\)\}\\approx\\alpha\\cdot\\mathbf\{x\}\_\{B\}\), this matches what decoderBBretains \(𝐱post\-sink\(B\)≈α⋅𝐱B\\mathbf\{x\}\_\{\\text\{post\-sink\}\}^\{\(B\)\}\\approx\\alpha\\cdot\\mathbf\{x\}\_\{B\}\)\. Similarly, what decoderBBremoves \(𝐱sink\(B\)≈𝐱A\\mathbf\{x\}\_\{\\text\{sink\}\}^\{\(B\)\}\\approx\\mathbf\{x\}\_\{A\}\) matches what decoderAAretains \(𝐱post\-sink\(A\)≈𝐱A\\mathbf\{x\}\_\{\\text\{post\-sink\}\}^\{\(A\)\}\\approx\\mathbf\{x\}\_\{A\}\)\. Theα\\alphascaling factor naturally appears on both sides of each comparison, ensuring scale consistency without explicit normalization\.

\(3\) Latent Space Losses:IsoFICReg and LED losses are computed for each separated source against their clean single\-source counterparts, providing implicit latent denoising and ensuring separated representations match the structure of clean signals\.

\(4\) Token\-Level Separation Loss:To enforce fine\-grained separation at the token level, we apply a regression loss directly to the separated token representations before sequence pooling\.

After the Latent Source Separation Transformer produces separated tokens𝐓A,𝐓B\\mathbf\{T\}\_\{A\},\\mathbf\{T\}\_\{B\}\(Section[C\.4\.3](https://arxiv.org/html/2606.04106#A3.SS4.SSS3)\), we compute focal Huber regression against the original unmixed token representations:

ℒtoken\(t\)\\displaystyle\\mathcal\{L\}\_\{\\text\{token\}\}^\{\(t\)\}=FocalHuber​\(𝐓A,𝐓Atarget\)\+FocalHuber​\(𝐓B,𝐓Btarget\)\\displaystyle=\\text\{FocalHuber\}\(\\mathbf\{T\}\_\{A\},\\mathbf\{T\}\_\{A\}^\{\\text\{target\}\}\)\+\\text\{FocalHuber\}\(\\mathbf\{T\}\_\{B\},\\mathbf\{T\}\_\{B\}^\{\\text\{target\}\}\)\(204\)ℒtoken\(f\)\\displaystyle\\mathcal\{L\}\_\{\\text\{token\}\}^\{\(f\)\}=FocalHuber​\(𝐓A\(f\),𝐓Atarget,\(f\)\)\+FocalHuber​\(𝐓B\(f\),𝐓Btarget,\(f\)\)\\displaystyle=\\text\{FocalHuber\}\(\\mathbf\{T\}\_\{A\}^\{\(f\)\},\\mathbf\{T\}\_\{A\}^\{\\text\{target\},\(f\)\}\)\+\\text\{FocalHuber\}\(\\mathbf\{T\}\_\{B\}^\{\(f\)\},\\mathbf\{T\}\_\{B\}^\{\\text\{target\},\(f\)\}\)\(205\)
where𝐓Atarget,𝐓Btarget\\mathbf\{T\}\_\{A\}^\{\\text\{target\}\},\\mathbf\{T\}\_\{B\}^\{\\text\{target\}\}are the token representations from encoding the original unmixed sources\.

Rationale:This token\-level loss enforces separation precision at the finest granularity, enabling the model to learn localized separation strategies\. For mixtures where sources occupy different time segments \(e\.g\., 50% duty cycle mixing\) or different frequency bands \(e\.g\., non\-overlapping spectral content\), the token\-level loss provides dense gradients that guide the model to learn time\-localized or frequency\-selective filtering\. This complements the global reconstruction losses \(which operate on full sequences\) by ensuring separation quality at every token position, particularly valuable for learning bandpass filter\-like behavior when sources occupy distinct spectral regions\.

#### C\.4\.5Limitation: Two\-Source Separation

Currently, we limit mixtures to two sources for computational efficiency and training stability\. Extension toN\>2N\>2sources is straightforward but requires additional computational resources and careful handling of permutation ambiguity\.

### C\.5Isotropic Focal Invariance Covariance Regularization \(IsoFICReg\)

IsoFICReg extends VICReg\[[15](https://arxiv.org/html/2606.04106#bib.bib15)\]with five key contributions that enable learning in extreme noise while maintaining fine\-grained discrimination: \(1\) latent coherent integration leveraging weak supervision from dataset structure, \(2\) focal reweighting to emphasize hard examples, \(3\) repulsion loss for fine\-grained discrimination, \(4\) isotropic representations via Z\-score standardization, and \(5\) dual\-domain consistency with latent denoising\.

#### C\.5\.1Theoretical Foundation: VICReg

VICReg learns representations through three complementary objectives operating on projection head outputs𝐏a,𝐏b∈ℝB×d\\mathbf\{P\}\_\{a\},\\mathbf\{P\}\_\{b\}\\in\\mathbb\{R\}^\{B\\times d\}:

\(1\) Invariance:Representations of augmented views should be similar:

ℒinv=𝔼​\[‖𝐏a−𝐏b‖2\]\\mathcal\{L\}\_\{\\text\{inv\}\}=\\mathbb\{E\}\[\\\|\\mathbf\{P\}\_\{a\}\-\\mathbf\{P\}\_\{b\}\\\|^\{2\}\]\(206\)
\(2\) Variance:Prevent collapse by maintaining variance along each dimension:

ℒvar=𝔼​\[ReLU​\(1−Var​\(𝐏a\)\+ϵ\)\]\+𝔼​\[ReLU​\(1−Var​\(𝐏b\)\+ϵ\)\]\\mathcal\{L\}\_\{\\text\{var\}\}=\\mathbb\{E\}\[\\text\{ReLU\}\(1\-\\sqrt\{\\text\{Var\}\(\\mathbf\{P\}\_\{a\}\)\+\\epsilon\}\)\]\+\\mathbb\{E\}\[\\text\{ReLU\}\(1\-\\sqrt\{\\text\{Var\}\(\\mathbf\{P\}\_\{b\}\)\+\\epsilon\}\)\]\(207\)
whereϵ=10−4\\epsilon=10^\{\-4\}for numerical stability\.

Note:In our IsoFICReg formulation, we omit the variance regularization term\. The Z\-score standardization \(described in Section[C\.5\.5](https://arxiv.org/html/2606.04106#A3.SS5.SSS5)\) applied before computing invariance and covariance losses inherently prevents dimensional collapse by normalizing each dimension to unit variance, making an explicit variance loss redundant\.

\(3\) Covariance:Decorrelate dimensions to maximize information:

ℒcov=1d​∑i≠jCov​\(𝐏a\)i​j2\+1d​∑i≠jCov​\(𝐏b\)i​j2\\mathcal\{L\}\_\{\\text\{cov\}\}=\\frac\{1\}\{d\}\\sum\_\{i\\neq j\}\\text\{Cov\}\(\\mathbf\{P\}\_\{a\}\)\_\{ij\}^\{2\}\+\\frac\{1\}\{d\}\\sum\_\{i\\neq j\}\\text\{Cov\}\(\\mathbf\{P\}\_\{b\}\)\_\{ij\}^\{2\}\(208\)
The total VICReg loss is:

ℒVICReg=λinv​ℒinv\+λvar​ℒvar\+λcov​ℒcov\\mathcal\{L\}\_\{\\text\{VICReg\}\}=\\lambda\_\{\\text\{inv\}\}\\mathcal\{L\}\_\{\\text\{inv\}\}\+\\lambda\_\{\\text\{var\}\}\\mathcal\{L\}\_\{\\text\{var\}\}\+\\lambda\_\{\\text\{cov\}\}\\mathcal\{L\}\_\{\\text\{cov\}\}\(209\)
whereλinv=25\\lambda\_\{\\text\{inv\}\}=25,λvar=25\\lambda\_\{\\text\{var\}\}=25,λcov=1\\lambda\_\{\\text\{cov\}\}=1\.

##### C\.5\.1\.1Projection Head Architecture \(Expander Networks\)

Following precedent by the original VICReg and Zhang et al\.\[[18](https://arxiv.org/html/2606.04106#bib.bib18)\]we utilize dedicated projection heads per domain that expand latent representations to higher\-dimensional space\. Each are constructed as follows:

ProjectionHead:ℝd→ℝdproj\\text\{ProjectionHead\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{d\_\{\\text\{proj\}\}\}\(210\)
whered=128d=128\(latent dimension\) anddproj=1024d\_\{\\text\{proj\}\}=1024\(projection dimension\)\.

Table 10:Projection Head Architecture \(used for IsoFICReg\)LayerInput DimOutput DimLinear1dddprojd\_\{\\text\{proj\}\}BatchNorm1dprojd\_\{\\text\{proj\}\}dprojd\_\{\\text\{proj\}\}GELU\-\-Dropout \(0\.25\)\-\-Linear2dprojd\_\{\\text\{proj\}\}dprojd\_\{\\text\{proj\}\}BatchNorm2dprojd\_\{\\text\{proj\}\}dprojd\_\{\\text\{proj\}\}GELU\-\-Dropout \(0\.25\)\-\-Linear3\(no bias\)dprojd\_\{\\text\{proj\}\}dprojd\_\{\\text\{proj\}\}Purpose:These projection heads serve IsoFICReg \(Section[C\.5](https://arxiv.org/html/2606.04106#A3.SS5)\) and LED\. By computing invariance on top of projections whose equivariant differences have already been enforced prior we force the representations to respect both symmetries simultaneously: the embeddings must remain invariant to noise while transforming predictably under frequency/time shifts\. This coupling ensures learned representations encode physical transformation rules rather than arbitrary features\.

#### C\.5\.2Contribution 1: Latent Coherent Integration

Standard VICReg computes invariance loss between a single pair of augmented views per sample\. We extend this by leveraging weak supervision from dataset structure: RF fingerprinting datasets contain multiple samples per emitter, each with unique hardware fingerprints but shared emitter identity\.

N\-Choose\-2 Pairwise Invariance:For each emitter classccwithNcN\_\{c\}samples in the batch, we compute invariance loss for all\(Nc2\)\\binom\{N\_\{c\}\}\{2\}pairs:

ℒinv\(c\)=1\(Nc2\)​∑i<j‖𝐏a\(i\)−𝐏b\(j\)‖2\\mathcal\{L\}\_\{\\text\{inv\}\}^\{\(c\)\}=\\frac\{1\}\{\\binom\{N\_\{c\}\}\{2\}\}\\sum\_\{i<j\}\\\|\\mathbf\{P\}\_\{a\}^\{\(i\)\}\-\\mathbf\{P\}\_\{b\}^\{\(j\)\}\\\|^\{2\}\(211\)
wherei,ji,jindex samples from emitterccin the augmented and latent\-transformed branches respectively\.

Analogy to Coherent Integration:This is analogous to coherent integration in classical signal processing\[[16](https://arxiv.org/html/2606.04106#bib.bib16)\]: by enforcing consistency across multiple noisy observations of the same emitter, noise \(uncorrelated across samples\) becomes inconsistent while signal \(correlated emitter characteristics\) becomes consistent\. This effectively improves SNR in latent space, enabling learning with extreme AWGN injection in negative\-SNR regimes\.

Distributed Training:We train on 12 H100 GPUs with distributed data parallelism\. Before computing invariance loss, we gather representations across all GPUs usingall\_gather, increasing the effective batch size for N\-choose\-2 pairing from local batch sizeBlocalB\_\{\\text\{local\}\}to global batch sizeBglobal=12×BlocalB\_\{\\text\{global\}\}=12\\times B\_\{\\text\{local\}\}\. This dramatically increases the number of pairs per emitter, strengthening the coherent integration effect\.

#### C\.5\.3Contribution 2: Focal Reweighting

To prevent overfitting to trivial high\-SNR samples and emphasize learning from difficult low\-SNR samples, we apply focal reweighting to both invariance and covariance losses\.

##### C\.5\.3\.1Focal Invariance Loss

We extend the focal loss concept\[[28](https://arxiv.org/html/2606.04106#bib.bib28)\]from classification to regression:

di​j\\displaystyle d\_\{ij\}=‖𝐏a\(i\)−𝐏b\(j\)‖2\(pairwise distance\)\\displaystyle=\\\|\\mathbf\{P\}\_\{a\}^\{\(i\)\}\-\\mathbf\{P\}\_\{b\}^\{\(j\)\}\\\|^\{2\}\\quad\\text\{\(pairwise distance\)\}\(212\)wfocal\\displaystyle w\_\{\\text\{focal\}\}=\(1−e−di​j\)γ\(focal weight\)\\displaystyle=\(1\-e^\{\-d\_\{ij\}\}\)^\{\\gamma\}\\quad\\text\{\(focal weight\)\}\(213\)ℒinvfocal\\displaystyle\\mathcal\{L\}\_\{\\text\{inv\}\}^\{\\text\{focal\}\}=1\(Nc2\)​∑i<jwfocal⋅di​j\\displaystyle=\\frac\{1\}\{\\binom\{N\_\{c\}\}\{2\}\}\\sum\_\{i<j\}w\_\{\\text\{focal\}\}\\cdot d\_\{ij\}\(214\)
whereγ=2\.0\\gamma=2\.0\. The focal weight emphasizes hard pairs \(largedi​jd\_\{ij\}, likely low\-SNR\) while down\-weighting easy pairs \(smalldi​jd\_\{ij\}, likely high\-SNR\)\.

Rationale:In negative\-SNR regimes, most pairs have large distances due to noise\. Without focal reweighting, the model would learn to minimize average distance without discriminating signal structure\. Focal reweighting ensures the model focuses on the hardest pairs, which contain the most informative gradients for learning robust representations\.

##### C\.5\.3\.2Focal Covariance Loss

Similarly, we apply focal reweighting to covariance regularization:

𝐂a\\displaystyle\\mathbf\{C\}\_\{a\}=1B−1​𝐏aT​𝐏a\(covariance matrix\)\\displaystyle=\\frac\{1\}\{B\-1\}\\mathbf\{P\}\_\{a\}^\{T\}\\mathbf\{P\}\_\{a\}\\quad\\text\{\(covariance matrix\)\}\(215\)ci​j\\displaystyle c\_\{ij\}=𝐂a​\[i,j\]2for​i≠j​\(off\-diagonal elements\)\\displaystyle=\\mathbf\{C\}\_\{a\}\[i,j\]^\{2\}\\quad\\text\{for \}i\\neq j\\text\{ \(off\-diagonal elements\)\}\(216\)wfocal\\displaystyle w\_\{\\text\{focal\}\}=\(1−e−c\)γ\(focal weight\)\\displaystyle=\(1\-e^\{\-c\}\)^\{\\gamma\}\\quad\\text\{\(focal weight\)\}\(217\)ℒcovfocal\\displaystyle\\mathcal\{L\}\_\{\\text\{cov\}\}^\{\\text\{focal\}\}=1d​\(d−1\)​∑i<jwfocal⋅ci​j\\displaystyle=\\frac\{1\}\{d\(d\-1\)\}\\sum\_\{i<j\}w\_\{\\text\{focal\}\}\\cdot c\_\{ij\}\(218\)
Rationale:Focal reweighting emphasizes decorrelating dimensions with high covariance \(harder to decorrelate\) while down\-weighting dimensions already well\-decorrelated\. This prevents the model from over\-optimizing easy dimensions at the expense of hard dimensions, encouraging more isotropic representations\.

#### C\.5\.4Contribution: Focal Repulsion Loss for Fine\-Grained Discrimination

To encourage fine\-grained discrimination between different emitters, we complement the attraction\-based invariance loss with a repulsion objective that pushes representations of different classes apart:

di​j\\displaystyle d\_\{ij\}=‖𝐏aug\(i\)−𝐏latent\(j\)‖2dproj\(normalized squared distance\)\\displaystyle=\\frac\{\\\|\\mathbf\{P\}\_\{\\text\{aug\}\}^\{\(i\)\}\-\\mathbf\{P\}\_\{\\text\{latent\}\}^\{\(j\)\}\\\|^\{2\}\}\{d\_\{\\text\{proj\}\}\}\\quad\\text\{\(normalized squared distance\)\}\(219\)vi​j\\displaystyle v\_\{ij\}=ReLU​\(margin−di​j\)\(margin violation\)\\displaystyle=\\text\{ReLU\}\(\\text\{margin\}\-d\_\{ij\}\)\\quad\\text\{\(margin violation\)\}\(220\)wfocalr​e​p​e​l\\displaystyle w\_\{\\text\{focal\}\}^\{repel\}=\(1−e−vi​j\)γ\(focal weight on violations\)\\displaystyle=\(1\-e^\{\-v\_\{ij\}\}\)^\{\\gamma\}\\quad\\text\{\(focal weight on violations\)\}\(221\)ℒrepel\\displaystyle\\mathcal\{L\}\_\{\\text\{repel\}\}=1Nneg​∑i,j∈𝒩wfocalr​e​p​e​l⋅vi​j\\displaystyle=\\frac\{1\}\{N\_\{\\text\{neg\}\}\}\\sum\_\{i,j\\in\\mathcal\{N\}\}w\_\{\\text\{focal\}\}^\{repel\}\\cdot v\_\{ij\}\(222\)
where𝒩\\mathcal\{N\}denotes pairs from different emitter classes,margin=1024\\text\{margin\}=1024, andγ=2\.0\\gamma=2\.0\. We apply log1p scaling to the final loss:ℒrepelfinal=log⁡\(1\+ℒrepel\)\\mathcal\{L\}\_\{\\text\{repel\}\}^\{\\text\{final\}\}=\\log\(1\+\\mathcal\{L\}\_\{\\text\{repel\}\}\)\.

Rationale:The ReLU in Eq\.[220](https://arxiv.org/html/2606.04106#A3.E220)focuses computation only on pairs violating the margin \(different classes that are too close\)\. Among these violations, focal reweighting \(Eq\.[221](https://arxiv.org/html/2606.04106#A3.E221)\) emphasizes the most severe cases: pairs with large violationsvi​jv\_\{ij\}\(very close negatives\) receive weight≈1\\approx 1, while pairs with small violations \(barely inside the margin\) receive weight≈0\\approx 0\. This concentrates gradients on the hardest negatives—different\-class pairs that are erroneously similar—forcing the model to learn discriminative features that separate even subtle hardware differences\.

Note:The large margin value \(1024\) is intentionally set well above the typical normalized distance range \( 0\-2\) to create a constant repulsion force rather than a traditional margin\-based dead zone\. This design ensures continuous gradient flow that prevents trivial collapse to over\-invariance, particularly for emitters from the same family with high redundancy\. The focal weighting then modulates this base repulsion to emphasize hard negatives\.

Distinction from Invariance Focal Weighting:While both use the same functional form\(1−e−x\)γ\(1\-e^\{\-x\}\)^\{\\gamma\}, they operate on different quantities:

- •Invariance\(Eq\.[214](https://arxiv.org/html/2606.04106#A3.E214)\): Applied to distancesdi​jd\_\{ij\}between same\-class pairs\. Emphasizes large distances \(hard positives that should be close but aren’t\)\.
- •Repulsion\(Eq\.[221](https://arxiv.org/html/2606.04106#A3.E221)\): Applied to margin violationsvi​jv\_\{ij\}for different\-class pairs\. Emphasizes large violations \(hard negatives that are too close\)\.

Both mechanisms prioritize difficult cases while down\-weighting trivial ones, but adapted to their respective objectives\.

Integration with Invariance:The repulsion loss is added to the invariance loss with logarithmic scaling to prevent domination:

ℒinvtotal=ℒinv\+log⁡\(1\+ℒrepel\)\\mathcal\{L\}\_\{\\text\{inv\}\}^\{\\text\{total\}\}=\\mathcal\{L\}\_\{\\text\{inv\}\}\+\\log\(1\+\\mathcal\{L\}\_\{\\text\{repel\}\}\)\(223\)
The logarithmic scaling ensures repulsion provides a regularization signal without overwhelming the primary invariance objective\.

#### C\.5\.5Contribution 4: Isotropic Representations via Z\-Score Standardization

To encourage isotropic \(spherical\) representations, we apply Z\-score standardization before computing all losses:

𝐏a\\displaystyle\\mathbf\{P\}\_\{a\}←𝐏a−𝔼​\[𝐏a\]std​\(𝐏a\)\\displaystyle\\leftarrow\\frac\{\\mathbf\{P\}\_\{a\}\-\\mathbb\{E\}\[\\mathbf\{P\}\_\{a\}\]\}\{\\text\{std\}\(\\mathbf\{P\}\_\{a\}\)\}\(224\)𝐏b\\displaystyle\\mathbf\{P\}\_\{b\}←𝐏b−𝔼​\[𝐏b\]std​\(𝐏b\)\\displaystyle\\leftarrow\\frac\{\\mathbf\{P\}\_\{b\}\-\\mathbb\{E\}\[\\mathbf\{P\}\_\{b\}\]\}\{\\text\{std\}\(\\mathbf\{P\}\_\{b\}\)\}\(225\)
where mean and standard deviation are computed per dimension across the batch\.

Unlike the original VICReg formulation, we omit the variance regularization term\. The Z\-score standardization applied before computing invariance and covariance losses inherently prevents dimensional collapse by normalizing each dimension to unit variance, making an explicit variance loss redundant\.

The complete IsoFICReg objective combines:

ℒI​s​o​F​I​C​R​e​g=λi​n​v​ℒi​n​vf​o​c​a​l\+λc​o​v​ℒc​o​vf​o​c​a​l\\displaystyle\\mathcal\{L\}\_\{IsoFICReg\}=\\lambda\_\{inv\}\\mathcal\{L\}\_\{inv\}^\{focal\}\+\\lambda\_\{cov\}\\mathcal\{L\}\_\{cov\}^\{focal\}\(226\)
whereλi​n​v=25\.0\\lambda\_\{inv\}=25\.0andλc​o​v=1\.0\\lambda\_\{cov\}=1\.0\.

Rationale:Z\-score standardization ensures all dimensions have unit variance and zero mean, creating a natural isotropic prior\. Combined with covariance\-based decorrelation, this encourages representations to uniformly fill the unit hypersphere, maximizing information content and preventing dimensional collapse\.

Distributed Batch Statistics:When training distributed, we gather representations across all GPUs before computing mean and standard deviation, ensuring batch statistics reflect the global batch rather than local mini\-batches\. This stabilizes training and improves representation quality\.

#### C\.5\.6Contribution 5: Dual\-Domain Consistency and Latent Denoising

Following precedent established by Zhang et al\.\[[18](https://arxiv.org/html/2606.04106#bib.bib18)\], we apply IsoFICReg to both time\-domain and frequency\-domain representations, with a critical design choice: we compute invariance losses between augmented embeddings \(from input\-space transformations\) and latent\-transformed embeddings \(from latent\-space transformations\)\. This provides implicit latent denoising and equivariance enforcement\.

Augmented embeddings \(𝐏aug\\mathbf\{P\}\_\{\\text\{aug\}\}\) are derived from the augmented branch where transformations are applied in input space:

𝐱aug→Encoder→Projector→𝐏aug\\mathbf\{x\}\_\{\\text\{aug\}\}\\rightarrow\\text\{Encoder\}\\rightarrow\\text\{Projector\}\\rightarrow\\mathbf\{P\}\_\{\\text\{aug\}\}\(228\)
Latent\-transformed embeddings \(𝐏latent\\mathbf\{P\}\_\{\\text\{latent\}\}\) are derived from the clean branch where transformations are applied in latent space via the Latent Equivariant Transformer \(Section[C\.3\.2](https://arxiv.org/html/2606.04106#A3.SS3.SSS2)\):

𝐱clean→Encoder→LatentEquivariantTransformer→Projector→𝐏latent\\mathbf\{x\}\_\{\\text\{clean\}\}\\rightarrow\\text\{Encoder\}\\rightarrow\\text\{LatentEquivariantTransformer\}\\rightarrow\\text\{Projector\}\\rightarrow\\mathbf\{P\}\_\{\\text\{latent\}\}\(229\)
The key distinction: augmented embeddings include AWGN \(applied in input space\), while latent\-transformed embeddings do not \(equivariant transformations applied in latent space after encoding clean signals\)\.

##### C\.5\.6\.1In\-Domain Losses

For each domain, we compute IsoFICReg between augmented and latent\-transformed embeddings:

ℒIsoFICReg\(t\)\\displaystyle\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}^\{\(t\)\}=ℒinv​\(𝐏aug\(t\),𝐏latent\(t\)\)\+ℒcov​\(𝐏aug\(t\),𝐏latent\(t\)\)\\displaystyle=\\mathcal\{L\}\_\{\\text\{inv\}\}\(\\mathbf\{P\}\_\{\\text\{aug\}\}^\{\(t\)\},\\mathbf\{P\}\_\{\\text\{latent\}\}^\{\(t\)\}\)\+\\mathcal\{L\}\_\{\\text\{cov\}\}\(\\mathbf\{P\}\_\{\\text\{aug\}\}^\{\(t\)\},\\mathbf\{P\}\_\{\\text\{latent\}\}^\{\(t\)\}\)\(230\)ℒIsoFICReg\(f\)\\displaystyle\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}^\{\(f\)\}=ℒinv​\(𝐏aug\(f\),𝐏latent\(f\)\)\+ℒcov​\(𝐏aug\(f\),𝐏latent\(f\)\)\\displaystyle=\\mathcal\{L\}\_\{\\text\{inv\}\}\(\\mathbf\{P\}\_\{\\text\{aug\}\}^\{\(f\)\},\\mathbf\{P\}\_\{\\text\{latent\}\}^\{\(f\)\}\)\+\\mathcal\{L\}\_\{\\text\{cov\}\}\(\\mathbf\{P\}\_\{\\text\{aug\}\}^\{\(f\)\},\\mathbf\{P\}\_\{\\text\{latent\}\}^\{\(f\)\}\)\(231\)
Critically, we donotcompute augmented\-augmented or latent\-latent invariance within domains \(time\-time, frequency\-frequency\), but do across domains\. We want to still provide a cross domain gradient but do not want noise versus noise within domain to pollute the gradients that are generated via a trivial similarity\.

##### C\.5\.6\.2Cross\-Domain Consistency

We additionally compute four cross\-domain IsoFICReg losses to enforce Parseval\-like consistency between time and frequency representations:

ℒcross\(1\)\\displaystyle\\mathcal\{L\}\_\{\\text\{cross\}\}^\{\(1\)\}=ℒIsoFICReg​\(𝐏aug\(t\),𝐏aug\(f\)\)\\displaystyle=\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}\(\\mathbf\{P\}\_\{\\text\{aug\}\}^\{\(t\)\},\\mathbf\{P\}\_\{\\text\{aug\}\}^\{\(f\)\}\)\(232\)ℒcross\(2\)\\displaystyle\\mathcal\{L\}\_\{\\text\{cross\}\}^\{\(2\)\}=ℒIsoFICReg​\(𝐏latent\(t\),𝐏latent\(f\)\)\\displaystyle=\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}\(\\mathbf\{P\}\_\{\\text\{latent\}\}^\{\(t\)\},\\mathbf\{P\}\_\{\\text\{latent\}\}^\{\(f\)\}\)\(233\)ℒcross\(3\)\\displaystyle\\mathcal\{L\}\_\{\\text\{cross\}\}^\{\(3\)\}=ℒIsoFICReg​\(𝐏aug\(t\),𝐏latent\(f\)\)\\displaystyle=\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}\(\\mathbf\{P\}\_\{\\text\{aug\}\}^\{\(t\)\},\\mathbf\{P\}\_\{\\text\{latent\}\}^\{\(f\)\}\)\(234\)ℒcross\(4\)\\displaystyle\\mathcal\{L\}\_\{\\text\{cross\}\}^\{\(4\)\}=ℒIsoFICReg​\(𝐏latent\(t\),𝐏aug\(f\)\)\\displaystyle=\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}\(\\mathbf\{P\}\_\{\\text\{latent\}\}^\{\(t\)\},\\mathbf\{P\}\_\{\\text\{aug\}\}^\{\(f\)\}\)\(235\)
Clarification:Cross\-domain lossesℒcross\(1\)\\mathcal\{L\}\_\{\\text\{cross\}\}^\{\(1\)\}andℒcross\(2\)\\mathcal\{L\}\_\{\\text\{cross\}\}^\{\(2\)\}pair augmented\-augmented and latent\-latent embeddings respectively across domains \(time\-frequency\), whileℒcross\(3\)\\mathcal\{L\}\_\{\\text\{cross\}\}^\{\(3\)\}andℒcross\(4\)\\mathcal\{L\}\_\{\\text\{cross\}\}^\{\(4\)\}pair augmented\-latent embeddings across domains\. This comprehensive cross\-domain pairing enforces that time and frequency representations respect Parseval’s theorem: energy conservation between domains should hold regardless of whether transformations are applied in input space \(augmented\) or latent space \(latent\-transformed\)\.

The total IsoFICReg loss is:

ℒIsoFICReg=ℒIsoFICReg\(t\)\+ℒIsoFICReg\(f\)\+∑i=14ℒcross\(i\)\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}=\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}^\{\(t\)\}\+\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}^\{\(f\)\}\+\\sum\_\{i=1\}^\{4\}\\mathcal\{L\}\_\{\\text\{cross\}\}^\{\(i\)\}\(236\)

##### C\.5\.6\.3Latent Denoising via Clean Targets

Augmented embeddings are derived from noisy input\-space transformations \(𝐱\+AWGN\+equivariant transforms\\mathbf\{x\}\+\\text\{AWGN\}\+\\text\{equivariant transforms\}\), while latent\-transformed embeddings are derived from clean latent\-space transformations \(𝐱\+equivariant transforms only\\mathbf\{x\}\+\\text\{equivariant transforms only\}\)\. By enforcing invariance between augmented and latent\-transformed embeddings, we provide clean targets for the noisy augmented embeddings\. The model learns: "noisy input\-space transformed representation should match clean latent\-space transformed representation\."

This is particularly powerful in negative\-SNR regimes: even when noise dominates the input signal, the latent\-transformed embeddings \(from clean signals\) provide a denoising target\. The invariance loss encourages the encoder to extract signal structure from noisy inputs that matches the structure extracted from clean inputs\.

##### C\.5\.6\.4Implicit Equivariance Enforcement

Both augmented and latent\-transformed embeddings undergo identical equivariant transformations \(frequency shift, time shift, phase rotation, IQ flip\)—the only difference is the presence of AWGN in augmented embeddings\. By minimizing the distance between augmented and latent\-transformed embeddings, we encourage the model to learn representations where equivariant transformations produce consistent effects regardless of noise level\.

For example, if a frequency shift byΔ​f\\Delta fis applied:

- •Augmented:𝐱\+AWGN→shift by​Δ​f→Encoder→Projector→𝐏aug\\mathbf\{x\}\+\\text\{AWGN\}\\rightarrow\\text\{shift by \}\\Delta f\\rightarrow\\text\{Encoder\}\\rightarrow\\text\{Projector\}\\rightarrow\\mathbf\{P\}\_\{\\text\{aug\}\}
- •Latent\-transformed:𝐱→Encoder→LatentEquivariantTransformer \(shift byΔf\)→Projector→𝐏latent\\mathbf\{x\}\\rightarrow\\text\{Encoder\}\\rightarrow\\text\{LatentEquivariantTransformer \(shift by \}\\Delta f\)\\rightarrow\\text\{Projector\}\\rightarrow\\mathbf\{P\}\_\{\\text\{latent\}\}

The invariance lossℒinv​\(𝐏aug,𝐏latent\)\\mathcal\{L\}\_\{\\text\{inv\}\}\(\\mathbf\{P\}\_\{\\text\{aug\}\},\\mathbf\{P\}\_\{\\text\{latent\}\}\)encourages these to be similar, implicitly enforcing that the frequency shift produces consistent latent transformations whether applied in input space \(with noise\) or latent space \(without noise\)\.

##### C\.5\.6\.5Critical Interaction with LED

Without LED’s explicit equivariance regression \(Section[C\.3](https://arxiv.org/html/2606.04106#A3.SS3)\), the N\-choose\-2 pairwise invariance losses would drive the model toward invariance toalltransformations, including frequency/time shifts, phase rotations, etc\. The augmented\-latent pairing would collapse to: "all transformed versions of the same signal should be identical, regardless of transformation type\."

LED prevents this collapse by explicitly enforcing that equivariant transformations produce predictable, learnable changes in the embeddings\. The difference projectionsΔaug=𝐳aug−𝐳clean\\Delta\_\{\\text\{aug\}\}=\\mathbf\{z\}\_\{\\text\{aug\}\}\-\\mathbf\{z\}\_\{\\text\{clean\}\}andΔlatent=𝐳latent−𝐳clean\\Delta\_\{\\text\{latent\}\}=\\mathbf\{z\}\_\{\\text\{latent\}\}\-\\mathbf\{z\}\_\{\\text\{clean\}\}must regress to the same transformation parameters \(Section[C\.3\.4](https://arxiv.org/html/2606.04106#A3.SS3.SSS4)\), forcing the model to learn that our target equivariant symmetry focused transformations/augmentations produce specific, predictable changes rather than being only invariant\.

The combination creates a balanced objective:

- •IsoFICReg \(augmented\-latent invariance\):Encourages similarity across noise levels and domains, providing denoising and cross\-domain consistency
- •LED \(difference regression\):Enforces predictable differences under equivariant transformations, preventing collapse to full invariance

This dual mechanism—latent denoising via clean latent\-transformed targets and implicit equivariance enforcement via augmented\-latent pairing—enables learning in negative\-SNR regimes while maintaining fine\-grained discrimination\. The augmented\-latent pairing provides a curriculum: the model learns to extract signal structure from noisy inputs \(augmented\) that matches the structure from clean inputs \(latent\-transformed\), while LED ensures the extracted structure respects equivariant transformation rules\.

### C\.6Reconstruction Losses

Reconstruction losses provide dense, instance\-specific gradients throughout the encoder and decoder, complementing the global structure learned by IsoFICReg and LED\. We apply reconstruction in both time and frequency domains with phase\-aware variants to ensure high\-fidelity signal recovery\.

#### C\.6\.1Dual\-Domain Reconstruction

The decoder reconstructs signals in the time domain, and we additionally compute reconstruction loss in the frequency domain by transforming both reconstruction and target via FFT\.

Time\-Domain Reconstruction:

ℒrecon\(t\)=FocalHuber​\(𝐱recon,𝐱target\)\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(t\)\}=\\text\{FocalHuber\}\(\\mathbf\{x\}\_\{\\text\{recon\}\},\\mathbf\{x\}\_\{\\text\{target\}\}\)\(237\)
where𝐱recon\\mathbf\{x\}\_\{\\text\{recon\}\}is the decoder output and𝐱target\\mathbf\{x\}\_\{\\text\{target\}\}is the clean target signal \(before AWGN injection\)\.

Frequency\-Domain Reconstruction:

𝐗recon\\displaystyle\\mathbf\{X\}\_\{\\text\{recon\}\}=FFT​\(𝐱recon\)\\displaystyle=\\text\{FFT\}\(\\mathbf\{x\}\_\{\\text\{recon\}\}\)\(238\)𝐗target\\displaystyle\\mathbf\{X\}\_\{\\text\{target\}\}=FFT​\(𝐱target\)\\displaystyle=\\text\{FFT\}\(\\mathbf\{x\}\_\{\\text\{target\}\}\)\(239\)ℒrecon\(f\)\\displaystyle\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(f\)\}=FocalHuber​\(𝐗recon,𝐗target\)\\displaystyle=\\text\{FocalHuber\}\(\\mathbf\{X\}\_\{\\text\{recon\}\},\\mathbf\{X\}\_\{\\text\{target\}\}\)\(240\)
Rationale:Dual\-domain reconstruction ensures the model learns to preserve both temporal dynamics \(time domain\) and spectral structure \(frequency domain\)\. Errors in one domain may not be apparent in the other—for example, phase errors are more visible in time domain, while spectral envelope distortions are more visible in frequency domain\. By enforcing reconstruction fidelity in both domains, we ensure comprehensive signal recovery\.

#### C\.6\.2Phase\-Aware Reconstruction

To encourage learning of phase relationships beyond magnitude reconstruction, we add explicit phase losses using circular\-aware regression \(Section[C\.3\.4](https://arxiv.org/html/2606.04106#A3.SS3.SSS4)\)\.

Time\-Domain Phase Loss:

ϕrecon\\displaystyle\\phi\_\{\\text\{recon\}\}=∠​\(𝐱recon\)\(extract phase\)\\displaystyle=\\angle\(\\mathbf\{x\}\_\{\\text\{recon\}\}\)\\quad\\text\{\(extract phase\)\}\(241\)ϕtarget\\displaystyle\\phi\_\{\\text\{target\}\}=∠​\(𝐱target\)\\displaystyle=\\angle\(\\mathbf\{x\}\_\{\\text\{target\}\}\)\(242\)ℒphase\(t\)\\displaystyle\\mathcal\{L\}\_\{\\text\{phase\}\}^\{\(t\)\}=PhaseFocalHuber​\(ϕrecon,ϕtarget\)\\displaystyle=\\text\{PhaseFocalHuber\}\(\\phi\_\{\\text\{recon\}\},\\phi\_\{\\text\{target\}\}\)\(243\)
Frequency\-Domain Phase Loss:

Φrecon\\displaystyle\\Phi\_\{\\text\{recon\}\}=∠​\(𝐗recon\)\\displaystyle=\\angle\(\\mathbf\{X\}\_\{\\text\{recon\}\}\)\(244\)Φtarget\\displaystyle\\Phi\_\{\\text\{target\}\}=∠​\(𝐗target\)\\displaystyle=\\angle\(\\mathbf\{X\}\_\{\\text\{target\}\}\)\(245\)ℒphase\(f\)\\displaystyle\\mathcal\{L\}\_\{\\text\{phase\}\}^\{\(f\)\}=PhaseFocalHuber​\(Φrecon,Φtarget\)\\displaystyle=\\text\{PhaseFocalHuber\}\(\\Phi\_\{\\text\{recon\}\},\\Phi\_\{\\text\{target\}\}\)\(246\)
The PhaseFocalHuber loss wraps phase errors to\[−π,π\]\[\-\\pi,\\pi\]before applying focal Huber regression:

Δϕ\\displaystyle\\Delta\_\{\\phi\}=\(ϕtarget−ϕrecon\+π\)mod2​π−π\\displaystyle=\(\\phi\_\{\\text\{target\}\}\-\\phi\_\{\\text\{recon\}\}\+\\pi\)\\mod 2\\pi\-\\pi\(247\)ℒphase\\displaystyle\\mathcal\{L\}\_\{\\text\{phase\}\}=FocalHuber​\(Δϕ,0\)\\displaystyle=\\text\{FocalHuber\}\(\\Delta\_\{\\phi\},0\)\(248\)
Rationale:Phase information is critical for signal reconstruction but often neglected by magnitude\-only losses\. Explicit phase losses ensure the model learns to preserve phase relationships, particularly important for coherent signals \(communications, audio\) where phase distortions cause audible artifacts or demodulation failures\. The circular\-aware formulation handles phase wrapping at±π\\pm\\pi\.

#### C\.6\.3Total Reconstruction Loss

The complete reconstruction loss combines all components:

ℒrecon=ℒrecon\(t\)\+ℒrecon\(f\)\+ℒphase\(t\)\+ℒphase\(f\)\\mathcal\{L\}\_\{\\text\{recon\}\}=\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(t\)\}\+\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(f\)\}\+\\mathcal\{L\}\_\{\\text\{phase\}\}^\{\(t\)\}\+\\mathcal\{L\}\_\{\\text\{phase\}\}^\{\(f\)\}\(249\)
whereλpercept\\lambda\_\{\\text\{percept\}\}controls the perceptual loss weight\.

Application:Reconstruction losses are applied to:

- •Single\-source reconstruction:Clean signals after encoding and decoding
- •Source\-separated reconstruction:Each separated source reconstructed to its original unmixed target
- •Denoised reconstruction:Noisy inputs reconstructed to clean targets \(implicit denoising\)

All reconstruction losses use focal Huber regression \(Section[C\.3\.4](https://arxiv.org/html/2606.04106#A3.SS3.SSS4)\) to emphasize hard examples \(low\-SNR, high\-error regions\) while maintaining robustness to outliers\.

### C\.7Perceptual Loss with EMA Encoder

To encourage reconstructions that preserve high\-level signal structure, we include a perceptual loss using an Exponential Moving Average \(EMA\) copy of the encoder\.

#### C\.7\.1EMA Encoder

The EMA encoderfEMAf\_\{\\text\{EMA\}\}is updated as a moving average of the training encoderfθf\_\{\\theta\}:

θEMA←α⋅θEMA\+\(1−α\)⋅θ\\theta\_\{\\text\{EMA\}\}\\leftarrow\\alpha\\cdot\\theta\_\{\\text\{EMA\}\}\+\(1\-\\alpha\)\\cdot\\theta\(250\)
whereα=0\.999\\alpha=0\.999\(momentum\)\. The EMA encoder is not trained via backpropagation but provides stable feature targets for perceptual loss\.

#### C\.7\.2Perceptual Loss Computation

For a reconstructed signal𝐱recon\\mathbf\{x\}\_\{\\text\{recon\}\}and its clean target𝐱target\\mathbf\{x\}\_\{\\text\{target\}\}, we extract features from both using the EMA encoder:

𝐳recon\(t\),𝐳recon\(f\)\\displaystyle\\mathbf\{z\}\_\{\\text\{recon\}\}^\{\(t\)\},\\mathbf\{z\}\_\{\\text\{recon\}\}^\{\(f\)\}=fEMA​\(𝐱recon\)\\displaystyle=f\_\{\\text\{EMA\}\}\(\\mathbf\{x\}\_\{\\text\{recon\}\}\)\(251\)𝐳target\(t\),𝐳target\(f\)\\displaystyle\\mathbf\{z\}\_\{\\text\{target\}\}^\{\(t\)\},\\mathbf\{z\}\_\{\\text\{target\}\}^\{\(f\)\}=fEMA​\(𝐱target\)\\displaystyle=f\_\{\\text\{EMA\}\}\(\\mathbf\{x\}\_\{\\text\{target\}\}\)\(252\)
The perceptual loss encourages feature\-space similarity:

ℒpercept\\displaystyle\\mathcal\{L\}\_\{\\text\{percept\}\}=FocalHuber​\(\[𝐳recon\(t\);𝐳recon\(f\)\],\[𝐳target\(t\);𝐳target\(f\)\]\)\\displaystyle=\\text\{FocalHuber\}\(\[\\mathbf\{z\}\_\{\\text\{recon\}\}^\{\(t\)\};\\mathbf\{z\}\_\{\\text\{recon\}\}^\{\(f\)\}\],\[\\mathbf\{z\}\_\{\\text\{target\}\}^\{\(t\)\};\\mathbf\{z\}\_\{\\text\{target\}\}^\{\(f\)\}\]\)\(253\)
Rationale:Perceptual loss encourages reconstructions that preserve high\-level signal structure \(e\.g\., spectral envelope, temporal dynamics\) rather than just minimizing point\-wise error\. The EMA encoder provides stable feature targets that evolve slowly as the encoder improves, preventing instability from rapidly changing feature spaces\. Furthermore, this can be viewed as yet another form of consistency regularization and implicit latent denoising\.

Application:Perceptual loss is applied to all reconstruction tasks: single\-source denoising reconstructions and source\-separated reconstructions\.

### C\.8Complete Loss Formulation

We combine all training objectives into a unified loss function\. Critically, we do not apply manual weighting between loss components—each loss operates at its natural scale\. The only scaling applied is a factor of 100 to all reconstruction losses to balance their contribution relative to latent\-space losses\.

#### C\.8\.1Loss Components

The complete training loss consists of five primary components:

\(1\) Isotropic Focal Invariance Covariance Regularization \(IsoFICReg\):

ℒIsoFICReg=ℒIsoFICReg\(t,t\)\+ℒIsoFICReg\(f,f\)\+ℒIsoFICReg\(t,f\)\+ℒIsoFICReg\(f,t\)\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}=\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}^\{\(t,t\)\}\+\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}^\{\(f,f\)\}\+\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}^\{\(t,f\)\}\+\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}^\{\(f,t\)\}\(255\)
where each domain loss includes invariance \(λinv=25\\lambda\_\{\\text\{inv\}\}=25\), covariance \(λcov=1\\lambda\_\{\\text\{cov\}\}=1\), and repulsion \(Section[C\.5](https://arxiv.org/html/2606.04106#A3.SS5)\)\.

\(2\) Latent Equivariant Differences \(LED\):

ℒLED=ℒlatent\+ℒinput\+ℒconsistency\\mathcal\{L\}\_\{\\text\{LED\}\}=\\mathcal\{L\}\_\{\\text\{latent\}\}\+\\mathcal\{L\}\_\{\\text\{input\}\}\+\\mathcal\{L\}\_\{\\text\{consistency\}\}\(256\)
where each term uses focal Huber regression on transformation parameter predictions \(Section[C\.3](https://arxiv.org/html/2606.04106#A3.SS3)\)\.

\(3\) Reconstruction Losses:

ℒrecon=ℒrecon\(t\)\+ℒrecon\(f\)\+ℒphase\(t\)\+ℒphase\(f\)\\mathcal\{L\}\_\{\\text\{recon\}\}=\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(t\)\}\+\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(f\)\}\+\\mathcal\{L\}\_\{\\text\{phase\}\}^\{\(t\)\}\+\\mathcal\{L\}\_\{\\text\{phase\}\}^\{\(f\)\}\(257\)
Applied per source with 100× scaling \(Section[C\.6](https://arxiv.org/html/2606.04106#A3.SS6)\)\.

\(4\) Source Separation Losses:

ℒsep=ℒrecon\(A\)\+ℒrecon\(B\)\+ℒsink\_comp\+ℒtoken\\mathcal\{L\}\_\{\\text\{sep\}\}=\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(A\)\}\+\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(B\)\}\+\\mathcal\{L\}\_\{\\text\{sink\\\_comp\}\}\+\\mathcal\{L\}\_\{\\text\{token\}\}\(258\)
whereℒrecon\(A\),ℒrecon\(B\)\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(A\)\},\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(B\)\}are per\-source reconstruction losses \(each scaled by 100×\) andℒtoken\\mathcal\{L\}\_\{\\text\{token\}\}is the token\-level separation loss \(Section[C\.4](https://arxiv.org/html/2606.04106#A3.SS4)\)\.

\(5\) Perception Losses:

ℒpercept=ℒperceptinput\\mathcal\{L\}\_\{\\text\{percept\}\}=\\mathcal\{L\}\_\{\\text\{percept\}\}^\{\\text\{input\}\}\(259\)
\(6\) Regularization Losses:

ℒreg=\\displaystyle\\mathcal\{L\}\_\{\\text\{reg\}\}=ℒorth\+ℒParseval\+ℒdiversity\+ℒnoise\_decorr\\displaystyle\\mathcal\{L\}\_\{\\text\{orth\}\}\+\\mathcal\{L\}\_\{\\text\{Parseval\}\}\+\\mathcal\{L\}\_\{\\text\{diversity\}\}\+\\mathcal\{L\}\_\{\\text\{noise\\\_decorr\}\}\(260\)\+ℒpower\_match\+ℒSNR\_reg\+ℒskip\_decorr\\displaystyle\+\\mathcal\{L\}\_\{\\text\{power\\\_match\}\}\+\\mathcal\{L\}\_\{\\text\{SNR\\\_reg\}\}\+\\mathcal\{L\}\_\{\\text\{skip\\\_decorr\}\}\(261\)
where:

- •ℒorth\\mathcal\{L\}\_\{\\text\{orth\}\}: Head orthogonalization across all multi\-head focus mechanisms \(Section[A\.6\.2](https://arxiv.org/html/2606.04106#A1.SS6.SSS2)\)
- •ℒParseval\\mathcal\{L\}\_\{\\text\{Parseval\}\}: JSD\-based cross\-domain consistency in Parseval Focus \(Section[A\.7](https://arxiv.org/html/2606.04106#A1.SS7)\)
- •ℒdiversity\\mathcal\{L\}\_\{\\text\{diversity\}\}: In\-domain vs\. cross\-domain focus diversity \(Section[A\.7](https://arxiv.org/html/2606.04106#A1.SS7)\)
- •ℒnoise\_decorr\\mathcal\{L\}\_\{\\text\{noise\\\_decorr\}\}: Pearson decorrelation between noise sinks and cleaned signals \(Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\)
- •ℒpower\_match\\mathcal\{L\}\_\{\\text\{power\\\_match\}\}: Total sunk noise power matches injected noise power \(Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\)
- •ℒSNR\_reg\\mathcal\{L\}\_\{\\text\{SNR\\\_reg\}\}: Reconstructed SNR matches target SNR \(Section[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)\)
- •ℒskip\_decorr\\mathcal\{L\}\_\{\\text\{skip\\\_decorr\}\}: Skip sink decorrelation and target alignment \(Section[B\.4\.3](https://arxiv.org/html/2606.04106#A2.SS4.SSS3)\)

#### C\.8\.2Total Loss: Single\-Source Training

For single\-source training \(denoising\), the total loss is:

ℒtotalsingle=ℒIsoFICReg\+ℒLED\+100⋅ℒrecon\+ℒpercept\+ℒreg\\mathcal\{L\}\_\{\\text\{total\}\}^\{\\text\{single\}\}=\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}\+\\mathcal\{L\}\_\{\\text\{LED\}\}\+100\\cdot\\mathcal\{L\}\_\{\\text\{recon\}\}\+\\mathcal\{L\}\_\{\\text\{percept\}\}\+\\mathcal\{L\}\_\{\\text\{reg\}\}\(262\)
The reconstruction loss is computed once per clean source and scaled by 100×\.

#### C\.8\.3Total Loss: Source Separation Training

For source separation training, the total loss is:

ℒtotalsep=ℒIsoFICReg\+ℒLED\+100⋅\(ℒrecon\(A\)\+ℒrecon\(B\)\)\+ℒsink\_comp\+ℒtoken\+ℒpercept\(A\)\+ℒpercept\(B\)\+ℒreg\\mathcal\{L\}\_\{\\text\{total\}\}^\{\\text\{sep\}\}=\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}\+\\mathcal\{L\}\_\{\\text\{LED\}\}\+100\\cdot\(\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(A\)\}\+\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(B\)\}\)\+\\mathcal\{L\}\_\{\\text\{sink\\\_comp\}\}\+\\mathcal\{L\}\_\{\\text\{token\}\}\+\\mathcal\{L\}\_\{\\text\{percept\}\}^\{\(A\)\}\+\\mathcal\{L\}\_\{\\text\{percept\}\}^\{\(B\)\}\+\\mathcal\{L\}\_\{\\text\{reg\}\}\(263\)
where:

- •ℒIsoFICReg\\mathcal\{L\}\_\{\\text\{IsoFICReg\}\}: Applied to separated source representations against their clean single\-source counterparts
- •ℒLED\\mathcal\{L\}\_\{\\text\{LED\}\}: Applied to separated source representations against their clean single\-source counterparts
- •100⋅\(ℒrecon\(A\)\+ℒrecon\(B\)\)100\\cdot\(\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(A\)\}\+\\mathcal\{L\}\_\{\\text\{recon\}\}^\{\(B\)\}\): Per\-source reconstruction losses, each scaled by 100×
- •\(ℒpercept\(A\)\+ℒpercept\(B\)\)\(\\mathcal\{L\}\_\{\\text\{percept\}\}^\{\(A\)\}\+\\mathcal\{L\}\_\{\\text\{percept\}\}^\{\(B\)\}\): Per\-source perception losses
- •ℒsink\_comp\\mathcal\{L\}\_\{\\text\{sink\\\_comp\}\}: Skip sink complementarity loss \(Section[B\.4\.3](https://arxiv.org/html/2606.04106#A2.SS4.SSS3)\)
- •ℒtoken\\mathcal\{L\}\_\{\\text\{token\}\}: Token\-level separation loss \(Section[C\.4](https://arxiv.org/html/2606.04106#A3.SS4)\)

#### C\.8\.4Loss Scaling Rationale

Why 100× for Reconstruction?Latent\-space losses \(IsoFICReg\) operate on high\-dimensional projections \(dproj=1024d\_\{\\text\{proj\}\}=1024\) with Z\-score standardization, producing loss magnitudes on the order of10110^\{1\}to10310^\{3\}\. Reconstruction losses operate on raw signal space with focal Huber regression, producing loss magnitudes on the order of10−310^\{\-3\}to10−110^\{\-1\}\. The 100× scaling brings reconstruction losses to comparable magnitude, ensuring balanced gradient contributions without manual tuning\.

Why No Other Weighting?Each loss component is designed with internal scaling mechanisms:

- •IsoFICReg:Built\-in coefficients \(λinv=25\\lambda\_\{\\text\{inv\}\}=25,λvar=25\\lambda\_\{\\text\{var\}\}=25,λcov=1\\lambda\_\{\\text\{cov\}\}=1\) from VICReg\[[15](https://arxiv.org/html/2606.04106#bib.bib15)\]
- •LED:Focal Huber regression with adaptiveγ\\gammabased on batch difficulty \(Section[C\.3\.4](https://arxiv.org/html/2606.04106#A3.SS3.SSS4)\)
- •Regularization:Each term operates at natural scale \(correlations, power ratios, SNR values\)

This design avoids the fragility of manual loss weighting, where small changes can destabilize training\. By operating at natural scales with a single reconstruction scaling factor, the loss landscape remains stable across different batch compositions and training stages\.

#### C\.8\.5Training Procedure

Training alternates between single\-source and source separation batches:

Single\-Source Batches:Clean signals with AWGN injection for denoising\. Computeℒtotalsingle\\mathcal\{L\}\_\{\\text\{total\}\}^\{\\text\{single\}\}\.

Source Separation Batches:Mixtures of two signals at controlled SINR \(20\-0 dB\)\. Computeℒtotalsep\\mathcal\{L\}\_\{\\text\{total\}\}^\{\\text\{sep\}\}\.

Both batch types contribute to encoder learning \(IsoFICReg, LED\), but source separation batches additionally train the source separation transformer and skip sinks for mixture handling\.

#### C\.8\.6Summary Table

Table[11](https://arxiv.org/html/2606.04106#A3.T11)summarizes all loss components and their scaling\.

Table 11:Complete Loss Formulation SummaryLoss ComponentScalingApplied ToSectionLatent\-Space LossesIsoFICReg \(time\)λinv=25,λvar=25,λcov=1\\lambda\_\{\\text\{inv\}\}=25,\\lambda\_\{\\text\{var\}\}=25,\\lambda\_\{\\text\{cov\}\}=1All batches[C\.5](https://arxiv.org/html/2606.04106#A3.SS5)IsoFICReg \(freq\)λinv=25,λvar=25,λcov=1\\lambda\_\{\\text\{inv\}\}=25,\\lambda\_\{\\text\{var\}\}=25,\\lambda\_\{\\text\{cov\}\}=1All batches[C\.5](https://arxiv.org/html/2606.04106#A3.SS5)IsoFICReg \(cross\)1×All batches[C\.5](https://arxiv.org/html/2606.04106#A3.SS5)Repulsionlog\(1\+⋅\)\\log\(1\+\\cdot\)All batches[C\.5](https://arxiv.org/html/2606.04106#A3.SS5)LED1×All batches[C\.3](https://arxiv.org/html/2606.04106#A3.SS3)Reconstruction LossesTime\-domain recon100×Per source[C\.6](https://arxiv.org/html/2606.04106#A3.SS6)Freq\-domain recon100×Per source[C\.6](https://arxiv.org/html/2606.04106#A3.SS6)Phase \(time\)100×Per source[C\.6](https://arxiv.org/html/2606.04106#A3.SS6)Phase \(freq\)100×Per source[C\.6](https://arxiv.org/html/2606.04106#A3.SS6)Perception LossesPerceptual1×Per source[C\.6](https://arxiv.org/html/2606.04106#A3.SS6)Source Separation LossesSkip sink complementarity1×Sep batches[B\.4\.3](https://arxiv.org/html/2606.04106#A2.SS4.SSS3)Token\-level separation1×Sep batches[C\.4](https://arxiv.org/html/2606.04106#A3.SS4)Regularization LossesHead orthogonalization1×All batches[A\.6\.2](https://arxiv.org/html/2606.04106#A1.SS6.SSS2)Parseval consistency1×All batches[A\.7](https://arxiv.org/html/2606.04106#A1.SS7)Focus diversity1×All batches[A\.7](https://arxiv.org/html/2606.04106#A1.SS7)Noise decorrelation1×All batches[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)Power matching1×All batches[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)SNR regression1×All batches[A\.4](https://arxiv.org/html/2606.04106#A1.SS4)Skip decorrelation1×All batches[B\.4\.3](https://arxiv.org/html/2606.04106#A2.SS4.SSS3)

## Appendix DAppendix D: Training Details

### D\.1Model Parameter Counts

Table[12](https://arxiv.org/html/2606.04106#A4.T12)summarizes the parameter counts for all network components in the PlanFormer system\.

Table 12:PlanFormer System Parameter CountsComponentTotal ParamsTrainableNon\-TrainablePlanFormer Encoder1,990,4781,978,19012,288Training\-Only Components:Projection Head/Expander2,234,3682,234,3680Latent Equivariant Transformer480,771480,7710Equivariant Prediction Head10,24510,2450Optional Inference Component:Latent Source Separation Transformer8,603,9078,603,9070PlanFormer Decoder95,163,12295,152,88210,240Inference \(Discriminative Tasks\)1,990,4781,978,19012,288Inference \(w/ Source Separation\)10,594,38510,582,09712,288Total \(Full Training System\)108,482,891108,462,36322,528Key Observations:

\(1\) Encoder Efficiency:The encoder contains only 1\.99M parameters, achieving the cross\-modal transfer results reported in the main paper\. This demonstrates that principled architectural design enables efficient cross\-modal transfer with compact models, offering a complementary approach to large\-scale architectures\.

\(2\) Decoder Parameter Distribution:The decoder contains 95\.2M parameters, with the majority allocated to hypernetwork\-generated pointwise modulation parameters \(Dynamic FiLM, Skip Sinks, Sliding Window Activation\)\. Critically, these hypernetwork\-generated parameters are not stored as weight matrices but computed dynamically during forward passes, resulting in lower computational burden than equivalent matrix multiplication operations\. The decoder serves primarily as a training\-time gradient substrate \(Section[B\.1](https://arxiv.org/html/2606.04106#A2.SS1)\) and can be discarded for inference on discriminative tasks\.

\(3\) Auxiliary Networks:The equivariance learning components \(Latent Equivariant Transformer, Equivariant Prediction Head\) contain only 491K parameters combined, demonstrating that explicit symmetry learning does not require large capacity\. The Latent Source Separation Transformer \(8\.6M parameters\) operates on compressed token representations, enabling efficient token\-level separation before reconstruction\.

\(4\) Deployment Flexibility:For classification/retrieval tasks, only the encoder \(1\.99M parameters\) is required at inference\. For latent\-space source separation, the encoder plus Latent Source Separation Transformer \(10\.6M parameters combined\) are needed\. The Projection Head, Equivariant Transformer, and Equivariant Prediction Head are training\-only components supporting loss computation\. For reconstruction tasks, the full encoder\-decoder system \(108\.5M parameters\) is used, with the understanding that decoder parameters primarily support training dynamics rather than representing stored knowledge\.

### D\.2Hyperparameters

We use a simple, fixed hyperparameter configuration throughout training, avoiding manual tuning by designing loss functions and architectures with internal adaptation mechanisms\.

Optimizer:Adam\[[54](https://arxiv.org/html/2606.04106#bib.bib54)\]with learning rateη=10−4\\eta=10^\{\-4\}

Batch Size:128 per GPU \(1536 global batch size across 12 GPUs\)

Training Schedule:No learning rate warmup or decay\. The learning rate remains constant at10−410^\{\-4\}throughout training\.

Rationale:By embedding adaptation mechanisms directly into the loss functions \(focal reweighting, adaptiveγ\\gammain focal Huber, dynamicKfocusK\_\{\\text\{focus\}\}in attention\), we avoid the fragility of manual hyperparameter schedules\. The model self\-adapts to batch difficulty, eliminating the need for learning rate annealing or warmup\.

### D\.3Cosine Annealing SNR Curriculum

While we avoid manual hyperparameter schedules, we employ a cosine annealing curriculum for AWGN augmentation to progressively challenge the model\.

SNR Range:Target SNR∈\[SNRlower​\(t\),100\]\\in\[\\text\{SNR\}\_\{\\text\{lower\}\}\(t\),100\]dB, whereSNRlower​\(t\)\\text\{SNR\}\_\{\\text\{lower\}\}\(t\)varies according to a cosine schedule and the upper bound of 100 dB serves as a proxy for native \(clean\) SNR\.

Cosine Annealing Schedule:

SNRlower​\(t\)=SNRmin\+12​\(SNRmax−SNRmin\)​\(1\+cos⁡\(tTperiod​π\)\)\\text\{SNR\}\_\{\\text\{lower\}\}\(t\)=\\text\{SNR\}\_\{\\min\}\+\\frac\{1\}\{2\}\(\\text\{SNR\}\_\{\\max\}\-\\text\{SNR\}\_\{\\min\}\)\\left\(1\+\\cos\\left\(\\frac\{t\}\{T\_\{\\text\{period\}\}\}\\pi\\right\)\\right\)\(264\)
where:

- •SNRmin=−10\\text\{SNR\}\_\{\\min\}=\-10dB \(most challenging\)
- •SNRmax=10\\text\{SNR\}\_\{\\max\}=10dB \(least challenging\)
- •ttis the current milestone \(288 batch forward passes\)
- •Tperiod=Ttotal80T\_\{\\text\{period\}\}=\\frac\{T\_\{\\text\{total\}\}\}\{80\}\(80 cycles over full training\)

Rationale:The cosine schedule alternates between easier \(high SNR floor\) and harder \(low SNR floor\) training regimes\. Early in each cycle, the model learns features present at moderate\-to\-high SNR\. As the cycle progresses and SNR floor decreases, the model is challenged to maintain those features in increasingly noisy conditions\. This curriculum provides:

1. 1\.Progressive difficulty:Prevents overwhelming the model with negative\-SNR samples before it has learned basic signal structure
2. 2\.Regularization:Cycling through difficulty levels acts as a regularizer, preventing overfitting to any single SNR regime
3. 3\.Feature persistence:Forces learning of features that persist across SNR degradation rather than SNR\-specific shortcuts
4. 4\.Whitening effect:AWGN injection provides a whitening effect on the input distribution, improving conditioning of the optimization landscape

The upper bound of 100 dB ensures the model always sees clean samples, learning features naturally present at native SNR while simultaneously learning robustness to noise degradation\.

### D\.4Data Resampling for Sample Rate Consistency

To prevent the model from learning sample\-rate\-specific discriminative shortcuts, we resample all training data to a unified sample rate of 7\.69 MHz \(the native rate for POWDER 4G/5G data\)\.

Original Sample Rates:

- •ORACLE\[[29](https://arxiv.org/html/2606.04106#bib.bib29)\]: 5 MHz→\\rightarrowupsample to 7\.69 MHz
- •POWDER\[[30](https://arxiv.org/html/2606.04106#bib.bib30)\]: 7\.69 MHz \(native, no resampling\)
- •Modulation Recognition\(internal\): 30\.72 MHz \(8 samples/symbol\)→\\rightarrowdownsample to 7\.69 MHz
- •Commodity SDR\(internal\): 10 MHz \(10\-15 kHz bandwidth, heavily oversampled\)→\\rightarrowdownsample to 7\.69 MHz
- •S4\_FM Radio\(internal\): 200 kHz→\\rightarrowupsample to 7\.69 MHz

Rationale:Different sample rates create different time\-domain resolutions and frequency\-domain spans\. Without resampling, the model could learn to discriminate emitters based on sample rate rather than signal structure\. Unified resampling ensures the model learns sample\-rate\-consistent representations, critical for transfer to domains with arbitrary sample rates\.

### D\.5Stratified Batch Sampling

To support N\-choose\-2 pairwise losses in IsoFICReg \(Section[C\.5\.2](https://arxiv.org/html/2606.04106#A3.SS5.SSS2)\), we employ stratified sampling to ensure balanced class representation in each batch\.

Stratified Distributed Sampler:

Algorithm[11](https://arxiv.org/html/2606.04106#alg11)describes the stratified sampling procedure for distributed training\.

Algorithm 11Stratified Distributed Sampler0:Concatenated dataset with

DDsub\-datasets

0:Number of classes per dataset:

\[C1,C2,…,CD\]\[C\_\{1\},C\_\{2\},\\ldots,C\_\{D\}\]
0:Number of samples per class:

\[N1,N2,…,N∑Cd\]\[N\_\{1\},N\_\{2\},\\ldots,N\_\{\\sum C\_\{d\}\}\]
0:Number of GPUs:

GG, GPU rank:

rr
0:Stratified indices for GPU

rr
1:

2:\{Compute global parameters\}

3:

Nmax←max⁡\(N1,…,N∑Cd\)N\_\{\\max\}\\leftarrow\\max\(N\_\{1\},\\ldots,N\_\{\\sum C\_\{d\}\}\)\{Max samples per class\}

4:

Nshard←Nmax−\(NmaxmodG\)N\_\{\\text\{shard\}\}\\leftarrow N\_\{\\max\}\-\(N\_\{\\max\}\\mod G\)\{Samples per class per GPU\}

5:

Ctotal←∑d=1DCdC\_\{\\text\{total\}\}\\leftarrow\\sum\_\{d=1\}^\{D\}C\_\{d\}\{Total classes\}

6:

7:\{Generate per\-class shuffled indices\}

8:foreach class

c∈\[1,Ctotal\]c\\in\[1,C\_\{\\text\{total\}\}\]do

9:

Nc←N\_\{c\}\\leftarrownumber of samples in class

cc
10:

indicesc←\\text\{indices\}\_\{c\}\\leftarrowshuffle samples in class

cc\(with seed\)

11:\{Repeat indices to reach

NmaxN\_\{\\max\}samples\}

12:

indicesc←repeat\(indicesc,⌈Nmax/Nc⌉\)\[:Nmax\]\\text\{indices\}\_\{c\}\\leftarrow\\text\{repeat\}\(\\text\{indices\}\_\{c\},\\lceil N\_\{\\max\}/N\_\{c\}\\rceil\)\[:N\_\{\\max\}\]
13:\{Shard across GPUs\}

14:

indicesc←indicesc\[r:Nshard:G\]\\text\{indices\}\_\{c\}\\leftarrow\\text\{indices\}\_\{c\}\[r:N\_\{\\text\{shard\}\}:G\]
15:endfor

16:

17:\{Stratified batch construction\}

18:

j←0j\\leftarrow 0\{Sample index within each class\}

19:for

i=0i=0to

\(Nshard/G\)⋅Ctotal−1\(N\_\{\\text\{shard\}\}/G\)\\cdot C\_\{\\text\{total\}\}\-1do

20:if

i\>0i\>0and

imodCtotal=0i\\mod C\_\{\\text\{total\}\}=0then

21:

j←j\+1j\\leftarrow j\+1\{Move to next sample in each class\}

22:endif

23:

class\_order←random\_permutation​\(\[1,…,Ctotal\]\)\\text\{class\\\_order\}\\leftarrow\\text\{random\\\_permutation\}\(\[1,\\ldots,C\_\{\\text\{total\}\}\]\)
24:

c←class\_order​\[imodCtotal\]c\\leftarrow\\text\{class\\\_order\}\[i\\mod C\_\{\\text\{total\}\}\]
25:yield

indicesc​\[j\]\\text\{indices\}\_\{c\}\[j\]
26:endfor

Key Properties:

1. 1\.Balanced representation:Each batch contains exactly one sample from each class \(for batch size = number of classes\) or balanced representation \(for larger batches\)
2. 2\.Class shuffling:Within each batch, class order is randomized to prevent positional biases
3. 3\.Sample repetition:Classes with fewer samples are repeated \(with shuffling\) to match the largest class, ensuring no class is underrepresented
4. 4\.Distributed consistency:Each GPU receives a disjoint shard of samples while maintaining stratification

Rationale:Stratified sampling maximizes the number of valid N\-choose\-2 pairs per batch\. With balanced class representation, each class has multiple samples in the batch, enabling pairwise invariance losses\. Random sampling would produce batches where many classes have only one sample, wasting the N\-choose\-2 mechanism\. This is particularly important for datasets with many classes \(39 emitters in our training data\), where random sampling would rarely produce multiple samples per class in a batch of 128\.

### D\.6Training Scale and Milestones

Due to the high sample rate of RF data \(7\.69 MHz\) and long sequence lengths \(5120 samples\), our training datasets contain an exceptionally large number of sequences\.

Milestone Definition:We measure training progress inmilestones, where 1 milestone = 288 batch forward passes\. This provides a more granular progress metric than epochs\.

Training Scale:

- •Estimated milestones per epoch:12,933
- •Estimated batches per epoch per GPU:3,724,780
- •Total GPUs:12 H100s \(80GB each\)
- •Global batches per epoch:3,724,780×12=44,697,3603,724,780\\times 12=44,697,360

Results Reported:All results in the main paper \(Table[1](https://arxiv.org/html/2606.04106#S4.T1)\) are from the same singular model trained for400 milestones, which represents400/12,933≈3\.1%400/12,933\\approx 3\.1\\%of a single epoch\.

Model Selection:We select the checkpoint that achieves the highest validation accuracy on a linear probe trained online with frozen encoder representations\. This online validation probe is trained on a held\-out subset of the RF training data and evaluated every milestone, ensuring we select weights that produce maximally discriminative representations rather than simply training for a fixed duration\.

Rationale:The exceptional scale of RF data \(high sample rate, long sequences, multiple collections per dataset\) creates a training regime where traditional epoch\-based metrics are impractical\. A single epoch would require∼\\sim13,000 milestones or∼\\sim3\.7M batches per GPU\. By reporting results at 400 milestones \(3\.1% of one epoch\), we demonstrate that the co\-designed architecture and losses enable learning of physical principles from limited data exposure—a property that may be valuable for domains where exhaustive data coverage is impractical\. The online validation probe ensures we select the checkpoint with the best representation quality rather than relying on training loss alone\.

### D\.7Hardware and Training Time

Hardware:12× NVIDIA H100 GPUs \(80GB VRAM each\) with NVLink interconnect

Distributed Training:PyTorch Distributed Data Parallel \(DDP\) withall\_gatherfor batch statistics in IsoFICReg

Training Time:Approximately 1,860 seconds per milestone \(288 batch forward passes\)\. For 400 milestones, total training time is approximately8\.6 days\(206\.4 hours\)\.

Total Compute:Approximately 2,477 H100\-hours \(12 GPUs × 206\.4 hours\) for the reported results, equivalent to 103 H100\-days\.

Time Breakdown:The per\-milestone time includes:

- •Forward passes:Encoder \(dual\-domain\), decoder \(dual\-domain\), projection heads, auxiliary networks \(LED transformers, source separation transformers\)
- •Loss computation:IsoFICReg \(N\-choose\-2 pairwise across distributed GPUs\), LED \(differences and regression\), reconstruction \(long sequences in memory\), regularization losses
- •Backward passes:Gradients through all networks
- •Monitoring:t\-SNE visualizations \(every 6 milestones\), reconstruction visualizations, loss logging, online validation probe evaluation

The training cost is dominated by:

1. 1\.Multiple encoder/decoder passes:Clean branch, augmented branch, latent\-transformed branch, source separation \(when applicable\)
2. 2\.Long sequence reconstruction:5120\-sample sequences stored in memory for dual\-domain reconstruction losses
3. 3\.N\-choose\-2 pairwise losses:Computing\(Nc2\)\\binom\{N\_\{c\}\}\{2\}pairs per class across global batch \(1536 samples\)
4. 4\.Distributed communication:all\_gatheroperations for batch statistics and representations
5. 5\.Online validation probe:Linear classifier trained every milestone for model selection

Memory Usage:Approximately 168GB / 187GB per node \(2 GPUs per node\)\. Memory is primarily consumed by:

- •Model parameters \(encoder, decoder, auxiliary networks\)
- •Optimizer states \(Adam maintains first and second moments\)
- •Activations and gradients for long sequences \(5120 samples\)
- •Intermediate representations for multiple branches \(clean, augmented, latent\-transformed\)
- •Monitoring buffers \(t\-SNE embeddings, reconstruction samples, validation probe\)

GPU Scaling:The 12\-GPU configuration was selected to complete training within our project timeline\. However, the framework scales linearly with GPU count—we have successfully trained comparable models using 4 and 8 H100 GPUs with identical hyperparameters and data\. The 12\-GPU specification ensures exact reproducibility of the reported checkpoint, but researchers with smaller computational budgets can achieve similar frozen\-representation transfer performance with fewer GPUs and proportionally longer training times\. This linear scaling demonstrates that our approach is accessible to institutions with varying resource constraints\.

Deployment Efficiency:Critically, these training costs do not reflect deployment requirements\. At inference, only the lightweight encoder \(1\.99M parameters\) is needed for classification tasks, or encoder \+ decoder for reconstruction tasks \(denoising, source separation\)\. The auxiliary networks \(projection heads, LED transformers\) are discarded after training\. Inference on a single sample requires only one forward pass through the encoder, with no distributed communication or pairwise loss computation\.

Optimization Opportunities:The current implementation prioritizes research flexibility over computational efficiency\. Substantial speedups are likely achievable through:

- •Gradient checkpointing for long sequences
- •Optimized attention kernels \(FlashAttention\[[55](https://arxiv.org/html/2606.04106#bib.bib55)\]\)
- •Reduced monitoring frequency \(t\-SNE, visualizations\)
- •Batch size tuning and gradient accumulation

We have not pursued these optimizations as training time was not a bottleneck for our research objectives\. The 8\.6\-day training time for 400 milestones demonstrates that effective cross\-modal transfer is achievable with reasonable computational budgets, particularly given that this represents only 3\.1% of a single epoch \(Section[D\.6](https://arxiv.org/html/2606.04106#A4.SS6)\)\.

## Appendix EAppendix E: Dataset & Evaluation Details

### E\.1Training Datasets

We train exclusively on RF fingerprinting datasets, which provide weak supervision critical for our N\-choose\-2 pairwise learning \(Section[C\.5\.2](https://arxiv.org/html/2606.04106#A3.SS5.SSS2)\)\. RF fingerprinting datasets are collected such that each file contains transmissions from a single known emitter, providing weak labels for latent coherent integration without requiring sample\-level annotations\.

Critical Challenge:A key risk in RF fingerprinting is learning spurious correlations from time\-varying channels rather than hardware characteristics\. Each file experiences unique channel conditions \(multipath, fading, interference\), creating a potential shortcut where the model discriminates files by channel rather than emitter\. This motivates our physically\-anchored regularizations: reconstruction losses enforce instance\-specific fidelity \(preventing over\-discrimination\), LED learns channel\-invariant equivariances \(frequency/time shifts, phase rotations\), and IsoFICReg with augmented\-latent pairing provides implicit denoising\. Together, these mechanisms facilitate the model learns hardware fingerprints rather than channel artifacts\.

#### E\.1\.1ORACLE

- •License:CC BY 4\.0 \(Inferred/Standard for GENESYS Lab public datasets\)\.
- •
- •Version:v1\.0 \(Initial Release, April 2019\)\.

Description:16 Ettus X310 software\-defined radios transmitting identical data \(same MAC address, protocol, modulation\) at multiple distances and across two temporal collections\.

Training Subset:We use distances of 14, 26, 38, and 50 feet from the first collection\.

Challenge:ORACLE presents extreme fine\-grained discrimination: all emitters share identical transmitted data, MAC addresses, protocols, and locations\. The only distinguishing features are subtle hardware imperfections \(I/Q imbalance, carrier frequency offset, phase noise\)\. This forces learning of highly discriminative representations that capture manufacturing variations rather than protocol or content differences\.

Weak Label:All samples from the same emitter across all distances and windows receive the same label, enabling N\-choose\-2 pairing across diverse channel conditions\.

Native Sample Rate:5 MHz→\\rightarrowresampled to 7\.69 MHz

#### E\.1\.2POWDER

- •License:CC BY 4\.0 \(Inferred/Standard for GENESYS Lab public datasets\)\.
- •
- •Version:v1\.0 \(Initial Release, December 2020\)\.

Description:4 Ettus X310 base stations in an outdoor setting, each transmitting WiFi, 4G, and 5G protocols across 5 sequential collections over two days\.

Training Subset:WiFi and 4G transmissions from all 4 base stations on Day 1 only\. Day 2 and 5G transmissions are reserved for cross\-protocol and temporal generalization testing \(reported in Table[1](https://arxiv.org/html/2606.04106#S4.T1)\)\.

Protocol\-Invariant Learning:Critically, we assign thesame weak labelto both WiFi and 4G transmissions from each base station\. This forces the model to learn emitter\-specific hardware characteristics that persist across protocols, rather than protocol\-specific features\. The reconstruction losses and LED ensure protocol\-specific structure is retained in the representations \(enabling protocol classification if needed downstream\), while IsoFICReg encourages protocol\-invariant emitter fingerprints\.

This design balances global \(emitter identity\) versus local \(protocol details\) and persistent \(hardware fingerprints\) versus transient \(protocol structure\) representations—enabling the model to support diverse downstream tasks without committing to a single level of abstraction\.

Native Sample Rate:

- •WiFi: 5 MHz→\\rightarrowresampled to 7\.69 MHz
- •4G/5G: 7\.69 MHz \(no resampling required\)

#### E\.1\.3Modulation Recognition

Source:Internal dataset, to be released publicly with forthcoming publication

Description:11 digital and analog modulations \(’BPSK’, ’QPSK’, ’8PSK’, ’CPFSK’, ’GFSK’, ’PAM4’, ’16QAM’, ’64QAM’, ’SSB\-AM’, ’DSB\-AM’, ’B\-FM’\) transmitted from a single Ettus X310 over a wired connection\.

Training Subset:All 11 modulations from the training emitter\.

Modulation\-Invariant Learning:Similar to POWDER, we assign thesame weak labelto all modulations, forcing the model to learn modulation\-invariant features \(e\.g\., hardware characteristics\) while reconstruction and LED preserve modulation\-specific structure \(magnitude/phase patterns\)\. This tests whether the model can learn fine\-grained discriminative features \(different modulations have distinct magnitude and phase structures\) while maintaining instance\-level fidelity\.

Evaluation Subset:8 modulations \(’BPSK’, ’QPSK’, ’8PSK’, ’CPFSK’, ’GFSK’, ’PAM4’, ’16QAM’, ’64QAM’\) from a different radio and day, used for the modulation classification result in Table[1](https://arxiv.org/html/2606.04106#S4.T1)\.

Native Sample Rate:30\.72 MHz \(8 samples/symbol\)→\\rightarrowdownsampled to 7\.69 MHz

#### E\.1\.4Commodity SDR

Source:Internal dataset, to be released upon acceptance

Description:12 commodity software\-defined radios from 4 families: 3× Ettus B210, 3× HackRF, 3× Ettus UBX160, 3× Analog Devices PlutoSDR\. All transmit continuous phase shift keying \(CPSK\) with 10\-15 kHz bandwidth \(heavily oversampled at 10 MHz\)\.

Training Data:Collections from two days with varying spatial configurations \(5 ft and 10 ft transmitter\-receiver separation, cluttered indoor environment\)\. Using data across multiple days encourages learning of hardware fingerprints that persist despite temporal and spatial channel variations\.

Challenge:The narrow bandwidth \(10\-15 kHz\) relative to sample rate \(10 MHz\) creates a challenging scenario where most spectral content is empty\. This tests whether the model learns to focus on signal\-bearing regions rather than relying on full\-bandwidth features\.

Native Sample Rate:10 MHz→\\rightarrowdownsampled to 7\.69 MHz

#### E\.1\.5S4\_FM Radio

Source:Internal dataset, to be released upon acceptance

Description:6 real\-world FM radio stations collected over\-the\-air in outdoor settings\.

Training Subset:All 6 stations\.

Challenge:Real\-world over\-the\-air collection introduces uncontrolled channel conditions, interference, and time\-varying propagation\. This tests robustness to realistic deployment scenarios\.

Native Sample Rate:200 kHz→\\rightarrowupsampled to 7\.69 MHz

#### E\.1\.6Training Dataset Summary

Train/Validation Split:For all datasets, we use an 80/20 split per file for training and validation, ensuring both splits experience the same channel conditions while providing held\-out samples for monitoring training progress\.

Table[13](https://arxiv.org/html/2606.04106#A5.T13)summarizes the training datasets\.

Table 13:Training Dataset SummaryDatasetEmittersProtocols/ModsEnvironmentNative RateResampledORACLE16WiFiIndoor \(4 dist\.\)5 MHz7\.69 MHzPOWDER4WiFi, 4GOutdoor7\.69 MHz\-Modulation Recognition111 modulationsWired30\.72 MHz7\.69 MHzCommodity SDR12 \(4 families\)CPSKIndoor \(cluttered\)10 MHz7\.69 MHzS4\_FM Radio6FM broadcastOutdoor \(OTA\)200 kHz7\.69 MHzTotal39 classesDiverseVaried\-7\.69 MHzDiversity Rationale:The training datasets span:

- •Hardware:Ettus X310, B210, UBX160, HackRF, PlutoSDR, commercial FM transmitters
- •Protocols:WiFi, 4G, 11 digital/analog modulations, FM broadcast
- •Environments:Indoor \(controlled, cluttered\), outdoor \(OTA\), wired
- •Sample rates:200 kHz to 30\.72 MHz \(154× range\)
- •Bandwidths:10 kHz \(Commodity SDR\) to multi\-MHz \(POWDER\)

This diversity forces learning of general signal\-theoretic principles rather than dataset\-specific shortcuts, critical for frozen\-encoder representations to transfer to unseen domains \(audio, images, seismology, text\)\.

### E\.2Evaluation Datasets

We evaluate cross\-modal frozen\-encoder representation transfer across 15 tasks spanning RF, audio, seismology, text, images, and video \(Table[14](https://arxiv.org/html/2606.04106#A5.T14)\)\. All evaluations use frozen encoder representations with no fine\-tuning\.

#### E\.2\.1Evaluation Protocol

Linear Classifier:Support Vector Machine \(SVM\) with linear kernel \(reported in main paper\) and RBF kernel \(reported in Table[14](https://arxiv.org/html/2606.04106#A5.T14)\)\.

Train/Test Split:For datasets without explicit test sets, we perform 5\-fold cross\-validation with stratified 80/20 splits per class\. For datasets with explicit test sets, we perform 5\-fold cross\-validation on the training data while using the same test set for all folds\.

Variable\-Length Processing:For sequences longer than the training window \(5120 samples\), we apply the variable\-length processing strategy \(Section[A\.9\.3](https://arxiv.org/html/2606.04106#A1.SS9.SSS3)\): segment into fixed\-size windows, process through encoder, aggregate tokens \(time: concatenate, frequency: average\), and apply sequence pooling\.

Preprocessing:Real\-valued signals undergo Hilbert transformation to generate complex\-valued \(IQ\) representations\. Images are unwrapped in snake pattern \(vertical then horizontal\), with each color channel processed independently and representations concatenated\. Video frames are unwrapped as images and concatenated to maintain causal structure\. All image data is resampled to 5120 samples minimum \(images, video\) or processed at native length with variable\-length extension \(audio, text, seismology\)\.

#### E\.2\.2Linear Probing

Classifier:Linear SVM \(LinearSVC from scikit\-learn\[[56](https://arxiv.org/html/2606.04106#bib.bib56)\]\) with balanced class weights\.

Hyperparameter Search:5\-fold cross\-validation grid search over regularization parameterC∈\{10−4,10−3,…,105\}C\\in\\\{10^\{\-4\},10^\{\-3\},\\ldots,10^\{5\}\\\}\.

Configuration:

- •Regularization:CCselected via grid search
- •Class weighting: Balanced \(inversely proportional to class frequencies\)
- •Maximum iterations:10710^\{7\}
- •Solver: Default \(dual optimization fornsamples<nfeaturesn\_\{\\text\{samples\}\}<n\_\{\\text\{features\}\}, else primal\)

Top\-k Accuracy:For top\-3 accuracy, we use the decision function scores \(distance to separating hyperplane\) as confidence estimates, ranking predictions accordingly\.

#### E\.2\.3Non\-Linear Probing \(RBF Kernel\)

Classifier:SVM with Radial Basis Function \(RBF\) kernel\.

Hyperparameter Search:5\-fold cross\-validation grid search over:

- •Regularization:C∈\{10−4,10−3,…,105\}C\\in\\\{10^\{\-4\},10^\{\-3\},\\ldots,10^\{5\}\\\}
- •Kernel coefficient:γ=scale=1nfeatures⋅Var​\(𝐗\)\\gamma=\\text\{scale\}=\\frac\{1\}\{n\_\{\\text\{features\}\}\\cdot\\text\{Var\}\(\\mathbf\{X\}\)\}

Configuration:

- •Kernel: RBF,K​\(𝐱i,𝐱j\)=exp⁡\(−γ​‖𝐱i−𝐱j‖2\)K\(\\mathbf\{x\}\_\{i\},\\mathbf\{x\}\_\{j\}\)=\\exp\(\-\\gamma\\\|\\mathbf\{x\}\_\{i\}\-\\mathbf\{x\}\_\{j\}\\\|^\{2\}\)
- •Class weighting: Balanced
- •Maximum iterations:10710^\{7\}
- •Decision function: One\-vs\-rest \(OVR\)

Top\-k Accuracy:For top\-3 accuracy, we enable probability estimates \(Platt scaling\[[57](https://arxiv.org/html/2606.04106#bib.bib57)\]\) and use predicted probabilities for ranking\.

#### E\.2\.4MFCC Baseline Features

For audio tasks \(speaker recognition, language recognition, music classification\), we compare against expert\-crafted Mel\-Frequency Cepstral Coefficients \(MFCCs\) to validate that learned representations capture comparable or superior information to decades of domain\-specific signal processing research\.

Feature Extraction:We use thepython\_speech\_featureslibrary\[[58](https://arxiv.org/html/2606.04106#bib.bib58)\]with default parameters:

- •Window length: 25ms
- •Window step: 10ms
- •Number of cepstral coefficients: 13
- •Number of mel\-filterbank channels: 26
- •FFT size: 512
- •Pre\-emphasis coefficient: 0\.97

Aggregation:For variable\-length audio signals, we compute frame\-level MFCCs and aggregate via mean and standard deviation across the temporal dimension:

𝐟MFCC=\[μ​\(MFCC\);σ​\(MFCC\)\]∈ℝ26\\mathbf\{f\}\_\{\\text\{MFCC\}\}=\[\\mu\(\\text\{MFCC\}\);\\sigma\(\\text\{MFCC\}\)\]\\in\\mathbb\{R\}^\{26\}\(265\)whereμ​\(⋅\)\\mu\(\\cdot\)andσ​\(⋅\)\\sigma\(\\cdot\)compute mean and standard deviation over frames, producing a fixed\-size feature vector of dimension 26 \(13 coefficients × 2 statistics\)\.

Classification:MFCC features are classified using the same linear and non\-linear SVM protocols described above, ensuring fair comparison\.

#### E\.2\.5RF Tasks

Modulation Recognition:8 modulations \(’BPSK’, ’QPSK’, ’8PSK’, ’CPFSK’, ’GFSK’, ’PAM4’, ’16QAM’, ’64QAM’\) from a held\-out radio \(different from training\) and channel \(different day\)\. Explicit test set\. 5\-fold CV on training data, fixed test set\.

RF Fingerprinting \(POWDER\)\[[30](https://arxiv.org/html/2606.04106#bib.bib30)\]:4 base stations transmitting WiFi, 4G,and5G \(unseen protocol\) on Day 2 \(unseen temporal collection\)\. Explicit test set\. 5\-fold CV on training data, fixed test set\.

#### E\.2\.6Audio Tasks

Bilingual Speaker Recognition \(TidyVoiceX\_Dev\)\[[33](https://arxiv.org/html/2606.04106#bib.bib33),[47](https://arxiv.org/html/2606.04106#bib.bib47)\]:808 speakers across 40 languages\. Due to scale, we randomly sample 50 speakers for evaluation\. 5\-fold CV with stratified 80/20 splits\. Each fold produces∼\\sim29 languages on average\. Audio files:∼\\sim5 seconds at 16 kHz\. Variable\-length processing applied\. Each file is one sample\.

- •License:CC BY 4\.0\.
- •
- •Version:v2\.0 \(Official Evaluation Set, April 2, 2026\)\.

Language Recognition \(TidyVoiceX\_Dev\)\[[33](https://arxiv.org/html/2606.04106#bib.bib33),[47](https://arxiv.org/html/2606.04106#bib.bib47)\]:Same data as speaker recognition\. Each fold produces∼\\sim29 languages from the 50 sampled speakers\. Same processing as speaker recognition\.

- •License:CC BY 4\.0\.
- •
- •Version:v2\.0 \(Official Evaluation Set, April 2, 2026\)\.

Instrument Classification \(TinySOL\)\[[34](https://arxiv.org/html/2606.04106#bib.bib34)\]:14 instruments across 4 families\. All data used\. 5\-fold CV with stratified 80/20 splits\. Audio files: 2\-10 seconds at 44\.1 kHz\. Variable\-length processing applied\. Each file is one sample\.

- •License:CC BY 4\.0\.
- •
- •Version:v1\.0\.

Music Genre Recognition \(GTZAN\)\[[35](https://arxiv.org/html/2606.04106#bib.bib35),[48](https://arxiv.org/html/2606.04106#bib.bib48)\]:10 genres\. All data used\. 5\-fold CV with stratified 80/20 splits\. Audio files: 30 seconds at 22\.05 kHz\. Variable\-length processing applied\. Each file is one sample\.

- •License:Available for research and educational purposes\. The dataset is intended strictly for non\-commercial academic use, as it was originally collected for research without explicit copyright permissions for all tracks\.
- •
- •Version:Original release \(2002\)\. This foundational version includes 1,000 audio tracks \(30 seconds each\) across 10 distinct musical genres\.

#### E\.2\.7Seismology Task

Seismic Event Classification \(SCSN/SCEDC\)\[[37](https://arxiv.org/html/2606.04106#bib.bib37)\]:3 classes \(local quakes, noise, teleseismic events\)\. Explicit test set with class imbalance\.

- •License:Open access for research use per SCEDC data policy\.
- •
- •Version:Data accessed \[03/2026\]\.

Data Structure:Each sample consists of 3 channels \(X, Y, Z seismometer measurements\)\. We process each channel independently through the encoder and concatenate the resulting representations before classification\.

Sampling Strategy:The test set has imbalanced classes\. To create balanced evaluation, we:

1. 1\.Identify the smallest test class
2. 2\.Randomly sample from larger test classes \(local quakes, noise\) to match
3. 3\.Use all available training data with 5\-fold CV

This ensures the classifier is evaluated on balanced test data while using all available training samples\.

#### E\.2\.8Text Tasks

ArXiv Paper Classification\[[38](https://arxiv.org/html/2606.04106#bib.bib38)\]:9 sub\-disciplines \(2 fields: math, computer science\)\. Explicit test set: 28,000 training files, 2,500 test files\. 5\-fold CV on training data, fixed test set\.

- •License:\[Mixed\] arXiv Non\-exclusive License / CC BY \(per original authors\)\.
- •
- •Version:v1\.0 \(March 2019/2021\)\.

Processing:Papers converted to byte streams \(UTF\-8 encoding\), processed as 1D sequences\. Average length:\>\>4,000 characters\. Variable\-length processing applied\.

Binary Field Classification:Same data, binary classification \(math vs\. computer science\)\.

#### E\.2\.9Image Tasks

All images resampled to 5120 samples after unwrapping \(no variable\-length processing needed\)\.

MNIST\[[39](https://arxiv.org/html/2606.04106#bib.bib39)\]:10 digit classes\. Explicit test set \(60k train, 10k test\)\. 5\-fold CV on training data, fixed test set\. Images: 28×28 grayscale\.

- •License:Public domain \(CC0 1\.0\)\.
- •
- •Version:Original release\.

FashionMNIST\[[40](https://arxiv.org/html/2606.04106#bib.bib40)\]:10 clothing classes\. Explicit test set \(60k train, 10k test\)\. 5\-fold CV on training data, fixed test set\. Images: 28×28 grayscale\.

- •License:MIT License\.
- •
- •Version:1\.0 \(August 2017\)\.

PathMNIST\[[42](https://arxiv.org/html/2606.04106#bib.bib42)\]:9 tissue pathology classes\. Explicit test set\. 5\-fold CV on training data, fixed test set\. Images: 64×64 RGB\.

- •License:CC BY 4\.0\.
- •
- •Version:v2\.0 \(April 2022\)\.

CIFAKE\[[43](https://arxiv.org/html/2606.04106#bib.bib43)\]:2 classes \(real vs\. AI\-generated images\)\. Explicit test set\. 5\-fold CV on training data, fixed test set\. Images: 32×32 RGB\.

- •License:CC BY 4\.0\.
- •
- •Version:v1\.0 \(March 2023\)\.

#### E\.2\.10Video Task

Mitosis Classification\[[44](https://arxiv.org/html/2606.04106#bib.bib44),[45](https://arxiv.org/html/2606.04106#bib.bib45)\]:2 classes \(normal vs\. abnormal cell division\)\. Explicit test set: 317 normal \+ 146 abnormal \(train\), 47 normal \+ 47 abnormal \(test\)\. 5\-fold CV on training data, fixed test set\.

- •License:CC BY 4\.0\.
- •
- •Version:v1\.0 \(April 2023\)\.

Processing:Each frame unwrapped as image, resampled to 5120 samples, successive frames concatenated to maintain causal structure\.

### E\.3Evaluation Results with RBF Kernel

Table[14](https://arxiv.org/html/2606.04106#A5.T14)presents results for both linear and RBF kernel SVMs\. The RBF kernel provides non\-linear decision boundaries, testing whether frozen representations benefit from non\-linear classification\. Results show modest improvements for most tasks \(average: 77\.7% linear vs\. 78\.2% RBF\), with physical tasks maintaining strong performance \(84\.5%/83\.6% both kernels\) and semantic tasks showing slight benefit from non\-linearity \(70\.0% linear vs\. 72\.0% RBF\)\. This demonstrates that learned representations are largely linearly separable, validating representation quality\.

Table 14:Frozen\-encoder representation transfer results with linear and RBF kernel SVMs \(5\-fold CV\)\. Physical tasks show strong performance with both kernels, while semantic tasks show graceful degradation in linear setting and slight benefit from nonlinearity\.TaskTypeTop\-1 \(Linear\)Top\-1 \(RBF\)Top\-3 \(Linear\)RFModulation Rec\.Physical95\.5±\\pm0\.888\.1±\\pm3\.499\.2±\\pm0\.2Fingerprinting \(POWDER\)Physical87\.6±\\pm0\.385\.4±\\pm1\.4N/AAudioSpeaker \(TidyVoice\)Physical90\.1±\\pm2\.686\.1±\\pm2\.697\.7±\\pm1\.0Language \(TidyVoice\)Semantic69\.8±\\pm4\.172\.9±\\pm3\.191\.2±\\pm2\.0Instrument Family \(TinySOL\)Physical91\.7±\\pm1\.191\.1±\\pm0\.9N/AInstrument \(TinySOL\)Phys\+Sem80\.5±\\pm1\.579\.9±\\pm1\.595\.5±\\pm0\.9Genre \(GTZAN\)Semantic64\.1±\\pm3\.262\.4±\\pm2\.589\.4±\\pm2\.1SeismologyEvent ClassificationPhysical89\.0±\\pm0\.390\.2±\\pm0\.1N/ATextArXiv Sub\-DisciplineSemantic36\.9±\\pm0\.437\.9±\\pm0\.371\.8±\\pm0\.3ArXiv FieldStruct\+Sem82\.7±\\pm0\.284±\\pm0\.4N/AImagesMNISTPhys\+Sem79\.2±\\pm0\.185\.5±\\pm0\.194\.5±\\pm0\.05FashionMNISTPhys\+Sem77\.0±\\pm0\.0181\.6±\\pm0\.296\.1±\\pm0\.04PathMNISTPhysical71\.3±\\pm0\.273\.2±\\pm0\.491\.9±\\pm0\.2CIFAKEPhysical82\.0±\\pm0\.187\.2±\\pm0\.1N/AVideoMitosis \(Full Video\)Physical68\.7±\\pm2\.867\.4±\\pm1\.0N/AAverage \(All\)77\.7%78\.2%91\.9%Average \(Physical\)84\.5%83\.6%96\.3%Average \(Semantic/Mixed\)70\.0%72\.0%89\.8%![Refer to caption](https://arxiv.org/html/2606.04106v1/Figures/PhysicalVsSemanticTSNEVisualization_labels.png)Figure 9:Learned representation structure: Physical vs\. Semantic tasks\.t\-SNE visualizations of frozen encoder representations across six representative tasks\.Left Column \(Physical tasks\):\(a\) RF Fingerprinting shows well structured emitter representations \(b\) Bilingual speaker recognition demonstrates clear speaker\-specific clusters consistent across languages\. \(c\) Instrument family classification shows distinct family groupings based on physical sound production mechanisms\. \(d\) The academic fields present within the ArXiv corpus show a clear structural distinction between fields\.Right Column \(Semantic tasks\):\(e\) Protocol recognition shows the representations respect protocol uniqueness but retain emitter level structure\. \(f\) Language recognition shows more overlap but maintains structure, with some clustering by linguistic families\. \(g\) We see structural overlap across instruments that exhibit similar human level excitation patterns\. \(h\) ArXiv sub\-discipline classification shows the most diffuse structure, yet remains non\-random, demonstrating that even raw text contains structural cues\. This systematic progression from tight physical clusters to structured semantic overlap helps support our hypothesis that signal\-theoretic learning captures physical structure but semantic content requires further capacity\. See Section[E\.2](https://arxiv.org/html/2606.04106#A5.SS2)for quantitative results\.
### E\.4Representation Structure Visualization

We visualize frozen encoder representations using t\-SNE and confusion matrices to qualitatively assess clustering quality and error patterns across the physical\-semantic spectrum\.

#### E\.4\.1Clustering Quality: Physical vs\. Semantic Tasks

Figure[9](https://arxiv.org/html/2606.04106#A5.F9)shows t\-SNE projections across eight tasks\. Physical tasks \(left column\) exhibit tight, well\-separated clusters: RF fingerprinting shows distinct emitter representations, speaker recognition demonstrates clear speaker\-specific clusters consistent across languages, instrument families separate by sound production mechanisms, and ArXiv fields show structural distinction\. Semantic tasks \(right column\) show more diffuse but structured representations: protocol recognition retains emitter\-level structure, language recognition shows overlap with linguistic family clustering, individual instruments overlap based on similar excitation patterns, and ArXiv sub\-disciplines remain non\-random despite being the most diffuse\. This progression from tight physical clusters to structured semantic overlap supports our hypothesis that signal\-theoretic learning captures physical structure but semantic content needs further abstraction\.

![Refer to caption](https://arxiv.org/html/2606.04106v1/Figures/1dConfusionPatternsVisualized_compressed.jpg)Figure 10:Interpretable confusion patterns across semantic spectrum\.Confusion matrices for three representative 1D tasks\.Top:Music genre classification shows confusion between physically similar genres \(rock/metal, classical/jazz\) that share instrumentation but differ in cultural context\.Middle:Individual instrument classification shows within\-family confusion \(brass instruments cluster, strings cluster\), demonstrating learned physical sound production mechanisms\.Bottom:ArXiv sub\-discipline classification shows structured confusion within fields \(math sub\-disciplines confuse with each other, CS sub\-disciplines confuse with each other\) despite being the most semantic task\. Errors reflect genuine structural similarity rather than random misclassification, validating that learned representations capture meaningful physical features\.
#### E\.4\.2Error Patterns: Interpretable Confusion

Figure[10](https://arxiv.org/html/2606.04106#A5.F10)shows confusion matrices for three representative 1D tasks spanning the semantic spectrum\.Music genre classification\(top\) shows interpretable confusion between related genres: rock/metal share instrumentation and tempo, classical/jazz share harmonic complexity\. Errors reflect shared physical structure despite semantic differences\.Individual instrument classification\(middle\) shows within\-family confusion: brass instruments \(French Horn, Trombone, Trumpet\) cluster together, as do strings \(Cello, Double Bass\)\. This demonstrates the model learned physical sound production mechanisms, with confusion arising from similar timbral characteristics\.ArXiv sub\-discipline classification\(bottom\) shows the most diffuse confusion pattern, yet errors remain structured: math sub\-disciplines \(Differential Geometry, Algebraic Geometry\) confuse with each other, as do CS sub\-disciplines \(Machine Learning, Computational Complexity\)\. Even in this extreme semantic test, confusion reflects genuine structural similarity in notation and equation density rather than random misclassification\.

#### E\.4\.3Interpretation

These visualizations provide qualitative evidence that: \(1\) physical tasks yield tight clusters with discriminative features, \(2\) semantic tasks show structured overlap reflecting shared physical foundations, \(3\) errors are interpretable—confusion arises from genuine similarity rather than arbitrary misclassification, and \(4\) representations remain meaningful across all tasks, even the most semantic\. Together with quantitative results \(Table[1](https://arxiv.org/html/2606.04106#S4.T1)\), this demonstrates that the physical\-semantic performance gap reflects fundamental differences in task nature rather than representation quality\.

### E\.5Qualitative Reconstruction Analysis

Following the quantitative evaluation via linear and non\-linear probing, we assess representation quality through a complementary qualitative lens: reconstruction fidelity\. Our decoder combines bottleneck latent guidance with two skip connections from compressed encoder features \(4x and 16x downsampled, Section[B\.2](https://arxiv.org/html/2606.04106#A2.SS2)\): the bottleneck latent provides global structural information learned through discriminative and equivariant objectives, while skip connections preserve high\-resolution details from already\-compressed representations\. Critically, we omit top\-level skip connections \(input and first conv layer\) to prevent trivial information leakage, forcing the bottleneck to learn meaningful compressed representations\.

To validate that learned representations \(not architectural biases\) drive reconstruction quality, we evaluate the same architecture with random weight initialization \(Figure[8](https://arxiv.org/html/2606.04106#A2.F8)\)\. Random weights produce near\-complete reconstruction failure—uniform gray fields with no structure—demonstrating that skip connections alone cannot reconstruct without learned bottleneck representations\. This design enables zero\-shot reconstruction while ensuring the encoder learns transferable representations validated by linear probing \(Table[1](https://arxiv.org/html/2606.04106#S4.T1)\), where only the bottleneck latent is used—skip connections are not involved in transfer learning evaluation\.

##### E\.5\.0\.1Reconstruction as Representation Quality Metric

Recent work has demonstrated that reconstruction quality from SSL bottleneck latents serves as a reliable proxy for learned feature representation quality\[[59](https://arxiv.org/html/2606.04106#bib.bib59)\]\. The intuition is straightforward: if learned features are meaningful, they should enable reconstruction that preserves the input’s structural properties\. Random or poorly learned features would produce incoherent reconstructions\.

##### E\.5\.0\.2Zero\-Shot Image Reconstruction

Figure[11](https://arxiv.org/html/2606.04106#A5.F11)shows reconstructions of natural images from three categories: wildlife \(cheetah\), natural scenes \(forest\), and landscapes\. Critically, these are \*\*zero\-shot reconstructions\*\*—the model was trained exclusively on RF data and has never encountered natural images during training\. Moreover, our 1D unwrapping strategy \(snake pattern\) discards 2D spatial correlations, placing the model at a significant disadvantage for image reconstruction\.

![Refer to caption](https://arxiv.org/html/2606.04106v1/Figures/ZeroShotReconstructionRepresentationQualityVisual_compressed.jpg)Figure 11:Zero\-shot image reconstructions from encoder\-decoder system trained exclusively on RF data\.Top row:Original images\.Middle row:Reconstructions combining bottleneck latent guidance \(global structure from transformer\-processed tokens\) with two skip connections from compressed encoder features \(4x and 16x downsampled, providing local details\)\.Bottom row:Absolute difference\. Despite 1D unwrapping and no exposure to natural images during training, reconstructions preserve structural coherence and spatial relationships\. Compare to Figure[8](https://arxiv.org/html/2606.04106#A2.F8)showing complete failure with random weights, validating that learned representations \(not architectural biases\) drive reconstruction quality\. Linear probing \(Table[1](https://arxiv.org/html/2606.04106#S4.T1)\) validates that the bottleneck latent alone \(without skip connections\) captures sufficient structure for classification tasks\.
##### E\.5\.0\.3Observations

Despite the architectural disadvantage and complete domain mismatch, reconstructions exhibit several notable properties:

Structural Preservation:Reconstructions maintain overall composition, layout, and spatial relationships\. In the cheetah image, the animal’s body shape, posture, and separation from the background are preserved\. In the forest scene, vertical tree structures, depth layers, and color gradients are maintained\. The nature scene preserves sky\-ground boundaries, object placements, and color zones\.

Edge and Boundary Detection:Foreground\-background separation and object boundaries are clearly delineated, indicating the encoder learned edge detection as a fundamental signal processing operation transferable across domains\.

Semantic Coherence:Reconstructions are not random or noisy—they are clearly structured representations of the input images\. This demonstrates that the encoder learned meaningful physical features \(edges, textures, other frequency oriented details\) that transfer zero\-shot to visual data\.

Expected Limitations:Full color details are lost \(expected with 1D processing of color channels independently of one another\)\. Spatial correlations have minor artifacts due to snake unwrapping, but the preservation of global structure demonstrates that the encoder captures sufficient information for coherent reconstruction\.

##### E\.5\.0\.4Interpretation

These reconstructions provide qualitative evidence that our encoder learns transferable physical structure rather than domain\-specific features\. The ability to reconstruct recognizable images from RF\-trained representations demonstrates that:

\(1\) Signal\-theoretic principles generalize:Features learned from RF signals \(frequency content, temporal dynamics, edge detection\) apply to visual signals \(color spectra, spatial frequencies, boundaries\)\.

\(2\) Representations are information\-rich:The bottleneck latent retains sufficient information about physical structure to enable coherent reconstruction, not just classification\.

\(3\) Architectural mismatch is surmountable:Even with 1D unwrapping destroying 2D correlations, the learned features capture enough structural information for meaningful reconstruction\. Native 2D processing would likely improve fidelity substantially\.

This qualitative analysis complements our quantitative results \(Table[1](https://arxiv.org/html/2606.04106#S4.T1)\), demonstrating that strong classification accuracy is not merely due to linear separability of arbitrary features, but reflects genuine learning of physical signal structure that transfers across modalities\. These reconstructions also highlight a key advantage of principle\-driven approaches: the learned features are interpretable as physical signal properties \(edges, frequencies, textures\) rather than opaque statistical correlations\.

### E\.6Comparison to CLIP Foundation Model

We compare PlanFormer to CLIP ViT\-B/32\[[1](https://arxiv.org/html/2606.04106#bib.bib1)\], a widely\-used vision\-language foundation model trained on 400 million image\-text pairs from the internet\. CLIP consists of dual encoders \(image and text\) trained via contrastive learning to align visual and linguistic representations\.

#### E\.6\.1Model Configuration and Evaluation Protocol

Model:We use the pretrained CLIP ViT\-B/32 image encoder \(151\.28M parameters, 14,780 MFLOPs\) from the official OpenAI release\. The encoder processes 224×\\times224 RGB images and produces 512\-dimensional embeddings\.

Evaluation Protocol:Identical to PlanFormer \(Section[4](https://arxiv.org/html/2606.04106#S4)\):

- •Freeze CLIP image encoder \(no fine\-tuning\)
- •Extract features from final layer before classification head
- •Train linear SVM classifier via 5\-fold cross\-validation
- •Report mean top\-1 accuracy±\\pmstandard deviation

Computational Complexity and Model Size: to compute the number of parameters of PlanFormer and the amount of FLOPs required for a forward pass, we used Meta’sfvcore111[https://github\.com/facebookresearch/fvcore](https://github.com/facebookresearch/fvcore)\. For CLIP we referred to the CLIP repository\[[46](https://arxiv.org/html/2606.04106#bib.bib46)\]\.

#### E\.6\.2Domain\-Specific Preprocessing

To enable CLIP evaluation on 1\-D signals, we convert time\-series data to 2\-D spectrograms\. All spectrograms are normalized to \[0, 1\], converted to RGB via the Viridis colormap, and resized to 224×\\times224 using bicubic interpolation before applying CLIP’s standard preprocessing\.

##### E\.6\.2\.1Audio Signals \(Speaker, Language, Instrument, Genre\)

Audio signals are processed in 5120\-sample windows with the following pipeline:

1. 1\.Mel Spectrogram:Compute mel\-scaled power spectrogram using 2048\-point FFT, 512\-sample hop length, and 128 mel bins spanning 0\-8 kHz
2. 2\.Logarithmic Scaling:Convert to decibels via amplitude\-to\-dB transformation to compress dynamic range
3. 3\.Normalization:Min\-max normalize to \[0, 1\] per spectrogram
4. 4\.Colorization:Apply Viridis colormap to convert grayscale to RGB
5. 5\.Resizing:Resize to 224×\\times224 via bicubic interpolation
6. 6\.Standard CLIP Preprocessing Function

For audio files longer than 5120 samples, we extract non\-overlapping windows, encode each spectrogram independently, and compute the mean\-pooled embedding across all windows to produce the final sample representation\.

##### E\.6\.2\.2RF and Seismology Signals

Seismology signals \(resampled to exactly 5120 samples consistent with Planformer evaluation\) and RF \(Modulation Recognition, POWDER Fingerprinting\) are processed as follows:

1. 1\.Power Spectrogram:Compute power spectrogram using 1024\-point FFT with 22\-sample hop length, producing approximately 513×\\times233 frequency\-time representation
2. 2\.Logarithmic Scaling:Applylog10⁡\(spec\+10−6\)\\log\_\{10\}\(\\text\{spec\}\+10^\{\-6\}\)to compress dynamic range
3. 3\.Normalization:Min\-max normalize to \[0, 1\]
4. 4\.Colorization:Apply Viridis colormap to convert to RGB
5. 5\.Resizing:Resize to 224×\\times224 via bicubic interpolation
6. 6\.Standard CLIP Preprocessing Function

No mean\-pooling is required as each sample is exactly 5120 samples\.

##### E\.6\.2\.3Images \(MNIST, FashionMNIST, PathMNIST, CIFAKE\)

Images are processed using CLIP’s standard preprocessing pipeline:

1. 1\.Resizing:Resize to 224×\\times224 via bicubic interpolation
2. 2\.Channel Conversion:Convert grayscale images to RGB by replicating the single channel three times
3. 3\.Normalization:Apply CLIP’s standard normalization \(mean=\[0\.48145466, 0\.45782750, 0\.40821073\], std=\[0\.26862954, 0\.26130258, 0\.27577711\]\)

Notably, we donotapply PlanFormer’s domain\-specific preprocessing \(Hilbert transform, unwrapping\), allowing CLIP to process images in its native format\.

##### E\.6\.2\.4Video \(Mitosis Classification\)

Videos are processed frame\-by\-frame:

1. 1\.Frame Extraction:Extract all frames from video
2. 2\.Frame Encoding:Process each frame as a standard image \(resize to 224×\\times224, convert to RGB, apply CLIP normalization\)
3. 3\.Temporal Aggregation:Encode each frame independently with frozen CLIP encoder and compute mean\-pooled embedding across all frames
4. 4\.Classification:Train linear SVM on mean\-pooled video embeddings

##### E\.6\.2\.5Text \(ArXiv Classification\)

We donotevaluate CLIP on text classification tasks\. CLIP’s text encoder uses a separate Transformer architecture optimized for linguistic processing, making direct comparison to PlanFormer’s frozen image encoder inappropriate\. Our inclusion of text tasks for PlanFormer serves to demonstrate a stress test for signal\-theoretic principles on highly semantic data, not to claim superiority over language\-specialized models\.

#### E\.6\.3Results Analysis

Table[1](https://arxiv.org/html/2606.04106#S4.T1)shows CLIP achieves higher overall accuracy \(83\.8% vs 77\.7%\), but performance gaps vary systematically by task type, validating our physical\-semantic boundary hypothesis\.

##### E\.6\.3\.1Physical Tasks: Competitive Performance with Massive Efficiency Gains

On physical/structural tasks \(Modulation Recognition, POWDER, Speaker ID, Instrument Family, Seismology, PathMNIST, CIFAKE, Mitosis\), PlanFormer achieves 84\.5% average accuracy compared to CLIP’s 87\.7%—a gap of only 3\.2%\. This competitive performance is achieved with:

- •76×\\timesfewer parameters\(1\.99M vs 151M\)
- •158×\\timeslower computational cost\(93\.6 vs 14,780 MFLOPs\)
- •Single\-modality pretraining\(RF signals only vs 400M image\-text pairs\)

Notably, PlanFormeroutperformsCLIP on RF fingerprinting \(87\.6% vs 55\.0%\), demonstrating that physics\-driven pretraining provides substantial advantages when the task domain aligns with the pretraining modality\. CLIP’s lower performance on RF tasks highlights the limitations of vision\-language pretraining for signal data that lacks natural visual or linguistic structure\.

##### E\.6\.3\.2Semantic Tasks: Language Grounding Advantage

On semantic tasks \(Language Recognition, Music Genre, Individual Instrument\) and mixed physical\-semantic tasks \(MNIST, FashionMNIST\), CLIP achieves 91\.2% compared to PlanFormer’s 70\.0%—a gap of 21\.2%\. This substantial difference supports our hypothesis that semantic understanding requires language grounding beyond physical principles\.

CLIP’s training on 400 million image\-text pairs provides explicit supervision for semantic concepts \(e\.g\., "jazz music," "English language," "digit 7"\), while PlanFormer learns only the physical structure of signals\. The graceful degradation on semantic tasks demonstrates that PlanFormer captures fundamental signal properties but lacks the linguistic grounding necessary for human\-defined semantic categories\.

##### E\.6\.3\.3Efficiency\-Performance Trade\-off

These results position PlanFormer and CLIP on the accuracy\-efficiency Pareto frontier\. PlanFormer achieves 92% of CLIP’s performance on physical tasks while requiring only 0\.66% of the parameters and 0\.63% of the computational cost\. This demonstrates that physics\-driven design enables a favorable trade\-off for applications where:

- •Physical/structural pattern recognition is the primary objective
- •Computational resources are constrained \(edge devices, embedded systems\)
- •Training data is limited or expensive to acquire
- •Deployment requires low latency and energy efficiency

##### E\.6\.3\.4Implications for Foundation Model Design

The complementary strengths of PlanFormer and CLIP suggest a hierarchical architecture for future foundation models:

1. 1\.Physical Layer \(PlanFormer\-like\):Lightweight, physics\-driven encoder captures causal structure and domain\-invariant signal properties with minimal resources
2. 2\.Semantic Layer \(CLIP\-like\):Language\-grounded model maps physical representations to human\-defined semantic concepts

This division of labor enables resource\-efficient deployment where the physical layer runs on\-device \(edge inference\) and the semantic layer is queried only when high\-level reasoning is required\. Our results demonstrate that physics\-driven pretraining is not a replacement for scale\-driven approaches, but rather a complementary paradigm that democratizes foundation model development for resource\-constrained applications\.

### E\.7Comparison to DinoV3

We additionally compare to DinoV3 ViT\-S\[[32](https://arxiv.org/html/2606.04106#bib.bib32)\], a self\-supervised vision foundation model trained on ImageNet \(21M parameters, 12,000 MFLOPs\)\. Preprocessing is identical to CLIP \(Section[E\.6](https://arxiv.org/html/2606.04106#A5.SS6)\) except using DinoV3’s standard ImageNet normalization\.

Results:Table[1](https://arxiv.org/html/2606.04106#S4.T1)shows DinoV3 achieves 83\.7% overall accuracy, nearly identical to CLIP \(83\.8%\)\. Performance stratifies by task type:

- •Physical Tasks:PlanFormer is competitive with DinoV3 \(84\.5% vs 83\.6%, \+0\.9%\) with 11×\\timesfewer parameters and 128×\\timeslower FLOPs
- •Semantic Tasks:DinoV3 outperforms PlanFormer \(84\.0% vs 70\.0%, \+14\.0%\), though less than CLIP’s advantage \(\+21\.2%\)
- •RF Fingerprinting:PlanFormer substantially outperforms DinoV3 \(87\.6% vs 66\.3%, \+21\.3%\)

Interpretation:DinoV3’s self\-supervised visual pretraining provides advantages on tasks with natural image statistics, while PlanFormer’s physics\-driven pretraining achieves competitive efficiency on physical tasks and excels on signal processing tasks \(RF, seismology\)\. The similar overall performance \(77\.7% vs 83\.7%\) despite 11×\\timesfewer parameters demonstrates the efficiency of principle\-driven design\.

## Appendix FAppendix F: Limitations

We acknowledge several limitations of our approach and provide context for interpreting our results:

### F\.1Semantic Understanding Boundary

Our most significant limitation is inherent to our approach: the model learns physical signal structure but not semantic content\. This manifests as systematically lower performance on semantic tasks \(70\.0% average top\-1 accuracy\) compared to physically\-grounded tasks \(84\.5% average\)\. Comparison to CLIP ViT\-B/32 \(151M parameters, trained on 400M image\-text pairs\) validates this boundary: PlanFormer achieves competitive performance on physical tasks \(84\.5% vs 87\.7%, within 3\.2%\) despite 76× fewer parameters, while the gap on semantic tasks \(70\.0% vs 91\.2%\) reflects CLIP’s language grounding advantage\. This is not a failure but rather the natural boundary of signal\-theoretic learning—semantic meaning requires human\-annotated supervision or large\-scale multi\-modal training that our approach explicitly avoids\.

Mitigation:This limitation suggests a hierarchical architecture for AI systems: physical foundation models \(like ours\) provide the base layer capturing causal structure and domain\-invariant properties, upon which semantic reasoning modules can be built\. Our work establishes what can be learned from signal processing principles alone, clarifying the division of labor between physical and semantic learning\.

### F\.2Suboptimal Spatial Processing

Our 1D unwrapping strategy for 2D images and 3D video discards spatial correlations, resulting in suboptimal performance on tasks requiring fine\-grained spatial reasoning\. For example, we achieve 79\.2% on MNIST and 77\.0% on FashionMNIST—reasonable but below CLIP’s performance \(98\.6% and 90\.4%, respectively\) and state\-of\-the\-art methods that leverage 2D convolutional structure\. CLIP processes images in their native 2D format, while our 1D unwrapping discards spatial correlations\. This performance gap demonstrates the cost of architectural mismatch, yet our results establish that signal\-theoretic principles enable meaningful transfer even when modality structure is suboptimal\.

Mitigation:Native 2D/3D extensions of our architecture \(e\.g\., 2D Parseval Focus, 2D frequency\-preserving pooling\) would likely improve performance on spatial tasks while maintaining our physics\-informed design principles\. Our current results establish a lower bound—transfer is possible even with architectural mismatch\.

### F\.3Skip Connection Contribution to Reconstruction

Our reconstruction quality analysis \(Section E\.5, Figure[11](https://arxiv.org/html/2606.04106#A5.F11)\) demonstrates zero\-shot generalization to unseen modalities\. The decoder architecture combines bottleneck latent guidance with two skip connections from compressed encoder features \(4x and 16x downsampled, Section[B\.2](https://arxiv.org/html/2606.04106#A2.SS2)\), making it difficult to isolate the exact contribution of each component to reconstruction fidelity\.

What We Can Conclude:

- •The bottleneck latent provides global structural guidance, supported by strong linear probing performance \(77\.7% average accuracy, Table[1](https://arxiv.org/html/2606.04106#S4.T1)\) using only the bottleneck latent without skip connections
- •Skip connections from compressed features \(4x, 16x\) contribute high\-resolution details necessary for pointwise\-level reconstruction fidelity
- •Learned weights are essential: random initialization produces complete reconstruction failure \(Figure[8](https://arxiv.org/html/2606.04106#A2.F8)\), proving architectural biases and skip connections alone are insufficient
- •The encoder\-decoder system captures transferable physical structure, while linear probing validates that the bottleneck latent alone is sufficient for classification tasks

What We Cannot Conclude:We cannot definitively quantify how much reconstruction quality derives from the bottleneck latent versus skip connections without a full ablation study removing all skip connections\. However, such an ablation would require architectural redesign to handle longer transformer sequences \(defeating the compression goal driven by attention cost\) or acceptance of significantly degraded reconstructions\.

Design Justification:Our skip connection design deliberately omits top\-level connections \(input, first conv layer\) and the post\-transformer skip \(redundant with token\-based upsampling\), using only two intermediate skips from compressed features \(4x, 16x\)\. This prevents trivial information leakage while enabling high\-fidelity reconstruction\. The 64x compression is driven by attention computational cost \(𝒪​\(N2\)\\mathcal\{O\}\(N^\{2\}\)\), not bottleneck capacity constraints\. The random weight baseline validates that this design successfully learns compressed, transferable representations in the bottleneck rather than relying on architectural shortcuts\.

### F\.4Training Data Diversity Requirements

Our approach requires training on a signal\-rich domain with diverse transformations\. RF data provides exceptional diversity \(frequency content from kHz to GHz, temporal dynamics, channel effects\), but training on a less diverse domain might not enable comparable transfer\. We have not tested whether training on, e\.g\., speech alone would enable transfer to RF or images\.

Mitigation:The key requirement is transformation diversity rather than domain diversity\. Any domain exhibiting rich frequency content, temporal dynamics, and varied transformations \(Doppler shifts, multipath, etc\.\) should enable similar transfer\. Our choice of RF is strategic but not unique\.

### F\.5Computational Requirements

Training the reported model required 12× H100 GPUs for 8\.6 days \(206\.4 hours, 2,477 H100\-hours total\) to reach 400 milestones\. This resource requirement is driven by: \(1\) long sequence lengths \(5120 samples\), \(2\) multiple encoder/decoder passes per batch \(clean, augmented, latent\-transformed branches\), \(3\) N\-choose\-2 pairwise losses across distributed GPUs, \(4\) dual\-domain reconstruction, and \(5\) online validation probe for model selection\. While this is modest compared to large language models \(GPT\-3: thousands of GPU\-years\), it exceeds typical academic budgets\. We note that our computational requirements reflect the complexity of our multi\-objective training framework rather than fundamental requirements of principle\-driven approaches\.

Scalability:Importantly, the 12\-GPU configuration was chosen to expedite training within our project timeline, not as a fundamental requirement\. We have produced comparable results using as few as 4 H100 GPUs with identical framework, data, and hyperparameters—training simply takes proportionally longer\. The 12\-GPU specification ensures exact reproducibility of our reported weights, but researchers with smaller budgets can achieve similar performance with fewer GPUs and extended training time\.

Mitigation:These costs reflect our research implementation prioritizing flexibility over efficiency\. Substantial speedups are achievable through gradient checkpointing, optimized attention kernels \(FlashAttention\[[55](https://arxiv.org/html/2606.04106#bib.bib55)\]\), and reduced monitoring frequency\. Moreover, deployment requires only the lightweight encoder \(1\.99M parameters\), enabling edge device inference\. The training cost is a one\-time investment yielding an encoder applicable across diverse domains, and the linear scaling with GPU count makes the approach accessible to institutions with varying computational budgets\.

### F\.6Limited Ablation Studies

Due to computational constraints, we do not provide comprehensive ablation studies isolating the contribution of each architectural component and loss term\. While we demonstrate that the complete system achieves strong cross\-modal transfer, we cannot definitively attribute performance to specific design choices \(e\.g\., frequency\-preserving pooling vs\. Parseval Focus vs\. head orthogonalization\)\.

Future Work:Systematic ablations would strengthen our claims about which components are essential versus auxiliary\. If accepted, we commit to conducting ablations during the camera\-ready period or as follow\-up work\.

### F\.7Scope of Evaluation

Our evaluation spans 15 diverse tasks across time series, images, text, and video, but this represents a small fraction of possible domains and tasks\. We have not tested: \(1\) highly structured data \(graphs, point clouds\), \(2\) multi\-modal fusion tasks, or \(3\) generative tasks beyond reconstruction\.

Interpretation:Our results demonstrate that signal\-theoretic principles enable cross\-modal transfer for the tested domains, but we do not claim universal applicability\. The boundary conditions of our approach remain an open question\.

### F\.8Complementarity to Scale Driven Approaches

Our approach is designed to complement rather than replace existing foundation models\. Scale\-driven multi\-modal models like CLIP excel at semantic understanding through diverse data exposure \(91\.2% on semantic tasks vs our 70\.0%\), while our principle\-driven approach excels at efficient physical understanding through encoded general laws \(84\.5% on physical tasks, within 3\.2% of CLIP’s 87\.7%, with 76× fewer parameters and 158× lower FLOPs\)\. Future systems may benefit from combining both paradigms: physical foundations providing interpretable base representations upon which semantic layers \(potentially from large\-scale models\) can be built\. This division of labor could yield systems that are both efficient and broadly capable\.

## Appendix GAppendix G: Broader Impact

### G\.1Positive Societal Impacts

Data Efficiency and Accessibility:Our approach demonstrates that principled architectural design enables cross\-modal transfer with compact models \(1\.99M parameters\), achieving competitive performance on physical tasks \(within 3\.2% of CLIP’s 151M parameter model\) while requiring 76× fewer parameters and 158× lower computational cost\. This offers a resource\-efficient path alongside large\-scale approaches, with significant implications for democratizing foundation model research: smaller models require less computational resources, reducing barriers for academic institutions and researchers in resource\-constrained settings\.

Interpretability and Scientific Understanding:By grounding our approach in established signal processing principles \(Fourier decomposition, energy conservation, symmetry\), we provide interpretable explanations for cross\-modal transfer\. The systematic performance gap between physical tasks \(84\.5%\) and semantic tasks \(70\.0%\) reveals a natural hierarchy in AI systems: physical understanding \(learnable from signal structure\) versus semantic understanding \(requiring human supervision\)\. This clarity helps researchers and practitioners understand what foundation models can and cannot learn, informing appropriate deployment decisions\.

Environmental Impact:Smaller models with lower computational requirements reduce energy consumption and carbon emissions associated with training and deployment\. Our encoder runs efficiently on edge devices \(Jetson Orin\), enabling local inference without cloud infrastructure—reducing latency, improving privacy, and minimizing environmental footprint\.

### G\.2Negative Societal Impacts and Mitigation

RF Fingerprinting and Surveillance:Our training data includes RF fingerprinting datasets designed to identify unique hardware imperfections in wireless transmitters\. While this technology has legitimate applications \(device authentication, spectrum management\), it could enable surveillance by tracking individuals via their devices’ RF signatures\. Our model’s ability to learn fine\-grained discriminative features from RF signals could lower the barrier to deploying such systems\.

Mitigation:Importantly, our model learns physical signal structure, not semantic identity\. RF fingerprinting requires labeled training data associating devices with individuals—our model does not provide this association\. Moreover, RF fingerprinting is already well\-established in the literature; our contribution is demonstrating that learned representations transfer across domains, not enabling new surveillance capabilities\. We advocate for responsible use policies and regulatory frameworks governing RF fingerprinting deployment\.

Dual\-Use Potential:Like any general\-purpose signal processing tool, our encoder could be applied to sensitive domains \(e\.g\., analyzing encrypted communications’ metadata, processing medical signals without authorization\)\. The model’s ability to extract meaningful features from diverse signal types without domain\-specific training could facilitate misuse\.

Mitigation:Our model captures physical structure, not semantic content—it cannot "understand" encrypted communications or make medical diagnoses without appropriate supervised training\. The model is a feature extractor, not a complete system for sensitive applications\. We emphasize that deployment in high\-stakes domains \(healthcare, security, finance\) requires domain\-specific validation, regulatory approval, and ethical oversight\. We will release the model with usage guidelines emphasizing these requirements\.

Fairness and Bias Considerations:Our training data consists of RF signals, which do not contain demographic information or human\-identifiable characteristics\. However, when applied to domains like speaker recognition or facial recognition \(via image processing\), the model could inherit or amplify biases present in downstream training data\. For example, if a speaker recognition system trained on our representations uses biased labeled data, the system could exhibit demographic disparities\.

Mitigation:Our encoder learns physical features \(voice characteristics, image structure\) agnostic to demographic attributes\. Bias arises from downstream labeled data and deployment context, not from our pretrained representations\. We encourage practitioners to: \(1\) audit downstream training data for demographic balance, \(2\) evaluate deployed systems for fairness across subgroups, and \(3\) implement bias mitigation techniques appropriate to their application domain\.

### G\.3Responsible Release and Future Work

We commit to releasing our model with comprehensive documentation including:

- •Clear usage guidelines emphasizing limitations \(physical vs\. semantic understanding\)
- •Warnings against deployment in high\-stakes domains without validation
- •Recommendations for fairness auditing in downstream applications
- •Transparency about training data sources and potential biases

We view our work as establishing scientific understanding of cross\-modal transfer via signal\-theoretic principles\. The primary impact is advancing foundation model research toward more interpretable, efficient, and principled approaches\. We encourage the community to build upon our work responsibly, with careful consideration of application\-specific ethical implications\.

Similar Articles

I drew the entire AI stack on one page... and it's mostly not models.

Reddit r/singularity

The author proposes a five-layer AI stack pyramid—foundations, data, models, agents, and applications—to argue that progress depends on more than just model capabilities. The article invites discussion on the placement of evaluation and interpretability within this architecture.

@snowboat84: This is the second part of the "When Physics Meets AI" series. The role of physics in AI can be divided into four layers: (1) The first layer is the bottommost, providing the computational skeleton—energy, entropy, and free energy are embedded into AI's training objectives. (2) The second layer is the middle layer, where physics shapes the network architecture—Hopfield's Ising energy function, CNN's translational symmetry, and renormalization group correspond to the hierarchical structure of deep networks.

X AI KOLs Timeline

This article explores the four layers of physics' role in AI, from the bottom computational skeleton to the methodological layer, arguing that physics' methodology is migrating from the natural world to the AI domain.