PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels

arXiv cs.AI 05/25/26, 04:00 AM Papers
wireless representation-learning self-supervised channel-estimation deep-learning mimo 6g
Summary
PilotWiMAE introduces a self-supervised framework that directly ingests noisy pilot observations for wireless channel representation learning, removing the unrealistic full-CSI assumption and enabling robust cross-frequency beam selection and channel estimation that beats supervised baselines.
arXiv:2605.22856v1 Announce Type: cross Abstract: Channel foundation models assume access to fully observed channels, an assumption that fails in deployment. We introduce PilotWiMAE, a self-supervised framework whose encoder ingests noisy pilot observations directly and whose attention factorizes along the axis separating temporal from joint space-frequency processing, an inductive bias inspired by the physics of the problem. Pilot input shrinks the observation space by up to two orders of magnitude and also removes the unrealistic assumption of full-CSI availability while incurring lower latency. The factorized design generates robust representations by exploiting the separable channel structure and allows a pretraining mask ratio of $99\%$. We pair patch-normalized reconstruction, which captures small-scale fading structure, with an auxiliary scale loss that recovers the large-scale fading features, and use an AWGN curriculum to match pilot noise at pretraining and deployment. Pretrained solely on $3.5$\,GHz and evaluated at $28$\,GHz across in-distribution and out-of-distribution settings, PilotWiMAE's cross-frequency beam selection and channel characterization beat supervised baselines despite operating on a smaller observation space. To weaken the coupling between decoder capacity and representation quality, we further propose a decoder-centric pretraining stage following the encoder-decoder joint pretraining, which allows PilotWiMAE to demonstrate competitive channel estimation without sacrificing representation quality. To foster further work in this direction, we release the PilotWiMAE pretrained weights and training pipeline, together with CSIGen, our Sionna-based ray-tracing channel-generation tool, and the channel datasets used in this work.
Original Article
View Cached Full Text
Cached at: 05/25/26, 09:02 AM
# PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels
Source: [https://arxiv.org/html/2605.22856](https://arxiv.org/html/2605.22856)
Berkay Guler, Giovanni Geraci, and Hamid JafarkhaniB\. Guler and H\. Jafarkhani are with the Center for Pervasive Communications and Computing, University of California, Irvine CA, USA\. They were supported in part by the NSF Award CNS\-2229467\.G\. Geraci is with Nokia and Universitat Pompeu Fabra, Spain\. He was supported in part by grants PID2021\-123999OB\-I00, PID2024\-156488OB\-I00, CEX2021\-001195\-M, and CNS2023\-145384\.Part of the results in this paper have been submitted to the International Conference on Machine Learning \(ICML\), AI4NextG Workshop which is non\-archival and will not have a proceedings\[[19](https://arxiv.org/html/2605.22856#bib.bib40)\]\.

###### Abstract

Channel foundation models assume access to fully observed channels, an assumption that fails in deployment\. We introduce PilotWiMAE, a self\-supervised framework whose encoder ingests noisy pilot observations directly and whose attention factorizes along the axis separating temporal from joint space\-frequency processing, an inductive bias inspired by the physics of the problem\. Pilot input shrinks the observation space by up to two orders of magnitude and also removes the unrealistic assumption of full\-CSI availability while incurring lower latency\. The factorized design generates robust representations by exploiting the separable channel structure and allows a pretraining mask ratio of99%99\\%\. We pair patch\-normalized reconstruction, which captures small\-scale fading structure, with an auxiliary scale loss that recovers the large\-scale fading features, and use an AWGN curriculum to match pilot noise at pretraining and deployment\. Pretrained solely on3\.53\.5GHz and evaluated at2828GHz across in\-distribution and out\-of\-distribution settings, PilotWiMAE’s cross\-frequency beam selection and channel characterization beat supervised baselines despite operating on a smaller observation space\. To weaken the coupling between decoder capacity and representation quality, we further propose a decoder\-centric pretraining stage following the encoder\-decoder joint pretraining, which allows PilotWiMAE to demonstrate competitive channel estimation without sacrificing representation quality\. To foster further work in this direction, we release the PilotWiMAE pretrained weights and training pipeline, together with CSIGen, our Sionna\-based ray\-tracing channel\-generation tool, and the channel datasets used in this work\.

## IIntroduction

Recent channel foundation models have made substantial progress in learning transferable representations of wireless channels by pretraining and evaluating fully observed channels generated by stochastic or ray\-tracing simulators\[[5](https://arxiv.org/html/2605.22856#bib.bib20),[25](https://arxiv.org/html/2605.22856#bib.bib19),[24](https://arxiv.org/html/2605.22856#bib.bib14),[27](https://arxiv.org/html/2605.22856#bib.bib17),[6](https://arxiv.org/html/2605.22856#bib.bib22),[38](https://arxiv.org/html/2605.22856#bib.bib21),[29](https://arxiv.org/html/2605.22856#bib.bib18),[18](https://arxiv.org/html/2605.22856#bib.bib15),[30](https://arxiv.org/html/2605.22856#bib.bib16),[37](https://arxiv.org/html/2605.22856#bib.bib25),[34](https://arxiv.org/html/2605.22856#bib.bib13),[32](https://arxiv.org/html/2605.22856#bib.bib10),[43](https://arxiv.org/html/2605.22856#bib.bib9),[28](https://arxiv.org/html/2605.22856#bib.bib11),[4](https://arxiv.org/html/2605.22856#bib.bib7),[20](https://arxiv.org/html/2605.22856#bib.bib12),[42](https://arxiv.org/html/2605.22856#bib.bib8)\]\. Some of these works add i\.i\.d\. additive white Gaussian noise \(AWGN\) to fully observed channels as a concession to realism\[[24](https://arxiv.org/html/2605.22856#bib.bib14),[18](https://arxiv.org/html/2605.22856#bib.bib15),[30](https://arxiv.org/html/2605.22856#bib.bib16),[27](https://arxiv.org/html/2605.22856#bib.bib17),[29](https://arxiv.org/html/2605.22856#bib.bib18),[5](https://arxiv.org/html/2605.22856#bib.bib20),[38](https://arxiv.org/html/2605.22856#bib.bib21),[6](https://arxiv.org/html/2605.22856#bib.bib22),[28](https://arxiv.org/html/2605.22856#bib.bib11),[20](https://arxiv.org/html/2605.22856#bib.bib12),[42](https://arxiv.org/html/2605.22856#bib.bib8),[4](https://arxiv.org/html/2605.22856#bib.bib7)\], while others omit noise evaluation altogether\. Neither of these two cases reflects how errors in channel state information \(CSI\) arise in practice\. In a real receiver, the channel is estimated from pilots, and only the error at pilot resource elements is i\.i\.d\. AWGN \(when there is no interference from pilot reuse\)\[[17](https://arxiv.org/html/2605.22856#bib.bib35)\]\. The error at non\-pilot resource elements, which make up the vast majority of the grid, depends on the interpolation method, the channel’s delay\-Doppler structure, the pilot density, the SNR at the pilots, and the pilot design, and does not admit a simple i\.i\.d\. model\[[12](https://arxiv.org/html/2605.22856#bib.bib34)\]\. Evaluation under fully observed or i\.i\.d\.\-perturbed channels therefore can only characterize how well a model learns the channel structure, but leaves open how it behaves in a system where such channels are never available\. Given the known sensitivity of the learned methods to noise and distribution shift\[[22](https://arxiv.org/html/2605.22856#bib.bib23),[35](https://arxiv.org/html/2605.22856#bib.bib24)\], this gap is worth closing\.

The second gap concerns the cost of deployment\. Wireless foundation models have largely inherited transformer architectures\[[36](https://arxiv.org/html/2605.22856#bib.bib26),[15](https://arxiv.org/html/2605.22856#bib.bib27)\]and training recipes\[[14](https://arxiv.org/html/2605.22856#bib.bib28),[21](https://arxiv.org/html/2605.22856#bib.bib29)\]from vision and language, where parameter count and sequence length do not face a hard runtime ceiling, while performance improves predictably with additional data and computation\[[26](https://arxiv.org/html/2605.22856#bib.bib30),[23](https://arxiv.org/html/2605.22856#bib.bib32),[40](https://arxiv.org/html/2605.22856#bib.bib31)\]\. In wireless systems, tasks such as precoding, scheduling, and decoding must be completed within slot\-level timing budgets of the order of a millisecond or less\[[2](https://arxiv.org/html/2605.22856#bib.bib33)\]\. However, computational footprint is rarely reported in the literature on channel foundation models\. When reported, the results often rely on high\-end GPUs and non\-uniform optimization stacks \(e\.g\., quantization or FlashAttention\[[13](https://arxiv.org/html/2605.22856#bib.bib2)\]\), making even large models appear practical and masking true deployment cost\.

![Refer to caption](https://arxiv.org/html/2605.22856v1/x1.png)Figure 1:High\-level PilotWiMAE pipeline: The model consumes sparse noisy pilot observations directly, pilot representations support direct decision\-making tasks without channel estimation or decoding, while an optional decoder reconstructs the full channel for estimation and/or prediction\.We address both gaps with two co\-equal design principles\. First, we pursue*robustness by design*by operating directly on sparse, noisy pilot observations\. Our approach removes an explicit channel estimator from the critical path to prevent error propagation at realistic low SNR, matches deployment observables, eliminates the full\-CSI assumption of prior channel foundation models, and cleanly integrates with existing pilot\-based protocols\. Channel recovery is posed as a downstream task on the same learned representations and is naturally aligned with our reconstruction\-based pretraining objective\.

Second, we enforce*wireless specificity by design*by factorizing attention along temporal and space\-frequency domains, an inductive bias grounded in the wide\-sense stationary uncorrelated scattering \(WSSUS\) model\[[9](https://arxiv.org/html/2605.22856#bib.bib38)\]and its MIMO extension\[[31](https://arxiv.org/html/2605.22856#bib.bib39)\], where temporal and spectro\-spatial correlations arise from distinct physical mechanisms\. The same principle motivates our pretraining objective, which pairs patch\-normalized reconstruction for small\-scale fading with an auxiliary scale loss that recovers large\-scale fading statistics\. Overall, pilot input shrinks the observation space, while the factorized design exploits separable channel structure to support an aggressive 99% pretraining mask ratio\. Together, they yield sub\-millisecond inference latency and representations that remain reliable in the noisy, partially observed regime where decisions are actually made\.

Fig\.[1](https://arxiv.org/html/2605.22856#S1.F1)provides a high\-level summary of this deployment\-oriented pipeline\. We next situate PilotWiMAE with respect to recent work\.

### I\-ARelated work

Prior work on wireless channel foundation models broadly adopts \(i\) joint\-embedding methods, including contrastive learning,\[[25](https://arxiv.org/html/2605.22856#bib.bib19),[18](https://arxiv.org/html/2605.22856#bib.bib15),[32](https://arxiv.org/html/2605.22856#bib.bib10),[33](https://arxiv.org/html/2605.22856#bib.bib3),[11](https://arxiv.org/html/2605.22856#bib.bib4)\], \(ii\) masked reconstructive learning\[[24](https://arxiv.org/html/2605.22856#bib.bib14),[18](https://arxiv.org/html/2605.22856#bib.bib15),[5](https://arxiv.org/html/2605.22856#bib.bib20),[6](https://arxiv.org/html/2605.22856#bib.bib22),[38](https://arxiv.org/html/2605.22856#bib.bib21),[28](https://arxiv.org/html/2605.22856#bib.bib11),[32](https://arxiv.org/html/2605.22856#bib.bib10),[4](https://arxiv.org/html/2605.22856#bib.bib7),[27](https://arxiv.org/html/2605.22856#bib.bib17)\], and \(iii\) joint reconstructive\-contrastive objectives\[[32](https://arxiv.org/html/2605.22856#bib.bib10),[18](https://arxiv.org/html/2605.22856#bib.bib15)\]\. Alongside these encoder\-oriented families, a separate line adopts decoder\-only causal generation for temporal prediction and forecasting\[[34](https://arxiv.org/html/2605.22856#bib.bib13),[29](https://arxiv.org/html/2605.22856#bib.bib18),[43](https://arxiv.org/html/2605.22856#bib.bib9)\]\. Although effective for sequence generation, decoder\-only models are not representation learners in the encoder\-based sense because they do not directly expose a compact, task\-agnostic representation for downstream adaptation\. We therefore focus on encoder\-based self\-supervised pipelines\.

This focus still leaves important design choices\. Contrastive objectives can learn transferable wireless features, but they often require multiple augmented views and forward passes before each update, and their performance is highly dependent on view construction and positives/negatives\[[25](https://arxiv.org/html/2605.22856#bib.bib19),[18](https://arxiv.org/html/2605.22856#bib.bib15),[32](https://arxiv.org/html/2605.22856#bib.bib10)\]\. Non\-contrastive joint\-embedding methods, such as JEPA\-style predictors over masked latents\[[11](https://arxiv.org/html/2605.22856#bib.bib4),[33](https://arxiv.org/html/2605.22856#bib.bib3)\], avoid explicit negative pairs and typically rely on masked\-context prediction consistency rather than handcrafted augmentation pipelines, but for the same reason they do not explicitly learn reconstruction\-oriented features\. However, in wireless, channel reconstruction tasks are first\-class downstream objectives\. It is desirable that the pretraining objective shapes representations that retain the signal structure needed to recover the channel itself, not just abstract latent invariances\. Within reconstructive learning, BERT\-style\[[14](https://arxiv.org/html/2605.22856#bib.bib28)\]masked modeling feeds a dense sequence to the encoder, processes masked and visible tokens together, and uses lightweight heads to only predict the masked positions\[[5](https://arxiv.org/html/2605.22856#bib.bib20),[38](https://arxiv.org/html/2605.22856#bib.bib21),[32](https://arxiv.org/html/2605.22856#bib.bib10),[6](https://arxiv.org/html/2605.22856#bib.bib22)\]\. However, this paradigm scales poorly with mask ratio because the encoder still processes masked tokens that carry no observation content\.

MAE\-style pretraining bypasses these limitations\. The encoder processes only visible tokens, and a transformer decoder reconstructs the masked content from encoded visible tokens and mask placeholders\[[21](https://arxiv.org/html/2605.22856#bib.bib29)\]\. The encoder cost decreases with the mask ratio, single\-view single\-pass updates avoid the multi\-view overhead of contrastive pretraining, and the input\-space reconstruction objective preserves channel\-valued structure in the learned representation\. The theory further shows that masked reconstruction implicitly performs contrastive alignment, because different masked views of the same input that share a reconstruction target act as positive pairs and are pulled together in feature space\. This explains the quality of MAE’s representation without an explicit contrastive loss\[[41](https://arxiv.org/html/2605.22856#bib.bib41)\]\.

Nevertheless, two design choices shape what an MAE encoder actually learns\. The first choice is the reconstruction target\. All wireless MAE variants use raw MSE reconstruction\[[24](https://arxiv.org/html/2605.22856#bib.bib14),[18](https://arxiv.org/html/2605.22856#bib.bib15),[28](https://arxiv.org/html/2605.22856#bib.bib11),[4](https://arxiv.org/html/2605.22856#bib.bib7),[27](https://arxiv.org/html/2605.22856#bib.bib17)\], which is poorly matched to channels whose amplitudes span a very large dynamic range\. Under raw MSE, the loss is dominated by a small fraction of high\-power, often LoS\-like channels, while low\-power NLoS\-rich channels, whose small\-scale fading patterns carry the complex multipath structure the encoder should actually learn, contribute negligibly to the gradient\. The second choice is how representational work is split between encoder and decoder\. Vision MAE pairs a deep encoder with a shallow decoder and discards the decoder after pretraining, forcing representational load onto the encoder\[[21](https://arxiv.org/html/2605.22856#bib.bib29)\]\. Wireless follows this trend\. However, the decoder is also maintained and reused for channel reconstruction tasks \(channel estimation, prediction, and CSI feedback\)\[[24](https://arxiv.org/html/2605.22856#bib.bib14),[18](https://arxiv.org/html/2605.22856#bib.bib15),[28](https://arxiv.org/html/2605.22856#bib.bib11),[27](https://arxiv.org/html/2605.22856#bib.bib17)\], since both strong representation quality and accurate reconstruction are desired\. As a result, a single pretraining stage is forced to deliver both objectives at once, making the encoder\-decoder capacity split a compromise rather than a deliberate choice\. PilotWiMAE addresses both issues as explained in detail in Section[III](https://arxiv.org/html/2605.22856#S3)\.

The choice of objective and reconstruction loss is only part of what makes a pretrained channel representation useful in practice\. Two further dimensions matter just as much\. The first is*the input interface*, which determines what the encoder actually observes during pretraining and deployment\. The second,*the architectural inductive bias*, regulates which channel properties the encoder is structurally encouraged to exploit, with direct consequences for representation quality\. Both remain underexplored in wireless self\-supervised learning, and we discuss them in turn\.

Starting with the input interface, most existing protocols still assume full\-grid CSI at evaluation, and operate on full\-grid CSI tensors during pretraining, sometimes masked under a self\-supervised reconstruction objective\. Several works are entirely based on pretraining and evaluating clean full\-CSI\[[5](https://arxiv.org/html/2605.22856#bib.bib20),[25](https://arxiv.org/html/2605.22856#bib.bib19),[24](https://arxiv.org/html/2605.22856#bib.bib14),[30](https://arxiv.org/html/2605.22856#bib.bib16)\]\. Others inject i\.i\.d\. AWGN into channels during pretraining and/or evaluation\[[27](https://arxiv.org/html/2605.22856#bib.bib17),[18](https://arxiv.org/html/2605.22856#bib.bib15),[38](https://arxiv.org/html/2605.22856#bib.bib21),[29](https://arxiv.org/html/2605.22856#bib.bib18),[28](https://arxiv.org/html/2605.22856#bib.bib11)\], which improves stress testing, but still does not match pilot\-based observability in real receivers\. Even when sparse settings are considered, they are typically task\-specific \(e\.g\., localization under pilot\-position sampling\) or use clean sparse observations rather than noisy pilot measurements\[[32](https://arxiv.org/html/2605.22856#bib.bib10)\]\. Consequently, a key mismatch in the deployment remains\. In these systems, representations are largely learned and validated in regimes where dense CSI is available, whereas practical systems observe noisy CSI only at pilot resource elements, without the possibility of recovering the perfect CSI\.

A small number of studies are moving toward realistic receiver\-side conditions\.\[[37](https://arxiv.org/html/2605.22856#bib.bib25)\]prepends a channel estimation and refinement module to a frozen feature extraction network\. Architecturally, this is not different from cascading any channel estimator with a feature extractor and reinstates the full\-CSI assumption at the encoder input rather than establishing a pilot\-native, general\-purpose representation learner across tasks\.\[[43](https://arxiv.org/html/2605.22856#bib.bib9)\]pretrains a causal autoregressive model on historical LMMSE channel estimates rather than on clean full\-CSI, and at inference uses the autoregressively predicted state as a prior for refining the channel estimate from current pilots\. The focus is therefore generative forecasting and prior\-based channel estimation, rather than self\-supervised encoder representation learning across tasks\.

Another challenge concerns architectural inductive bias\. Most prior channel representation models apply dense all\-to\-all attention to channel tokens, even though channel statistics exhibit structure induced by the underlying physics\. A recent angle\-delay\-time representation learner sparsifies attention to tokens falling inside an angle\-delay window and a temporal cone, capturing scatterer evolution in a sparse angle\-delay domain\[[6](https://arxiv.org/html/2605.22856#bib.bib22)\]\. That construction does not carry over to dense space\-frequency inputs, where correlations are not confined to a small neighborhood of each token, so spectro\-spatial mixing must attend broadly over the space\-frequency grid rather than only locally\. Even so, a mild WSSUS assumption implies that temporal correlation factorizes from joint space\-frequency correlation\[[9](https://arxiv.org/html/2605.22856#bib.bib38),[31](https://arxiv.org/html/2605.22856#bib.bib39)\], which motivates factorized attention\. Factorized attention appears in video\[[10](https://arxiv.org/html/2605.22856#bib.bib36),[8](https://arxiv.org/html/2605.22856#bib.bib37)\]and in concurrent wireless work aimed at complexity reduction\[[39](https://arxiv.org/html/2605.22856#bib.bib5),[44](https://arxiv.org/html/2605.22856#bib.bib6),[38](https://arxiv.org/html/2605.22856#bib.bib21)\], but not as a pilot\-native, physics\-induced bias to separate temporal from spectro\-spatial domains for self\-supervised CSI representation learning\. PilotWiMAE addresses this gap by jointly aligning*what is observed*\(noisy pilots\) and*how it is processed*\(WSSUS\-motivated factorized attention\) in one pretraining framework\.

### I\-BContribution and Summary of Results

We introduce PilotWiMAE111PilotWiMAE code, pretrained weights, and training pipeline are available athttps://github\.com/BerkIGuler/PilotWiMAE\. CSIGen, our Python\-based ray\-tracing channel\-generation tool built on Sionna, and channel datasets used in this work are available athttps://github\.com/BerkIGuler/CSIGen\., a self\-supervised, foundation\-model\-style framework for wireless channel representation\. The key contributions and findings of this work are summarized below\.

- •We propose a pilot\-native input interface that operates directly on noisy pilot observations for decision\-making tasks, bypasses channel estimation, and eliminates the need for the full\-CSI assumption typically made by prior channel foundation models\.
- •We design a factorized space\-time \(FST\) encoder that applies temporal attention across time slots and spectro\-spatial attention within each time slot, an inductive bias grounded in WSSUS\-motivated separability that yields robust, high\-performing representations and enables an aggressive99%99\\%pretraining mask ratio\.
- •We pair patch\-normalized reconstruction, which captures small\-scale fading structure, with an auxiliary scale loss that recovers the large\-scale fading signature \(path loss and shadowing\), jointly supervised on encoder\- and decoder\-side features\.
- •We introduce a decoder\-centric masked reconstruction pretraining stage, following the AWGN\-augmented joint encoder\-decoder pretraining with masked reconstruction and scale losses, to decouple decoder capacity from encoder representation quality\.
- •Through extensive evaluation under cross\-band \(3\.53\.5to2828GHz\) transfer, including both in\-distribution \(ID\) and out\-of\-distribution \(OOD\) settings, we demonstrate that PilotWiMAE’s frozen representations transfer to beam selection and channel characterization without task\-specific fine\-tuning, beating supervised baselines despite operating on a smaller observation space\.
- •We further demonstrate noise\-robust, pilot\-pattern\-agnostic channel estimation that remains competitive with supervised pilot\-pattern\-specific baselines, and report how performance scales with decoder depth when only the decoder is pretrained on top of a frozen encoder\.
- •We introduce a noise\-robust AWGN pretraining curriculum that anneals the SNR lower bound across epochs, progressively exposing the model to more challenging SNR conditions while aligning corruption with pilot\-noise statistics\.

![Refer to caption](https://arxiv.org/html/2605.22856v1/figures/pilotwimae_diagram.png)Figure 2:PilotWiMAE architecture\. Pilot patches feed the FST encoder\. The resulting representations are decoded by a JST transformer\. Finally, the tokens are mapped back to patch dimension for channel reconstruction\. For clarity, the diagram shows the main forward path only and omits several elements described in Section[III](https://arxiv.org/html/2605.22856#S3)\.#### Paper organization

The remainder of this paper is organized as follows\. Section[II](https://arxiv.org/html/2605.22856#S2)formulates the pilot\-native channel representation problem\. Section[III](https://arxiv.org/html/2605.22856#S3)details the PilotWiMAE framework \(Fig\.[2](https://arxiv.org/html/2605.22856#S1.F2)\)\. Section[IV](https://arxiv.org/html/2605.22856#S4)describes the experimental setup and reports cross\-band ID and OOD results\. Section[V](https://arxiv.org/html/2605.22856#S5)reports training and inference complexity\. Section[VI](https://arxiv.org/html/2605.22856#S6)concludes the paper\.

## IIProblem Formulation

We consider a base station operating overTTconsecutive time slots, each withNscN\_\{\\mathrm\{sc\}\}subcarriers and a uniform planar array ofNh×NvN\_\{\\mathrm\{h\}\}\\times N\_\{\\mathrm\{v\}\}antennas\. The full channel tensor𝐇∈ℂT×NhNv×Nsc\\mathbf\{H\}\\in\\mathbb\{C\}^\{T\\times N\_\{\\mathrm\{h\}\}N\_\{\\mathrm\{v\}\}\\times N\_\{\\mathrm\{sc\}\}\}is never observed\. Instead, the receiver obtains a noisy observation at a sparse set of pilot resource elements indexed by𝒫\\mathcal\{P\},

𝐇^𝒫=𝐇𝒫\+𝐍𝒫\(SNR\),\\hat\{\\mathbf\{H\}\}\_\{\\mathcal\{P\}\}=\\mathbf\{H\}\_\{\\mathcal\{P\}\}\+\\mathbf\{N\}\_\{\\mathcal\{P\}\}\(\\mathrm\{SNR\}\),\(1\)where𝐍𝒫\(SNR\)\\mathbf\{N\}\_\{\\mathcal\{P\}\}\(\\mathrm\{SNR\}\)is i\.i\.d\. circularly\-symmetric complex Gaussian noise whose variance is a known function of the pilot SNR alone\[[17](https://arxiv.org/html/2605.22856#bib.bib35)\]\.

More precisely, pilots are sent at selected OFDM symbol indices𝒯p⊂\{0,…,T−1\}\\mathcal\{T\}\_\{\\mathrm\{p\}\}\\subset\\\{0,\\ldots,T\-1\\\}and subcarrier indicesℱp⊂\{0,…,Nsc−1\}\\mathcal\{F\}\_\{\\mathrm\{p\}\}\\subset\\\{0,\\ldots,N\_\{\\mathrm\{sc\}\}\-1\\\}, and at each\(t,f\)∈𝒯p×ℱp\(t,f\)\\in\\mathcal\{T\}\_\{\\mathrm\{p\}\}\\times\\mathcal\{F\}\_\{\\mathrm\{p\}\}allNhNvN\_\{\\mathrm\{h\}\}N\_\{\\mathrm\{v\}\}antenna entries are observed\. Our method is agnostic to the specific layout of\(𝒯p,ℱp\)\(\\mathcal\{T\}\_\{\\mathrm\{p\}\},\\mathcal\{F\}\_\{\\mathrm\{p\}\}\)\.

Prior channel foundation models largely adopt a dense CSI interface rather than the pilot\-only observations in \([1](https://arxiv.org/html/2605.22856#S2.E1)\)\. A generic way to express the mismatch between such a full\-grid tensor and ground truth𝐇\\mathbf\{H\}is an additive estimation residual,

𝐇^=𝐇\+𝐄\(𝜽\),\\hat\{\\mathbf\{H\}\}=\\mathbf\{H\}\+\\mathbf\{E\}\(\\boldsymbol\{\\theta\}\),\(2\)where𝐄\(𝜽\)\\mathbf\{E\}\(\\boldsymbol\{\\theta\}\)is generally correlated and its statistics depend on the estimator and on channel parameters that are scenario\-specific and unavailable in general\. Existing pipelines either ignore𝐄\(𝜽\)\\mathbf\{E\}\(\\boldsymbol\{\\theta\}\)or substitute it with i\.i\.d\. circularly\-symmetric complex Gaussian noise\. While mathematically convenient, the true residual is not i\.i\.d\. circularly\-symmetric complex Gaussian\.

One straightforward alternative is to add a channel estimator to the encoder, recovering𝐇^\\hat\{\\mathbf\{H\}\}before producing representations\. This either requires end\-to\-end training of an additional block or requires the encoder to be robust to the artifacts of a specific estimator, both of which complicate the design without addressing the root issue\. Instead, we feed𝐇^𝒫\\hat\{\\mathbf\{H\}\}\_\{\\mathcal\{P\}\}directly into the encoder, i\.e\.,

𝐇^𝒫⟶encoder⟶representations⟶decision\.\\hat\{\\mathbf\{H\}\}\_\{\\mathcal\{P\}\}\\;\\longrightarrow\\;\\text\{encoder\}\\;\\longrightarrow\\;\\text\{representations\}\\;\\longrightarrow\\;\\text\{decision\}\.\(3\)
The hypothesis is that when pretraining uses masked reconstruction, sparse pilot inputs can support the global channel structure in the encoder latent\. A deliberately shallow decoder turns the encoder latent into the representational bottleneck where the encoder must integrate information across the resource grid from the visible pilot tokens to predict the masked content\. Decision\-making tasks that do not require dense CSI, including beam selection and channel characterization, can follow a compact inference path using pilot\-only features without reconstructing the entire grid during inference\. In general, joint encoder\-decoder pretraining couples latent quality with decoder capacity, motivating us to discard that decoder, freeze the encoder, and pretrain a freshly initialized decoder with greater reconstruction capacity on patch\-normalized reconstruction alone\. The second stage yields stronger dense\-channel recovery while encoder representations stay fixed\.

## IIIThe PilotWiMAE Framework

In this section, we introduce PilotWiMAE, a self\-supervised framework that realizes the two design principles of Section[I](https://arxiv.org/html/2605.22856#S1)through six components, each developed in its own subsection below\.

### III\-APilot\-native input interface

During pretraining, the encoder sees extremely sparse inputs generated by structured random masking under AWGN\. During inference, it sees a sparse input𝐇^𝒫\\hat\{\\mathbf\{H\}\}\_\{\\mathcal\{P\}\}induced by the pilot pattern \(Appendix[B](https://arxiv.org/html/2605.22856#A2)\)\.

Before tokenization, each sample is power\-normalized using a dataset\-level reference powerPrefP\_\{\\mathrm\{ref\}\}computed on the pretraining split\. We use𝐇¯=𝐇/Pref\\bar\{\\mathbf\{H\}\}=\\mathbf\{H\}/\\sqrt\{P\_\{\\mathrm\{ref\}\}\}\. DefiningS=NhNvS=N\_\{\\mathrm\{h\}\}N\_\{\\mathrm\{v\}\}andF=NscF=N\_\{\\mathrm\{sc\}\}, let the complex input tensor be𝐇¯∈ℂT×S×F\\bar\{\\mathbf\{H\}\}\\in\\mathbb\{C\}^\{T\\times S\\times F\}\. Let the 3D patch size be\(pt,ps,pf\)\(p\_\{\\mathrm\{t\}\},p\_\{\\mathrm\{s\}\},p\_\{\\mathrm\{f\}\}\)\. Defining\(nt,ns,nf\)=\(T/pt,S/ps,F/pf\)\(n\_\{\\mathrm\{t\}\},n\_\{\\mathrm\{s\}\},n\_\{\\mathrm\{f\}\}\)=\(T/p\_\{\\mathrm\{t\}\},S/p\_\{\\mathrm\{s\}\},F/p\_\{\\mathrm\{f\}\}\), tokenization yieldsP=ntnsnfP=n\_\{\\mathrm\{t\}\}n\_\{\\mathrm\{s\}\}n\_\{\\mathrm\{f\}\}patches, withNsf=nsnfN\_\{\\mathrm\{sf\}\}=n\_\{\\mathrm\{s\}\}n\_\{\\mathrm\{f\}\}spectro\-spatial tokens per slot\. The real and imaginary parts are split and concatenated within each patch\. Therefore, each raw patch vector has dimensionDp=2ptpspfD\_\{\\mathrm\{p\}\}=2p\_\{\\mathrm\{t\}\}p\_\{\\mathrm\{s\}\}p\_\{\\mathrm\{f\}\}before linear projection to the model dimensiondd\.

Patch offsets\(it,is,if\)\(i\_\{\\mathrm\{t\}\},i\_\{\\mathrm\{s\}\},i\_\{\\mathrm\{f\}\}\)index temporal, spatial, and frequency patches withit∈\{0,…,nt−1\}i\_\{\\mathrm\{t\}\}\\in\\\{0,\\ldots,n\_\{\\mathrm\{t\}\}\-1\\\},is∈\{0,…,ns−1\}i\_\{\\mathrm\{s\}\}\\in\\\{0,\\ldots,n\_\{\\mathrm\{s\}\}\-1\\\}, andif∈\{0,…,nf−1\}i\_\{\\mathrm\{f\}\}\\in\\\{0,\\ldots,n\_\{\\mathrm\{f\}\}\-1\\\}, respectively, and time\-major unfolding maps them top=itnsnf\+isnf\+if∈\{0,…,P−1\}p=i\_\{\\mathrm\{t\}\}n\_\{\\mathrm\{s\}\}n\_\{\\mathrm\{f\}\}\+i\_\{\\mathrm\{s\}\}n\_\{\\mathrm\{f\}\}\+i\_\{\\mathrm\{f\}\}\\in\\\{0,\\ldots,P\-1\\\}\. Each token𝐱p∈ℝd\\mathbf\{x\}\_\{p\}\\in\\mathbb\{R\}^\{d\}is the linear projection of theDpD\_\{\\mathrm\{p\}\}\-dimensional real\-imaginary channel patch at flat indexppto the model dimension\.

We add an axial sinusoidal positional embedding to each token\. To this end, we partition the model dimension into per\-axis sub\-widthsd=dt\+ds\+dfd=d\_\{\\mathrm\{t\}\}\+d\_\{\\mathrm\{s\}\}\+d\_\{\\mathrm\{f\}\}withdt=ds=⌊d/3⌋d\_\{\\mathrm\{t\}\}=d\_\{\\mathrm\{s\}\}=\\lfloor d/3\\rflooranddf=d−2⌊d/3⌋d\_\{\\mathrm\{f\}\}=d\-2\\lfloor d/3\\rfloor, and assign one sub\-width to each axis\. For axisa∈\{t,s,f\}a\\in\\\{\\mathrm\{t\},\\mathrm\{s\},\\mathrm\{f\}\\\}with patch countnan\_\{a\}and sub\-widthdad\_\{a\}, the per\-axis embedding𝝍i\(a\)∈ℝda\\boldsymbol\{\\psi\}^\{\(a\)\}\_\{i\}\\in\\mathbb\{R\}^\{d\_\{a\}\}at indexi∈\{0,…,na−1\}i\\in\\\{0,\\ldots,n\_\{a\}\-1\\\}uses the standard11D transformer sinusoidal recipe of\[[36](https://arxiv.org/html/2605.22856#bib.bib26)\],

\[𝝍i\(a\)\]2j\\displaystyle\[\\boldsymbol\{\\psi\}^\{\(a\)\}\_\{i\}\]\_\{2j\}=sin⁡\(i10 000−2j/da\),\\displaystyle=\\sin\\\!\\bigl\(i\\,10\\,000^\{\-2j/d\_\{a\}\}\\bigr\),\(4\)\[𝝍i\(a\)\]2j\+1\\displaystyle\[\\boldsymbol\{\\psi\}^\{\(a\)\}\_\{i\}\]\_\{2j\+1\}=cos⁡\(i10 000−2j/da\),\\displaystyle=\\cos\\\!\\bigl\(i\\,10\\,000^\{\-2j/d\_\{a\}\}\\bigr\),\(5\)withj∈\{0,…,⌊\(da−1\)/2⌋\}j\\in\\\{0,\\ldots,\\lfloor\(d\_\{a\}\-1\)/2\\rfloor\\\}, applying \([5](https://arxiv.org/html/2605.22856#S3.E5)\) only when2j\+1<da2j\+1<d\_\{a\}\. The patch\-level positional embedding then concatenates the three axis blocks,𝐞ppe=\[\(𝝍it\(t\)\)⊤,\(𝝍is\(s\)\)⊤,\(𝝍if\(f\)\)⊤\]⊤∈ℝd\\mathbf\{e\}^\{\\mathrm\{pe\}\}\_\{p\}=\\bigl\[\\,\(\\boldsymbol\{\\psi\}^\{\(\\mathrm\{t\}\)\}\_\{i\_\{\\mathrm\{t\}\}\}\)^\{\\top\},\\ \(\\boldsymbol\{\\psi\}^\{\(\\mathrm\{s\}\)\}\_\{i\_\{\\mathrm\{s\}\}\}\)^\{\\top\},\\ \(\\boldsymbol\{\\psi\}^\{\(\\mathrm\{f\}\)\}\_\{i\_\{\\mathrm\{f\}\}\}\)^\{\\top\}\\,\\bigr\]^\{\\top\}\\in\\mathbb\{R\}^\{d\}, with\(it,is,if\)\(i\_\{\\mathrm\{t\}\},i\_\{\\mathrm\{s\}\},i\_\{\\mathrm\{f\}\}\)decoded frompp\. Tokens receive this positional encoding via𝐱p←𝐱p\+αpe𝐞ppe\\mathbf\{x\}\_\{p\}\\leftarrow\\mathbf\{x\}\_\{p\}\+\\alpha\_\{\\mathrm\{pe\}\}\\mathbf\{e\}^\{\\mathrm\{pe\}\}\_\{p\}, whereαpe\\alpha\_\{\\mathrm\{pe\}\}is a learnable scalar initialized near zero such that the fixed sinusoidal magnitudes do not dominate the linear patch embedding early in pretraining\.

### III\-BWSSUS\-motivated factorized attention

Under the classical WSSUS model\[[9](https://arxiv.org/html/2605.22856#bib.bib38)\]and its MIMO extension\[[31](https://arxiv.org/html/2605.22856#bib.bib39)\], the temporal correlation is governed by the Doppler spectrum, a function of the environment mobility, while the spectro\-spatial correlation is governed by the joint angular\-delay power spectrum, a function of the scattering geometry\. Writing the channel autocorrelation over a time lagΔt\\Delta t, a frequency lagΔf\\Delta f, and a spatial lagΔs\\Delta s, the WSSUS assumption and the standard separability of Doppler from angle\-delay dispersion in MIMO\-WSSUS channels\[[31](https://arxiv.org/html/2605.22856#bib.bib39)\]yield

RH\(Δt,Δf,Δs\)≈Rt\(Δt\)Rsf\(Δf,Δs\)\.R\_\{\\mathrm\{H\}\}\(\\Delta t,\\Delta f,\\Delta s\)\\;\\approx\\;R\_\{\\mathrm\{t\}\}\(\\Delta t\)\\,R\_\{\\mathrm\{sf\}\}\(\\Delta f,\\Delta s\)\.\(6\)According to \([6](https://arxiv.org/html/2605.22856#S3.E6)\), temporal and space\-frequency correlations are weakly coupled and a representation learner does not need to model cross\-domain correlations\. On the other hand, space and frequency remain coupled through the jointly angle\- and delay\-dependent scattering structure\[[31](https://arxiv.org/html/2605.22856#bib.bib39)\]\. Therefore, a three\-way separability is not desirable as it would discard the real physical structure that the encoder should learn\.

This prior maps to an attention factorization\. To represent the sparse subset of tokens retained by the input mask, letnk≤ntn\_\{\\mathrm\{k\}\}\\leq n\_\{\\mathrm\{t\}\}andNsf′≤NsfN^\{\\prime\}\_\{\\mathrm\{sf\}\}\\leq N\_\{\\mathrm\{sf\}\}denote the numbers of retained temporal and spectro\-spatial patch indices, respectively\. The retained set is rectangular both during pretraining \(structured random masking, Section[III\-C](https://arxiv.org/html/2605.22856#S3.SS3)\) and at inference \(the fixed pilot pattern, Appendix[B](https://arxiv.org/html/2605.22856#A2)\), and\(nk,Nsf′\)=\(nt,Nsf\)\(n\_\{\\mathrm\{k\}\},N^\{\\prime\}\_\{\\mathrm\{sf\}\}\)=\(n\_\{\\mathrm\{t\}\},N\_\{\\mathrm\{sf\}\}\)recovers the unmasked case\. Let𝐙\(ℓ\)∈ℝnk×Nsf′×d\\mathbf\{Z\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{n\_\{\\mathrm\{k\}\}\\times N^\{\\prime\}\_\{\\mathrm\{sf\}\}\\times d\}be the token tensor at the input to Blockℓ∈\{0,1,…,LFST\}\\ell\\in\\\{0,1,\\ldots,L\_\{\\mathrm\{FST\}\}\\\}, whereLFSTL\_\{\\mathrm\{FST\}\}is the number of FST encoder blocks\. Within each block, our FST encoder applies temporal attention across the retained temporal\-patch indices followed by spectro\-spatial attention within each retained temporal slice, i\.e\.,

𝐙:,s,:\(ℓ\+12\)\\displaystyle\\mathbf\{Z\}^\{\(\\ell\+\\tfrac\{1\}\{2\}\)\}\_\{:,s,:\}=Attnt\(𝐙:,s,:\(ℓ\)\),s=1,…,Nsf′,\\displaystyle=\\mathrm\{Attn\}\_\{\\mathrm\{t\}\}\\\!\\left\(\\mathbf\{Z\}^\{\(\\ell\)\}\_\{:,s,:\}\\right\),\\quad s=1,\\dots,N^\{\\prime\}\_\{\\mathrm\{sf\}\},\(7\)𝐙t,:,:\(ℓ\+1\)\\displaystyle\\mathbf\{Z\}^\{\(\\ell\+1\)\}\_\{t,:,:\}=Attnsf\(𝐙t,:,:\(ℓ\+12\)\),t=1,…,nk,\\displaystyle=\\mathrm\{Attn\}\_\{\\mathrm\{sf\}\}\\\!\\left\(\\mathbf\{Z\}^\{\(\\ell\+\\tfrac\{1\}\{2\}\)\}\_\{t,:,:\}\\right\),\\quad t=1,\\dots,n\_\{\\mathrm\{k\}\},\(8\)where𝐙\(ℓ\+12\)\\mathbf\{Z\}^\{\(\\ell\+\\tfrac\{1\}\{2\}\)\}denotes the intermediate token tensor between the two sublayers\. Residual connections and feedforward sublayers are omitted for the sake of clarity\. As illustrated in Fig\.[3](https://arxiv.org/html/2605.22856#S3.F3), cross\-slot information is exchanged only by temporal attention at fixed spectro\-spatial indices\. On the other hand, spectro\-spatial information is exchanged only by within\-slot attention at fixed time indices, mirroring \([6](https://arxiv.org/html/2605.22856#S3.E6)\)\. Because factorization is an inductive bias rather than a sparsification of the attention matrix, it does not degrade expressivity on structures that WSSUS captures\. Our ablations against a joint space\-time \(JST\) baseline of matched parameter count show consistent gains on both ID and OOD beam selection and channel characterization\.

![Refer to caption](https://arxiv.org/html/2605.22856v1/x2.png)Figure 3:Factorized space\-time attention on the patch\-token grid with axes\(nt,ns,nf\)\(n\_\{\\mathrm\{t\}\},n\_\{\\mathrm\{s\}\},n\_\{\\mathrm\{f\}\}\)\. For illustration, the figure usesnt=8n\_\{\\mathrm\{t\}\}=8,ns=3n\_\{\\mathrm\{s\}\}=3,nf=6n\_\{\\mathrm\{f\}\}=6,nk=3n\_\{\\mathrm\{k\}\}=3, andρk=318\\rho\_\{\\mathrm\{k\}\}=\\frac\{3\}\{18\}, which leads to≈94%\\approx 94\\%mask ratio\.Beyond the inductive\-bias argument, factorization also lowers per\-layer attention cost relative to joint space\-time attention\. We will discuss the complexity expressions, measured FLOPs, and per\-sample latency in Section[V](https://arxiv.org/html/2605.22856#S5)\.

### III\-CAggressive masking enabled by factorization

Factorized attention and separability permit an unusually aggressive pretraining masking regime\. We apply a structured random mask that retains onlynkn\_\{\\mathrm\{k\}\}out ofntn\_\{\\mathrm\{t\}\}temporal\-patch indices\. Across those retained patches, we keep a common fractionρk∈\(0,1\)\\rho\_\{\\mathrm\{k\}\}\\in\(0,1\)of the spectro\-spatial token positions \(the same positions in every retained patch rather than resampled per patch\)\. Therefore, the overall fraction of visible tokens is\(nk/nt\)ρk\(n\_\{\\mathrm\{k\}\}/n\_\{\\mathrm\{t\}\}\)\\,\\rho\_\{\\mathrm\{k\}\}and the overall mask ratio is1−\(nk/nt\)ρk1\-\(n\_\{\\mathrm\{k\}\}/n\_\{\\mathrm\{t\}\}\)\\,\\rho\_\{\\mathrm\{k\}\}\. This is equivalent toNsf′=ρkNsfN^\{\\prime\}\_\{\\mathrm\{sf\}\}=\\rho\_\{\\mathrm\{k\}\}N\_\{\\mathrm\{sf\}\}in \([7](https://arxiv.org/html/2605.22856#S3.E7)\)\. The mask structure matches the factorization\. The temporal block mixes information across thenkn\_\{\\mathrm\{k\}\}kept temporal patches at each fixed spectro\-spatial position\. The spectro\-spatial block mixes across the visible positions within each kept patch\. Therefore, the decoder receives informed tokens at every visible location\. Randomizing the mask across pretraining examples keeps the encoder agnostic to any specific pilot pattern\. As a result, at inference, the same pretrained model can ingest whatever fixed pilot configuration is used at the receiver\.

This structured keep set is well matched to factorized attention since every attended pair under FST differs on at most one axis\. Under \([6](https://arxiv.org/html/2605.22856#S3.E6)\), the autocorrelation along a single axis is governed by eitherRtR\_\{\\mathrm\{t\}\}alone orRsfR\_\{\\mathrm\{sf\}\}alone\. Therefore, each attended pair is one that the physics predicts to be well correlated\. By contrast, applying the same keep set to a JST encoder spreads its quadratic attention budget over all pairs of retained tokens, including pairs that differ on both axes simultaneously\. For such cross\-axis pairs, the autocorrelation factorizes asRt\(Δt\)Rsf\(Δf,Δs\)R\_\{\\mathrm\{t\}\}\(\\Delta t\)\\,R\_\{\\mathrm\{sf\}\}\(\\Delta f,\\Delta s\), which can decay rapidly through either factor and so carries comparatively little signal\. FST therefore extracts more information per unit of token budget than JST on the same visible set, allowing aggressive masking to remain useful rather than degrading into mixing across weakly correlated lags\.

### III\-DPatch\-normalized reconstruction with an auxiliary scale loss

The power of wireless channel spans an enormous dynamic range across samples\. Path loss and shadowing typically vary by tens of dB between channels in the same training set, while small\-scale multipath fading, the relevant structure for beam selection and many other downstream tasks, occupies a much finer amplitude scale\. A reconstruction loss computed in raw amplitude would therefore be dominated by high\-power patches, and the network would minimize it by fitting large\-scale trends while leaving the multipath geometry largely unlearned\. We address this by normalizing each patch by its own mean and variance before computing the reconstruction loss\. Writing𝐩b,i∈ℝDp\\mathbf\{p\}\_\{b,i\}\\in\\mathbb\{R\}^\{D\_\{\\mathrm\{p\}\}\}for theii\-th patch of samplebbandμb,i\\mu\_\{b,i\},σb,i2\\sigma^\{2\}\_\{b,i\}for its empirical mean and variance, respectively, the reconstruction loss over the masked setℳ\\mathcal\{M\}is

ℒrecon=1\|ℳ\|∑\(b,i\)∈ℳ‖𝐩^b,i−𝐩b,i−μb,i𝟏σb,i2\+ϵr‖22,\\mathcal\{L\}\_\{\\mathrm\{recon\}\}=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{\(b,i\)\\in\\mathcal\{M\}\}\\left\\\|\\hat\{\\mathbf\{p\}\}\_\{b,i\}\-\\frac\{\\mathbf\{p\}\_\{b,i\}\-\\mu\_\{b,i\}\\mathbf\{1\}\}\{\\sqrt\{\\sigma^\{2\}\_\{b,i\}\+\\epsilon\_\{\\mathrm\{r\}\}\}\}\\right\\\|\_\{2\}^\{2\},\(9\)whereϵr\\epsilon\_\{\\mathrm\{r\}\}is a small stability constant\. By cancelling per\-patch amplitude, this loss forces the encoder to represent the inter\-patch fading structure and correlations rather than the sample\-level power that dominates the raw signal\. This also removes a trivial shortcut in which the network minimizes the loss by predicting patch means in raw space\. As a result, dividing out per\-patch amplitude discards the large\-scale fading signature \(path loss and shadowing\) that certain downstream tasks rely on\. This motivates us to introduce the auxiliary scale loss as follows\.

For each raw patch, we form the target

𝐬b,i=\(μb,ilog⁡\(σb,i2\+ϵs\)\),\\mathbf\{s\}\_\{b,i\}=\\begin\{pmatrix\}\\mu\_\{b,i\}\\\\ \\log\(\\sigma^\{2\}\_\{b,i\}\+\\epsilon\_\{\\mathrm\{s\}\}\)\\end\{pmatrix\},\(10\)representing the variance in the log scale for numerical stability\. We ask the model to predict𝐬b,i\\mathbf\{s\}\_\{b,i\}from both encoder and decoder token features\. The encoder\-side predictions𝐬^b,ienc\\hat\{\\mathbf\{s\}\}^\{\\mathrm\{enc\}\}\_\{b,i\}are self\-supervised on the visible set𝒦\\mathcal\{K\}\. Therefore, the encoder itself learns to carry large\-scale statistics in its latent\. The predictions at the decoder\-side,𝐬^b,idec\\hat\{\\mathbf\{s\}\}^\{\\mathrm\{dec\}\}\_\{b,i\}, are self\-supervised on the masked setℳ\\mathcal\{M\}\. As a result, the reconstruction path also recovers the scale\. Although the decoder\-side term is supervised at the decoder, its gradient propagates back through the decoder into the encoder\. Therefore, it additionally shapes encoder visible\-token representations to encode scale information in a form the decoder can use to predict scales at masked positions\. The two corresponding loss terms are

ℒscale,enc\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{scale,enc\}\}=1\|𝒦\|∑\(b,i\)∈𝒦‖𝐬^b,ienc−𝐬b,i‖22,\\displaystyle=\\frac\{1\}\{\|\\mathcal\{K\}\|\}\\sum\_\{\(b,i\)\\in\\mathcal\{K\}\}\\left\\\|\\hat\{\\mathbf\{s\}\}^\{\\mathrm\{enc\}\}\_\{b,i\}\-\\mathbf\{s\}\_\{b,i\}\\right\\\|\_\{2\}^\{2\},\(11\)ℒscale,dec\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{scale,dec\}\}=1\|ℳ\|∑\(b,i\)∈ℳ‖𝐬^b,idec−𝐬b,i‖22,\\displaystyle=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{\(b,i\)\\in\\mathcal\{M\}\}\\left\\\|\\hat\{\\mathbf\{s\}\}^\{\\mathrm\{dec\}\}\_\{b,i\}\-\\mathbf\{s\}\_\{b,i\}\\right\\\|\_\{2\}^\{2\},and the full pretraining objective is

ℒpretrain=ℒrecon\+λencℒscale,enc\+λdecℒscale,dec\.\\mathcal\{L\}\_\{\\mathrm\{pretrain\}\}=\\mathcal\{L\}\_\{\\mathrm\{recon\}\}\+\\lambda\_\{\\mathrm\{enc\}\}\\mathcal\{L\}\_\{\\mathrm\{scale,enc\}\}\+\\lambda\_\{\\mathrm\{dec\}\}\\mathcal\{L\}\_\{\\mathrm\{scale,dec\}\}\.\(12\)
Fig\.[4](https://arxiv.org/html/2605.22856#S3.F4)summarizes the pretraining wiring\. The pilot tensor feeds the FST encoder and JST decoder\. Patch\-normalized reconstruction predicts𝐩^b,i\\hat\{\\mathbf\{p\}\}\_\{b,i\}on\(b,i\)∈ℳ\(b,i\)\\in\\mathcal\{M\}, while scale heads predict𝐬^b,ienc\\hat\{\\mathbf\{s\}\}^\{\\mathrm\{enc\}\}\_\{b,i\}on𝒦\\mathcal\{K\}and𝐬^b,idec\\hat\{\\mathbf\{s\}\}^\{\\mathrm\{dec\}\}\_\{b,i\}onℳ\\mathcal\{M\}, matching the supervision sets in \([9](https://arxiv.org/html/2605.22856#S3.E9)\), \([11](https://arxiv.org/html/2605.22856#S3.E11)\), and \([12](https://arxiv.org/html/2605.22856#S3.E12)\)\.

![Refer to caption](https://arxiv.org/html/2605.22856v1/x3.png)Figure 4:PilotWiMAE pretraining flow and loss groupings\. Auxiliary scale heads attach to encoder and decoder features\. Reconstruction and decoder\-side scale use masked patchesℳ\\mathcal\{M\}, while encoder\-side scale uses visible patches𝒦\\mathcal\{K\}\.
### III\-ENoise\-robust pretraining curriculum

During pretraining, the sparse masked input is corrupted with AWGN while the reconstruction and scale targets are computed against the clean channel\. This prepares the encoder for deployment, where it processes sparse pilot\-pattern inputs under the corresponding noise regime\.

More precisely, lete∈\{0,…,E−1\}e\\in\\\{0,\\ldots,E\-1\\\}index the pretraining epoch, lets0s\_\{0\}denote the initial lower bound for the SNR range in dB, and letsmaxs\_\{\\max\}denote the upper bound in dB\. The lower bound follows a cosine schedule that anneals froms0s\_\{0\}down to0dB,

smin\(e\)=s02\(1\+cos⁡\(πeE−1\)\),s\_\{\\min\}\(e\)=\\frac\{s\_\{0\}\}\{2\}\\left\(1\+\\cos\\left\(\\frac\{\\pi e\}\{E\-1\}\\right\)\\right\),\(13\)so that early epochs concentrate on higher\-SNR observations and later epochs progressively expose the model to lower SNRs, widening the pretraining distribution as the encoder stabilizes\. Then, we sample

SNRb,dB∼𝒰\[smin\(e\),smax\]\.\\mathrm\{SNR\}\_\{b,\\mathrm\{dB\}\}\\sim\\mathcal\{U\}\[s\_\{\\min\}\(e\),s\_\{\\max\}\]\.\(14\)For each samplebb, we measure the channel powerPbP\_\{b\}on the kept patches that the encoder actually consumes\. Then, the linear SNR sets the noise varianceσb2=Pb/SNRb,lin\\sigma^\{2\}\_\{b\}=P\_\{b\}/\\mathrm\{SNR\}\_\{b,\\mathrm\{lin\}\}\. Circularly symmetric complex Gaussian noise of varianceσb2\\sigma^\{2\}\_\{b\}is drawn independently and added to the visible elements only, while the reconstruction and scale losses are evaluated against statistics computed from the clean full\-grid channel𝐇b\\mathbf\{H\}\_\{b\}\.

### III\-FTwo\-phase pretraining schedule

Masked reconstruction couples encoder representations to the paired decoder\. When the decoder is small, gradients emphasize encoder\-side integration of context from visible pilots\. When it is large, the decoder can partly compensate for weaker latents, weakening pressure on the encoder to encode useful geometry\.

##### Phase 1 \(joint encoder and decoder\) pretraining

We pretrain with a deliberately shallow JST decoder ofLJSTL\_\{\\mathrm\{JST\}\}layers and the full objective \([12](https://arxiv.org/html/2605.22856#S3.E12)\)\. This makes encoder latents well\-suited for downstream pilot\-native tasks while the paired decoder stays intentionally capacity\-limited\.

##### Phase 2 \(decoder\-centric\) pretraining

We discard the shallow decoder weights, freeze the encoder, attach a freshly initialized decoder with greater reconstruction capacity \(for example more layers, feed\-forward width, or attention heads\), and pretrain only this decoder on reconstruction only\. As a result, encoder representations stay fixed while the new decoder absorbs the residual mapping from frozen tokens to dense channel patches\. Procedure\-wise, Phase 2 resembles downstream adaptation \(fixed backbone, new head\), but the objective remains masked self\-supervised reconstruction under structured random masking rather than a supervised objective on a fixed pilot grid\. The resulting reconstruction will be agnostic to the pilot pattern\. This is fundamentally different from conventional channel estimation, where the decoder is supervised to learn a mapping from a fixed pilot pattern to full CSI\.

The two\-phase schedule produces a single pretrained encoder\-decoder pair\. Downstream tasks that do not require dense CSI consume only the encoder’s features\. On the other hand, applications that require accurate dense CSI additionally use the decoder output\. Pretraining parameters appear in Section[IV\-B](https://arxiv.org/html/2605.22856#S4.SS2)and Table[II](https://arxiv.org/html/2605.22856#S4.T2)\.

## IVExperiments

We pretrain PilotWiMAE on a ray\-tracing channel dataset at3\.53\.5GHz\. Then, we evaluate transfer without task\-specific fine\-tuning on held\-out test data\. For cross\-frequency beam selection and channel characterization, our evaluation includes in\-distribution \(ID\) and out\-of\-distribution \(OOD\) settings\. We cover frequency mismatch alone \(ID,3\.53\.5to2828GHz on pretraining cities\) and combined frequency\-plus\-city mismatch \(OOD,3\.53\.5to2828GHz on the held\-out city\)\. For channel estimation, we report dense\-channel recovery at3\.53\.5GHz on the held\-out city, isolating scene mismatch under matched carrier\.

### IV\-ADataset

We create a ray\-tracing channel dataset using our in\-house generation pipeline CSIGen, built on Sionna\[[7](https://arxiv.org/html/2605.22856#bib.bib42)\], across urban deployment scenarios\. Each scenario uses six base stations and includes both LoS and NLoS links, with channel tensors generated at3\.53\.5GHz and2828GHz under a shared simulation protocol\. The pretraining split uses Boston, New York City, San Francisco, and Chicago\. For evaluation, we use unseen channels from these same cities for ID testing\. We also use Los Angeles as a held\-out city for OOD testing\. Combined with the3\.53\.5to2828GHz carrier shift, this setup isolates frequency\-only transfer \(ID\) from frequency\-plus\-scene transfer \(OOD\)\. Per\-city channel counts and scene sizes are summarized in Appendix[A](https://arxiv.org/html/2605.22856#A1)\. Table[I](https://arxiv.org/html/2605.22856#S4.T1)reports the Sionna\-based generation parameters used across cities\.

TABLE I:Sionna\-based dataset generation parametersFor pilot\-only inference and evaluation, we use a fixed pilot placement on the OFDM grid with time slot indices𝒯p=\{2,11\}\\mathcal\{T\}\_\{\\mathrm\{p\}\}=\\\{2,11\\\}and subcarrier indicesℱp=\{0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27\}\\mathcal\{F\}\_\{\\mathrm\{p\}\}=\\\{0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27\\\}, observing allNhNvN\_\{\\mathrm\{h\}\}N\_\{\\mathrm\{v\}\}antennas at each pilot resource element\. The time\-domain placement follows 5G NR PDSCH DMRS Mapping Type A\[[2](https://arxiv.org/html/2605.22856#bib.bib33)\], with Symbol 2 as the front\-loaded DMRS and Symbol 11 as the additional DMRS, while the frequency\-domain placement uniformly tiles the band\. This yields\|𝒯p\|×\|ℱp\|=32\|\\mathcal\{T\}\_\{\\mathrm\{p\}\}\|\\times\|\\mathcal\{F\}\_\{\\mathrm\{p\}\}\|=32pilots among\|𝒯\|×\|ℱ\|=448\|\\mathcal\{T\}\|\\times\|\\mathcal\{F\}\|=448resource elements\. The same mask is used across all downstream evaluations for a controlled comparison, with no pattern\-specific fine\-tuning\. Fig\.[12](https://arxiv.org/html/2605.22856#A2.F12)in Appendix[B](https://arxiv.org/html/2605.22856#A2)illustrates the pattern on the time\-frequency grid\.

### IV\-BPretraining

We pretrain PilotWiMAE at3\.53\.5GHz under the two\-phase schedule of Section[III\-F](https://arxiv.org/html/2605.22856#S3.SS6), with500500phase\-1 epochs followed by200200phase\-2 epochs\. We instantiate the FST encoder withLFST=3L\_\{\\mathrm\{FST\}\}=3blocks and eight heads, the JST decoder withLJST=2L\_\{\\mathrm\{JST\}\}=2layers and four heads in Phase 1, eight heads in Phase 2, and model dimensiond=128d=128throughout\. Phase 1 applies aggressive factorized masking \(nk=2n\_\{\\mathrm\{k\}\}=2,ρk=0\.1\\rho\_\{\\mathrm\{k\}\}=0\.1,∼99%\\sim 99\\%overall\), while Phase 2 uses a milder budget \(nk=4n\_\{\\mathrm\{k\}\}=4,ρk=0\.75\\rho\_\{\\mathrm\{k\}\}=0\.75,∼79%\\sim 79\\%overall\)\. The asymmetry follows from the different objectives of the two phases\. Phase 1 shapes the encoder representation, where forcing prediction from very few visible tokens compels integration of long\-range structure across the grid\. In our experiments, this improves downstream accuracy as the mask ratio is pushed toward the aggressive regime\. Instead, Phase 2 shapes the decoder with the encoder frozen, where loosening the budget to∼79%\\sim 79\\%supplies enough visible context for accurate dense recovery while still randomizing the kept set to preserve pilot\-pattern agnosticity\. Table[II](https://arxiv.org/html/2605.22856#S4.T2)lists the remaining architecture and optimization details\.

TABLE II:Pretraining architecture and optimization
### IV\-CTasks

We evaluate three downstream tasks\. Cross\-frequency beam selection and channel characterization \(LoS/NLoS classification\) are evaluated under the cross\-band transfer protocol \(3\.53\.5to2828GHz\)\. Beam selection tests whether the learned representation preserves directional structure across bands, while channel characterization tests whether it preserves propagation\-state semantics under frequency\-dependent channel statistics\. For both, we use frozen pretrained representations and k\-Nearest Neighbors \(kNN\) evaluation without task\-specific fine\-tuning\. For beam selection, labels are defined using a DFT codebook\[[3](https://arxiv.org/html/2605.22856#bib.bib44)\]\. Channel estimation is evaluated at matched carrier \(3\.53\.5GHz\) on the held\-out city and reads out dense\-channel reconstructions from the decoder\.

#### IV\-C1Shared transfer protocol

Beam selection and channel characterization use the same frozen\-feature transfer protocol:

- •The representation encoder is pretrained at3\.53\.5GHz and then frozen\.
- •Evaluation is performed at2828GHz without task\-specific fine\-tuning\.
- •Features are evaluated with a common kNN protocol for all compared methods \(10 disjoint folds\), using mean\-pooled encoder representations\.
- •In each fold, kNN is fit on90%90\\%of samples and evaluated on the10%10\\%held\-out\.
- •We report mean and standard deviation over cross\-validation folds\.

We use kNN as the readout because it is non\-parametric and adds no trainable parameters on top of the frozen encoder\. kNN’s accuracy reflects how well the learned feature geometry already groups channels by task\-relevant similarity\. Even a linear probe would improve the results while adding a fresh inductive bias of its own \(a learned linear separator over the feature space\)\. This can compensate for representations whose neighborhood structure is itself uninformative\. By contrast, kNN succeeds only when nearby points in the learned representation also share the downstream label, which is a property that a good self\-supervised representation is supposed to have\. Therefore, kNN is a better choice for evaluating the performance\.

#### IV\-C2kNN details

We usek=20k=20with cosine\-distance\-based weighted voting\[[16](https://arxiv.org/html/2605.22856#bib.bib45)\]\. Mean\-pooled encoder features are passed to kNN without external feature normalization\. For cosine distance, L2 normalization is applied\. For all models and both full\-channel and pilot\-only inputs, mean pooling over encoded patches yields ad=128d=128representation\.

#### IV\-C3Cross\-frequency beam selection

Beam selection evaluates whether pretrained representations preserve the directional structure across frequency bands, with performance reported as top\-3 beam\-selection accuracy\. For a uniform planar array \(UPA\) withNhN\_\{\\mathrm\{h\}\}columns andNvN\_\{\\mathrm\{v\}\}rows and half\-wavelength spacing in both dimensions, letKhK\_\{\\mathrm\{h\}\}andKvK\_\{\\mathrm\{v\}\}denote the number of angular bins along horizontal and vertical axes\. Along each axis, we use either an*oversampled*DFT grid \(which subsumes the critically sampled grid as the special caseKh=NhK\_\{\\mathrm\{h\}\}=N\_\{\\mathrm\{h\}\},Kv=NvK\_\{\\mathrm\{v\}\}=N\_\{\\mathrm\{v\}\}, yieldingM=NhNvM=N\_\{\\mathrm\{h\}\}N\_\{\\mathrm\{v\}\}codewords\) or an*undersampled*grid:

Kh\\displaystyle K\_\{\\mathrm\{h\}\}=\{OhNh,oversampled\(Oh≥1\),Nh/Uh,undersampled\(Uh≥1,Uh∣Nh\),\\displaystyle=\(15\)Kv\\displaystyle K\_\{\\mathrm\{v\}\}=\{OvNv,oversampled\(Ov≥1\),Nv/Uv,undersampled\(Uv≥1,Uv∣Nv\),\\displaystyle=wherea∣ba\\mid bdenotes thataadividesbb,Oh,Ov,Uh,UvO\_\{\\mathrm\{h\}\},O\_\{\\mathrm\{v\}\},U\_\{\\mathrm\{h\}\},U\_\{\\mathrm\{v\}\}are positive integers that define over\- and undersampling factors\. Only one branch of \([15](https://arxiv.org/html/2605.22856#S4.E15)\) is active per axis\. Undersampled codebooks remain uniform in\[0,2π\)\[0,2\\pi\)on each ring\. The total codebook size isM=KhKvM=K\_\{\\mathrm\{h\}\}K\_\{\\mathrm\{v\}\}, and codewords are Kronecker products of 1D steering vectors,

\[𝐚h\(mh\)\]n\\displaystyle\[\\mathbf\{a\}\_\{\\mathrm\{h\}\}\(m\_\{\\mathrm\{h\}\}\)\]\_\{n\}=1Nhexp⁡\(jnϕmh\),\\displaystyle=\\frac\{1\}\{\\sqrt\{N\_\{\\mathrm\{h\}\}\}\}\\exp\\\!\\bigl\(\\mathrm\{j\}\\,n\\,\\phi\_\{m\_\{\\mathrm\{h\}\}\}\\bigr\),\(16\)ϕmh=2πmhKh,n=0,…,Nh−1,\\displaystyle\\quad\\phi\_\{m\_\{\\mathrm\{h\}\}\}=\\frac\{2\\pi\\,m\_\{\\mathrm\{h\}\}\}\{K\_\{\\mathrm\{h\}\}\},\\quad n=0,\\ldots,N\_\{\\mathrm\{h\}\}\-1,\[𝐚v\(mv\)\]n\\displaystyle\[\\mathbf\{a\}\_\{\\mathrm\{v\}\}\(m\_\{\\mathrm\{v\}\}\)\]\_\{n\}=1Nvexp⁡\(jnψmv\),\\displaystyle=\\frac\{1\}\{\\sqrt\{N\_\{\\mathrm\{v\}\}\}\}\\exp\\\!\\bigl\(\\mathrm\{j\}\\,n\\,\\psi\_\{m\_\{\\mathrm\{v\}\}\}\\bigr\),ψmv=2πmvKv,n=0,…,Nv−1,\\displaystyle\\quad\\psi\_\{m\_\{\\mathrm\{v\}\}\}=\\frac\{2\\pi\\,m\_\{\\mathrm\{v\}\}\}\{K\_\{\\mathrm\{v\}\}\},\\quad n=0,\\ldots,N\_\{\\mathrm\{v\}\}\-1,𝐰mh,mv=𝐚v\(mv\)⊗𝐚h\(mh\),\\mathbf\{w\}\_\{m\_\{\\mathrm\{h\}\},m\_\{\\mathrm\{v\}\}\}=\\mathbf\{a\}\_\{\\mathrm\{v\}\}\(m\_\{\\mathrm\{v\}\}\)\\otimes\\mathbf\{a\}\_\{\\mathrm\{h\}\}\(m\_\{\\mathrm\{h\}\}\),\(17\)withmh∈\{0,…,Kh−1\}m\_\{\\mathrm\{h\}\}\\in\\\{0,\\ldots,K\_\{\\mathrm\{h\}\}\-1\\\},mv∈\{0,…,Kv−1\}m\_\{\\mathrm\{v\}\}\\in\\\{0,\\ldots,K\_\{\\mathrm\{v\}\}\-1\\\}, flat indexm=mvKh\+mhm=m\_\{\\mathrm\{v\}\}K\_\{\\mathrm\{h\}\}\+m\_\{\\mathrm\{h\}\},𝐰m≡𝐰mh,mv\\mathbf\{w\}\_\{m\}\\equiv\\mathbf\{w\}\_\{m\_\{\\mathrm\{h\}\},m\_\{\\mathrm\{v\}\}\}, and⊗\\otimesdenotes the Kronecker product\. The1/Nh1/\\sqrt\{N\_\{\\mathrm\{h\}\}\}and1/Nv1/\\sqrt\{N\_\{\\mathrm\{v\}\}\}scaling in \([16](https://arxiv.org/html/2605.22856#S4.E16)\) keeps‖𝐰m‖2=1\\\|\\mathbf\{w\}\_\{m\}\\\|\_\{2\}=1for any\(Kh,Kv\)\(K\_\{\\mathrm\{h\}\},K\_\{\\mathrm\{v\}\}\)\. Given channel vectors𝐡t,f\\mathbf\{h\}\_\{t,f\}, a single beam label per frame is assigned by maximizing average beam gain over subcarriers and slots:

m⋆=arg⁡maxm∈\{0,…,M−1\}⁡1TNsc∑t=0T−1∑f=0Nsc−1\|𝐰mH𝐡t,f\|2,m^\{\\star\}=\\arg\\max\_\{m\\in\\\{0,\\ldots,M\-1\\\}\}\\frac\{1\}\{T\\,N\_\{\\mathrm\{sc\}\}\}\\sum\_\{t=0\}^\{T\-1\}\\sum\_\{f=0\}^\{N\_\{\\mathrm\{sc\}\}\-1\}\\left\|\\mathbf\{w\}\_\{m\}^\{\\mathrm\{H\}\}\\mathbf\{h\}\_\{t,f\}\\right\|^\{2\},\(18\)since the angular structure is approximately constant within an OFDM frame\. Unless stated otherwise, we use\(Nh,Nv\)=\(8,4\)\(N\_\{\\mathrm\{h\}\},N\_\{\\mathrm\{v\}\}\)=\(8,4\)\. The main beam\-selection plots fixM=128M\{=\}128, obtained from \([15](https://arxiv.org/html/2605.22856#S4.E15)\) via\(Oh,Ov\)=\(2,2\)\(O\_\{\\mathrm\{h\}\},O\_\{\\mathrm\{v\}\}\)=\(2,2\)on the oversampled branch\. Fig\.[5](https://arxiv.org/html/2605.22856#S4.F5)additionally sweeps otherMMcardinalities, including the critically sampled construction\(Oh,Ov\)=\(1,1\)\(O\_\{\\mathrm\{h\}\},O\_\{\\mathrm\{v\}\}\)=\(1,1\), undersampled\(Uh,Uv\)\(U\_\{\\mathrm\{h\}\},U\_\{\\mathrm\{v\}\}\)constructions at smallMM, and larger oversampled\(Oh,Ov\)\(O\_\{\\mathrm\{h\}\},O\_\{\\mathrm\{v\}\}\)at highMM\. The legend entries list the active factors per curve\.

#### IV\-C4Channel characterization

Channel characterization evaluates whether pretrained representations preserve propagation\-state semantics under the same3\.53\.5to2828GHz transfer\. The binary LoS/NLoS label is the ray\-tracer’s geometric LoS indicator \(whether the direct base\-station\-to\-UE path is unobstructed\) and is therefore carrier\-independent\. A given link carries the same label at3\.53\.5GHz and2828GHz, while its small\-scale and large\-scale statistics differ across carriers\. Performance is reported by LoS classification accuracy in both ID and OOD settings\. The dataset details for different test cities are provided in Appendix[A](https://arxiv.org/html/2605.22856#A1)\.

#### IV\-C5Compared methods

##### Beam selection and channel characterization

For these two tasks, we compare the following methods:

- •*Supervised baseline*: an FST encoder followed by a linear classification head, trained end\-to\-end on full\-channel inputs under cross\-entropy loss\. For kNN evaluation, the classifier is discarded and mean\-pooled encoder features are used\.
- •*Self\-supervised JST baseline*: a JST encoder paired with a JST decoder of matching capacity, pretrained with patch\-normalized reconstruction alone \(no AWGN curriculum, no auxiliary scale heads\)\.
- •*PilotWiMAE ablations*: three reduced variants that share PilotWiMAE’s FST encoder and phase\-1 JST decoder but toggle PilotWiMAE’s two pretraining ingredients on and off\.*FST*uses patch\-normalized reconstruction alone\.*FST\+scale*adds the auxiliary encoder\- and decoder\-side scale heads\.*FST\+noise*adds the AWGN SNR curriculum\.
- •*PilotWiMAE*\(also denoted*FST\+noise\+scale*in the figures\): the full proposed method, combining the FST encoder, JST decoder, AWGN curriculum, and auxiliary scale heads\.

All compared methods are pretrained, or trained \(for the supervised baseline\), on the same3\.53\.5GHz pretraining split as PilotWiMAE \(Section[IV\-A](https://arxiv.org/html/2605.22856#S4.SS1)\)\. To isolate the effect of the factorized inductive bias, the JST and FST encoders are matched in capacity\. They shared=128d=128, eight attention heads, and the same number of attention sublayers \(LFST=3L\_\{\\mathrm\{FST\}\}=3FST blocks contribute six attention sublayers, equal to the six layers of the JST encoder\), and consequently have the same number of trainable parameters\. The self\-supervised JST baseline pairs its encoder with a JST decoder of the same depth and heads as the PilotWiMAE phase\-1 decoder\. Therefore, any gap between the JST baseline and the FST ablation reflects the attention factorization in the encoder rather than a difference in the choice of the decoder\. The supervised baseline shares the same FST encoder configuration as PilotWiMAE and replaces the JST decoder with a task\-specific linear head\. Detailed configurations of the JST self\-supervised baseline and the supervised baseline are provided in Appendix[D](https://arxiv.org/html/2605.22856#A4)\.

##### Channel estimation

For dense\-channel recovery, PilotWiMAE is compared against classical estimators and supervised encoder\-decoder baselines:

- •Linear interpolation: a lightweight classical baseline\.
- •*LMMSE practical*: a Kronecker\-structured LMMSE estimator whose second\-order channel statistics are estimated from the3\.53\.5GHz pretraining split and evaluated on Los Angeles test channels at3\.53\.5GHz, isolating scene mismatch under matched carrier\.
- •*LMMSE gold*: the same Kronecker structure with statistics computed on Los Angeles itself for both covariance estimation and evaluation, representing an oracle matched to the deployment scenario\.
- •*Supervised FST*: the FST encoder configuration matched to PilotWiMAE paired with the same JST decoder used in PilotWiMAE’s Phase 2, supervised end\-to\-end against the full channel tensor with the fixed pilot pattern of Appendix[B](https://arxiv.org/html/2605.22856#A2)at the input\.
- •*Supervised JST*: the same end\-to\-end setup as supervised FST but with a standard JST encoder, testing whether the factorized restriction remains expressive enough for the regression task of channel estimation\.

We adopt the Kronecker covariance model for both LMMSE references because it parallels the factorized inductive bias of the FST encoder and, equally important, makes the LMMSE computation feasible at our grid sizes\.

### IV\-DResults

Across beam selection and channel characterization, the factorized encoder family is consistently more robust than JST under both ID and OOD transfers, especially in lower\-SNR regimes\. Noise\-robust pretraining improves stability across SNR, and the auxiliary scale objective is most beneficial for channel characterization, where large\-scale fading cues are directly informative\. Its effect on beam selection is smaller\. Overall, PilotWiMAE, depicted as the FST noise scale, provides the best performance in Figs\.[6](https://arxiv.org/html/2605.22856#S4.F6)\-[9](https://arxiv.org/html/2605.22856#S4.F9), while pilot\-only inference remains competitive with full\-channel inputs despite using a substantially smaller observation space\. In channel estimation, the normalized mean squared error \(NMSE\) of dense\-channel recovery is improved by increasing the decoder depth\. PilotWiMAE achieves competitive performance against classical and supervised pilot\-pattern\-specific encoder\-decoder baselines despite being pretrained without commitment to a specific pilot pattern \(Figs\.[10](https://arxiv.org/html/2605.22856#S4.F10)\-[11](https://arxiv.org/html/2605.22856#S4.F11)\)\.

![Refer to caption](https://arxiv.org/html/2605.22856v1/x4.png)Figure 5:OOD \(28 GHz, Los Angeles\): top\-3 beam\-selection accuracy vs SNR for PilotWiMAE \(FST\+noise\+scale\) using DFT codebooks and horizontal\-vertical Kronecker factorizations that match the codebook sizeMM\.#### IV\-D1Cross\-frequency beam selection

In the Los Angeles OOD split at2828GHz, Fig\.[5](https://arxiv.org/html/2605.22856#S4.F5)reports the top\-3 accuracy versus SNR in a wide range of codebook sizes, including multiple horizontal\-vertical Kronecker constructions that yield the same codebook sizeMM\. For each construction, we evaluate the same frozen encoder features from PilotWiMAE under pilot\-only and full\-channel inputs with an otherwise identical kNN readout\. The pilot\-versus\-full gap depends strongly on the SNR, typically widest at low SNR and tightening toward high SNR, whereas across codebook sizes and tilings we do not observe a simple monotone trend withMMalone\. In particular, at matchedMM, the horizontal\-vertical factorization itself shifts accuracy by several points, with constructions that allocate more bins to the horizontal axis \(e\.g\.,uh<uvu\_\{h\}\{<\}u\_\{v\}on the undersampled branch oroh\>ovo\_\{h\}\{\>\}o\_\{v\}on the oversampled branch\) consistently outperforming their transposed counterparts\. This suggests that, given our scene setup, the representation captures horizontal directionality more sharply than vertical\. As a result, undersampling the vertical axis or oversampling the horizontal one preserves more of the discriminative angular structure at a fixed budget\. Head\-to\-head comparisons to supervised and self\-supervised baselines at a representative fine codebook \(M=128M\{=\}128\) are discussed next\.

![Refer to caption](https://arxiv.org/html/2605.22856v1/x5.png)Figure 6:Cross\-frequency beam selection \(top\-3 accuracy\) in\-distribution at 28 GHz with codebook sizeM=128M=128\.![Refer to caption](https://arxiv.org/html/2605.22856v1/x6.png)Figure 7:Cross\-frequency beam selection \(top\-3 accuracy\) out\-of\-distribution at 28 GHz with codebook sizeM=128M=128\.Figs\.[6](https://arxiv.org/html/2605.22856#S4.F6)and[7](https://arxiv.org/html/2605.22856#S4.F7)fixM=128M\{=\}128and compare PilotWiMAE with the supervised baseline, the self\-supervised JST encoder, and the intermediate FST ablations in the ID and OOD splits\. Several trends emerge\. First, low SNR is the differentiating regime\. The noise\-pretrained variants \(FST\+noise and FST\+noise\+scale\) dominate at low SNR by a wide margin\. This isolates noise\-robust pretraining as the dominant low\-SNR enabler, rather than the encoder architecture or the supervision signal alone\. Second, the supervised baseline exhibits a clear pilot\-versus\-full gap that opens at moderate SNR and persists across the sweep\. Every FST configuration shows curves that nearly coincide between full\-channel and pilot\-only readouts\. This contrast is the empirical signature of pilot\-native pretraining\. Because the supervised baseline shares the same encoder backbone as PilotWiMAE, the tight pilot\-versus\-full agreement reflects a property of the pretraining objective, not the encoder\. Third, JST trails every FST variant across the entire SNR range and on both splits, supporting the case for the factorized inductive bias under aggressive pretraining masking\. Fourth, adding the scale loss on top of noise is mildly negative for beam selection across the SNR range\. This is consistent with the joint takeaway that the scale objective benefits channel characterization more than beam selection\. The ID\-to\-OOD shift causes only a small absolute drop and preserves the relative ranking across methods, indicating that cross\-band transfer to the held\-out city does not introduce method\-dependent collapse\.

![Refer to caption](https://arxiv.org/html/2605.22856v1/x7.png)Figure 8:Channel characterization \(LoS accuracy\) in\-distribution at 28 GHz\.![Refer to caption](https://arxiv.org/html/2605.22856v1/x8.png)Figure 9:Channel characterization \(LoS accuracy\) out\-of\-distribution at 28 GHz\.
#### IV\-D2Channel characterization

Figs\.[8](https://arxiv.org/html/2605.22856#S4.F8)and[9](https://arxiv.org/html/2605.22856#S4.F9)report the LoS classification accuracy versus SNR on the ID and OOD splits, with the same set of methods as the beam\-selection plots\. First, the auxiliary scale loss is the dominant ablation, mirroring the beam\-selection picture in reverse\. The scale\-pretrained variants \(FST scale and FST noise scale\) lead at every SNR and on both splits, since the LoS/NLoS label depends on large\-scale fading statistics rather than on fine angular structure\. PilotWiMAE \(FST noise scale\) leads with the flattest profile, since the AWGN curriculum lifts the low\-SNR floor\. Second, the supervised baseline shows a much smaller pilot\-versus\-full gap compared to beam selection\. This is because recovering a label that depends on aggregate channel power from sparse pilots is as good as that of the dense grid\. Third, the ID\-to\-OOD shift is essentially invisible, i\.e, all curves and their relative ranking are nearly the same for Los Angeles\.

#### IV\-D3Channel estimation

We evaluate dense\-channel recovery in the Los Angeles OOD scenario at3\.53\.5GHz for decoder depths\{1,2,4,6,12\}\\\{1,2,4,6,12\\\}\. As depicted in Fig\.[10](https://arxiv.org/html/2605.22856#S4.F10), with increasing depth, NMSE consistently improves throughout the entire range of SNR\. The gains are modest at low SNR, where noise dominates, and become more pronounced at medium\-to\-high SNR\. This indicates that additional decoder capacity is most useful once pilot observations are sufficiently reliable\.

![Refer to caption](https://arxiv.org/html/2605.22856v1/x9.png)Figure 10:Los Angeles OOD channel estimation at3\.53\.5GHz using a frozen FST\+noise\+scale encoder with decoder\-only pretraining\. Curves show NMSE versus SNR for decoder depths 1, 2, 4, 6, and 12\.Fig\.[11](https://arxiv.org/html/2605.22856#S4.F11)compares the same dense\-channel recovery setting against the classical and supervised neural baselines introduced in Section[IV\-C](https://arxiv.org/html/2605.22856#S4.SS3)\. PilotWiMAE dominates the low\- to mid\-SNR regime by a wide margin\. At0dB SNR, PilotWiMAE already reaches an NMSE that the next\-best method only attains around55dB\. The lead persists up to around2020dB, where the supervised pilot\-pattern\-specific baselines catch up and outperform PilotWiMAE at higher SNR\. Note that PilotWiMAE’s decoder operates on a frozen, pilot\-pattern\-agnostic encoder representation\. On the other hand, the supervised baselines optimize the full encoder\-decoder end\-to\-end against a fixed pilot pattern, which provides more leverage once pilot observations are nearly noise\-free\. However, the decoder\-depth sweep in Fig\.[10](https://arxiv.org/html/2605.22856#S4.F10)shows that scaling the phase\-2 capacity toLJST=12L\_\{\\mathrm\{JST\}\}=12closes the high\-SNR gap, after which PilotWiMAE outperforms supervised baselines throughout the entire swept SNR range\. Overall, PilotWiMAE is the strongest dense\-channel recovery method in the deployment\-relevant SNR range\.

![Refer to caption](https://arxiv.org/html/2605.22856v1/x10.png)Figure 11:Los Angeles OOD channel estimation at3\.53\.5GHz: NMSE versus SNR for PilotWiMAE \(FST\), supervised encoder\-decoder baselines on the fixed pilot pattern, classical interpolation, and Kronecker LMMSE references \(*practical*versus*gold*\)\.

## VComputational Complexity

This section reports the training and inference cost of PilotWiMAE and the supervised baseline, profiled with the protocol implemented in our profiler\. All measurements use a single NVIDIA RTX A4000 GPU with PyTorch and mixed precision\. Training cost is measured at training batch sizeBtr=256B\_\{\\mathrm\{tr\}\}=256and is reported in Table[III](https://arxiv.org/html/2605.22856#S5.T3)\. Inference cost is measured at inference batch sizeBinf=32B\_\{\\mathrm\{inf\}\}=32, averaged over100100time repeats after warmup, reported in Table[IV](https://arxiv.org/html/2605.22856#S5.T4)as amortized per\-sample latency; the amortization convention is described in Appendix[C](https://arxiv.org/html/2605.22856#A3)\.

TABLE III:Per\-epoch training cost on a single NVIDIA RTX A4000 at training batch sizeBtr=256B\_\{\\mathrm\{tr\}\}=256\. FLOPs are reported in petaFLOPs \(P\)\.##### Discussion

Several observations follow from these tables\. First, self\-supervised pretraining of either the FST or JST encoder is roughly three times cheaper per epoch than end\-to\-end supervised training of the same FST backbone\. This is because pretraining processes only the visible token subset while the supervised baseline always sees the full grid\. Second, in inference, the FST encoder benefits decisively from pilot\-only input\. Its per\-sample latency drops from0\.700\.70ms in the full\-channel case to0\.150\.15ms in the pilot\-only case, a reduction of4\.64\.6times that is consistent with the reduced sequence length combined with the𝒪\(ntNsf\(nt\+Nsf\)d\)\\mathcal\{O\}\(n\_\{\\mathrm\{t\}\}N\_\{\\mathrm\{sf\}\}\(n\_\{\\mathrm\{t\}\}\+N\_\{\\mathrm\{sf\}\}\)d\)scaling of factorized attention\. Third, the JST encoder, whose attention scales as𝒪\(\(ntNsf\)2d\)\\mathcal\{O\}\(\(n\_\{\\mathrm\{t\}\}N\_\{\\mathrm\{sf\}\}\)^\{2\}d\)and is therefore asymptotically more expensive than FST’s by a factor ofntNsf/\(nt\+Nsf\)n\_\{\\mathrm\{t\}\}N\_\{\\mathrm\{sf\}\}/\(n\_\{\\mathrm\{t\}\}\+N\_\{\\mathrm\{sf\}\}\), is nevertheless faster than FST in the pilot\-only regime\. This is because after masking, its sequence length is small enough that the quadratic cost is no longer the bottleneck and the simpler block structure dominates\. However, JST is the slowest of all configurations on the full grid \(1\.831\.83ms per sample\), where its quadratic dependence onntNsfn\_\{\\mathrm\{t\}\}N\_\{\\mathrm\{sf\}\}is exposed\.

TABLE IV:Inference cost on a single NVIDIA RTX A4000 at inference batch sizeBinf=32B\_\{\\mathrm\{inf\}\}=32, averaged over100100timed repeats after warmup\.

## VIConclusion

PilotWiMAE demonstrates that self\-supervised wireless channel representation learning can be both robust and deployment\-aware by design\. Pilot\-native inputs avoid unrealistic full\-CSI assumptions and factorized attention improves transfer under frequency shift, including to a held\-out city, while reducing inference burden\. With self\-supervised pretraining at 3\.5 GHz, the learned representations transfer to 28 GHz for beam selection and LoS/NLoS classification without task\-specific fine\-tuning and provide strong performance under both ID and OOD evaluations\. The ablations show that noise\-robust pretraining is key for low\-SNR stability, and that auxiliary scale supervision is particularly useful for channel\-state semantics\. These results suggest a practical recipe for future wireless representation learning through structure\-aware encoders with deployment\-matched corruption and physics\-aware pretraining objectives\. Extending this recipe to broader pilot patterns, frequencies, and system\-level latency profiling is a natural next step\. We release the PilotWiMAE pretrained weights and training pipeline, together with the CSIGen ray\-tracing channel\-generation tool and the channel datasets used in this work, to support reproducibility and future advances in self\-supervised wireless channel representation learning\.

## Appendix ADataset details

For each city, Table[V](https://arxiv.org/html/2605.22856#A1.T5)reports the size of the scene, the number of channels in each split, and LoS prevalence of the test channels\. The pretraining cities supply the train split and serve as the ID test set, while the held\-out Los Angeles scene supplies the OOD test set\. We use10%10\\%of the training set to validate the model during pretraining\.

TABLE V:Per\-city scene size, channel counts, and LoS share\.
## Appendix BInference pilot pattern

Fig\.[12](https://arxiv.org/html/2605.22856#A2.F12)visualizes the fixed pilot mask\.

![Refer to caption](https://arxiv.org/html/2605.22856v1/x11.png)Figure 12:Visualization of the fixed pilot resource elements \(highlighted\) on the14×3214\\times 32OFDM grid\.
## Appendix CPer\-sample latency convention

We report the inference cost per sample, obtained by dividing the per\-batch latency byBinfB\_\{\\mathrm\{inf\}\}\. This amortized convention matches our profiling pipeline and is appropriate as a throughput\-style figure of merit in moderate\-to\-large batches, where the GPU is well\-utilized and the per\-batch latency scales approximately linearly withBinfB\_\{\\mathrm\{inf\}\}\. At very small batch sizes, fixed kernel\-launch and memory\-traffic overhead become non\-negligible, so the latency atB=1B=1can exceed the amortized per\-sample latency\.

## Appendix DBaseline training configurations

Table[VI](https://arxiv.org/html/2605.22856#A4.T6)reports the JST pretraining configuration used as the self\-supervised baseline\. Table[VII](https://arxiv.org/html/2605.22856#A4.T7)reports the supervised baseline configurations \(factorized encoder backbone with a linear classification head\)\.

TABLE VI:JST pretraining configuration\.TABLE VII:Supervised baseline configurations\.
## References

- \[1\]Study on channel model for frequencies from 0\.5 to 100 GHz\.Technical ReportTechnical ReportTR 38\.901,3rd Generation Partnership Project \(3GPP\)\.Cited by:[TABLE I](https://arxiv.org/html/2605.22856#S4.T1.9.13.3.2)\.
- \[2\]3GPP\(2026\-03\)NR; Physical Channels and Modulation\.Technical SpecificationTechnical ReportTS 38\.211,3rd Generation Partnership Project \(3GPP\)\.Note:V19\.3\.0Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p2.1),[§IV\-A](https://arxiv.org/html/2605.22856#S4.SS1.p2.5)\.
- \[3\]3GPPPhysical layer procedures for data\.Technical SpecificationTechnical ReportTR 38\.214,3rd Generation Partnership Project \(3GPP\)\.Cited by:[§IV\-C](https://arxiv.org/html/2605.22856#S4.SS3.p1.3)\.
- \[4\]A\. Aboulfotouh, E\. Mohammed, and H\. Abou\-Zeid\(2025\)6G WavesFM: a foundation model for sensing, communication, and localization\.IEEE Open J\. Commun\. Soc\.6\(\),pp\. 6792–6807\.Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p4.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[5\]S\. Alikhani, G\. Charan, and A\. Alkhateeb\(2025\-05\)LWM: a pre\-trained wireless foundation model for universal feature extraction\.InProc\. IEEE ICMLCN,pp\. 1–6\.External Links:[Document](https://dx.doi.org/10.1109/ICMLCN64995.2025.11140266)Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p2.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p6.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[6\]S\. Alikhani, A\. Malhotra, S\. Hamidi\-Rad, and A\. Alkhateeb\(2026\)LWM\-Temporal: sparse spatio\-temporal attention for wireless channel representation learning\.Note:arXiv:2603\.10024External Links:2603\.10024Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p2.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p8.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[7\]F\. A\. Aoudia, J\. Hoydis, M\. Nimier\-David, B\. Nicolet, S\. Cammerer, and A\. Keller\(2025\)Sionna rt: technical report\.Note:arXiv:2504\.21719External Links:2504\.21719Cited by:[§IV\-A](https://arxiv.org/html/2605.22856#S4.SS1.p1.4)\.
- \[8\]A\. Arnab, M\. Dehghani, G\. Heigold, C\. Sun, M\. Lučić, and C\. Schmid\(2021\-10\)ViViT: a video vision transformer\.InProc\. ICCV,pp\. 6816–6826\.External Links:[Document](https://dx.doi.org/10.1109/ICCV48922.2021.00676)Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p8.1)\.
- \[9\]P\. Bello\(1963\-12\)Characterization of randomly time\-variant linear channels\.IEEE Transactions on Communications Systems11\(4\),pp\. 360–393\.External Links:[Document](https://dx.doi.org/10.1109/TCOM.1963.1088793)Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p8.1),[§I](https://arxiv.org/html/2605.22856#S1.p4.1),[§III\-B](https://arxiv.org/html/2605.22856#S3.SS2.p1.3)\.
- \[10\]G\. Bertasius, H\. Wang, and L\. Torresani\(2021\-07\)Is space\-time attention all you need for video understanding?\.InProc\. ICML,pp\. 813–824\.Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p8.1)\.
- \[11\]V\. Chu, O\. Mashaal, and H\. Abou\-Zeid\(2026\)WirelessJEPA: a multi\-antenna foundation model using spatio\-temporal wireless latent predictions\.Note:arXiv:2601\.20190External Links:2601\.20190Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p2.1)\.
- \[12\]S\. Coleri, M\. Ergen, A\. Puri, and A\. Bahai\(2002\-09\)Channel estimation techniques based on pilot arrangement in OFDM systems\.IEEE Transactions on Broadcasting48\(3\),pp\. 223–229\.External Links:[Document](https://dx.doi.org/10.1109/TBC.2002.804034)Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[13\]T\. Dao, D\. Y\. Fu, S\. Ermon, A\. Rudra, and C\. Ré\(2022\)FlashAttention: fast and memory\-efficient exact attention with io\-awareness\.Note:arXiv:2205\.14135External Links:2205\.14135Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p2.1)\.
- \[14\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\-06\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProc\. NAACL HLT,pp\. 4171–4186\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p2.1),[§I](https://arxiv.org/html/2605.22856#S1.p2.1)\.
- \[15\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby\(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InProc\. ICLR,Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p2.1)\.
- \[16\]S\. A\. Dudani\(1976\)The distance\-weighted k\-Nearest\-Neighbor rule\.IEEE Transactions on Systems, Man, and CyberneticsSMC\-6\(4\),pp\. 325–327\.External Links:[Document](https://dx.doi.org/10.1109/TSMC.1976.5408784)Cited by:[§IV\-C2](https://arxiv.org/html/2605.22856#S4.SS3.SSS2.p1.2)\.
- \[17\]O\. Edfors, M\. Sandell, J\.\-J\. van de Beek, S\.K\. Wilson, and P\.O\. Borjesson\(1998\-07\)OFDM channel estimation by singular value decomposition\.IEEE Trans\. Commun\.46\(7\),pp\. 931–939\.External Links:[Document](https://dx.doi.org/10.1109/26.701321)Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p1.1),[§II](https://arxiv.org/html/2605.22856#S2.p1.6)\.
- \[18\]B\. Guler, G\. Geraci, and H\. Jafarkhani\(2026\)A multi\-task foundation model for wireless channel representation using contrastive and masked autoencoder learning\.IEEE JSAC44,pp\. 4489–4504\.External Links:[Document](https://dx.doi.org/10.1109/JSAC.2026.3677157)Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p2.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p4.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p6.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[19\]B\. Guler, G\. Geraci, and H\. Jafarkhani\(2026\)PilotWiMAE: wireless channel pilots are all you need\.Note:Submitted to the International Conference on Machine Learning \(ICML\), AI4NextG WorkshopCited by:PilotWiMAE: Pilot\-Native Representation Learning for Wireless Channels\.
- \[20\]J\. Guo, P\. Jiang, C\. Wen, S\. Jin, and J\. Zhang\(2025\)LVM4CSI: enabling direct application of pre\-trained large vision models for wireless channel tasks\.Note:arXiv:2507\.05121External Links:2507\.05121Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[21\]K\. He, X\. Chen, S\. Xie, Y\. Li, P\. Dollár, and R\. Girshick\(2022\-06\)Masked autoencoders are scalable vision learners\.InProc\. CVPR,pp\. 15979–15988\.Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p3.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p4.1),[§I](https://arxiv.org/html/2605.22856#S1.p2.1)\.
- \[22\]D\. Hendrycks and T\. Dietterich\(2019\)Benchmarking neural network robustness to common corruptions and perturbations\.InProc\. ICLR,Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[23\]J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark, T\. Hennigan, E\. Noland, K\. Millican, G\. van den Driessche, B\. Damoc, A\. Guy, S\. Osindero, K\. Simonyan, E\. Elsen, O\. Vinyals, J\. W\. Rae, and L\. Sifre\(2022\)Training compute\-optimal large language models\.InProc\. NeurIPS,Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p2.1)\.
- \[24\]J\. Jiang, X\. Ruan, and S\. Xu\(2026\)CSI\-MAE: a masked autoencoder\-based channel foundation model\.Note:arXiv:2601\.03789External Links:2601\.03789Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p4.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p6.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[25\]J\. Jiang, W\. Yu, Y\. Li, Y\. Gao, and S\. Xu\(2025\-05\)A MIMO wireless channel foundation model via CIR\-CSI consistency\.InProc\. IEEE ICMLCN,pp\. 1–6\.External Links:[Document](https://dx.doi.org/10.1109/ICMLCN64995.2025.11140262)Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p2.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p6.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[26\]J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei\(2020\)Scaling laws for neural language models\.Note:arXiv:2001\.08361External Links:2001\.08361Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p2.1)\.
- \[27\]B\. Liu, S\. Gao, X\. Liu, X\. Cheng, and L\. Yang\(2025a\-05\)WiFo: wireless foundation model for channel prediction\.Science China Information Sciences68\(6\),pp\. 162302\.External Links:[Document](https://dx.doi.org/10.1007/s11432-025-4349-0)Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p4.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p6.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[28\]B\. Liu, X\. Liu, S\. Gao, X\. Cai, X\. Cheng, and L\. Yang\(2026\)WiFo\-2: a generalist foundation model unifies heterogeneous wireless system design\.Note:arXiv:2511\.22222External Links:2511\.22222Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p4.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p6.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[29\]B\. Liu, X\. Liu, S\. Gao, X\. Cheng, and L\. Yang\(2024\)LLM4CP: adapting large language models for channel prediction\.Journal of Communications and Information Networks9\(2\),pp\. 113–125\.External Links:[Document](https://dx.doi.org/10.23919/JCIN.2024.10582829)Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p6.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[30\]X\. Liu, S\. Gao, B\. Liu, X\. Cheng, and L\. Yang\(2025\)WiFo\-CF: wireless foundation model for CSI feedback\.Note:arXiv:2508\.04068External Links:2508\.04068Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p6.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[31\]G\. Matz and F\. Hlawatsch\(2011\)Chapter 1 \- fundamentals of time\-varying communication channels\.InWireless Communications Over Rapidly Time\-Varying Channels,F\. Hlawatsch and G\. Matz \(Eds\.\),pp\. 1–63\.External Links:[Document](https://dx.doi.org/10.1016/B978-0-12-374483-8.00001-7)Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p8.1),[§I](https://arxiv.org/html/2605.22856#S1.p4.1),[§III\-B](https://arxiv.org/html/2605.22856#S3.SS2.p1.3),[§III\-B](https://arxiv.org/html/2605.22856#S3.SS2.p1.4)\.
- \[32\]G\. Pan, H\. Kaixuan, H\. Chen, S\. Zhang, C\. Häger, and H\. Wymeersch\(2025\)Large wireless localization model \(LWLM\): a foundation model for positioning in 6G networks\.Note:arXiv:2505\.10134External Links:2505\.10134Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p2.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p6.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[33\]A\. Salihu, M\. Rupp, and S\. Schwarz\(2024\-08\)Self\-supervised and invariant representations for wireless localization\.IEEE Trans\. Wireless Commun\.23\(8\),pp\. 8281–8296\.External Links:[Document](https://dx.doi.org/10.1109/TWC.2023.3348203)Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p2.1)\.
- \[34\]Y\. Sheng, J\. Wang, X\. Zhou, L\. Liang, H\. Ye, S\. Jin, and G\. Y\. Li\(2025\)A wireless foundation model for multi\-task prediction\.Note:arXiv:2507\.05938External Links:2507\.05938Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[35\]R\. Taori, A\. Dave, V\. Shankar, N\. Carlini, B\. Recht, and L\. Schmidt\(2020\)Measuring robustness to natural distribution shifts in image classification\.InProc\. NeurIPS,Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[36\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InProc\. NeurIPS,pp\. 6000–6010\.Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p2.1),[§III\-A](https://arxiv.org/html/2605.22856#S3.SS1.p4.9)\.
- \[37\]Y\. Wang, L\. Sun, T\. Yang, Y\. Shi, M\. Elkashlan, and X\. Tang\(2026\)Filter\-and\-attend: wireless channel foundation model with noise\-plus\-interference suppression structure\.Note:arXiv:2509\.15993External Links:2509\.15993Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p7.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[38\]T\. Yang, P\. Zhang, M\. Zheng, Y\. Shi, L\. Jing, J\. Huang, and N\. Li\(2026\)WirelessGPT: a generative foundation model for multi\-task integrated sensing and communication\.IEEE JSAC44,pp\. 2259–2273\.External Links:[Document](https://dx.doi.org/10.1109/JSAC.2025.3640156)Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p2.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p6.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p8.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[39\]S\. S\. Yellapragada, A\. K\. Kocharlakota, M\. Costa, E\. Ollila, and S\. A\. Vorobyov\(2026\)Computationally efficient neural receivers via axial self\-attention\.Note:arXiv:2510\.12941External Links:2510\.12941Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p8.1)\.
- \[40\]X\. Zhai, A\. Kolesnikov, N\. Houlsby, and L\. Beyer\(2022\)Scaling vision transformers\.InProc\. CVPR,pp\. 1204–1213\.Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p2.1)\.
- \[41\]Q\. Zhang, Y\. Wang, and Y\. Wang\(2022\)How mask matters: towards theoretical understandings of masked autoencoders\.InProc\. NeurIPS,Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p3.1)\.
- \[42\]T\. Zheng, J\. Guo, L\. Dai, S\. Jin, and J\. Zhang\(2026\)MUSE\-FM: multi\-task environment\-aware foundation model for wireless communications\.Note:arXiv:2509\.01967External Links:2509\.01967Cited by:[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[43\]X\. Zhou, L\. Liang, H\. Ye, J\. Zhang, C\. Wen, and S\. Jin\(2026\)Reducing pilots in channel estimation with predictive foundation models\.Note:arXiv:2512\.15562External Links:2512\.15562Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p1.1),[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p7.1),[§I](https://arxiv.org/html/2605.22856#S1.p1.1)\.
- \[44\]A\. Zubow, J\. Angjo, S\. Dimce, and F\. Dressler\(2026\)Physics\-informed transformer for multi\-band channel frequency response reconstruction\.Note:arXiv:2604\.01944External Links:2604\.01944Cited by:[§I\-A](https://arxiv.org/html/2605.22856#S1.SS1.p8.1)\.
PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels

Similar Articles

CSI-JEPA: Towards Foundation Representations for Ubiquitous Sensing with Minimal Supervision

JEPA for AI-Native 6G: Predictive Representations and Open Challenges

Geometry-Aware Infrastructure-Anchored Denoiser for UWB Sensing and Work-Zone Reconstruction

EA-RMENet -- Path Loss Prediction in Urban Environments using Deep Learning

Discovering Millions of Interpretable Features with Sparse Autoencoders

Submit Feedback

Similar Articles

CSI-JEPA: Towards Foundation Representations for Ubiquitous Sensing with Minimal Supervision
JEPA for AI-Native 6G: Predictive Representations and Open Challenges
Geometry-Aware Infrastructure-Anchored Denoiser for UWB Sensing and Work-Zone Reconstruction
EA-RMENet -- Path Loss Prediction in Urban Environments using Deep Learning
Discovering Millions of Interpretable Features with Sparse Autoencoders