CSI-JEPA: Towards Foundation Representations for Ubiquitous Sensing with Minimal Supervision

arXiv cs.LG Papers

Summary

CSI-JEPA is a self-supervised framework for learning reusable representations from unlabeled Wi-Fi channel state information, enabling label-efficient multi-task sensing. It achieves up to 98% label savings and outperforms supervised models.

arXiv:2605.14171v1 Announce Type: new Abstract: Channel state information (CSI) provides a widely available sensing modality for human and environment perception, but existing CSI sensing models usually rely on task-specific supervised training and require substantial labeled data for each task, device, user, or environment. This limits their scalability in practical deployments where unlabeled CSI is abundant but labeled data is costly to collect. In this paper, we present CSI-JEPA, a self-supervised predictive representation learning framework for label-efficient, multi-task Wi-Fi sensing. CSI-JEPA learns reusable temporal-spectral representations from unlabeled CSI samples by predicting latent features of masked channel regions from visible context. To better match the physical structure of CSI, CSI-JEPA tokenizes channel-response amplitude windows along the time and subcarrier dimensions. It then introduces a channel variation-aware masking strategy that samples predictive targets from regions with stronger local temporal and subcarrier-domain variations. After pretraining, the encoder is frozen and used as a backbone, with lightweight task-specific adapters added for downstream sensing tasks. We evaluate CSI-JEPA on seven real-world Wi-Fi sensing tasks spanning diverse objectives and deployment settings. The results show that CSI-JEPA improves downstream sensing performance over competitive baselines, achieving up to 10.64 percentage points mean accuracy gain over state-of-the-art supervised Transformer and matched-budget label savings of up to 98.0%.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:27 AM

# CSI-JEPA: Towards Foundation Representations for Ubiquitous Sensing with Minimal Supervision
Source: [https://arxiv.org/html/2605.14171](https://arxiv.org/html/2605.14171)
###### Abstract

Channel state information \(CSI\) provides a widely available sensing modality for human and environment perception, but existing CSI sensing models usually rely on task\-specific supervised training and require substantial labeled data for each task, device, user, or environment\. This limits their scalability in practical deployments where unlabeled CSI is abundant but labeled data is costly to collect\. In this paper, we present CSI\-JEPA, a self\-supervised predictive representation learning framework for label\-efficient, multi\-task Wi\-Fi sensing\. CSI\-JEPA learns reusable temporal\-spectral representations from unlabeled CSI samples by predicting latent features of masked channel regions from visible context\. To better match the physical structure of CSI, CSI\-JEPA tokenizes channel\-response amplitude windows along the time and subcarrier dimensions\. It then introduces a channel variation\-aware masking strategy that samples predictive targets from regions with stronger local temporal and subcarrier\-domain variations\. After pretraining, the encoder is frozen and used as a backbone, with lightweight task\-specific adapters added for downstream sensing tasks\. We evaluate CSI\-JEPA on seven real\-world Wi\-Fi sensing tasks spanning diverse objectives and deployment settings\. The results show that CSI\-JEPA improves downstream sensing performance over competitive baselines, achieving up to 10\.64 percentage points mean accuracy gain over state\-of\-the\-art supervised Transformer and matched\-budget label savings of up to 98\.0%\.

## IIntroduction

Wi\-Fi sensing has emerged as a promising paradigm for ubiquitous and device\-free perception\[[14](https://arxiv.org/html/2605.14171#bib.bib25),[4](https://arxiv.org/html/2605.14171#bib.bib4),[23](https://arxiv.org/html/2605.14171#bib.bib5)\]\. By analyzing wireless channel perturbations caused by human motion, breathing, body presence, and environmental changes, Wi\-Fi systems can support a wide range of sensing applications without requiring targets to carry dedicated sensors\[[33](https://arxiv.org/html/2605.14171#bib.bib3),[19](https://arxiv.org/html/2605.14171#bib.bib2)\]\. Channel state information \(CSI\), which captures fine\-grained channel responses across OFDM subcarriers and packet time indices, is particularly attractive for wireless sensing because it can be readily obtained from commodity Wi\-Fi devices, especially with the growing support from the emerging IEEE 802\.11bf protocol\[[27](https://arxiv.org/html/2605.14171#bib.bib24),[9](https://arxiv.org/html/2605.14171#bib.bib23)\]\.

Despite this promise, practical Wi\-Fi sensing systems still face a major scalability challenge\. Most existing CSI\-based sensing frameworks are trained in a supervised and task\-specific manner\[[15](https://arxiv.org/html/2605.14171#bib.bib30),[21](https://arxiv.org/html/2605.14171#bib.bib29),[6](https://arxiv.org/html/2605.14171#bib.bib27),[16](https://arxiv.org/html/2605.14171#bib.bib28),[31](https://arxiv.org/html/2605.14171#bib.bib26)\]\. For each sensing task, user group, device configuration, or environment, they often require a substantial amount of labeled CSI data to achieve reliable predictive performance\. However, labeled CSI, i\.e\., CSI measurements paired with task\-specific ground\-truth annotations, is often costly to collect because it requires synchronized sensing activities and controlled experimental procedures\[[35](https://arxiv.org/html/2605.14171#bib.bib20)\]\. In contrast, unlabeled CSI can be continuously and readily collected during normal Wi\-Fi operation\. This mismatch between abundant unlabeled CSI and scarce labeled data motivates learning a reusable CSI representation that can be adapted to downstream sensing tasks with minimal supervision\.

Self\-supervised representation learning provides a natural way to address this problem\[[26](https://arxiv.org/html/2605.14171#bib.bib21)\]\. However, many existing self\-supervised methods rely on consistency learning\[[35](https://arxiv.org/html/2605.14171#bib.bib20)\]or masked reconstruction\[[39](https://arxiv.org/html/2605.14171#bib.bib32),[11](https://arxiv.org/html/2605.14171#bib.bib31)\], which may emphasize input\-level similarity or low\-level CSI recovery rather than task\-relevant channel dynamics\. Recent advances in the joint\-embedding predictive architecture \(JEPA\) learn representations by predicting target embeddings from context embeddings in latent space, rather than reconstructing raw inputs\[[13](https://arxiv.org/html/2605.14171#bib.bib36)\]\. By shifting the pretext task from raw signal reconstruction to latent\-space prediction, JEPA provides a promising way to learn reusable representations that are more aligned with downstream predictive objectives\. However, directly applying JEPA to CSI\-based sensing is non\-trivial\. CSI measurements form a structured temporal\-subcarrier field, where the temporal axis reflects motion evolution and physiological dynamics, while the subcarrier axis reflects frequency\-selective fading and multipath correlations\. Therefore, the mask\-and\-predict strategy in JEPA should preserve and exploit this physical structure\. A generic masking strategy may ignore informative channel variations, whereas overly aggressive time\-only or subcarrier\-only masking may remove too much information needed for predictive learning\.

In this paper, we present CSI\-JEPA, a self\-supervised predictive representation learning framework tailored for label\-efficient Wi\-Fi sensing\. CSI\-JEPA tokenizes CSI amplitude windows into temporal\-subcarrier patch tokens and learns reusable representations through masked latent prediction\. To better match the physical structure of CSI, we introduce a channel variation\-aware masking strategy that estimates local channel dynamics from temporal and subcarrier\-domain variations and samples predictive target regions with stronger channel dynamics\. Notably, the model is pretrained solely on unlabeled CSI samples without using task\-specific labels\. After pretraining, the encoder is frozen and transferred to multiple Wi\-Fi sensing tasks using lightweight task\-specific adapters\. This design separates representation learning from task adaptation and allows the same pretrained backbone to support diverse sensing objectives with limited labeled supervision\. In particular, CSI\-JEPA provides a reusable predictive representation layer between PHY\-layer CSI acquisition and application\-layer sensing inference\. By operating on CSI measurements available in existing Wi\-Fi systems, CSI\-JEPA can serve as a protocol\-compatible sensing primitive for future integrated WLAN communication and sensing services envisioned by IEEE 802\.11bf protocol\[[24](https://arxiv.org/html/2605.14171#bib.bib18)\]\. Its input modality is compatible with channel measurements obtained through standard Wi\-Fi sounding and sensing procedures\[[36](https://arxiv.org/html/2605.14171#bib.bib13),[30](https://arxiv.org/html/2605.14171#bib.bib12)\], allowing routinely collected unlabeled CSI measurements to be reused for self\-supervised representation learning\. The main contributions of this paper are summarized as follows\.

- •We propose CSI\-JEPA, the first joint\-embedding predictive representation learning framework for label\-efficient Wi\-Fi sensing\. CSI\-JEPA learns reusable temporal\-subcarrier representations from unlabeled CSI samples and transfers the frozen encoder to downstream sensing tasks with lightweight model adaptation\.
- •We introduce a channel variation\-aware masking strategy tailored to unlabeled CSI data\. Instead of uniformly sampling masked local channel segments, the proposed strategy estimates local temporal and subcarrier\-domain channel variations and adaptively selects predictive target regions with potentially stronger dynamics\.
- •We conduct a comprehensive evaluation on seven real\-world Wi\-Fi sensing tasks under different labeled budgets\. CSI\-JEPA improves downstream sensing performance over raw\-feature baselines, supervised Transformer training, and reconstruction\-based self\-supervised pretraining\. Compared with the strongest Transformer model, CSI\-JEPA achieves up to 10\.64 percentage points \(pp\) mean accuracy gain and up to 14\.38 pp mean F1 gain, while reducing the matched labeled budget by up to 98\.0%\.

## IIRelated Works

Supervised and multi\-task wireless sensing\.Wireless channel measurements have been widely used in networking and communication systems, supporting tasks such as channel estimation\[[10](https://arxiv.org/html/2605.14171#bib.bib8)\], beam management\[[17](https://arxiv.org/html/2605.14171#bib.bib9)\], and radio map estimation\[[22](https://arxiv.org/html/2605.14171#bib.bib10)\]\. In Wi\-Fi sensing, channel\-related measurements have further enabled device\-free perception tasks such as localization\[[15](https://arxiv.org/html/2605.14171#bib.bib30)\], respiration monitoring\[[21](https://arxiv.org/html/2605.14171#bib.bib29)\], user identification\[[6](https://arxiv.org/html/2605.14171#bib.bib27)\], and proximity estimation\[[31](https://arxiv.org/html/2605.14171#bib.bib26)\]\. Most existing CSI sensing systems rely on task\-specific feature extraction or supervised deep learning models trained for a particular sensing objective, environment, device setup, or user group\. Recent work has also started to explore unified multi\-task sensing models\. For instance, LLM4WM adapts a pretrained language model to multiple channel\-associated communication tasks through multi\-task adapters and MoE\-LoRA\[[20](https://arxiv.org/html/2605.14171#bib.bib19)\]\. MMSense adapts a vision\-based foundation model for multi\-modal and multi\-task wireless sensing by integrating image, radar, LiDAR, and textual inputs for channel, human, and environment sensing tasks\[[18](https://arxiv.org/html/2605.14171#bib.bib22)\]\. While these methods can achieve strong performance when sufficient labeled data are available, they often require costly data collection, device\- and environment\-specific calibration, and repeated retraining to adapt to each new deployment\.

Self\-supervised learning models for CSI\.Self\-supervised learning \(SSL\) has recently been explored as a way to reduce labeling requirements in wireless sensing and communication systems\[[26](https://arxiv.org/html/2605.14171#bib.bib21)\]\. AutoFi learns transferable CSI representations from randomly collected unlabeled CSI samples using consistency\-based SSL with mutual\-information and geometric structural objectives, enabling few\-shot human gesture and gait recognition\[[35](https://arxiv.org/html/2605.14171#bib.bib20)\]\. AM\-FM pretrains a Wi\-Fi sensing foundation model on large\-scale unlabeled CSI using a hybrid self\-supervised objective that combines contrastive learning, masked reconstruction, and physics\-informed autocorrelation prediction\[[39](https://arxiv.org/html/2605.14171#bib.bib32)\]\. CSI\-MAE applies masked autoencoder pretraining to complex CSI generated from 3GPP channel models, learning channel representations for channel extrapolation, channel feedback, and user positioning\[[11](https://arxiv.org/html/2605.14171#bib.bib31)\]\. These works demonstrate the potential of self\-supervised pretraining for learning transferable wireless representations\. However, reconstruction\-based objectives train the model to recover raw CSI values, which can make the learned representation sensitive to low\-level amplitude variations or channel\-specific details that may not always be the most discriminative factors for downstream sensing tasks\.

JEPA for wireless networks\.Different from traditional supervised learning methods that rely on task\-specific labels and existing self\-supervised CSI processing that often use view consistency or raw reconstruction, JEPA learns representations by predicting target embeddings from context embeddings in latent space rather than reconstructing raw inputs\[[13](https://arxiv.org/html/2605.14171#bib.bib36)\]\. It has since been extended to practical self\-supervised representation learning across images\[[1](https://arxiv.org/html/2605.14171#bib.bib38)\], videos\[[2](https://arxiv.org/html/2605.14171#bib.bib35)\], and vision\-language signals\[[5](https://arxiv.org/html/2605.14171#bib.bib34)\]\. JEPA\-style predictive representation learning has also started to appear in wireless network systems\. WirelessJEPA\[[7](https://arxiv.org/html/2605.14171#bib.bib1)\]learns general\-purpose representations from raw multi\-antenna I/Q streams using masked latent prediction over antenna\-time grids, and evaluates the learned encoder on RF\-centric tasks such as modulation classification, AoA estimation, and RF fingerprinting\. While raw I/Q signals provide fine\-grained physical\-layer information, collecting synchronized multi\-antenna I/Q streams often requires dedicated radio hardware and controlled measurement setups\. Recent work has also applied JEPA to CSI trajectory modeling by learning velocity\-conditioned latent channel dynamics, where future channel\-chart embeddings are predicted from current CSI representations and user velocity\[[3](https://arxiv.org/html/2605.14171#bib.bib33)\]\. Follow\-up work further structures these latent transitions using homomorphic world models with Lie algebra\-based action operators to improve geometric consistency and compositional rollout prediction\[[25](https://arxiv.org/html/2605.14171#bib.bib14)\]\. Beyond CSI and I/Q signals, JEPA\-MSAC\[[37](https://arxiv.org/html/2605.14171#bib.bib16)\]applies temporal block\-masked JEPA to multimodal sensing\-assisted communications, using vision, radar, LiDAR, GPS, and RF beam\-level power measurements to support localization, beam prediction, and RSSI prediction\. Different from these prior efforts, our CSI\-JEPA focuses on label\-efficient Wi\-Fi sensing from temporal\-subcarrier CSI amplitude windows and introduces channel variation\-aware target selection for JEPA pretrainingwithoutrequiring hard\-to\-obtain I/Q streams, specialized multimodal sensing hardware, explicit velocity measurements, trajectory\-level supervision, or any position annotations\.

## IIISystem Model and Problem Formulation

### III\-ACSI\-based Sensing System

We consider a general Wi\-Fi sensing system consisting of wireless transmitters and receivers deployed in an indoor environment\. A transmitter sends Wi\-Fi packets to a receiver at regular or application\-dependent intervals, and the receiver estimates CSI from received packets\. CSI characterizes the complex channel response between the transmitter and receiver over multiple subcarriers and packet time indices\. For systems with multiple antennas, receiver streams, or channel views, CSI can be represented asH^c,k​\(t\)∈ℂ\\hat\{H\}\_\{c,k\}\(t\)\\in\\mathbb\{C\}, whereccdenotes the channel or antenna stream index,kkdenotes the subcarrier index, andttdenotes the packet time index\.

In essence, user motion, human breathing, body presence, and environmental dynamics perturb wireless propagation paths\. These perturbations change the temporal and spectral patterns ofH^c,k​\(t\)\\hat\{H\}\_\{c,k\}\(t\), making CSI a useful signaling indicator for passive sensing and monitoring\. Formally, the raw CSI is complex\-valued:

H^c,k​\(t\)=Ac,k​\(t\)​ej​ϕc,k​\(t\),\\hat\{H\}\_\{c,k\}\(t\)=A\_\{c,k\}\(t\)e^\{j\\phi\_\{c,k\}\(t\)\},\(1\)whereAc,k​\(t\)A\_\{c,k\}\(t\)is the amplitude andϕc,k​\(t\)\\phi\_\{c,k\}\(t\)is the phase\. In practical commodity Wi\-Fi systems, CSI phase can be unstable due to timing offsets, frequency synchronization errors, and hardware\-dependent distortions\. Therefore, it is common to use the amplitude component as the sensing model input:

Ac,k​\(t\)=\|H^c,k​\(t\)\|\.A\_\{c,k\}\(t\)=\|\\hat\{H\}\_\{c,k\}\(t\)\|\.\(2\)This amplitude\-only design avoids relying on device\-specific phase calibration and provides a robust input representation for commodity Wi\-Fi sensing\[[34](https://arxiv.org/html/2605.14171#bib.bib11)\]111When reliably calibrated phase is available, our proposed framework can be extended by treating phase or complex\-valued CSI components as additional input channels\.\. In this way, a sensing window is formed by aggregating CSI amplitudes overTTpacket time indices andKKsubcarriers\. The resulting temporal\-spectral CSI tensor is defined as

X∈ℝC×K×T,Xc,k,t=Ac,k​\(t\)\.X\\in\\mathbb\{R\}^\{C\\times K\\times T\},\\qquad X\_\{c,k,t\}=A\_\{c,k\}\(t\)\.\(3\)To reduce scale variations across devices and environments, each CSI window is independently standardized by subtracting its mean amplitude and dividing by its standard deviation\.

This representation forms a natural temporal\-spectral field\. The temporal axis captures motion evolution and physiological dynamics, while the subcarrier axis captures frequency\-selective fading and multipath correlation\. Such structure motivates our temporal\-spectral tokenization and channel variation\-aware masking design as introduced in Sec\. IV\-B\.

Observations\.Fig\.[1](https://arxiv.org/html/2605.14171#S3.F1)illustrates representative CSI amplitude examples from a wireless human activity sensing task, specifically fall and non\-fall events, where the goal is to distinguish abrupt, high\-dynamic body motions from normal daily activities based on channel responses\. The two classes exhibit visibly different temporal\-spectral patterns in the CSI heatmaps\. In addition, the aggregated temporal and subcarrier profiles show distinct structures, suggesting that both dimensions provide useful and complementary sensing cues\. This observation motivates modeling CSI as a structured temporal\-spectral field for the masking policy in JEPA that is aware of channel variations along both time and subcarrier horizons\.

![Refer to caption](https://arxiv.org/html/2605.14171v1/x1.png)

Figure 1:Illustrative CSI amplitude examples for Fall and Non\-Fall samples\.Left: temporal\-spectral CSI heatmaps\.Right: aggregated temporal and subcarrier profiles obtained by averaging over subcarriers and time, respectively\. The two classes exhibit visibly different structures in both the heatmap and the aggregated one\-dimensional views, suggesting that discriminative sensing cues exist along both temporal and subcarrier dimensions\.
### III\-BWi\-Fi Sensing Task and Adaptation Objective

We consider a set of Wi\-Fi sensing tasksℳ\\mathcal\{M\}\. Each taskm∈ℳm\\in\\mathcal\{M\}has a task\-specific label space𝒴\(m\)\\mathcal\{Y\}^\{\(m\)\}\. Given an input CSI tensorXX, the objective is to predict

y^\(m\)=arg⁡maxl∈𝒴\(m\)⁡P​\(y\(m\)=l∣X\)\.\\hat\{y\}^\{\(m\)\}=\\arg\\max\_\{l\\in\\mathcal\{Y\}^\{\(m\)\}\}P\(y^\{\(m\)\}=l\\mid X\)\.\(4\)
![Refer to caption](https://arxiv.org/html/2605.14171v1/x2.png)

Figure 2:Overview of CSI\-JEPA\. The framework performs self\-supervised predictive pretraining on temporal\-spectral CSI samples using channel variation\-aware masking, an online encoder, a predictor, and an EMA target encoder\. After pretraining, the encoder is frozen and adapted to downstream Wi\-Fi sensing tasks with lightweight task\-specific adapters\.In this work, we evaluate each downstream task independently by training a separate lightweight adapter on top of a frozen pretrained encoder\. This protocol isolates the quality of the learned CSI representation and avoids confounding the evaluation with task\-balancing choices in joint multi\-task optimization\.

Formally, our goal is to learn a reusable CSI encoderfθ:X↦Zf\_\{\\theta\}:X\\mapsto Zfrom unlabeled samples, such that downstream tasks can be adapted with only a small labeled subset\. For taskmm, let𝒮ml\\mathcal\{S\}^\{l\}\_\{m\}denote the labeled subset used for downstream adaptation\. Given a frozen pretrained encoderfθf\_\{\\theta\}, we train a lightweight adapterama\_\{m\}by solving

minam⁡𝔼\(X,y\)∼𝒮ml​\[ℒm​\(am​\(fθ​\(X\)\),y\)\],where​θ​is​frozen\.\\min\_\{a\_\{m\}\}\\mathbb\{E\}\_\{\(X,y\)\\sim\\mathcal\{S\}^\{l\}\_\{m\}\}\\left\[\\mathcal\{L\}\_\{m\}\\left\(a\_\{m\}\(f\_\{\\theta\}\(X\)\),y\\right\)\\right\],\\quad\\mathrm\{where\}~\\theta\\ \\mathrm\{is~frozen\}\.\(5\)
This objective reflects the practical setting where unlabeled CSI can be obtained from normal Wi\-Fi operation, while labeled CSI is expensive to collect\. It also matches our frozen\-encoder evaluation protocol, where downstream performance reflects the quality and reusability of the pretrained CSI representation\.

## IVProposed Channel Variation\-Aware Predictive Representation Learning

To resolve the formulated problem in Sec\. III\-B, we propose CSI\-JEPA that consists of four components as shown in Fig\.[2](https://arxiv.org/html/2605.14171#S3.F2)\. First, a temporal\-spectral tokenizer converts CSI samples into patch tokens while preserving the two\-dimensional structure over time and subcarriers\. Second, a channel variation\-aware masking module estimates local channel variation from temporal and subcarrier\-domain changes and selects the most informative target regions on the patch grid\. Third, an online encoder and a lightweight predictor infer the latent representations of masked target regions from visible context\. Fourth, an exponential moving average \(EMA\) target encoder provides stable target representations for latent prediction\. During the pretraining stage, task labels are ignored and the model is optimized only by a predictive latent loss\. After pretraining, the online encoder is frozen and transferred to downstream sensing tasks with lightweight task\-specific adapters\.

### IV\-ATemporal\-Spectral Tokenization

CSI is not an unstructured vector\. The temporal dimension reflects motion evolution and physiological dynamics, while the subcarrier dimension captures frequency\-selective fading and multipath correlation\. Therefore, we tokenize CSI along temporal and subcarrier dimensions jointly\.

Given an input sampleX∈ℝC×K×TX\\in\\mathbb\{R\}^\{C\\times K\\times T\}, we divide the subcarrier\-time plane into non\-overlapping patches of sizePK×PTP\_\{K\}\\times P\_\{T\}\. This produces anNK×NTN\_\{K\}\\times N\_\{T\}patch grid, whereNK=K/PKN\_\{K\}=K/P\_\{K\}andNT=T/PTN\_\{T\}=T/P\_\{T\}\.

Then, we use a convolutional patch embedder with kernel size and stride\(PK,PT\)\(P\_\{K\},P\_\{T\}\)to extract non\-overlapping temporal\-spectral patch tokens:

Z0=PatchEmbed​\(X\)∈ℝNK×NT×D,Z\_\{0\}=\\mathrm\{PatchEmbed\}\(X\)\\in\\mathbb\{R\}^\{N\_\{K\}\\times N\_\{T\}\\times D\},\(6\)whereDDis the token embedding dimension\.

To preserve subcarrier and temporal locations, we add fixed two\-dimensional sine\-cosine positional embeddings:

Z~i,j=Z0,i,j\+Ei,jpos,\\widetilde\{Z\}\_\{i,j\}=Z\_\{0,i,j\}\+E^\{\\mathrm\{pos\}\}\_\{i,j\},\(7\)whereEi,jposE^\{\\mathrm\{pos\}\}\_\{i,j\}denotes the fixed positional embedding for patch location\(i,j\)\(i,j\)on the subcarrier\-time patch grid\. The resulting patch grid is then flattened into a token sequenceZ=\{z~1,z~2,…,z~N\}Z=\\\{\\tilde\{z\}\_\{1\},\\tilde\{z\}\_\{2\},\\ldots,\\tilde\{z\}\_\{N\}\\\}, whereN=NK​NTN=N\_\{K\}N\_\{T\}\.

### IV\-BChannel Variation\-Aware Masking

A key design question in CSI\-based self\-supervised learning is how to select masked target regions that are informative for representation learning while preserving sufficient global context\. Uniform random block masking treats all temporal\-spectral regions equally, but CSI measurements are not uniformly informative\. Human motion and environmental changes often induce localized channel variations over both time and subcarrier horizons\. In contrast, masking an entire temporal window or an entire subcarrier band can be overly destructive, since it may remove broad structural information that is needed for reliable learning\.

As illustrated in Fig\.[1](https://arxiv.org/html/2605.14171#S3.F1), informative sensing patterns exist along both temporal and subcarrier domains\. This motivates a masking strategy that is aware of channel variations explicitly\. We therefore introduce channel variation\-aware temporal\-spectral masking, which estimates a local channel\-variation map from changes in CSI samples along the temporal and subcarrier dimensions and samples target blocks that cover regions with stronger channel dynamics\. Specifically, the temporal variation reflects time\-varying channel properties induced by motion and Doppler effects, while subcarrier\-domain variation reflects frequency\-selective fading, multipath propagation, and interference patterns\. This makes the predictive pretext task focus on sensing\-informative regions rather than arbitrary locations\.

Given a normalized CSI tensorX∈ℝC×K×TX\\in\\mathbb\{R\}^\{C\\times K\\times T\}, we first compute channel variation along the temporal dimension as

Vt​\(t,k\)=1C​∑c=1C\|Xc,k,t−Xc,k,t−1\|,V\_\{t\}\(t,k\)=\\frac\{1\}\{C\}\\sum\_\{c=1\}^\{C\}\\left\|X\_\{c,k,t\}\-X\_\{c,k,t\-1\}\\right\|,\(8\)and then derive the subcarrier\-domain channel variation as

Vk​\(t,k\)=1C​∑c=1C\|Xc,k,t−Xc,k−1,t\|\.V\_\{k\}\(t,k\)=\\frac\{1\}\{C\}\\sum\_\{c=1\}^\{C\}\\left\|X\_\{c,k,t\}\-X\_\{c,k\-1,t\}\\right\|\.\(9\)For those boundary positions, we set the missing differences to zero\. Here,VtV\_\{t\}captures temporal channel dynamics, whileVkV\_\{k\}captures local subcarrier\-domain variation\. The formulation supports multiple CSI channels or antenna views by averaging over the channel dimensionCC\. In this way, the two variation maps are combined into a new channel\-variation map:

M​\(t,k\)=λ​Vt​\(t,k\)\+\(1−λ\)​Vk​\(t,k\),M\(t,k\)=\\lambda V\_\{t\}\(t,k\)\+\(1\-\\lambda\)V\_\{k\}\(t,k\),\(10\)whereλ∈\[0,1\]\\lambda\\in\[0,1\]balances temporal and subcarrier\-domain contributions\. Unless otherwise specified, we setλ=0\.5\\lambda=0\.5\.

Since CSI\-JEPA operates on temporal\-spectral patches, we aggregate the raw channel\-variation map onto the patch grid\. Let the patchized CSI representation form anNK×NTN\_\{K\}\\times N\_\{T\}grid, and let𝒫i,j\\mathcal\{P\}\_\{i,j\}denote the set of CSI entries covered by patch\(i,j\)\(i,j\), whereiiandjjindex the subcarrier and temporal patch locations, respectively\. The patch\-level variation score is then computed as

M~​\(i,j\)=1\|𝒫i,j\|​∑\(t,k\)∈𝒫i,jM​\(t,k\)\.\\widetilde\{M\}\(i,j\)=\\frac\{1\}\{\|\\mathcal\{P\}\_\{i,j\}\|\}\\sum\_\{\(t,k\)\\in\\mathcal\{P\}\_\{i,j\}\}M\(t,k\)\.\(11\)
We determine the target block sizebK×bTb\_\{K\}\\times b\_\{T\}on the patch grid according to the predefined masking configuration\. Given this block size, we enumerate all feasible rectangular blocks on theNK×NTN\_\{K\}\\times N\_\{T\}patch grid\. Letℬa,b\\mathcal\{B\}\_\{a,b\}denote the candidate block whose upper\-left corner is\(a,b\)\(a,b\), and letΩ\\Omegadenote the set of all feasible block locations\. We assign each candidate block a window\-level variation score by averaging the patch\-level scores inside the block:

R​\(a,b\)=1\|ℬa,b\|​∑\(i,j\)∈ℬa,bM~​\(i,j\)\.R\(a,b\)=\\frac\{1\}\{\|\\mathcal\{B\}\_\{a,b\}\|\}\\sum\_\{\(i,j\)\\in\\mathcal\{B\}\_\{a,b\}\}\\widetilde\{M\}\(i,j\)\.\(12\)The target block location is then sampled according to the normalized window\-level variation score:

p​\(a,b\)=R​\(a,b\)\+ϵ∑\(u,v\)∈Ω\(R​\(u,v\)\+ϵ\),p\(a,b\)=\\frac\{R\(a,b\)\+\\epsilon\}\{\\sum\_\{\(u,v\)\\in\\Omega\}\\left\(R\(u,v\)\+\\epsilon\\right\)\},\(13\)whereϵ\\epsilonis a small constant for numerical stability\.

To avoid making the masking policy overly deterministic, we mix the variation\-guided distribution with uniform block sampling:

pmask​\(a,b\)=\(1−η\)​p​\(a,b\)\+η​1\|Ω\|,p\_\{\\mathrm\{mask\}\}\(a,b\)=\(1\-\\eta\)p\(a,b\)\+\\eta\\frac\{1\}\{\|\\Omega\|\},\(14\)whereη\\etais the exploration probability\. After sampling a block location\(a,b\)\(a,b\), the target region is set to𝒯=ℬa,b\\mathcal\{T\}=\\mathcal\{B\}\_\{a,b\}, and the visible context becomes

𝒞=𝒢∖𝒯,\\mathcal\{C\}=\\mathcal\{G\}\\setminus\\mathcal\{T\},\(15\)where𝒢\\mathcal\{G\}is the full patch grid\. This design preserves the advantages of block\-based latent prediction while making the mask placement CSI\-aware\. It avoids the excessive information removal of full time\-axis or full subcarrier\-axis masking, and instead focuses the prediction task on locally informative windows where channel dynamics are strongest\. As a result, the encoder is encouraged to model motion\-sensitive temporal transitions jointly with correlated subcarrier\-domain variations, which better matches the physical structure of CSI\-based sensing than purely random or single\-dimension masking\.

### IV\-CPredictive Latent Pretraining

Given the tokenized CSI sampleZZand the target region𝒯\\mathcal\{T\}sampled from Eq\. \(15\) by channel variation\-aware masking, CSI\-JEPA learns by predicting the latent representations of masked target patches from visible context patches\. The visible context is denoted byZ𝒞=\{z~i∣i∈𝒞\}Z\_\{\\mathcal\{C\}\}=\\\{\\tilde\{z\}\_\{i\}\\mid i\\in\\mathcal\{C\}\\\}, where𝒞\\mathcal\{C\}denotes the visible context patch set\.

Online and target encoders\.The online encoder maps visible context tokens into latent representations:

𝐡𝒞=fθ​\(Z𝒞\)\.\\mathbf\{h\}\_\{\\mathcal\{C\}\}=f\_\{\\theta\}\(Z\_\{\\mathcal\{C\}\}\)\.\(16\)This encoder can be instantiated by any temporal\-spectral backbone\. In this work, we use a lightweight Vision Transformer \(ViT\)\-style encoder\[[8](https://arxiv.org/html/2605.14171#bib.bib17)\]to model long\-range dependencies among visible temporal\-spectral tokens through multi\-head self\-attention\.

On the other hand, the target encoder has the same architecture as the online encoder but uses parametersξ\\xiupdated by EMA:

ξ←μ​ξ\+\(1−μ\)​θ,\\xi\\leftarrow\\mu\\xi\+\(1\-\\mu\)\\theta,\(17\)whereμ\\muis the momentum coefficient\. Given the full token sequenceZZ, the target encoder produces stop\-gradient latent targets for the masked region:

𝐡𝒯=sg​\(fξ​\(Z\)𝒯\),\\mathbf\{h\}\_\{\\mathcal\{T\}\}=\\mathrm\{sg\}\\left\(f\_\{\\xi\}\(Z\)\_\{\\mathcal\{T\}\}\\right\),\(18\)wherefξ​\(Z\)𝒯f\_\{\\xi\}\(Z\)\_\{\\mathcal\{T\}\}denotes the target encoder outputs indexed by the masked target set𝒯\\mathcal\{T\}, andsg​\(⋅\)\\mathrm\{sg\}\(\\cdot\)denotes the stop\-gradient operation\.

Predictor\.The predictor maps the context representation into the target latent space\. Since target tokens are not visible to the online encoder, the predictor receives the context representation and the positional information of target patches:

𝐡^𝒯=qϕ​\(𝐡𝒞,E𝒯pos\),\\hat\{\\mathbf\{h\}\}\_\{\\mathcal\{T\}\}=q\_\{\\phi\}\(\\mathbf\{h\}\_\{\\mathcal\{C\}\},E^\{\\mathrm\{pos\}\}\_\{\\mathcal\{T\}\}\),\(19\)whereE𝒯posE^\{\\mathrm\{pos\}\}\_\{\\mathcal\{T\}\}denotes the positional embeddings of the target patches\. The predictor is intentionally lightweight compared with the encoder, so that the encoder is encouraged to learn reusable CSI representations instead of relying on a high\-capacity prediction head\.

Learning objective\.CSI\-JEPA minimizes the discrepancy between the predicted target representations and the EMA target representations:

ℒJEPA=1\|𝒯\|​∑i∈𝒯ℓ​\(h^i,sg​\(hi\)\),\\mathcal\{L\}\_\{\\mathrm\{JEPA\}\}=\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\_\{i\\in\\mathcal\{T\}\}\\ell\\left\(\\hat\{h\}\_\{i\},\\mathrm\{sg\}\(h\_\{i\}\)\\right\),\(20\)whereh^i\\hat\{h\}\_\{i\}is the predicted target embedding,hih\_\{i\}is the corresponding EMA target embedding,sg​\(⋅\)\\mathrm\{sg\}\(\\cdot\)denotes stop\-gradient, andℓ​\(⋅,⋅\)\\ell\(\\cdot,\\cdot\)is the SmoothL1L\_\{1\}loss\. For a predicted target embeddingh^i\\hat\{h\}\_\{i\}and its stop\-gradient target embeddingsg​\(hi\)\\mathrm\{sg\}\(h\_\{i\}\), the loss is defined element\-wise as

ℓ​\(h^i,sg​\(hi\)\)=1D​∑j=1D\{12​ri,j2,\|ri,j\|<1,\|ri,j\|−12,otherwise,\\ell\\left\(\\hat\{h\}\_\{i\},\\mathrm\{sg\}\(h\_\{i\}\)\\right\)=\\frac\{1\}\{D\}\\sum\_\{j=1\}^\{D\}\\begin\{cases\}\\frac\{1\}\{2\}r\_\{i,j\}^\{2\},&\|r\_\{i,j\}\|<1,\\\\ \|r\_\{i,j\}\|\-\\frac\{1\}\{2\},&\\mathrm\{otherwise\},\\end\{cases\}\(21\)whereDDis the embedding dimension and

ri,j=h^i,j−sg​\(hi,j\)\.r\_\{i,j\}=\\hat\{h\}\_\{i,j\}\-\\mathrm\{sg\}\(h\_\{i,j\}\)\.\(22\)The online encoder parametersθ\\thetaand predictor parametersϕ\\phiare updated by backpropagation, while the target encoder parametersξ\\xiare updated only through EMA\.

We pretrain CSI\-JEPA on heterogeneous unlabeled CSI samples from multiple sensing tasks:

𝒰=⋃m∈ℳ𝒰\(m\)\.\\mathcal\{U\}=\\bigcup\_\{m\\in\\mathcal\{M\}\}\\mathcal\{U\}^\{\(m\)\}\.\(23\)No task labels are needed during pretraining, and the same JEPA objective is applied to all unlabeled CSI samples\. This design encourages the encoder to learn task\-agnostic CSI primitives, including temporal continuity, activity periodicity, cross\-subcarrier consistency, and other reusable patterns shared across sensing objectives\.

### IV\-DDownstream Adaptation

After the self\-supervised pretraining stage, we transfer the online encoderfθf\_\{\\theta\}to downstream Wi\-Fi sensing tasks\. To directly evaluate the quality of the learned representation and reduce the cost of task adaptation, we freeze the pretrained encoder for all downstream evaluations\. Only a lightweight task\-specific adapter is trained using a limited number of labeled samples for each task\.

For a labeled CSI sampleXX, the frozen encoder produces token representations𝐡=fθ​\(X\)=\{h1,h2,…,hN\}\\mathbf\{h\}=f\_\{\\theta\}\(X\)=\\\{h\_\{1\},h\_\{2\},\\ldots,h\_\{N\}\\\}\. Since each downstream label is assigned to the entire CSI window, we aggregate token representations into a sample\-level embedding by average pooling:

r=1N​∑i=1Nhi\.r=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}h\_\{i\}\.\(24\)
For each downstream taskmm, we train a separate adapterama\_\{m\}on top of the frozen representation using the labeled subset𝒮ml\\mathcal\{S\}\_\{m\}^\{l\}:

minam⁡𝔼\(X,y\)∼𝒮ml​\[ℒCE​\(am​\(r\),y\)\],θ​frozen\.\\min\_\{a\_\{m\}\}\\mathbb\{E\}\_\{\(X,y\)\\sim\\mathcal\{S\}\_\{m\}^\{l\}\}\\left\[\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\\left\(a\_\{m\}\(r\),y\\right\)\\right\],\\qquad\\theta\\ \\mathrm\{frozen\}\.\(25\)This independent adaptation protocol is used for all downstream tasks to isolate the quality of the pretrained CSI representation\.

We evaluate two frozen\-encoder adaptation settings\. The first setting uses a linear classifier on top of the mean\-pooled representation, which measures whether the pretrained features are linearly separable for downstream sensing tasks\. The second setting uses a lightweight two\-layer MLP adapter, which evaluates whether task\-relevant sensing information can be extracted by a small nonlinear head without updating the encoder\. For both settings, we vary the number of labeled samples used to train the adapter\. This allows us to evaluate label efficiency under limited supervision while keeping the pretrained encoder fixed\. Compared with full fine\-tuning, this frozen\-encoder protocol is computationally efficient and more directly reflects the reusability of the learned CSI representation as reported in Sec\. V\.

## VExperiments and Evaluation

### V\-AExperimental Setup

TABLE I:Summary of the seven CSI\-Bench sensing tasks\.#### Dataset

We evaluate CSI\-JEPA on seven Wi\-Fi sensing tasks from CSI\-Bench\[[38](https://arxiv.org/html/2605.14171#bib.bib37)\]\. The benchmark includes four individually defined tasks and three shared\-data tasks as shown in Table[I](https://arxiv.org/html/2605.14171#S5.T1)\. For self\-supervised pretraining, we use only valid unlabeled CSI samples from the training splits of the seven tasks, resulting in approximately 151K CSI windows\. No validation or test samples are used during pretraining, and all task labels are ignored, ensuring strict separation between self\-supervised pretraining and downstream evaluation\. To evaluate label efficiency, we train downstream adapters using label budgets\{10,100,500,1000,Bmax\}\\\{10,100,500,1000,B\_\{\\max\}\\\}, whereBmaxB\_\{\\max\}denotes the task\-specific maximum number of labeled training samples\. Because CSI\-Bench has imbalanced task labels and different training\-set sizes, we use the full training split when it contains fewer than 10k samples and otherwise cap the maximum budget at 10k labels\. For each budget, labeled samples are selected only from the corresponding training split, while the validation and test sets remain fixed according to the CSI\-Bench splits\. The validation split is used for model selection, and accuracy and weighted F1\-score are reported on the held\-out test split\.

#### Implementation Details

All methods use amplitude\-only CSI inputs\. Each CSI sample is independently normalized and standardized to a fixed input size\. The model architecture is summarized in Table[II](https://arxiv.org/html/2605.14171#S5.T2)\. We pretrain CSI\-JEPA for 20 epochs using AdamW with learning rate10−410^\{\-4\}and weight decay10−510^\{\-5\}\. The target encoder is updated by exponential moving average, with the momentum linearly increased from 0\.996 to 1\.0 during pretraining\. The exploration probability in channel variation\-aware masking is set toη=0\.3\\eta=0\.3\. We save checkpoints during pretraining and use the 15\-epoch checkpoint in the main experiments, since downstream performance becomes stable after a few pretraining epochs in our validation study\. For downstream adaptation, all adapters are trained with AdamW and batch size 32 for at most 20 epochs, with early stopping based on validation accuracy\.

TABLE II:Implementation details of CSI\-JEPA\.
#### Baselines and Adaptation Protocol

We evaluate CSI\-JEPA from three perspectives: supervised baselines, an alternative SSL baseline, and CSI\-JEPA design variants for adaptation, backbone, and masking ablations\.

Supervised baselines\.We compare CSI\-JEPA with three models trained directly on labeled CSI data:

- •Raw linear\[[28](https://arxiv.org/html/2605.14171#bib.bib41)\]:a linear classifier trained on flattened CSI amplitude features\.
- •Raw MLP\[[29](https://arxiv.org/html/2605.14171#bib.bib39)\]:a two\-layer MLP trained from scratch on flattened CSI amplitude features\.
- •Sup\. Transformer\[[32](https://arxiv.org/html/2605.14171#bib.bib40)\]:a supervised Transformer encoder trained end\-to\-end from labeled CSI samples\.

These baselines evaluate whether self\-supervised pretraining can improve downstream sensing performance compared with direct supervised training, especially under limited label budgets\.

SSL baseline based on CSI\-MAE\[[11](https://arxiv.org/html/2605.14171#bib.bib31)\]\.We implement a CSI\-MAE\-style masked reconstruction baseline to compare CSI\-JEPA with reconstruction\-based SSL\. For a fair comparison, this baseline uses the same temporal\-subcarrier tokenizer, patch size, ViT encoder scale, unlabeled pretraining corpus, and pretraining budget as CSI\-JEPA, but replaces latent target prediction with masked CSI amplitude reconstruction\. Detailed adaptation of CSI\-MAE to our Wi\-Fi sensing task setting is described in Sec\.[V\-C](https://arxiv.org/html/2605.14171#S5.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.14171v1/x3.png)
![Refer to caption](https://arxiv.org/html/2605.14171v1/x4.png)

Figure 3:Accuracy and weighted F1\-score under different label budgets on four individually defined CSI sensing tasks\.CSI\-JEPA adaptation variants\.After pretraining, the CSI\-JEPA encoder is frozen for all downstream tasks\. We evaluate two lightweight task\-specific adapters:

- •CSI\-JEPA linear probe:a linear classifier trained on the mean\-pooled frozen representation\.
- •CSI\-JEPA MLP probe:a lightweight MLP adapter trained on the mean\-pooled frozen representation\.

CSI\-JEPA backbone variants\.To study the effect of encoder architecture, we compare three backbone variants under the same JEPA pretraining protocol:

- •ViT backbone\[[8](https://arxiv.org/html/2605.14171#bib.bib17)\]:directly tokenizes CSI amplitude patches and processes them with a Transformer encoder\.
- •CNN backbone\[[12](https://arxiv.org/html/2605.14171#bib.bib15)\]:uses convolutional layers to extract local temporal\-subcarrier features from CSI heatmaps\.
- •CNN\-ViT backbone:first applies a CNN front\-end for local feature extraction, followed by a Transformer encoder for global temporal\-subcarrier modeling\.

This comparison evaluates whether local convolutional inductive bias, global self\-attention, or their combination is more effective for JEPA\-based CSI representation learning\.

Masking variants\.To study the effect of target selection during pretraining, we compare four masking strategies:

- •Time\-based masking:masks a contiguous temporal region across all subcarriers\.
- •Subcarrier\-based masking:masks a contiguous subcarrier band across all time steps\.
- •Temporal\-spectral masking:uniformly samples a contiguous rectangular block on the temporal\-subcarrier patch grid\.
- •Channel variation\-aware masking:samples a rectangular target block according to channel variation scores\.

### V\-BLabel\-Efficient Downstream Adaptation

Fig\.[3](https://arxiv.org/html/2605.14171#S5.F3)reports the accuracy and weighted F1\-score on four CSI\-based sensing tasks under different label budgets\. Across all tasks, CSI\-JEPA consistently improves over the raw\-feature baselines, showing that predictive pretraining learns reusable CSI representations beyond directly fitting classifiers on flattened CSI amplitudes\. The advantage is especially clear underlimitedsupervision\. With only 100 or 500 labeled samples, CSI\-JEPA with an MLP probe already achieves strong performance on all four sensing tasks\.

Compared with the supervised Transformer trained from scratch, CSI\-JEPA shows more stable performance across label budgets\. For example, on Breathing Detection and Motion Source Recognition, the supervised Transformer exhibits noticeable fluctuations when the labeled budget is small, whereas CSI\-JEPA improves more smoothly as the number of labels increases\. This suggests that self\-supervised predictive pretraining provides a stronger representation for downstream sensing than directly training a Transformer from limited labeled CSI data\. Among the two CSI\-JEPA adaptation variants, the MLP probe generally performs best, indicating that the frozen representation contains useful sensing information that can be effectively extracted by a lightweight adapter\. We also observe that CSI\-JEPA Linear is competitive, which further suggests that the learned representation is largely separable for downstream sensing tasks even without updating the backbone\.

To further quantify the trends in Fig\.[3](https://arxiv.org/html/2605.14171#S5.F3), Table[III](https://arxiv.org/html/2605.14171#S5.T3)summarizes the label\-efficiency gains of CSI\-JEPA MLP over the supervised Transformer\. We use the supervised Transformer as the reference baseline because it is the strongest supervised baseline as validated in Fig\.[3](https://arxiv.org/html/2605.14171#S5.F3)\. Accuracy and F1 gains \(Δ\\Delta\) are computed at the same labeling budget and reported in percentage points \(pp\)\. To provide a conservative summary of label efficiency, we report a sliding\-reference matched budget pair\. For each Transformer reference budgetb′b^\{\\prime\}, we identify the smallest CSI\-JEPA budgetbJb\_\{J\}and the smallest Transformer budgetbTb\_\{T\}such that both reach withinϵ\\epsilon= 5 pp of the Transformer performance atb′b^\{\\prime\}\. Formally, for a performance metricPP,

PJ​\(bJ\)≥PT​\(b′\)−ϵ,PT​\(bT\)≥PT​\(b′\)−ϵ\.P\_\{J\}\(b\_\{J\}\)\\geq P\_\{T\}\(b^\{\\prime\}\)\-\\epsilon,\\qquad P\_\{T\}\(b\_\{T\}\)\\geq P\_\{T\}\(b^\{\\prime\}\)\-\\epsilon\.The corresponding labeled data saving rate is defined as

SavingRate​\(b′\)=1−bJbT\.\\mathrm\{SavingRate\}\(b^\{\\prime\}\)=1\-\\frac\{b\_\{J\}\}\{b\_\{T\}\}\.For each sensing task, we report the matched budget pair with the largest positive label saving rate\.

As shown in Table[III](https://arxiv.org/html/2605.14171#S5.T3), our CSI\-JEPA achieves positive mean gains on most tasks, with particularly large improvements on Motion Source Recognition and Breathing Detection\. Compared with the supervised Transformer at the same label budgets, CSI\-JEPA improves the sensing accuracy by 10\.64 pp on Motion Source Recognition and 9\.18 pp on Breathing Detection, and improves mean F1 by 14\.38 pp and 12\.55 pp, respectively\. The maximum gains are even larger, reaching 45\.13 pp in F1 on Motion Source Recognition\.

TABLE III:Label\-efficiency summary of CSI\-JEPA MLP over the supervised Transformer on four single\-task settings\.The conservative matched budget analysis further shows that CSI\-JEPA can reach comparable Transformer performance with far fewer labeled samples on several sensing tasks\. On Breathing Detection, CSI\-JEPA MLP with500labels matches the Transformer performance level associated with10,000labels, corresponding to a saving rate of 95\.0%\. On Motion Source Recognition, CSI\-JEPA MLP with500labels matches the Transformer level associated with1,000labels, corresponding to a saving rate of 50\.0%\. On Fall Detection, CSI\-JEPA MLP reaches the matched Transformer performance level with only10labels compared with500labels for the supervised Transformer, corresponding to a saving rate of 98\.0%\. For Localization, no positive matched\-budget saving is observed under this criterion, although CSI\-JEPA still provides positive mean and maximum F1 gains\. This may be because accurate localization relies more heavily on fine\-grained spatial calibration and absolute channel characteristics, which are less transferable from learned representations and require more task\-specific labeled data\. Overall, these results indicate that CSI\-JEPA can improve downstream sensing performance while reducing task\-specific label requirements, especially on tasks where supervised learning remains far from saturation under limited labels\.

TABLE IV:Comparison of different masking strategies across seven tasks\. Results are reported as Acc/F1 \(%\) using the MLP evaluation head\. Bold marks a best result with\>1\>1pp margin, and underline marks a best result with≤1\\leq 1pp margin\.Fig\.[4](https://arxiv.org/html/2605.14171#S5.F4)further evaluates CSI\-JEPA on three shared\-data CSI sensing tasks\. The results show that CSI\-JEPA MLP consistently achieves the best or near\-best performance across different label budgets\. On Proximity Recognition, CSI\-JEPA provides clear gains in the low\- and medium\-label regimes compared with raw\-feature baselines, showing that the pretrained representation remains effective across diverse downstream sensing objectives\.

![Refer to caption](https://arxiv.org/html/2605.14171v1/x5.png)
![Refer to caption](https://arxiv.org/html/2605.14171v1/x6.png)

Figure 4:Accuracy and weighted F1\-score under different label budgets on three shared\-data tasks\.
### V\-CComparison with Reconstruction\-Based SSL

To compare CSI\-JEPA with masked reconstruction\-based self\-supervised learning, we implement a CSI\-MAE\-style baseline from\[[11](https://arxiv.org/html/2605.14171#bib.bib31)\]within our benchmark setting\. Specifically, we preserve its MAE\-style masked reconstruction objective, use uniform random masking with a 75% masking ratio, and include a learnable\[CLS\]token in both the encoder and decoder\. For downstream evaluation, we also follow the original CSI\-MAE adaptation protocol by using the encoder\[CLS\]representation with a linear prediction head\. Since the original CSI\-MAE evaluates positioning as a regression task, it uses a linear regression head\. In our Wi\-Fi sensing benchmark, the downstream tasks are classification tasks, so we replace the linear regression head with a linear classification head while keeping the same linear\-probe protocol\.

Fig\.[5](https://arxiv.org/html/2605.14171#S5.F5)compares CSI\-JEPA with the CSI\-MAE\-style baseline on individually defined tasks and shared\-data tasks\. CSI\-JEPA with a linear probe outperforms CSI\-MAE on most tasks, with especially clear gains on Breathing Detection, Localization, Proximity Recognition, and User Identification\. This indicates that latent\-space predictive learning provides more linearly accessible representations than raw masked reconstruction for many CSI\-based sensing tasks\. CSI\-MAE slightly outperforms CSI\-JEPA Linear on Motion Source Recognition, suggesting that reconstruction\-based pretraining can still learn useful channel structure for some cases\. However, CSI\-JEPA with the MLP adapter achieves the best overall performance across all seven sensing tasks, showing that a lightweight nonlinear adapter can further extract task\-relevant information from the frozen predictive representation\.

![Refer to caption](https://arxiv.org/html/2605.14171v1/x7.png)\(a\)Individually defined sensing tasks\.
![Refer to caption](https://arxiv.org/html/2605.14171v1/x8.png)\(b\)Shared\-data sensing tasks

Figure 5:Comparison with CSI\-MAE under the 1000\-label setting\. CSI\-MAE uses masked CSI reconstruction pretraining, while CSI\-JEPA predicts latent target representations\.TABLE V:CSI\-JEPA backbone ablation across seven tasks\. All backbone variants use the same random temporal\-spectral masking strategy\. Results are reported as Acc/F1 \(%\)\.
### V\-DAblation Study on Masking and Backbone Design

Next, we investigate how the proposed channel variation\-aware masking method and the encoder backbone affect representation quality and downstream sensing performance\. Table[IV](https://arxiv.org/html/2605.14171#S5.T4)first compares four masking strategies under the same ViT backbone and MLP probing protocol\. The proposed channel\-aware masking achieves the best performance on Fall, MSR, HAR, and PR\. Compared with random temporal–spectral masking, channel\-aware masking achieves clear improvements on the three tasks where it obtains a\>1\>1pp margin, with relative gains ranging from 1\.5% to 6\.7%\. These results indicate that selecting target regions according to local channel variations provides a more informative predictive task than uniformly sampling target blocks\.

On Localization and Breathing Detection, temporal\-spectral masking already reaches near\-saturated performance, leaving limited room for further improvement\. This is consistent with the original CSI\-Bench results in\[[38](https://arxiv.org/html/2605.14171#bib.bib37)\], where several supervised models also achieve near\-perfect or perfect performance on Room\-Level Localization under the official split\. These results show that selecting sensing targets from high\-variation CSI regions can substantially improve representation quality, especially when discriminative sensing cues are localized and heterogeneous across the temporal\-subcarrier field\.

![Refer to caption](https://arxiv.org/html/2605.14171v1/x9.png)

Figure 6:Effect of pretraining epochs on downstream adaptation\. Results are averaged over Fall, MSR, and Breath tasks\.Second, Table[V](https://arxiv.org/html/2605.14171#S5.T5)compares different encoder backbones under the same random temporal\-spectral masking protocol\. It is observed that the ViT backbone achieves the best performance on most tasks, suggesting that direct self\-attention over temporal\-subcarrier patch tokens is more effective than CNN\-based feature extraction for the pretraining stage\. Although the CNN\-only model uses multiple convolutional stages and produces the same20×2920\\times 29token grid with 256\-dimensional tokens, its convolutional inductive bias mainly emphasizes local patterns before global pooling and projection\. This can be limiting for CSI\-based sensing, where useful cues may involve long\-range dependencies across time and subcarriers, such as motion\-induced temporal evolution and frequency\-selective multipath structures\.

Interestingly, the CNN\+ViT hybrid model improves over the CNN\-only backbone on several tasks, but it still does not outperform the pure ViT backbone\. This suggests that adding a shallow convolutional front\-end does not necessarily improve JEPA\-based CSI representation learning, and may distort fine\-grained temporal\-subcarrier information before the Transformer encoder models global dependencies\. In contrast, the ViT backbone directly operates on temporal\-subcarrier patch tokens and uses self\-attention to model global relationships across the entire CSI window\. Together with the masking ablation in Table[IV](https://arxiv.org/html/2605.14171#S5.T4), these results support the two main design choices of CSI\-JEPA: 1\) adopting a ViT\-style encoder to capture global temporal–subcarrier dependencies, and 2\) employing channel variation–aware masking to focus predictive learning on sensing\-informative channel dynamics\.

### V\-EPretraining Sensitivity and Computational Cost

#### Effect of pretraining epochs

We further evaluate CSI\-JEPA checkpoints pretrained for different numbers of epochs using the MLP adapter with 1,000 labeled samples\. Fig\.[6](https://arxiv.org/html/2605.14171#S5.F6)shows that downstream performance improves rapidly during the first few pretraining epochs and becomes stable after about 5–7 epochs\. Increasing pretraining from 7 to 20 epochs only leads to marginal changes, suggesting that CSI\-JEPA can learn useful CSI representations with moderate self\-supervised pretraining under the current dataset scale\.

#### Computational Cost

Table[VI](https://arxiv.org/html/2605.14171#S5.T6)reports the computational cost on Fall Detection with 1,000 labeled samples, measured on a network server with an NVIDIA RTX A6000 GPU\. Training time is measured for downstream adapter training, and inference latency is measured per CSI window with batch size 1\. Compared with the supervised Transformer, CSI\-JEPA MLP updates only 0\.067M task\-specific adapter parameters, which is about 50×\\timesfewer trainable parameters\. Although the frozen encoder introduces an additional forward\-pass cost during adapter training and inference, the end\-to\-end latency remains at the millisecond level\. These results indicate that CSI\-JEPA does not trade label efficiency for prohibitive computational overhead, and remains practical for online sensing\.

TABLE VI:Computational cost on the Fall Detection task\.

## VIConclusion

This paper presented CSI\-JEPA as a step toward reusable foundation representations for Wi\-Fi sensing with minimal supervision\. Instead of learning a separate supervised model for each sensing task, CSI\-JEPA uses masked latent prediction to pretrain a frozen CSI encoder from unlabeled temporal\-spectral windows and adapts it to downstream tasks through lightweight heads\. The evaluation on seven real\-world CSI\-Bench tasks shows that this design improves low\-label adaptation over state\-of\-the\-art baselines, while achieving matched\-budget label savings of up to 98\.0% and maintaining ms\-level latency\. Future work will extend CSI\-JEPA toward more challenging cross\-user, cross\-device, and cross\-environment adaptation, as well as richer CSI modalities such as calibrated phase or multiple access point measurements\.

## References

- \[1\]M\. Assran, Q\. Duval, I\. Misra, P\. Bojanowski, P\. Vincent, M\. Rabbat, Y\. LeCun, and N\. Ballas\(2023\)Self\-supervised learning from images with a joint\-embedding predictive architecture\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 15619–15629\.Cited by:[§II](https://arxiv.org/html/2605.14171#S2.p3.1)\.
- \[2\]A\. Bardes, Q\. Garrido, J\. Ponce, X\. Chen, M\. Rabbat, Y\. LeCun, M\. Assran, and N\. Ballas\(2023\)V\-jepa: Latent video prediction for visual representation learning\.Cited by:[§II](https://arxiv.org/html/2605.14171#S2.p3.1)\.
- \[3\]C\. B\. Chaaya, A\. M\. Girgis, and M\. Bennis\(2024\)Learning latent wireless dynamics from channel state information\.IEEE Wireless Communications Letters14\(2\),pp\. 489–493\.Cited by:[§II](https://arxiv.org/html/2605.14171#S2.p3.1)\.
- \[4\]C\. Chen, G\. Zhou, and Y\. Lin\(2023\)Cross\-domain WiFi sensing with channel state information: A survey\.ACM Computing Surveys55\(11\),pp\. 1–37\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p1.1)\.
- \[5\]D\. Chen, M\. Shukor, T\. Moutakanni, W\. Chung, J\. Yu, T\. Kasarla, Y\. Bang, A\. Bolourchi, Y\. LeCun, and P\. Fung\(2025\)Vl\-jepa: Joint embedding predictive architecture for vision\-language\.arXiv preprint arXiv:2512\.10942\.Cited by:[§II](https://arxiv.org/html/2605.14171#S2.p3.1)\.
- \[6\]L\. Cheng and J\. Wang\(2019\)Walls have no ears: A non\-intrusive WiFi\-based user identification system for mobile devices\.IEEE/ACM Transactions on Networking27\(1\),pp\. 245–257\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p2.1),[§II](https://arxiv.org/html/2605.14171#S2.p1.1)\.
- \[7\]V\. Chu, O\. Mashaal, and H\. Abou\-Zeid\(2026\)WirelessJEPA: A Multi\-Antenna Foundation Model using Spatio\-temporal Wireless Latent Predictions\.arXiv preprint arXiv:2601\.20190\.Cited by:[§II](https://arxiv.org/html/2605.14171#S2.p3.1)\.
- \[8\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly,et al\.\(2020\)An image is worth 16x16 words: Transformers for image recognition at scale\.arXiv preprint arXiv:2010\.11929\.Cited by:[§IV\-C](https://arxiv.org/html/2605.14171#S4.SS3.p2.2),[1st item](https://arxiv.org/html/2605.14171#S5.I3.i1.p1.1.1)\.
- \[9\]R\. Du, H\. Hua, H\. Xie, X\. Song, Z\. Lyu, M\. Hu, Y\. Xin, S\. McCann, M\. Montemurro, T\. X\. Han,et al\.\(2024\)An overview on IEEE 802\.11 bf: WLAN sensing\.IEEE Communications Surveys & Tutorials27\(1\),pp\. 184–217\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p1.1)\.
- \[10\]H\. Feng, Y\. Xu, and Y\. Zhao\(2024\)Deep learning\-based joint channel estimation and CSI feedback for RIS\-assisted communications\.IEEE Communications Letters28\(8\),pp\. 1860–1864\.Cited by:[§II](https://arxiv.org/html/2605.14171#S2.p1.1)\.
- \[11\]J\. Jiang, X\. Ruan, and S\. Xu\(2026\)CSI\-MAE: A Masked Autoencoder\-based Channel Foundation Model\.arXiv preprint arXiv:2601\.03789\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p3.1),[§II](https://arxiv.org/html/2605.14171#S2.p2.1),[§V\-A](https://arxiv.org/html/2605.14171#S5.SS1.SSS0.Px3.p3.1.1),[§V\-C](https://arxiv.org/html/2605.14171#S5.SS3.p1.1)\.
- \[12\]Y\. Lecun, L\. Bottou, Y\. Bengio, and P\. Haffner\(1998\)Gradient\-based learning applied to document recognition\.Proceedings of the IEEE86\(11\),pp\. 2278–2324\.External Links:[Document](https://dx.doi.org/10.1109/5.726791)Cited by:[2nd item](https://arxiv.org/html/2605.14171#S5.I3.i2.p1.1.1)\.
- \[13\]Y\. LeCunet al\.\(2022\)A path towards autonomous machine intelligence version 0\.9\. 2, 2022\-06\-27\.Open Review62\(1\),pp\. 1–62\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p3.1),[§II](https://arxiv.org/html/2605.14171#S2.p3.1)\.
- \[14\]C\. Li, Z\. Cao, and Y\. Liu\(2021\)Deep AI enabled ubiquitous wireless sensing: A survey\.ACM Computing Surveys \(CSUR\)54\(2\),pp\. 1–35\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p1.1)\.
- \[15\]Z\. Li, X\. Luo, M\. Chen, G\. Li, and Y\. Liu\(2026\)Beamforming Feedback\-Driven Wireless Positioning: A Transferable Vision Transformer Approach\.IEEE Transactions on Mobile Computing\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p2.1),[§II](https://arxiv.org/html/2605.14171#S2.p1.1)\.
- \[16\]Z\. Li, X\. Luo, M\. Chen, C\. Xu, and Y\. Liu\(2025\)BFMLoc: Transformer\-Based Indoor Positioning Leveraging Beamforming Feedback Matrices\.InICC 2025\-IEEE International Conference on Communications,pp\. 6699–6704\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p2.1)\.
- \[17\]Z\. Li, X\. Luo, M\. Chen, C\. Xu, S\. Mao, and Y\. Liu\(2025\)Contextual combinatorial beam management via online probing for multiple access mmWave wireless networks\.IEEE Journal on Selected Areas in Communications43\(3\),pp\. 959–972\.Cited by:[§II](https://arxiv.org/html/2605.14171#S2.p1.1)\.
- \[18\]Z\. Li, X\. Luo, X\. Ge, L\. Zhou, X\. Lin, and Y\. Liu\(2025\)MMSense: Adapting Vision\-based Foundation Model for Multi\-task Multi\-modal Wireless Sensing\.arXiv preprint arXiv:2511\.12305\.Cited by:[§II](https://arxiv.org/html/2605.14171#S2.p1.1)\.
- \[19\]J\. Liu, H\. Liu, Y\. Chen, Y\. Wang, and C\. Wang\(2019\)Wireless sensing for human activity: a survey\.IEEE Communications Surveys & Tutorials22\(3\),pp\. 1629–1645\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p1.1)\.
- \[20\]X\. Liu, S\. Gao, B\. Liu, X\. Cheng, and L\. Yang\(2025\)LLM4WM: Adapting LLM for wireless multi\-tasking\.IEEE Transactions on Machine Learning in Communications and Networking\.Cited by:[§II](https://arxiv.org/html/2605.14171#S2.p1.1)\.
- \[21\]X\. Liu, J\. Cao, S\. Tang, J\. Wen, and P\. Guo\(2015\)Contactless respiration monitoring via off\-the\-shelf WiFi devices\.IEEE Transactions on Mobile Computing15\(10\),pp\. 2466–2479\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p2.1),[§II](https://arxiv.org/html/2605.14171#S2.p1.1)\.
- \[22\]X\. Luo, Z\. Li, Z\. Peng, M\. Chen, and Y\. Liu\(2025\)Denoising diffusion probabilistic model for radio map estimation in generative wireless networks\.IEEE Transactions on Cognitive Communications and Networking11\(2\),pp\. 751–763\.Cited by:[§II](https://arxiv.org/html/2605.14171#S2.p1.1)\.
- \[23\]Y\. Ma, G\. Zhou, and S\. Wang\(2019\)WiFi sensing with channel state information: A survey\.ACM Computing Surveys \(CSUR\)52\(3\),pp\. 1–36\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p1.1)\.
- \[24\]F\. Meneghello, C\. Chen, C\. Cordeiro, and F\. Restuccia\(2023\)Toward integrated sensing and communications in IEEE 802\.11 bf Wi\-Fi networks\.IEEE Communications Magazine61\(7\),pp\. 128–133\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p4.1)\.
- \[25\]S\. Naoumi, M\. Bennis, and M\. Chafii\(2026\)Structured Latent Dynamics in Wireless CSI via Homomorphic World Models\.arXiv preprint arXiv:2603\.20048\.Cited by:[§II](https://arxiv.org/html/2605.14171#S2.p3.1)\.
- \[26\]A\. Y\. Radwan, M\. Yildirim, N\. Hasanzadeh, H\. Tabassum, and S\. Valaee\(2025\)A tutorial\-cum\-survey on self\-supervised learning for wi\-fi sensing: Trends, challenges, and outlook\.IEEE Communications Surveys & Tutorials\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p3.1),[§II](https://arxiv.org/html/2605.14171#S2.p2.1)\.
- \[27\]T\. Ropitault, C\. R\. da Silva, S\. Blandino, A\. Sahoo, N\. Golmie, K\. Yoon, C\. Aldana, and C\. Hu\(2024\)IEEE 802\.11 bf WLAN sensing procedure: Enabling the widespread adoption of WiFi sensing\.IEEE Communications Standards Magazine8\(1\),pp\. 58–64\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p1.1)\.
- \[28\]F\. Rosenblatt\(1958\)The perceptron: a probabilistic model for information storage and organization in the brain\.\.Psychological review65\(6\),pp\. 386\.Cited by:[1st item](https://arxiv.org/html/2605.14171#S5.I1.i1.p1.1.1)\.
- \[29\]D\. E\. Rumelhart, G\. E\. Hinton, and R\. J\. Williams\(1986\)Learning representations by back\-propagating errors\.nature323\(6088\),pp\. 533–536\.Cited by:[2nd item](https://arxiv.org/html/2605.14171#S5.I1.i2.p1.1.1)\.
- \[30\]A\. Sahoo, T\. Ropitault, S\. Blandino, and N\. Golmie\(2024\)Sensing performance of the IEEE 802\.11 bf protocol and its impact on data communication\.In2024 IEEE 100th Vehicular Technology Conference \(VTC2024\-Fall\),pp\. 1–7\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p4.1)\.
- \[31\]P\. Sapiezynski, A\. Stopczynski, D\. K\. Wind, J\. Leskovec, and S\. Lehmann\(2017\)Inferring person\-to\-person proximity using WiFi signals\.Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies1\(2\),pp\. 1–20\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p2.1),[§II](https://arxiv.org/html/2605.14171#S2.p1.1)\.
- \[32\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[3rd item](https://arxiv.org/html/2605.14171#S5.I1.i3.p1.1.1)\.
- \[33\]C\. Wu, B\. Wang, O\. C\. Au, and K\. R\. Liu\(2022\)Wi\-Fi can do more: Toward ubiquitous wireless sensing\.IEEE Communications Standards Magazine6\(2\),pp\. 42–49\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p1.1)\.
- \[34\]J\. Yang, X\. Chen, H\. Zou, C\. X\. Lu, D\. Wang, S\. Sun, and L\. Xie\(2023\)SenseFi: A library and benchmark on deep\-learning\-empowered WiFi human sensing\.Patterns4\(3\)\.Cited by:[§III\-A](https://arxiv.org/html/2605.14171#S3.SS1.p2.5)\.
- \[35\]J\. Yang, X\. Chen, H\. Zou, D\. Wang, and L\. Xie\(2022\)AutoFi: Toward automatic Wi\-Fi human sensing via geometric self\-supervised learning\.IEEE Internet of Things Journal10\(8\),pp\. 7416–7425\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p2.1),[§I](https://arxiv.org/html/2605.14171#S1.p3.1),[§II](https://arxiv.org/html/2605.14171#S2.p2.1)\.
- \[36\]E\. Yi, D\. Wu, J\. Xiong, F\. Zhang, K\. Niu, W\. Li, and D\. Zhang\(2024\)BFMSense: WiFi sensing using beamforming feedback matrix\.In21st USENIX Symposium on Networked Systems Design and Implementation \(NSDI 24\),pp\. 1697–1712\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p4.1)\.
- \[37\]C\. Zheng, J\. He, G\. Cai, N\. Li, M\. Bennis, H\. Wymeersch, and M\. Debbah\(2026\)JEPA\-MSAC: A Joint\-Embedding Predictive Architecture for Multimodal Sensing\-Assisted Communications\.arXiv preprint arXiv:2603\.29796\.Cited by:[§II](https://arxiv.org/html/2605.14171#S2.p3.1)\.
- \[38\]G\. Zhu, Y\. Hu, W\. Gao, W\. Wang, B\. Wang, and K\. Liu\(2025\)CSI\-Bench: A Large\-Scale In\-the\-Wild Dataset for Multi\-task WiFi Sensing\.arXiv preprint arXiv:2505\.21866\.Cited by:[§V\-A](https://arxiv.org/html/2605.14171#S5.SS1.SSS0.Px1.p1.2),[§V\-D](https://arxiv.org/html/2605.14171#S5.SS4.p2.1)\.
- \[39\]G\. Zhu, Y\. Hu, S\. Jayaweera, W\. Gao, W\. Wang, J\. Zhang, B\. Wang, C\. Wu, and K\. Liu\(2026\)AM\-FM: A Foundation Model for Ambient Intelligence Through WiFi\.arXiv preprint arXiv:2602\.11200\.Cited by:[§I](https://arxiv.org/html/2605.14171#S1.p3.1),[§II](https://arxiv.org/html/2605.14171#S2.p2.1)\.

Similar Articles

Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence

arXiv cs.LG

This paper introduces a fleet of five sensor-specialized Mini-JEPA foundation models for hydrologic intelligence, achieving high reconstruction accuracy (R² up to 0.97) and outperforming the Google AlphaEarth generalist on physics-matched tasks when routed via an LLM agent.