Dywave: Event-Aligned Dynamic Tokenization for Heterogeneous IoT Sensing Signal

arXiv cs.LG Papers

Summary

Dywave is a dynamic tokenization framework for IoT sensing signals that uses wavelet-based hierarchical decomposition to align tokens with semantic events, achieving up to 12% higher accuracy and 75% reduction in input token length on five real-world datasets.

arXiv:2605.14014v1 Announce Type: new Abstract: Internet of Things (IoT) systems continuously collect heterogeneous sensing signals from ubiquitous sensors to support intelligent applications such as human activity analysis, emotion monitoring, and environmental perception. These signals are inherently non-stationary and multi-scale, posing unique challenges for standard tokenization techniques. This paper proposes Dywave, a dynamic tokenization framework for IoT sensing signals that constructs compact input representations aligned with intrinsic temporal structures and underlying physical events. Dywave leverages wavelet-based hierarchical decomposition, identifies meaningful temporal boundaries corresponding to underlying semantic events, and adaptively compresses redundant intervals while preserving temporal coherence. Extensive evaluations on five real-world IoT sensing datasets across activity recognition, stress assessment, and nearby object detection demonstrate that Dywave outperforms state-of-the-art methods by up to 12% in accuracy, while improving computational efficiency by reducing input token lengths by up to 75% across mainstream sequence models. Moreover, Dywave exhibits improved robustness to domain shifts and varying sequence lengths.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:26 AM

# Event-Aligned Dynamic Tokenization for Heterogeneous IoT Sensing Signals
Source: [https://arxiv.org/html/2605.14014](https://arxiv.org/html/2605.14014)
Denizhan KaraJinyang LiHongjue ZhaoYigong HuYizhuo ChenXiaomin OuyangShengzhong LiuTarek Abdelzaher

###### Abstract

Internet of Things \(IoT\) systems continuously collect heterogeneous sensing signals from ubiquitous sensors to support intelligent applications such as human activity analysis, emotion monitoring, and environmental perception\. These signals are inherently non\-stationary and multi\-scale, posing unique challenges for standard tokenization techniques\. This paper proposes Dywave, a dynamic tokenization framework for IoT sensing signals that constructs compact input representations aligned with intrinsic temporal structures and underlying physical events\. Dywave leverages wavelet\-based hierarchical decomposition, identifies meaningful temporal boundaries corresponding to underlying semantic events, and adaptively compresses redundant intervals while preserving temporal coherence\. Extensive evaluations on five real\-world IoT sensing datasets across activity recognition, stress assessment, and nearby object detection demonstrate that Dywave outperforms state\-of\-the\-art methods by up to 12% in accuracy, while improving computational efficiency by reducing input token lengths by up to 75% across mainstream sequence models\. Moreover, Dywave exhibits improved robustness to domain shifts and varying sequence lengths\.

Machine Learning, ICML

## 1Introduction

Internet of Things \(IoT\) systems increasingly rely on continuous streams of sensing modalities, such as inertial measurement unit \(IMU\) for human activity recognition \(HAR\)\(Korany et al\.,[2019](https://arxiv.org/html/2605.14014#bib.bib31); Kawano et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib26); Wang et al\.,[2022](https://arxiv.org/html/2605.14014#bib.bib61)\), electrocardiograms \(ECG\) for healthcare\(Nath et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib39); Alharbi et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib2); Zakaria et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib70)\), and acoustic signals that increase environmental awareness to enhance human safety\(Wang et al\.,[2023b](https://arxiv.org/html/2605.14014#bib.bib65); Kim et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib27); Kimura et al\.,[2024](https://arxiv.org/html/2605.14014#bib.bib28)\)\. Learning from these diverse signals enables intelligent sensing applications that can perceive, understand, and respond to the physical world\(Baris et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib4)\)\.

Recent advances in language and vision highlight the central role ofdata tokenizationin enabling large\-scale generalizable models\(Petrov et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib44); Bommasani et al\.,[2021](https://arxiv.org/html/2605.14014#bib.bib5)\)\. In NLP, words and subwords form linguistically and statistically grounded discrete tokens\(Sennrich et al\.,[2016](https://arxiv.org/html/2605.14014#bib.bib54); Kudo & Richardson,[2018](https://arxiv.org/html/2605.14014#bib.bib32)\), whereas in vision, spatial patches serve as localized tokens aligned with model inductive biases\(He et al\.,[2016](https://arxiv.org/html/2605.14014#bib.bib21); Dosovitskiy et al\.,[2020](https://arxiv.org/html/2605.14014#bib.bib14)\)\. These tokenization schemes establish a shared representation interface for large\-scale training, zero\-shot transfer, and task generalization\([Ahia et al\.,](https://arxiv.org/html/2605.14014#bib.bib1); Tao et al\.,[2024](https://arxiv.org/html/2605.14014#bib.bib57)\)\. In contrast, raw signals in IoT sensing applications are often non\-intuitive for humans and lack a similar natural notion of semantic units\. Unlike words or images, signals appear as continuous waveforms that are temporally heterogeneous, with information encoded in the transitions between underlying physical events and in multiscale interactions, such as fast transients overlapping with slow contextual trends\. The absence of appropriate atomic units for sensing signals creates atokenization gap, forcing existing approaches to rely on uniform windows that partition signals into*patches*as input tokens for downstream backbones\(Nie et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib41); Naghashi et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib38); Ekambaram et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib15)\)\. Here, we consider*uniform patching*as a method that partitions signals into fixed\-size patches, either at a single temporal scale\(Nie et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib41)\)or across multiple resolutions\(Naghashi et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib38); Wang et al\.,[2024](https://arxiv.org/html/2605.14014#bib.bib63)\)\. Since these windows are defined a priori, their boundaries are inherently content\-agnostic and often misaligned with the underlying signal dynamics\.

![Refer to caption](https://arxiv.org/html/2605.14014v1/x1.png)Figure 1:Ego4D \(HAR\) raw signal examples\. Signal events are manually annotated with red bounding boxes\.Limitations: Although tokens corresponding to uniform windows offer a simple heuristic approach, it remains misaligned with the intrinsic dynamic structures of physical events, which rarely conform to uniform timescales\. This results in event fragmentation and the obscuration of underlying semantics\. For example, in human activity recognition using IMU signals, brief motion gestures \(*e\.g\.*, waving\) may occur within a second, while complex activities \(*e\.g\.*, walking\) can span tens of seconds and vary in intensity\.

Moreover, real\-world signals exhibit highly irregular information density, with quiescent intervals alternating with short bursts of salient activity\. Uniform patching can generate abundant patches over long periods of redundancy that contribute little information, resulting in equal computation being allocated to both informative and noninformative regions, dramatic inflation of input length, and limited representativeness of critical transitions\.

Lastly, optimal hyperparameters \(*e\.g\.*, patch size and stride\) are application\-specific\. Smaller strides preserve fine\-grained details but explode sequence length and computation cost, whereas larger non\-overlapping patches can compress computation at the expense of semantic precision\. Performance fluctuates irregularly across these settings, revealing a lack of universally effective configurations and necessitating time\-consuming per\-domain tuning\.

Methodology\.We argue that effective modeling of sensing signals requires rethinking*input tokenization*as a dynamic process rather than a preset, fixed heuristic\. Instead of uniformly segmenting time\-series and modifying the backbone architecture, we propose Dywave, a*dynamic, event\-aligned tokenization*scheme that adapts to the intrinsic temporal structure of physical signals and is compatible with mainstream backbone encoders\.

To achieve this, Dywave first extractshierarchical embeddingsby explicitly leveraging the scale\-separated structure of physical events\. These structures enable Dywave to exploit multi\-resolution temporal patterns, rather than treating the signal as a homogeneous sequence\.

Building on these embeddings, Dywave performstemporal anchor formationby estimating which timesteps are most salient and selecting anchors at semantic transitions that correspond to meaningful events\.

Finally, Dywave appliesdynamic temporal fusion, aggregating neighboring timesteps into anchor\-aligned tokens through saliency\-weighted pooling\. This produces compact representations whose length adapts to semantic complexity rather than raw signal duration\.

Evaluation: We evaluate Dywave on five diverse real\-world sensing datasets spanning diverse sampling rates, sequence lengths, and temporal dynamics\. Dywave mitigates fragmentation and truncation by focusing on semantically meaningful intervals\. Furthermore, we demonstrate how Dywave decomposes complex human activities into fine\-grained micro\-activity segments, revealing the underlying temporal structures of human\-centric continuous sensor signals\.

The contributions are summarized as follows:

- •We identify the*tokenization gap*in physical sensing, highlighting the lack of human\-intuitive semantic units\.
- •We propose Dywave, a dynamic tokenization module that converts raw signals into compact, event\-aligned tokens\.
- •Extensive evaluation on real\-world applications with a case study demonstrates Dywave’s superior downstream performance and fine\-grained event decomposition\.

## 2Motivation and Design Principles

This section outlines challenges in sensing signal tokenization and introduces the design principles behind Dywave\.

### 2\.1Challenges and Motivations

Unlike words or objects in language and vision, sensing signals appear as continuous waveforms where event semantics are dispersed across time and encoded in transitions\. Existing methods segment these streams into predefined windows uniformly across time\. However, these segments rarely correspond to coherent physical events or human actions, as temporal dynamics are often irregular and context\-dependent, introducing unique challenges\.

![Refer to caption](https://arxiv.org/html/2605.14014v1/x2.png)Figure 2:Overview of Dywave\.Signal Heterogeneity and Complexity\. Real\-world sensing data are extremely diverse across users, contexts, and modalities\. Even when performing the same activity, different users produce signals that vary greatly in temporal structure and intensity\. To illustrate this variability, Figure[1](https://arxiv.org/html/2605.14014#S1.F1)visualizes 30\-second accelerometer samples from the Ego4D human activity recognition dataset\(Grauman et al\.,[2022](https://arxiv.org/html/2605.14014#bib.bib20)\), comparing signals of*cleaning*activity across users and time periods, as well as the*reading*activity\. Even within the same activity, signal patterns differ significantly, both across users and across sessions of the same user\. For instance, User 1’s two cleaning segments show distinct motion rhythms, while User 2’s cleaning activity exhibits sharper, more intense movements\. Conversely, User 3’s reading activity is dominated by long stationary intervals with only brief bursts of motion at the beginning\. The variations highlight the non\-stationary, user\-dependent nature of sensing data, where activity semantics are tightly coupled with individual context\. Applying uniform patching under such diversity neglects signal\-specific semantics, producing arbitrary segmentations that fail to align with true event boundaries or preserve coherence across similar activities\.

Computation Efficiency\. Beyond representational granularity, a uniform window for patching also limits computational efficiency\. In ubiquitous computing systems with strict latency and energy constraints, uniform patching treats all regions equally, allocating identical computation to both dynamic and redundant segments\. As illustrated in Figure[1](https://arxiv.org/html/2605.14014#S1.F1), long and flat intervals of User 3’s reading activity are dominant\. Yet, uniform tokenization can expand the interval with minimal semantic content, unnecessarily inflating the input length and the computational cost\.

### 2\.2Design Principles for Sensing Signal Tokenization

The challenges above reveal the limitations of uniform patching for heterogeneous, non\-stationary sensing signals and underscore the need for a dynamic tokenization that moves beyond static, uniform windows\. From these observations, we derive two design principles for effectively tokenizing time\-series sensing signals\.

Principle 1: Physical grounding\.Sensing signals originate from continuous physical processes, where semantics emerge from transitions between underlying physical states\. Therefore, tokenization must preserve physical coherence to ensure each token corresponds to a distinct, meaningful physical event rather than an arbitrary temporal slice\. Physically grounded representations should maintain the temporal continuity of event relationships to infer contextually consistent semantics for dynamic sensing signals\.

Principle 2: Adaptivity across scales and domains\.Sensing signals are multiscale and exhibit strong temporal heterogeneity, with rapid transient spikes coexisting with long, gradual dynamics\. Hence, an effective tokenization strategy should balance temporal and semantic resolution by allocating finer granularity to information\-dense events and coarser granularity to stable segments\. Moreover, the tokenization strategy should adapt on a per\-sample basis, rather than assuming homogeneous temporal structure across segments\.

By adhering to these principles, a dynamic tokenization for sensing signals can transform raw sensor streams into structured, semantically aligned representations, enabling more robust and efficient downstream learning\.

## 3Dywave Design

This section presents Dywave, a dynamic tokenization module that adapts to the signal’s underlying temporal structure\. We provide a detailed overview of Dywave in Figure[2](https://arxiv.org/html/2605.14014#S2.F2)\.

Problem Formulation: LetX∈ℝC×LX\\in\\mathbb\{R\}^\{C\\times L\}denote a raw time\-series segment, whereCCis the number of channels andLLis the sequence length\. The goal of Dywave is to transformXXinto a compact sequence ofpatched tokensE∈ℝC×L′×dE\\in\\mathbb\{R\}^\{C\\times L^\{\\prime\}\\times d\}, whereL′L^\{\\prime\}depends on the sample semantics\.

### 3\.1Hierarchical Embedding

Real\-world time\-series signals exhibit multi\-granular structure across temporal and frequency scales\. To capture this, we apply the*Maximal Overlap Discrete Wavelet Transform*\(MODWT\)\(Percival & Walden,[2000](https://arxiv.org/html/2605.14014#bib.bib43)\)to decompose raw inputs into multi\-resolution time\-frequency representations\. For an inputX∈ℝC×LX\\in\\mathbb\{R\}^\{C\\times L\}, MODWT yields:

\{d​X1,…,d​XJ,A\}=MODWT​\(X\),\\\{dX\_\{1\},\\ldots,dX\_\{J\},A\\\}=\\text\{MODWT\}\(X\),\(1\)whereA∈ℝC×LA\\in\\mathbb\{R\}^\{C\\times L\}encodes long\-term global trends,d​X1∈ℝC×LdX\_\{1\}\\in\\mathbb\{R\}^\{C\\times L\}captures highest\-frequency variations, and\{d​Xj\}j≥2\\\{dX\_\{j\}\\\}\_\{j\\geq 2\}represent progressively slower oscillations\. Since MODWT is undecimated, all components preserve the original sequence lengthLL\.

Next, we partition the components into detail and context streams and extract hierarchical embeddings that capture the fine\-grained transients and long\-range temporal structure\.

Detail Embedding\.The detail stream captures localized, high\-frequency variations\. We project it using lightweight convolutional layers, which model short\-range dependencies while preserving temporal alignment\.

EU=DetailEncoder​\(\{X,d​X1,…,d​XK\}\)\.E^\{U\}=\\text\{DetailEncoder\}\(\\\{X,dX\_\{1\},\\dots,dX\_\{K\}\\\}\)\.\(2\)
Context Embedding\.The context embedding encodes slow\-varying, long\-range patterns through a lightweight hourglass transformer as the context encoder, which performs downscaling, self\-attention, and upscaling\. The downsampling degree is adaptively selected to balance model capacity and computational cost\.

EV=ContextEncoder​\(\{d​XK\+1,…,d​XJ,A\}\)\.E^\{V\}=\\text\{ContextEncoder\}\(\\\{dX\_\{K\+1\},\\dots,dX\_\{J\},A\\\}\)\.\(3\)
Embedding Fusion\.We fuse detail and context embeddings along the feature dimension to form a unified hierarchical embeddingEF=Concat​\(EU,EV\)E^\{F\}=\\text\{Concat\}\(E^\{U\},E^\{V\}\)that is temporally aligned and semantically enriched for downstream tasks\.

### 3\.2Temporal Anchor Formation

Rather than treating segmentation as an explicit objective, Dywave adaptively allocates temporal resolution based on intrinsic signal dynamics\. Given hierarchical embeddings that jointly encode fine\-grained transients and long\-range context, Dywave identifies regions where representational detail should be preserved at higher density, while compressing temporally redundant intervals\. As a result, patch boundaries emerge implicitly from variations in underlying physical dynamics, instead of being imposed by uniform temporal intervals or predefined heuristics\.

Temporal Event Saliency Estimation\.To quantify where finer temporal resolution is warranted, Dywave estimates event saliency by measuring representational change between adjacent timesteps\.

Pt=1−sim​\(Fk​\(Et−1F\),Fq​\(EtF\)\),t∈\[2,L\],P\_\{t\}=1\-\\text\{sim\}\(F\_\{k\}\(E^\{F\}\_\{t\-1\}\),F\_\{q\}\(E^\{F\}\_\{t\}\)\),\\qquad t\\in\[2,L\],\(4\)wheresim​\(⋅\)\\text\{sim\}\(\\cdot\)denotes cosine similarity andFk,FqF\_\{k\},F\_\{q\}are linear layers that mapEFE^\{F\}to key and query\.

Intuitively, continuous physical processes tend to produce slowly varying embeddings, resulting in low saliency scores, whereas genuine event transitions induce abrupt representational shifts that yield high saliency\.

Anchor Allocation\.PtP\_\{t\}acts as a continuous indicator of local information density that guides subsequent resolution allocation\. However,PtP\_\{t\}may exhibit local fluctuations due to noise or minor signal variations\. To regulate density and avoid over\-allocation, Dywave applies temporal non\-maximum suppression and top\-kkselection to extract a compact set of anchors\.

𝒜=TopK​\(NMS​\(P,wnms\),⌈τ⋅L⌉\),\\mathcal\{A\}=\\text\{TopK\}\(\\text\{NMS\}\(P,w\_\{\\text\{nms\}\}\),\\lceil\\tau\\cdot L\\rceil\),\(5\)wherewnms=⌊L/\(2​⌈τ​L⌉\)⌋w\_\{\\text\{nms\}\}=\\lfloor L/\(2\\lceil\\tau L\\rceil\)\\rflooris the window size for NMS, and⌈τ⋅L⌉\\lceil\\tau\\cdot L\\rceilis the maximum number of anchors to select\. The resulting anchors𝒜\\mathcal\{A\}indicate where representational capacity should be concentrated, enabling adaptive compression that reflects semantic complexity instead of raw signal duration\.

Table 1:Short\-context classification performance\. Best performance isboldedand second best isunderlined\.
### 3\.3Dynamic Temporal Information Fusion

Following temporal anchor formation, Dywave performs resolution\-aware temporal fusion to construct compact input representations that reflect semantic complexity\. Fusion is guided by anchor locations𝒜\\mathcal\{A\}that indicate where higher representational fidelity is required, allowing Dywave to concentrate computation around salient physical events while smoothly compressing temporally redundant intervals commonly observed in real\-world sensing signals\.

Temporal Anchor Aggregation\.Temporal anchors act as sparse reference points around which local temporal neighborhoods are aggregated\. Rather than enforcing hard partitions, each timestep is associated with its nearest anchor based on temporal proximity\. Formally, the anchor assignment for timettis defined as:

κ​\(t\)=arg⁡mina∈𝒜⁡\|t−a\|,t∈\{1,…,L\}\.\\kappa\(t\)=\\arg\\min\_\{a\\in\\mathcal\{A\}\}\|t\-a\|,\\qquad t\\in\\\{1,\\ldots,L\\\}\.\(6\)
Saliency\-Weighted Temporal Fusion\. Given anchor\-centered temporal neighborhoods, Dywave aggregates embeddings using saliency\-aware weighting\. Timesteps exhibiting greater information variability contribute more prominently to the fused representation, while stable regions are down\-weighted\. For each anchorak∈𝒜a\_\{k\}\\in\\mathcal\{A\}, the final embeddingE=\{E1,…,E\|𝒜\|\}E=\\\{E\_\{1\},\\ldots,E\_\{\|\\mathcal\{A\}\|\}\\\}is computed as:

Ek=∑t:κ​\(t\)=akPt​EtF∑t:κ​\(t\)=akPt\+ε,k∈1,…,\|𝒜\|;E\_\{k\}\\;=\\;\\frac\{\\sum\_\{t:\\,\\kappa\(t\)=a\_\{k\}\}P\_\{t\}\\,E^\{F\}\_\{t\}\}\{\\sum\_\{t:\\,\\kappa\(t\)=a\_\{k\}\}P\_\{t\}\+\\varepsilon\},\\qquad k\\in\{1,\\ldots,\|\\mathcal\{A\}\|\};\\\(7\)whereε\\varepsilonensures numerical stability, andE∈ℝC×\|𝒜\|×dE\\in\\mathbb\{R\}^\{C\\times\|\\mathcal\{A\}\|\\times d\}is the final embedding for the backbone encoder\.

The saliency\-weighted fusion prioritizes temporally localized event dynamics while suppressing redundant background activity, enabling adaptive compression that preserves semantic integrity without explicit segmentation\.

### 3\.4Training Objective

To ensureEEpreserves multi\-scale information, Dywave leverages a lightweight decoder to reconstruct the wavelet coefficients from the fused embeddingEE:

ℒrec=MSE​\(\{d​X1,…,d​XJ,A\},Decode​\(E\)\),\\mathcal\{L\}\_\{\\text\{rec\}\}=\\text\{MSE\}\\big\(\\\{dX\_\{1\},\\ldots,dX\_\{J\},A\\\},\\,\\text\{Decode\}\(E\)\\big\),\(8\)whereDecodecomprises a linear layer, transposed convolution, and adaptive pooling to match input lengthLL\. This enforces that compressed tokens retain both low\-frequency trends and high\-frequency transients\. The full objective combines training supervision with reconstruction:

ℒ=ℒtask\+λrec⋅ℒrec\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{task\}\}\+\\lambda\_\{\\text\{rec\}\}\\cdot\\mathcal\{L\}\_\{\\text\{rec\}\}\.\(9\)Due to space limitations, we provide additional details on Dywave’s components in Appendix[C](https://arxiv.org/html/2605.14014#A3)\.

## 4Evaluation

We evaluate Dywave on five sensing datasets, assessing short/long\-context, multimodal, and generalization performance, as well as token efficiency\. We also conduct ablations and qualitative analyses of Dywave’s tokenization\.

### 4\.1Evaluation Setup

Datasets\.We evaluate on 5 datasets across three sensing applications: Human Activity Recognition\(Grauman et al\.,[2022](https://arxiv.org/html/2605.14014#bib.bib20); Reiss & Stricker,[2012](https://arxiv.org/html/2605.14014#bib.bib48); Sztyler & Stuckenschmidt,[2016](https://arxiv.org/html/2605.14014#bib.bib56)\), Moving Object Detection\(Liu et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib36)\), and Stress Assessment\(Schmidt et al\.,[2018b](https://arxiv.org/html/2605.14014#bib.bib53)\)\. Each sample is a fixed\-length time\-series segment used for short\- and long\-context classification\.

Baselines\.We consider 5 baselines compatible with various backbones: PatchTST\(Nie et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib41)\), DropPatch\(Qiu et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib46)\), MedFormer\(Wang et al\.,[2024](https://arxiv.org/html/2605.14014#bib.bib63)\), WaveToken\(Masserano et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib37)\), and MultiPatch\(Naghashi et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib38)\)\. We evaluate them using two sequence encoders, Transformer\(Vaswani et al\.,[2017](https://arxiv.org/html/2605.14014#bib.bib60)\)and Mamba2\(Dao & Gu,[2024](https://arxiv.org/html/2605.14014#bib.bib11)\), with the same parameter settings\.

We provide more details on the datasets, baselines, encoders, and additional configurations in Appendix[A](https://arxiv.org/html/2605.14014#A1),[B](https://arxiv.org/html/2605.14014#A2), and[D](https://arxiv.org/html/2605.14014#A4)\.

![Refer to caption](https://arxiv.org/html/2605.14014v1/x3.png)Figure 3:Short\-context performance vs\. different parameters\.![Refer to caption](https://arxiv.org/html/2605.14014v1/x4.png)\(a\)MOD
![Refer to caption](https://arxiv.org/html/2605.14014v1/x5.png)\(b\)PAMAP2
![Refer to caption](https://arxiv.org/html/2605.14014v1/x6.png)\(c\)RWHAR

Figure 4:Short\-context Accuracy vs\. Token with the Transformer encoder\.
### 4\.2Short\-context Performance

#### 4\.2\.1Short\-context Classification

We evaluate Dywave on short\-context signals \(2\-5 seconds\) across five sensing configurations: MOD seismic, PAMAP2 accelerometer & gyroscope, and RWHAR accelerometer & gyroscope\. Table[1](https://arxiv.org/html/2605.14014#S3.T1)shows classification performance using Transformer and Mamba2 backbone encoders\. Dywave consistently achieves the best performance across all datasets and architectures, with gains up to 12% in accuracy on HAR tasks using Mamba2\. Unlike fixed\-size tokenization methods, Dywave eliminates the need for exhaustive hyperparameter tuning of patch size and stride\. As shown in Figure[3](https://arxiv.org/html/2605.14014#S4.F3), PatchTST is highly sensitive to these parameters, requiring extensive grid search, while Dywave uses learnable, instance\-specific segmentation\.

![Refer to caption](https://arxiv.org/html/2605.14014v1/x7.png)Figure 5:Multimodal classification accuracy\.Moreover, Wavetoken performs notably worse than other baselines\. Discretizing the input into quantized token IDs appears ill\-suited for high\-frequency sensing data with rich dynamics, as it disrupts the fine\-grained amplitude and temporal coherence essential for signal characterization\.

#### 4\.2\.2Multimodal Classification

In addition to unimodal classification, we further validate Dywave’s capability in handling multimodal sensing inputs by jointly using accelerometer and gyroscope signals\. Each modality is processed independently by Dywave, producing modality\-specific token sequences that are fed into separate backbone encoders\. We then perform late fusion of intermediate representations before the final classifier layers\. Figure[5](https://arxiv.org/html/2605.14014#S4.F5)presents the classification results on PAMAP2 and RWHAR using the Mamba2 backbone encoder\. Although Dywave’s adaptive tokenization is not specifically designed to handle multimodal inputs, it still maintains strong performance across all datasets\. Additionally, fixed\-size tokenization requires extensive hyperparameter tuning to ensure consistent temporal resolution and alignment between modalities\. The results demonstrate that Dywave generalizes well to multimodal input and remains robust and effective on adaptive segmentation\.

#### 4\.2\.3Performance vs\. Token Distribution

Figure[4](https://arxiv.org/html/2605.14014#S4.F4)compares classification accuracy against average token count for Dywave and PatchTST across MOD, PAMAP2, and RWHAR\. PatchTST shows no clear correlation between token count and accuracy, with more tokens often performing worse due to fixed tokenization’s sensitivity to hyperparameters\. Dywave consistently achieves comparable or superior accuracy with far fewer tokens \(upper\-left region\), demonstrating that adaptive tokenization captures informative segments with many fewer tokens\.

#### 4\.2\.4Dynamic Patching Baseline Comparison

Table 2:Comparing Dywave with dynamic patching baselines\. T stands for Transformer and M stands for Mamba2\.We compare Dywave against two dynamic patching baselines, LightGTS\(Wang et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib64)\)and Ruptures\(Truong et al\.,[2020](https://arxiv.org/html/2605.14014#bib.bib58)\)\. Table[2](https://arxiv.org/html/2605.14014#S4.T2)reports results on RWHAR and PAMAP2 with both backbone encoders\. Dywave consistently outperforms both baselines, confirming that representation\-driven, event\-aligned tokenization is more accurate and efficient than sequence\-level adaptive patching or statistical signal segmentation\. LightGTS applies uniform windows within each sample, which overlooks intra\-sequence heterogeneity, where a single signal may contain both rapid transient events and stationary intervals\. Ruptures performs change\-point detection directly on the raw signal without task supervision, yielding boundaries statistically salient but not semantically meaningful, resulting in degraded performance while producing more tokens\.

![Refer to caption](https://arxiv.org/html/2605.14014v1/x8.png)Figure 6:Long\-contextclassification performance with the Transformer encoder\.![Refer to caption](https://arxiv.org/html/2605.14014v1/x9.png)\(a\)MOD \- Audio
![Refer to caption](https://arxiv.org/html/2605.14014v1/x10.png)\(b\)Ego4D \- Accelerometer

Figure 7:Long\-context token distribution with the Transformer encoder\.

### 4\.3Long\-context Performance

#### 4\.3\.1Long\-context Classification

We evaluate Dywave on long\-context signals: MOD audio, Ego4D accelerometer, and WESAD ECG & EMG\. Figure[6](https://arxiv.org/html/2605.14014#S4.F6)shows results on both backbone encoders\. Dywave consistently achieves the highest accuracy and F1\-score across all datasets\. The advantage is most significant on Ego4D, where 30\-second sequences contain multiple heterogeneous sub\-events \(posture transitions, hand\-object interactions, environmental perturbations\)\. Fixed\-size tokenization mixes unrelated actions and obscures fine\-grained transitions, while Dywave dynamically identifies semantic boundaries that align with activity transitions, enabling the backbone to focus on informative temporal regions\.

#### 4\.3\.2Long\-context Token Distribution

Figure[7](https://arxiv.org/html/2605.14014#S4.F7)shows the distribution of input token lengths for Dywave compared to other methods\. Dywave achieves higher accuracy and F1 scores while using significantly fewer tokens\. On MOD audio \(16,000 Hz\), Dywave produces 55 tokens on average, nearly four times fewer than PatchTST, while maintaining superior accuracy\. On Ego4D, Dywave delivers better performance with fewer tokens despite overlapping motion phases and ambient noise\. These results demonstrate Dywave’s ability to dynamically adjust tokenization to signal semantics, selectively retaining informative segments while compressing stationary intervals\.

### 4\.4Generalizability Evaluation

We evaluate the generalization ability of Dywave under both*domain shifts*and*sequence\-length variations*\. Two finetuning strategies are considered: \(1\)*full backbone finetuning*, where the tokenization module is frozen and both the encoder and classification head are updated; and \(2\)*head\-only finetuning*, where only the classification head is trained\. We study \(1\)cross\-domain generalizationon out\-of\-domain MOD data, and \(2\)sequence\-length generalizationby transferring between Ego4D\-S \(5s\) and Ego4D \(30s\)\.

Table 3:Generalization withFull Backbone Finetuning\.Table 4:Generalization withClassification Head Finetuning\.#### 4\.4\.1Cross\-Domain Generalization

As shown in Table[3](https://arxiv.org/html/2605.14014#S4.T3), under full backbone finetuning, Dywave consistently improves cross\-domain accuracy across both modalities while operating with substantially fewer tokens\. In contrast, PatchTST shows limited robustness when the patching module is fixed\.

This gap becomes more pronounced under head\-only finetuning\. In Table[4](https://arxiv.org/html/2605.14014#S4.T4), PatchTST suffers severe performance degradation \(*e\.g\.*, 0\.81→\\rightarrow0\.34 on seismic\), indicating strong reliance on source\-domain embeddings\. By comparison, Dywave maintains substantially higher accuracy, suggesting that its event\-aligned tokenization encourages the backbone to learn more transferable representations during training rather than overfitting to domain\-specific inputs\.

#### 4\.4\.2Sequence\-Length Generalization

On sequence\-length generalization, Dywave exhibits stable performance when transferring between short and long temporal contexts, consistently outperforming PatchTST in both transfer directions with far fewer tokens\. This indicates that Dywave can maintain the performance while representing longer temporal spans more efficiently\. The advantage is more significant under head\-only finetuning, where PatchTST significantly degrades in accuracy while Dywave preserves meaningful predictive performance\. The token count gap further reveals Dywave’s superior scalability\. From 5s to 30 s, PatchTST increases tokens by over 6× \(59→372\), while Dywave grows modestly \(15\.5→35\), demonstrating that token density is governed by semantic structure rather than signal length and Dywave can achieve length\-invariant, low\-latency input token representations with minimal redundancy\.

Table 5:Dywave variants with Transformer backbone encoder\.

### 4\.5Ablation Studies

We analyze contributions of different modules using five variants that remove individual components or replace designs\. Table[5](https://arxiv.org/html/2605.14014#S4.T5)shows results on MOD dataset for short\- and long\-context settings\. Due to space limitations, we leave the detailed descriptions of the variants in Appendix[E\.1](https://arxiv.org/html/2605.14014#A5.SS1)\.

Variantsw/oWave,FixedDWT, andw/oReconevaluate the hierarchical embedding module\. Removing or simplifying this module consistently degrades performance\. Compared to Dywave\-FixedDWT using standard DWT representations only, Dywave efficiently reduces the number of tokens and achieves superior accuracy\. Dywave\-w/oWave without hierarchical embedding fails to effectively reduce input tokens and can incur additional overhead\. Removing the reconstruction \(Dywave\-w/oRecon\) results in noticeable performance drop, as it encourages semantically coherent segmentation and retains fine\-grained information\. These results demonstrate that hierarchical embedding and reconstruction objective are critical for both efficacy and efficiency\.

Variantsw/oFusion,CNNBound, andSpecBoundanalyze dynamic anchor selection and fusion\. Replacing saliency estimation with CNN\-based prediction leads to a clear performance decline, indicating local convolutional filters are insufficient for capturing temporal dependencies\. The event saliency module leverages contextual relations between neighboring segments, enabling semantically consistent segmentation\. Dropping non\-anchor segments \(Dywave\-w/oFusion\) achieves comparable performance in short\-context settings but sacrifices efficiency with a much higher token count\. In long\-context settings \(MOD audio\), it reduces input length but suffers greater accuracy degradation\. This implies non\-anchor segments contain meaningful cues and should not be discarded\. Dynamic fusion is crucial for achieving compact representations without sacrificing accuracy\. Using spectral energy as the saliency criterion captures low\-level frequency changes in the signal but may not align with meaningful semantic transitions\. Dywave\-SpecBound occasionally produces lower token counts, but at the cost of reduced accuracy, further confirming that semantic\-space saliency is more informative for event\-aligned tokenization than signal\-level spectral measures\.

![Refer to caption](https://arxiv.org/html/2605.14014v1/x11.png)\(a\)MOD \(2 seconds\)
![Refer to caption](https://arxiv.org/html/2605.14014v1/x12.png)\(b\)RWHAR \(4\.5 seconds\)
![Refer to caption](https://arxiv.org/html/2605.14014v1/x13.png)\(c\)PAMAP2 \(5 seconds\)
![Refer to caption](https://arxiv.org/html/2605.14014v1/x14.png)\(d\)Ego4D \(30 seconds\)

Figure 8:On\-device \(Raspberry Pi 4\) Profiling\.
### 4\.6Computational Overhead

Figure[8](https://arxiv.org/html/2605.14014#S4.F8)shows end\-to\-end runtime profiling on a Raspberry Pi 4 device\. Dywave introduces preprocessing overhead relative to simple patching, but this cost is*amortized*by substantial reductions in backbone computation\. For seismic, the signal is short and information\-dense, leaving limited redundancy to compress\. However, as the context length grows \(*e\.g\.*, RWHAR, PAMAP2, Ego4D\), compression reduces both latency and memory usage\. For longer sequences with heterogeneous dynamics, Dywave’s token compression substantially reduces backbone computation, and the advantage grows with longer context windows or larger encoder models\. This makes Dywave particularly well\-suited for real\-world deployments where signals are long, heterogeneous, and resource constraints are tight\.

### 4\.7Inference Noise Robustness

Figure[9](https://arxiv.org/html/2605.14014#S4.F9)evaluates robustness to Gaussian noise injected at test time on RWHAR with the Mamba2 backbone\. PatchTST degrades sharply with increasing noise\. On the other hand, Dywave degrades more gradually over the same range\. This robustness stems from two properties: saliency estimation operates on hierarchical embeddings that integrate multi\-scale context, and cosine similarity is invariant to magnitude scaling, making boundary detection insensitive to additive noise\. Notably, Dywave’s token count increases adaptively under noise, reflecting dynamic allocation of representational capacity as signal uncertainty grows\. This behavior contrasts with fixed\-patch methods, whose token count is fixed regardless of signal quality\.

![Refer to caption](https://arxiv.org/html/2605.14014v1/x15.png)Figure 9:Inference robustness with random noise injection\.

## 5Related Work

Human\-Centric Sensing Signal Modeling\.Deep learning has driven substantial progress in human\-centric sensing applications, including human activity recognition\(Chen et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib10); Zhang et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib72)\), healthcare monitoring\(Ren et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib49); Saha et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib51); Chatterjee et al\.,[2020](https://arxiv.org/html/2605.14014#bib.bib9); Ullah et al\.,[2022](https://arxiv.org/html/2605.14014#bib.bib59); Englhardt et al\.,[2024](https://arxiv.org/html/2605.14014#bib.bib16)\), and stress or affect recognition\(Yu & Sano,[2023](https://arxiv.org/html/2605.14014#bib.bib69); Nath et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib39); Wang et al\.,[2023a](https://arxiv.org/html/2605.14014#bib.bib62)\)\. The sensing signals underlying these tasks are highly heterogeneous, non\-stationary, and context\-dependent, motivating extensive research on specialized model architectures\(Yao et al\.,[2017](https://arxiv.org/html/2605.14014#bib.bib66); Ekambaram et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib15); Kara et al\.,[2024b](https://arxiv.org/html/2605.14014#bib.bib25); Shams et al\.,[2024](https://arxiv.org/html/2605.14014#bib.bib55)\)and learning frameworks\(Deldari et al\.,[2022](https://arxiv.org/html/2605.14014#bib.bib13); Kimura et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib29); Ouyang et al\.,[2022](https://arxiv.org/html/2605.14014#bib.bib42); Zhang et al\.,[2022](https://arxiv.org/html/2605.14014#bib.bib71)\)\. Despite these advances, effectively transforming raw sensing streams into input representations that respect the intrinsic temporal structure of physical events remains a core challenge\. Most existing approaches rely on predefined segmentation rules or external supervision, which limits their ability to adapt to irregular and event\-driven temporal dynamics commonly observed in real\-world sensing scenarios\(Zheng et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib73); Gao et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib17)\)\.

Time\-Series Tokenization and Representation\.Tokenization serves as a key interface between sensing signals and sequential models\. A widely used strategy applies fixed\-length windowing to partition time\-series into uniform tokens, analogous to patching in vision transformers\(Dosovitskiy et al\.,[2020](https://arxiv.org/html/2605.14014#bib.bib14); Nie et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib41);[Cao et al\.,](https://arxiv.org/html/2605.14014#bib.bib6); Chang et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib8); Das et al\.,[2024](https://arxiv.org/html/2605.14014#bib.bib12);[Jin et al\.,](https://arxiv.org/html/2605.14014#bib.bib23); Zhou et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib75)\)\. While simple and effective, fixed patching requires careful tuning of window size and stride and is inherently insensitive to variations in signal dynamics\. Recent work explores multi\-scale patching to capture temporal patterns at multiple resolutions\(Cao et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib7); Zou et al\.,[2024](https://arxiv.org/html/2605.14014#bib.bib76); Zhong et al\.,[2024](https://arxiv.org/html/2605.14014#bib.bib74); Wang et al\.,[2024](https://arxiv.org/html/2605.14014#bib.bib63)\), though such methods are often tailored to forecasting and assume relatively homogeneous temporal structures\. Other approaches discretize continuous signals into symbolic tokens inspired by language modeling\(Sennrich et al\.,[2016](https://arxiv.org/html/2605.14014#bib.bib54);[Ansari et al\.,](https://arxiv.org/html/2605.14014#bib.bib3); Masserano et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib37); Götz et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib18)\), but quantization can disrupt fine\-grained temporal coherence critical for sensing data\. Frequency\-based representations further enrich modeling via time–frequency transforms\(Yao et al\.,[2019](https://arxiv.org/html/2605.14014#bib.bib67); Kara et al\.,[2024a](https://arxiv.org/html/2605.14014#bib.bib24); Hu et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib22); Piao et al\.,[2024](https://arxiv.org/html/2605.14014#bib.bib45); Yi et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib68)\), yet their reliance on fixed windowing constrains temporal adaptivity\. A key challenge is adapting tokenization granularity to match signal variability\. LightGTS\(Wang et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib64)\)learns per\-sequence patching, enabling inter\-sequence adaptation, but still overlooks intra\-sequence heterogeneity\. In vision, TokenLearner\(Ryoo et al\.,[2021](https://arxiv.org/html/2605.14014#bib.bib50)\)and DynamicViT\(Rao et al\.,[2021](https://arxiv.org/html/2605.14014#bib.bib47)\)perform adaptive token selection on tokenized patches and require modifications to the backbone\. In contrast, sensing signals lack semantically meaningful token units, and tokenization must be constructed from continuous waveforms\. Moreover, Dywave is decoupled from the backbone as a modular front\-end that can be paired with arbitrary sequence models without architectural changes\.

## 6Conclusion

We introduced Dywave, a dynamic wavelet\-based framework for adaptive time\-series tokenization in sensing applications\. By aligning token boundaries with intrinsic event transitions rather than fixed temporal windows, Dywave constructs representations that better reflect the underlying physical structure of sensing signals\. Our results demonstrate that adaptive, event\-aligned tokenization improves robustness and generalization across diverse sensing tasks and temporal conditions\. This work highlights the importance of physically grounded, adaptive input representations as a foundation for scalable and semantically meaningful learning in human\-centric sensing systems\. Due to space limitations, we provide additional evaluation results and discussions in Appendix\.

## Acknowledgements

Research reported in this paper was sponsored in part by the Army Research Laboratory under Cooperative Agreement W911NF\-17\-20196, NSF CNS 20\-38817, and the Boeing Company\. The views and conclusions contained in this document are those of the author\(s\) and should not be interpreted as representing the official policies of the CCDC Army Research Laboratory or the US government\. The US government is authorized to reproduce and distribute reprints for government purposes, notwithstanding any copyright notation hereon\.

## Impact Statement

This work addresses the tokenization gap betweencontinuous sensing signalsanddiscrete input representationsfor deep learning and Edge AI\. We discuss below both the potential benefits and societal considerations\.

Positive Impacts\.We hope this framing brings attention to an important yet underexplored challenge at the intersection of sensing and machine learning\. As AI systems increasingly operate in and interact with the physical world, bridging this gap becomes essential for developing models that can perceive and reason about continuous sensor streams\. By highlighting this problem and proposing a principled solution, we hope this framing advances interdisciplinary research at the intersection of sensing and AI\.

Additionally, Dywave’s adaptive tokenization reduces the need for extensive hyperparameter tuning required by uniform patching methods, improving accessibility for practitioners\. The event\-aligned boundaries further contribute to automatic micro\-activity segmentation, substantially reducing the annotation burden associated with fine\-grained labeling in sensing applications, where raw signals are difficult to interpret and require manual effort\. This automation also carries a direct privacy benefit\. Since segmentation no longer requires human annotators to inspect raw sensor streams, sensitive physiological or behavioral recordings need not be exposed during the labeling process\.

Societal Considerations\.The sensing applications enabled by this work could potentially raise considerations common to ubiquitous computing\. For human activity recognition, improved models could enhance assistive technologies for elderly care and rehabilitation monitoring, but similar capabilities could potentially be misused for surveillance without consent\. For healthcare applications using ECG or physiological signals, more accurate monitoring benefits patient care while requiring careful attention to data privacy\. Dywave’s event\-aligned tokenization produces compact representations that selectively retain semantically salient segments and discard stationary or uninformative intervals, which reduces the fidelity of stored data and can limit re\-identification from retained representations\.

We would like to note that Dywave is a general\-purpose tokenization framework and does not itself collect or store personal data\. All datasets used in this work are publicly available and no additional IRB approval was required for our experiments\. As with many sensing\-based systems, real\-world deployment requires appropriate consent and adherence to privacy regulations, which are beyond the scope of this work\.

## References

- \(1\)Ahia, O\., Kumar, S\., Gonen, H\., Kasai, J\., Mortensen, D\. R\., Smith, N\. A\., and Tsvetkov, Y\.Do all languages cost the same? tokenization in the era of commercial language models\.
- Alharbi et al\. \(2023\)Alharbi, R\., Shahi, S\., Cruz, S\., Li, L\., Sen, S\., Pedram, M\., Romano, C\., Hester, J\., Katsaggelos, A\. K\., and Alshurafa, N\.Smokemon: unobtrusive extraction of smoking topography using wearable energy\-efficient thermal\.*Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies*, 6\(4\):1–25, 2023\.
- \(3\)Ansari, A\. F\., Stella, L\., Turkmen, A\. C\., Zhang, X\., Mercado, P\., Shen, H\., Shchur, O\., Rangapuram, S\. S\., Arango, S\. P\., Kapoor, S\., et al\.Chronos: Learning the language of time series\.*Transactions on Machine Learning Research*\.
- Baris et al\. \(2025\)Baris, O\., Chen, Y\., Dong, G\., Han, L\., Kimura, T\., Quan, P\., Wang, R\., Wang, T\., Abdelzaher, T\., Bergés, M\., et al\.Foundation models for cps\-iot: Opportunities and challenges\.*arXiv preprint arXiv:2501\.16368*, 2025\.
- Bommasani et al\. \(2021\)Bommasani, R\., Hudson, D\. A\., Adeli, E\., Altman, R\., Arora, S\., von Arx, S\., Bernstein, M\. S\., Bohg, J\., Bosselut, A\., Brunskill, E\., et al\.On the opportunities and risks of foundation models\.*arXiv e\-prints*, pp\. arXiv–2108, 2021\.
- \(6\)Cao, D\., Jia, F\., Arik, S\. O\., Pfister, T\., Zheng, Y\., Ye, W\., and Liu, Y\.Tempo: Prompt\-based generative pre\-trained transformer for time series forecasting\.In*The Twelfth International Conference on Learning Representations*\.
- Cao et al\. \(2025\)Cao, Y\., Tian, Z\., Guo, W\., and Liu, X\.Mspatch: A multi\-scale patch mixing framework for multivariate time series forecasting\.*Expert Systems with Applications*, 273:126849, 2025\.
- Chang et al\. \(2025\)Chang, C\., Wang, W\.\-Y\., Peng, W\.\-C\., and Chen, T\.\-F\.Llm4ts: Aligning pre\-trained llms as data\-efficient time\-series forecasters\.*ACM Transactions on Intelligent Systems and Technology*, 16\(3\):1–20, 2025\.
- Chatterjee et al\. \(2020\)Chatterjee, S\., Moreno, A\., Lizotte, S\. L\., Akther, S\., Ertin, E\., Fagundes, C\. P\., Lam, C\., Rehg, J\. M\., Wan, N\., Wetter, D\. W\., et al\.Smokingopp: Detecting the smoking’opportunity’context using mobile sensors\.*Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies*, 4\(1\):1–26, 2020\.
- Chen et al\. \(2023\)Chen, L\., Hu, R\., Wu, M\., and Zhou, X\.Hmgan: A hierarchical multi\-modal generative adversarial network model for wearable human activity recognition\.*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 7\(3\):1–27, 2023\.
- Dao & Gu \(2024\)Dao, T\. and Gu, A\.Transformers are ssms: Generalized models and efficient algorithms through structured state space duality\.In*International Conference on Machine Learning*, pp\. 10041–10071\. PMLR, 2024\.
- Das et al\. \(2024\)Das, A\., Kong, W\., Sen, R\., and Zhou, Y\.A decoder\-only foundation model for time\-series forecasting\.In*Forty\-first International Conference on Machine Learning*, 2024\.
- Deldari et al\. \(2022\)Deldari, S\., Xue, H\., Saeed, A\., Smith, D\. V\., and Salim, F\. D\.Cocoa: Cross modality contrastive learning for sensor data\.*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 6\(3\):1–28, 2022\.
- Dosovitskiy et al\. \(2020\)Dosovitskiy, A\., Beyer, L\., Kolesnikov, A\., Weissenborn, D\., Zhai, X\., Unterthiner, T\., Dehghani, M\., Minderer, M\., Heigold, G\., Gelly, S\., et al\.An image is worth 16x16 words: Transformers for image recognition at scale\.In*International Conference on Learning Representations*, 2020\.
- Ekambaram et al\. \(2023\)Ekambaram, V\., Jati, A\., Nguyen, N\., Sinthong, P\., and Kalagnanam, J\.Tsmixer: Lightweight mlp\-mixer model for multivariate time series forecasting\.In*Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining*, pp\. 459–469, 2023\.
- Englhardt et al\. \(2024\)Englhardt, Z\., Ma, C\., Morris, M\. E\., Chang, C\.\-C\., Xu, X\. O\., Qin, L\., McDuff, D\., Liu, X\., Patel, S\., and Iyer, V\.From classification to clinical insights: Towards analyzing and reasoning about mobile and behavioral health data with large language models\.*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 8\(2\):1–25, 2024\.
- Gao et al\. \(2023\)Gao, Z\., Wang, Y\., Chen, J\., Xing, J\., Patel, S\., Liu, X\., and Shi, Y\.Mmtsa: Multi\-modal temporal segment attention network for efficient human activity recognition\.*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 7\(3\):1–26, 2023\.
- Götz et al\. \(2025\)Götz, L\., Kollovieh, M\., Günnemann, S\., and Schwinn, L\.Byte pair encoding for efficient time series forecasting\.*arXiv preprint arXiv:2505\.14411*, 2025\.
- Graps \(1995\)Graps, A\.An introduction to wavelets\.*IEEE computational science and engineering*, 2\(2\):50–61, 1995\.
- Grauman et al\. \(2022\)Grauman, K\., Westbury, A\., Byrne, E\., Chavis, Z\., Furnari, A\., Girdhar, R\., Hamburger, J\., Jiang, H\., Liu, M\., Liu, X\., et al\.Ego4d: Around the world in 3,000 hours of egocentric video\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pp\. 18995–19012, 2022\.
- He et al\. \(2016\)He, K\., Zhang, X\., Ren, S\., and Sun, J\.Deep residual learning for image recognition\.In*Proceedings of the IEEE conference on Computer Vision and Pattern Recognition*, pp\. 770–778, 2016\.
- Hu et al\. \(2025\)Hu, C\., Chen, Y\., Kara, D\., Liu, S\., Abdelzaher, T\., Wu, F\., and Chen, G\.Openmae: efficient masked autoencoder for vibration sensing with open\-domain data enrichment\.*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 9\(2\):1–29, 2025\.
- \(23\)Jin, M\., Wang, S\., Ma, L\., Chu, Z\., Zhang, J\. Y\., Shi, X\., Chen, P\.\-Y\., Liang, Y\., Li, Y\.\-F\., Pan, S\., et al\.Time\-llm: Time series forecasting by reprogramming large language models\.In*The Twelfth International Conference on Learning Representations*\.
- Kara et al\. \(2024a\)Kara, D\., Kimura, T\., Chen, Y\., Li, J\., Wang, R\., Chen, Y\., Wang, T\., Liu, S\., and Abdelzaher, T\.Phymask: An adaptive masking paradigm for efficient self\-supervised learning in iot\.In*Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems*, pp\. 97–111, 2024a\.
- Kara et al\. \(2024b\)Kara, D\., Kimura, T\., Shengzhong, L\., Jinyang, L\., Dongxin, L\., Tianshi, W\., Ruijie, W\., Yizhuo, C\., Yigong, H\., and Tarek, A\.Freqmae: Frequency\-aware masked autoencoder for multi\-modal iot sensing\.In*The World Wide Web Conference*, 2024b\.
- Kawano et al\. \(2023\)Kawano, H\., Okamoto, M\., and Murao, K\.Estimating sampling rate of human activity data from accelerometer using transformer\-based regression model\.In*Adjunct Proceedings of the 2023 ACM International Joint Conference on Pervasive and Ubiquitous Computing & the 2023 ACM International Symposium on Wearable Computing*, pp\. 200–201, 2023\.
- Kim et al\. \(2023\)Kim, G\., Yeo, D\., Jo, T\., Rus, D\., and Kim, S\.What and when to explain? on\-road evaluation of explanations in highly automated vehicles\.*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 7\(3\):1–26, 2023\.
- Kimura et al\. \(2024\)Kimura, T\., Li, J\., Wang, T\., Chen, Y\., Wang, R\., Kara, D\., Wigness, M\., Bhattacharyya, J\., Srivatsa, M\., Liu, S\., et al\.Vibrofm: Towards micro foundation models for robust multimodal iot sensing\.In*2024 IEEE 21st International Conference on Mobile Ad\-Hoc and Smart Systems \(MASS\)*, pp\. 10–18\. IEEE, 2024\.
- Kimura et al\. \(2025\)Kimura, T\., Li, X\., Hanna, O\., Chen, Y\., Chen, Y\., Kara, D\., Wang, T\., Li, J\., Ouyang, X\., Liu, S\., et al\.Infomae: Pair\-efficient cross\-modal alignment for multimodal time\-series sensing signals\.In*Proceedings of the ACM on Web Conference 2025*, pp\. 3084–3095, 2025\.
- Kingma & Ba \(2014\)Kingma, D\. P\. and Ba, J\.Adam: A method for stochastic optimization\.*arXiv preprint arXiv:1412\.6980*, 2014\.
- Korany et al\. \(2019\)Korany, B\., Karanam, C\. R\., Cai, H\., and Mostofi, Y\.Xmodal\-id: Using wifi for through\-wall person identification from candidate video footage\.In*The 25th Annual International Conference on Mobile Computing and Networking*, MobiCom ’19, New York, NY, USA, 2019\. Association for Computing Machinery\.ISBN 9781450361699\.doi:10\.1145/3300061\.3345437\.URL[https://doi\.org/10\.1145/3300061\.3345437](https://doi.org/10.1145/3300061.3345437)\.
- Kudo & Richardson \(2018\)Kudo, T\. and Richardson, J\.Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing\.*EMNLP 2018*, pp\. 66, 2018\.
- Larrubia et al\. \(2025\)Larrubia, L\. F\., Morettin, P\. A\., and Chiann, C\.The maximal overlap discrete wavelet scattering transform and its application in classification tasks\.*arXiv preprint arXiv:2506\.12039*, 2025\.
- Lee et al\. \(2019\)Lee, G\., Gommers, R\., Waselewski, F\., Wohlfahrt, K\., and O’Leary, A\.Pywavelets: A python package for wavelet analysis\.*Journal of Open Source Software*, 4\(36\):1237, 2019\.
- Lina & Mayrand \(1995\)Lina, J\.\-M\. and Mayrand, M\.Complex daubechies wavelets\.*Applied and Computational Harmonic Analysis*, 2\(3\):219–229, 1995\.
- Liu et al\. \(2023\)Liu, S\., Kimura, T\., Liu, D\., Wang, R\., Li, J\., Diggavi, S\., Srivastava, M\., and Abdelzaher, T\.Focal: Contrastive learning for multimodal time\-series sensing signals in factorized orthogonal latent space\.*Advances in Neural Information Processing Systems*, 36, 2023\.
- Masserano et al\. \(2025\)Masserano, L\., Ansari, A\. F\., Han, B\., Zhang, X\., Faloutsos, C\., Mahoney, M\. W\., Wilson, A\. G\., Park, Y\., Rangapuram, S\. S\., Maddix, D\. C\., et al\.Enhancing foundation models for time series forecasting via wavelet\-based tokenization\.In*Forty\-second International Conference on Machine Learning*, 2025\.
- Naghashi et al\. \(2025\)Naghashi, V\., Boukadoum, M\., and Diallo, A\. B\.A multiscale model for multivariate time series forecasting\.*Scientific Reports*, 15\(1\):1565, 2025\.
- Nath et al\. \(2023\)Nath, R\. K\., Tervonen, J\., Närväinen, J\., Pettersson, K\., and Mäntyjärvi, J\.Towards self\-supervised learning of ecg signal representation for the classification of acute stress types\.In*Proceedings of the Great Lakes Symposium on VLSI 2023*, pp\. 85–90, 2023\.
- \(40\)Nawrot, P\., Tworkowski, S\., Tyrolski, M\., Kaiser, Ł\., Wu, Y\., Szegedy, C\., and Michalewski, H\.Hierarchical transformers are more efficient language models\.
- Nie et al\. \(2023\)Nie, Y\., Nguyen, N\. H\., Sinthong, P\., and Kalagnanam, J\.A time series is worth 64 words: Long\-term forecasting with transformers\.In*The Eleventh International Conference on Learning Representations*, 2023\.
- Ouyang et al\. \(2022\)Ouyang, X\., Shuai, X\., Zhou, J\., Shi, I\. W\., Xie, Z\., Xing, G\., and Huang, J\.Cosmo: Contrastive fusion learning with small data for multimodal human activity recognition\.In*International Conference on Mobile Computing And Networking \(MobiCom\)*, 2022\.
- Percival & Walden \(2000\)Percival, D\. B\. and Walden, A\. T\.*Wavelet methods for time series analysis*, volume 4\.Cambridge university press, 2000\.
- Petrov et al\. \(2023\)Petrov, A\., La Malfa, E\., Torr, P\., and Bibi, A\.Language model tokenizers introduce unfairness between languages\.*Advances in neural information processing systems*, 36:36963–36990, 2023\.
- Piao et al\. \(2024\)Piao, X\., Chen, Z\., Murayama, T\., Matsubara, Y\., and Sakurai, Y\.Fredformer: Frequency debiased transformer for time series forecasting\.In*Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining*, pp\. 2400–2410, 2024\.
- Qiu et al\. \(2025\)Qiu, T\., Xie, Y\., Niu, H\., Xiong, Y\., and Gao, X\.Enhancing masked time\-series modeling via dropping patches\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 39, pp\. 20077–20085, 2025\.
- Rao et al\. \(2021\)Rao, Y\., Zhao, W\., Liu, B\., Lu, J\., Zhou, J\., and Hsieh, C\.\-J\.Dynamicvit: Efficient vision transformers with dynamic token sparsification\.*Advances in neural information processing systems*, 34:13937–13949, 2021\.
- Reiss & Stricker \(2012\)Reiss, A\. and Stricker, D\.Introducing a new benchmarked dataset for activity monitoring\.In*International Symposium on Wearable Computers \(ISWC\)*, 2012\.
- Ren et al\. \(2025\)Ren, J\., Zheng, R\., Zhang, W\., She, D\., Bai, Y\., Jin, Z\., and Gao, Y\.Motion2press: Cross model learning from imu to plantar pressure for gait analysis\.*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 9\(3\):1–33, 2025\.
- Ryoo et al\. \(2021\)Ryoo, M\., Piergiovanni, A\., Arnab, A\., Dehghani, M\., and Angelova, A\.Tokenlearner: Adaptive space\-time tokenization for videos\.*Advances in neural information processing systems*, 34:12786–12797, 2021\.
- Saha et al\. \(2025\)Saha, M\., Xu, M\. A\., Mao, W\., Neupane, S\., Rehg, J\. M\., and Kumar, S\.Pulse\-ppg: An open\-source field\-trained ppg foundation model for wearable applications across lab and field settings\.*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 9\(3\):1–35, 2025\.
- Schmidt et al\. \(2018a\)Schmidt, P\., Reiss, A\., Duerichen, R\., Marberger, C\., and Van Laerhoven, K\.Introducing wesad, a multimodal dataset for wearable stress and affect detection\.In*Proceedings of the 20th ACM international conference on multimodal interaction*, pp\. 400–408, 2018a\.
- Schmidt et al\. \(2018b\)Schmidt, P\., Reiss, A\., Dürichen, R\., Marberger, C\., and Laerhoven, K\. V\.Introducing wesad, a multimodal dataset for wearable stress and affect detection\.In*ICMI 2018*, pp\. 400–408\. ACM, 2018b\.doi:10\.1145/3242969\.3242985\.
- Sennrich et al\. \(2016\)Sennrich, R\., Haddow, B\., and Birch, A\.Neural machine translation of rare words with subword units\.In*Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 1715–1725, 2016\.
- Shams et al\. \(2024\)Shams, S\., Dindar, S\. S\., Jiang, X\., and Mesgarani, N\.Ssamba: Self\-supervised audio representation learning with mamba state space model\.In*2024 IEEE Spoken Language Technology Workshop \(SLT\)*, pp\. 1053–1059\. IEEE, 2024\.
- Sztyler & Stuckenschmidt \(2016\)Sztyler, T\. and Stuckenschmidt, H\.On\-body localization of wearable devices: An investigation of position\-aware activity recognition\.In*IEEE International Conference on Pervasive Computing and Communications \(PerCom\)*, 2016\.
- Tao et al\. \(2024\)Tao, C\., Liu, Q\., Dou, L\., Muennighoff, N\., Wan, Z\., Luo, P\., Lin, M\., and Wong, N\.Scaling laws with vocabulary: Larger models deserve larger vocabularies\.*Advances in Neural Information Processing Systems*, 37:114147–114179, 2024\.
- Truong et al\. \(2020\)Truong, C\., Oudre, L\., and Vayatis, N\.Selective review of offline change point detection methods\.*Signal processing*, 167:107299, 2020\.
- Ullah et al\. \(2022\)Ullah, M\. A\., Chatterjee, S\., Fagundes, C\. P\., Lam, C\., Nahum\-Shani, I\., Rehg, J\. M\., Wetter, D\. W\., and Kumar, S\.mrisk: continuous risk estimation for smoking lapse from noisy sensor data with incomplete and positive\-only labels\.*Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies*, 6\(3\):1–29, 2022\.
- Vaswani et al\. \(2017\)Vaswani, A\., Shazeer, N\., Parmar, N\., Uszkoreit, J\., Jones, L\., Gomez, A\. N\., Kaiser, Ł\., and Polosukhin, I\.Attention is all you need\.In*Advances in neural information processing systems*, pp\. 5998–6008, 2017\.
- Wang et al\. \(2022\)Wang, L\., Li, W\., Sun, K\., Zhang, F\., Gu, T\., Xu, C\., and Zhang, D\.Loear: Push the range limit of acoustic sensing for vital sign monitoring\.*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 6\(3\):1–24, 2022\.
- Wang et al\. \(2023a\)Wang, X\., Zhang, H\., Cao, L\., Zeng, K\., Li, Q\., Li, N\., and Feng, L\.Contrastive learning of stress\-specific word embedding for social media based stress detection\.In*Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*, pp\. 5137–5149, 2023a\.
- Wang et al\. \(2024\)Wang, Y\., Huang, N\., Li, T\., Yan, Y\., and Zhang, X\.Medformer: A multi\-granularity patching transformer for medical time\-series classification\.*Advances in Neural Information Processing Systems*, 37:36314–36341, 2024\.
- Wang et al\. \(2025\)Wang, Y\., Qiu, Y\., Chen, P\., Shu, Y\., Rao, Z\., Pan, L\., Yang, B\., and Guo, C\.Lightgts: A lightweight general time series forecasting model\.In*International Conference on Machine Learning*, pp\. 64109–64126\. PMLR, 2025\.
- Wang et al\. \(2023b\)Wang, Z\., Wang, Y\., Tian, M\., and Shen, J\.Hearfire: Indoor fire detection via inaudible acoustic sensing\.*Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies*, 6\(4\):1–25, 2023b\.
- Yao et al\. \(2017\)Yao, S\., Hu, S\., Zhao, Y\., Zhang, A\., and Abdelzaher, T\.Deepsense: A unified deep learning framework for time\-series mobile sensing data processing\.In*International Conference on World Wide Web \(WWW\)*, 2017\.
- Yao et al\. \(2019\)Yao, S\., Piao, A\., Jiang, W\., Zhao, Y\., Shao, H\., Liu, S\., Liu, D\., Li, J\., Wang, T\., Hu, S\., et al\.Stfnets: Learning sensing signals from the time\-frequency perspective with short\-time fourier neural networks\.In*The World Wide Web Conference*, pp\. 2192–2202, 2019\.
- Yi et al\. \(2023\)Yi, K\., Zhang, Q\., Fan, W\., Wang, S\., Wang, P\., He, H\., An, N\., Lian, D\., Cao, L\., and Niu, Z\.Frequency\-domain mlps are more effective learners in time series forecasting\.*Advances in Neural Information Processing Systems*, 36:76656–76679, 2023\.
- Yu & Sano \(2023\)Yu, H\. and Sano, A\.Semi\-supervised learning for wearable\-based momentary stress detection in the wild\.*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 7\(2\):1–23, 2023\.
- Zakaria et al\. \(2023\)Zakaria, C\., Yilmaz, G\., Mammen, P\. M\., Chee, M\., Shenoy, P\., and Balan, R\.Sleepmore: Inferring sleep duration at scale via multi\-device wifi sensing\.*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 6\(4\):1–32, 2023\.
- Zhang et al\. \(2022\)Zhang, X\., Zhao, Z\., Tsiligkaridis, T\., and Zitnik, M\.Self\-supervised contrastive pre\-training for time series via time\-frequency consistency\.In*Neural Information Processing Systems \(NeurIPS\)*, 2022\.
- Zhang et al\. \(2025\)Zhang, Y\., Ayush, K\., Qiao, S\., Heydari, A\. A\., Narayanswamy, G\., Xu, M\. A\., Metwally, A\., Xu, J\., Garrison, J\., Xu, X\., Althoff, T\., Liu, Y\., Kohli, P\., Zhan, J\., Malhotra, M\., Patel, S\., Mascolo, C\., Liu, X\., McDuff, D\., and Yang, Y\.SensorLM: Learning the language of wearable sensors\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.URL[https://openreview\.net/forum?id=TrHeq0yFhv](https://openreview.net/forum?id=TrHeq0yFhv)\.
- Zheng et al\. \(2025\)Zheng, N\., Liu, R\., Fan, X\., Zhang, C\., Zhang, L\., and Yin, Z\.Segall: A unified active learning framework for wireless sensing data segmentation\.*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 9\(3\):1–27, 2025\.
- Zhong et al\. \(2024\)Zhong, S\., Song, S\., Zhuo, W\., Li, G\., Liu, Y\., and Chan, S\.\-H\. G\.A multi\-scale decomposition mlp\-mixer for time series analysis\.*Proceedings of the VLDB Endowment*, 17\(7\):1723–1736, 2024\.
- Zhou et al\. \(2023\)Zhou, T\., Niu, P\., Sun, L\., Jin, R\., et al\.One fits all: Power general time series analysis by pretrained lm\.*Advances in neural information processing systems*, 36:43322–43355, 2023\.
- Zou et al\. \(2024\)Zou, X\., You, C\., Zhao, R\., Yang, H\., and Cheng, X\.Scalemixer: A multi\-scale mlp\-mixer model for long\-term time series forecasting\.In*International Conference on Neural Information Processing*, pp\. 44–58\. Springer, 2024\.

## Appendix

The appendix of this paper is structured as follows\.

- •Section[A](https://arxiv.org/html/2605.14014#A1)details the evaluated datasets and statistics\.
- •Section[B](https://arxiv.org/html/2605.14014#A2)describes the baselines and the backbone encoders\.
- •Section[C](https://arxiv.org/html/2605.14014#A3)elaborates on the methodology\.
- •Section[D](https://arxiv.org/html/2605.14014#A4)specifies the implementation and training details\.
- •Section[E](https://arxiv.org/html/2605.14014#A5)provides additional evaluation results\.
- •Section[F](https://arxiv.org/html/2605.14014#A6)presents an additional case study\.
- •Section[G](https://arxiv.org/html/2605.14014#A7)discusses the limitations and future works\.

Table 6:Dataset Statistics\. Acc stands for Accelerometer, ECG for Electrocardiogram, and EMG for Electromyogram\.Dataset NameClassesModalities \(Frequency\)Channel SizeSample Length\# SamplesEgo4D\(Grauman et al\.,[2022](https://arxiv.org/html/2605.14014#bib.bib20)\)10Acc \(100 Hz\)13000 \(30 seconds\)98315MOD\(Liu et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib36)\)7Audio \(8000 Hz\), Seismic \(100 Hz\)1Audio: 16000, Seismic: 200 \(2 seconds\)8828PAMAP2\(Reiss & Stricker,[2012](https://arxiv.org/html/2605.14014#bib.bib48)\)18Acc \(100 Hz\), Gyroscope \(100 Hz\)3360 \(2 seconds\)7075RWHAR\(Sztyler & Stuckenschmidt,[2016](https://arxiv.org/html/2605.14014#bib.bib56)\)8Acc \(100 Hz\), Gyroscope \(100 Hz\)3450 \( 5 seconds\)12888WESAD\(Schmidt et al\.,[2018a](https://arxiv.org/html/2605.14014#bib.bib52)\)8ECG \(700 Hz\), EMG \(700 Hz\)12100 \( 5 seconds\)11229Ego4D\-S\(Grauman et al\.,[2022](https://arxiv.org/html/2605.14014#bib.bib20)\)10Acc \(100 Hz\)3500 \(5 seconds\)59390MOD\-OOD\(Liu et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib36)\)4Audio \(8000 Hz\), Seismic \(100 Hz\)1Audio: 16000, Seismic: 200 \(2 seconds\)35162

## Appendix ADataset and Preprocessing

This section provides details on the datasets and preprocessing used in the experiments\. Table[6](https://arxiv.org/html/2605.14014#Ax1.T6)shows the dataset statistics\.

### A\.1Datasets

- •Ego4D\(Grauman et al\.,[2022](https://arxiv.org/html/2605.14014#bib.bib20)\)is a large\-scale egocentric dataset containing 836 hours of IMU recordings collected across 74 locations in 9 countries\. We use the 100 Hz accelerometer signals covering 10 complex activities \(*e\.g\.*, , cleaning, crafting, cooking\) and segment each sequence into 5\-second and 30\-second clips for long\-context evaluation and cross\-sequence generalization\. Data are randomly split by sequence into training, validation, and test sets following a 70:15:15 ratio\.
- •Moving Object Detection \(MOD\)\(Liu et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib36)\)is a public dataset of seismic and acoustic signals collected from diverse moving objects across multiple environments, designed for nearby object classification that can enhance human\-safety and situational awareness\. For example, roadside sensing nodes can detect approaching vehicles or objects near intersections and pedestrian zones, offering early alerts to drivers and pedestrians\. We segment each signal into 2\-second samples, using 100 Hz seismic data for short\-context evaluation and 8 kHz acoustic data for long\-context evaluation\. The dataset also provides an out\-of\-distribution dataset with different object types and environments, which we use to assess cross\-domain generalization\.
- •Physical Activity Monitoring \(PAMAP2\)\(Reiss & Stricker,[2012](https://arxiv.org/html/2605.14014#bib.bib48)\)is an open dataset for human activity recognition, collected from 9 subjects wearing three IMUs on the chest, wrist, and ankle\. It includes 18 physical activities, such as walking, running, and cycling\. For our experiments, we use the 100 Hz accelerometer and gyroscope signals, segmented into 2\-second samples for short\-context evaluation across both modalities\.
- •Real World Human Activity Recognition \(RWHAR\)\(Sztyler & Stuckenschmidt,[2016](https://arxiv.org/html/2605.14014#bib.bib56)\)is a public IMU dataset collected from 15 subjects performing 8 everyday activities, such as stair climbing, jumping, and walking\. Sensors were placed on seven body locations, and we use the accelerometer and magnetometer signals from the waist, segmented into 9\-second samples\. Ten subjects are used for training, two for validation, and the remainder for testing\. Both modalities are evaluated under short\-context settings\.
- •Wearable Stress and Affect Detection \(WESAD\)\(Schmidt et al\.,[2018a](https://arxiv.org/html/2605.14014#bib.bib52)\)is a multimodal physiological dataset for stress and affect recognition, collected from 15 subjects using a chest\-worn RespiBAN and a wrist\-worn Empatica E4\. It includes ECG, EDA, EMG, respiration, BVP, body temperature, and IMU signals\. For our experiments, we use the 700 Hz ECG and EMG data from the RespiBAN, segmented into 3\-second samples labeled asstressoramusement\. Subjects are randomly divided into training \(11\), validation \(2\), and test \(2\) groups\.

### A\.2Preprocessing

Each sample is a time\-series segment of shapeC×LC\\times LwhereCCis the number of channels andLLthe sequence length\. During training, one augmentation is randomly selected from a pool of time\-domain transformations, including permutation, scaling, negation, horizontal flip, time warping, and magnitude warping\. The augmentation is then applied with a probability of 0\.5 to enhance variability and generalization\.

## Appendix BBaselines

### B\.1Tokenization Baselines

To isolate the effect of tokenization, we restrict our comparisons to baselines that do not modify the underlying backbone architectures\. All methods operate on the same sequence encoders with identical model configurations, differing only in how input signals are tokenized or segmented\. This ensures a fair and controlled evaluation, where performance differences can be attributed to tokenization strategies rather than architectural changes or additional model capacity\.

- •PatchTST\(Nie et al\.,[2023](https://arxiv.org/html/2605.14014#bib.bib41)\)tokenizes time\-series inputs by segmenting them into fixed\-length subseries, or*patches*, which serve as input tokens to the backbone encoder\. Each univariate channel is patched and processed independently, with shared backbone weights across channels\. The model relies on a fixed patch size and stride to control granularity and overlap between patches\. We extensively tune these parameters and report the best\-performing configuration\.
- •DropPatch\(Qiu et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib46)\)follows the same fixed\-size patching strategy as PatchTST, segmenting signals into subseries defined by preset patch size and stride\. It randomly drops a portion of tokens during training to improve robustness and reduce overfitting\. For fair comparison, we adopt the same optimal patch size and stride as PatchTST and fix the token drop rate at 0\.5 across all experiments\.
- •MedFormer\(Wang et al\.,[2024](https://arxiv.org/html/2605.14014#bib.bib63)\)adopts a multi\-granularity patching strategy to model temporal dependencies at multiple resolutions\. It builds parallel token streams with varying patch sizes and strides to model fine\- and coarse\-grained temporal dynamics\. For fair comparison, we use the optimal patch size and stride from PatchTST as the base configuration and set additional granularities to 0\.5×, 1×, 2×, and 4× of the base values\.
- •WaveToken\(Masserano et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib37)\)converts quantized wavelet coefficients into discrete token IDs as opposed to projecting patches into a continuous embedding space\. By decomposing inputs via the discrete wavelet transform, it represents time\-localized frequencies as quantized tokens, similar to language model tokens, for learning in a discrete, frequency\-aware space\. Due to the large vocabulary size introduced by wavelet quantization, we evaluate WaveToken only on short\-context signals to mitigate memory overhead\.
- •MultiPatchFormer\(Naghashi et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib38)\)introduces a multi\-scale patch embedding strategy to capture temporal dependencies at different resolutions and cross\-channel correlations across signal dimensions\. Each time series is divided into multiple patch streams with distinct sizes and strides, enabling joint modeling of fine\- and coarse\-grained dynamics\. Streams are embedded independently and projected into a shared feature space for cross\-scale interaction\. For consistency, we use the optimal patch size and stride from PatchTST as the base configuration and extend additional granularities to 0\.5×, 1×, 2×, and 4× of the base values\.
- •LightGTS\(Wang et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib64)\)is a general time series forecasting architecture that handles diverse sampling scales and intrinsic frequencies during multi\-source pre\-training\. It employs periodical tokenization, which adaptively divides each sample into patches based on cycle length\. \.

### B\.2Backbone Encoders

- •Transformer\(Vaswani et al\.,[2017](https://arxiv.org/html/2605.14014#bib.bib60)\)has been widely used for time\-series modeling\. It processes sequential inputs through multi\-head self\-attention layers, capturing long\-range dependencies and contextual relationships between tokens\. For time\-series data, each patch or segment is treated as an input token, and positional encoding is added to preserve temporal ordering\.
- •Mamba2\(Dao & Gu,[2024](https://arxiv.org/html/2605.14014#bib.bib11)\)is a recently developed state\-space model designed for efficient long\-sequence modeling\. Unlike attention\-based architectures with complexity growing quadratically with the sequence length, Mamba2 employs a recurrent state\-space formulation with linear computational complexity for scalable inference over extended time horizons\. It models temporal dependencies via continuous\-time dynamics and selective state updates, efficiently capturing both local and long\-range signal patterns\.

## Appendix CMethodology

Algorithm 1Dywave: Dynamic Tokenization Pipeline0:Raw signal

X∈ℝC×LX\\in\\mathbb\{R\}^\{C\\times L\}, compression ratio

τ\\tau, partition index

KK, downsampling factor

ss
1:Step 1: Wavelet Decomposition

2:

\{d​X1,…,d​XJ,A\}←MODWT​\(X\)\\\{dX\_\{1\},\\ldots,dX\_\{J\},A\\\}\\leftarrow\\text\{MODWT\}\(X\)
3:Step 2: Hierarchical Embedding

4:

XU←Concat​\(X,d​X1,…,d​XK\)X^\{U\}\\leftarrow\\text\{Concat\}\(X,dX\_\{1\},\\ldots,dX\_\{K\}\)\{Detail stream\}

5:

XV←Concat​\(d​XK\+1,…,d​XJ,A\)X^\{V\}\\leftarrow\\text\{Concat\}\(dX\_\{K\+1\},\\ldots,dX\_\{J\},A\)\{Context stream\}

6:

EU←Conv1D​\(XU\)E^\{U\}\\leftarrow\\text\{Conv1D\}\(X^\{U\}\)
7:

EV←Interpolate​\(TransformerBlock​\(AvgPool1D​\(Linear​\(XV\),s\)\),L\)E^\{V\}\\leftarrow\\text\{Interpolate\}\(\\text\{TransformerBlock\}\(\\text\{AvgPool1D\}\(\\text\{Linear\}\(X^\{V\}\),s\)\),L\)
8:

EF←Concat​\(EU,EV\)E^\{F\}\\leftarrow\\text\{Concat\}\(E^\{U\},E^\{V\}\)
9:Step 3: Temporal Anchor Formation

10:

Pt←1−CosineSim​\(Fk​\(Et−1F\),Fq​\(EtF\)\)P\_\{t\}\\leftarrow 1\-\\text\{CosineSim\}\(F\_\{k\}\(E^\{F\}\_\{t\-1\}\),F\_\{q\}\(E^\{F\}\_\{t\}\)\)for

t∈\[2,L\]t\\in\[2,L\]
11:

𝒜←TopK​\(NMS​\(P,wnms\),⌈τ⋅L⌉\)\\mathcal\{A\}\\leftarrow\\text\{TopK\}\(\\text\{NMS\}\(P,w\_\{\\text\{nms\}\}\),\\lceil\\tau\\cdot L\\rceil\)\{

wnms=⌊L/\(2​⌈τ​L⌉\)⌋w\_\{\\text\{nms\}\}=\\lfloor L/\(2\\lceil\\tau L\\rceil\)\\rfloor\}

12:Step 4: Dynamic Temporal Fusion

13:

κ​\(t\)←arg⁡mina∈𝒜⁡\|t−a\|\\kappa\(t\)\\leftarrow\\arg\\min\_\{a\\in\\mathcal\{A\}\}\|t\-a\|for

t∈\[1,L\]t\\in\[1,L\]
14:

EkA←∑t:κ​\(t\)=akPt⋅EtF/\(∑t:κ​\(t\)=akPt\+ε\)E^\{A\}\_\{k\}\\leftarrow\\sum\_\{t:\\kappa\(t\)=a\_\{k\}\}P\_\{t\}\\cdot E^\{F\}\_\{t\}\\,/\\,\(\\sum\_\{t:\\kappa\(t\)=a\_\{k\}\}P\_\{t\}\+\\varepsilon\)for

k∈\[1,\|𝒜\|\]k\\in\[1,\|\\mathcal\{A\}\|\]
15:

E←MLP​\(EA\)E\\leftarrow\\text\{MLP\}\(E^\{A\}\)
16:Step 5: Reconstruction Loss \(Training Only\)

17:

W^←AdaptivePool​\(ConvTranspose1D​\(Linear​\(E\)\),L\)\\hat\{W\}\\leftarrow\\text\{AdaptivePool\}\(\\text\{ConvTranspose1D\}\(\\text\{Linear\}\(E\)\),L\)
18:

ℒrec←MSE​\(\{d​X1,…,d​XJ,A\},W^\)\\mathcal\{L\}\_\{\\text\{rec\}\}\\leftarrow\\text\{MSE\}\(\\\{dX\_\{1\},\\ldots,dX\_\{J\},A\\\},\\hat\{W\}\)
19:return

EE,

ℒrec\\mathcal\{L\}\_\{\\text\{rec\}\}

Dywave transforms raw sensing signals into compact, event\-aligned token embeddings through a five\-stage pipeline\. Given an inputX∈ℝC×LX\\in\\mathbb\{R\}^\{C\\times L\}, the framework first applies wavelet decomposition to extract multi\-resolution coefficients capturing both high\-frequency transients and low\-frequency trends\. These coefficients are then partitioned into detail and context streams, which are encoded via separate pathways and fused into per\-timestep embeddings\. A learned saliency function identifies temporal anchors at semantic transitions, and neighboring timesteps are aggregated into anchor\-aligned tokens through saliency\-weighted fusion\. An auxiliary reconstruction objective ensures the compressed tokens preserve multi\-scale signal structure\. Algorithm[1](https://arxiv.org/html/2605.14014#alg1)summarizes the complete pipeline\.

### C\.1Signal Decomposition through Maximal Overlap DWT

To capture the temporal structure of time\-series signals, we apply the*Maximal Overlap Discrete Wavelet Transform*\(MODWT\)\(Percival & Walden,[2000](https://arxiv.org/html/2605.14014#bib.bib43)\)to decompose raw inputs into multi\-resolution*context*and*details*components\. Unlike the standard DWT, which downsamples the signal and imposes constraints on sequence length\(Larrubia et al\.,[2025](https://arxiv.org/html/2605.14014#bib.bib33)\), MODWT is*undecimated*, preserving the original sequence length across all scales\. This ensures temporal alignment between coefficients and the raw signal, producing a coherent multi\-frequency*snapshot*at each timestamp\. Formally, for an inputX∈ℝC×LX\\in\\mathbb\{R\}^\{C\\times L\}withCCchannels and lengthLL, MODWT computes wavelet and scaling coefficients at each scalej∈\[1,J\]j\\in\[1,J\]and timestept∈\[1,L\]t\\in\[1,L\]as:

d​Xj,t=W~j,t=∑l=0Lj−1h~j,l​Aj−1,t−l​mod​L,Aj,t=V~j,t=∑l=0Lj−1g~j,l​Aj−1,t−l​mod​L,A0,t=Xt\.dX\_\{j,t\}=\\widetilde\{W\}\_\{j,t\}=\\sum\_\{l=0\}^\{L\_\{j\}\-1\}\\widetilde\{h\}\_\{j,l\}\\,A\_\{j\-1,t\-l\\ \\mathrm\{mod\}\\ L\},\\qquad A\_\{j,t\}=\\widetilde\{V\}\_\{j,t\}=\\sum\_\{l=0\}^\{L\_\{j\}\-1\}\\widetilde\{g\}\_\{j,l\}\\,A\_\{j\-1,t\-l\\ \\mathrm\{mod\}\\ L\},\\qquad A\_\{0,t\}=X\_\{t\}\.\(10\)Here,W~j\{\\widetilde\{W\}\_\{j\}\}denote the*discrete*wavelet coefficients at levelj∈\[1,J\]j\\in\[1,J\],AjA\_\{j\}the corresponding*approximations*,h~j\\widetilde\{h\}\_\{j\}andg~j\\widetilde\{g\}\_\{j\}the rescaled wavelet and scaling filters, andLjL\_\{j\}the effective filter length\. Since MODWT is undecimated, bothd​XjdX\_\{j\}andAjA\_\{j\}preserve the full temporal resolution of the original sequenceLL\.

The recursive formulation produces a hierarchy of approximations\{Aj\}j∈1,…,J\\\{A\_\{j\}\\\}\_\{j\\in\{1,\\dots,J\}\}and details\{d​Xj\}1,…,J\\\{dX\_\{j\}\\\}\_\{1,\\dots,J\}\. We retain the discrete coefficients at every level and the coarsest approximation, yielding the MODWT output:

\{d​X1,d​X2,…,d​XJ,A\}∈ℝ\(J\+1\)×C×L=MODWT​\(X\),\\\{dX\_\{1\},dX\_\{2\},\\ldots,dX\_\{J\},A\\\}\\in\\mathbb\{R\}^\{\(J\+1\)\\times C\\times L\}=\\text\{MODWT\}\(X\),\(11\)
MODWT disentangles fine transients from coarse contextual structure while maintaining precise temporal alignment across all scales, producing an interpretable, time\-consistent hierarchy of signal representations for arbitrary sequence lengthsLL\.

We leverage PyWavelets\(Lee et al\.,[2019](https://arxiv.org/html/2605.14014#bib.bib34)\)as the MODWT implementation\.

### C\.2Hierarchical Embedding

![Refer to caption](https://arxiv.org/html/2605.14014v1/x16.png)Figure 10:Physics\-Informed Hierarchical Embedding Module\.Figure[10](https://arxiv.org/html/2605.14014#A3.F10)illustrates the hierarchical embedding module\. Given the MODWT decomposition, we partition coefficients into detail and context streams based on frequency characteristics\. The detail streamXU=\{X,d​X1,…,d​XK\}X^\{U\}=\\\{X,dX\_\{1\},\\ldots,dX\_\{K\}\\\}captures high\-frequency transients, while the context streamXV=\{d​XK\+1,…,d​XJ,A\}X^\{V\}=\\\{dX\_\{K\+1\},\\ldots,dX\_\{J\},A\\\}encodes low\-frequency trends\. We setK=1K=1by default, assigning the most fine\-grained signal component to the detail stream\.

Detail Embedding: To capture the detail embedding, we leverage convolution layers to extract fine\-grained representations from the detail streams:

EU=Conv​\(XU\),EU∈ℝC,dU,L,XU∈ℝC×\(K\+1\)×L,E^\{U\}=\\text\{Conv\}\(X^\{U\}\),\\quad E^\{U\}\\in\\mathbb\{R\}^\{C,d\_\{U\},L\},\\ X^\{U\}\\in\\mathbb\{R\}^\{C\\times\(K\+1\)\\times L\},\(12\)wheredUd\_\{U\}is the detail embedding dimension projected from the detail stream\.

Context Embedding: Unlike the detail stream, which benefits from localized convolution, modeling the context stream requires capturing long\-range relationships across time and scale\. To achieve this efficiently, we adopt self\-attention within an*hourglass transformer*architecture\([Nawrot et al\.,](https://arxiv.org/html/2605.14014#bib.bib40)\)that first compresses and then expands the temporal resolution\. The context stream is projected into a latent space and adaptively downsampled according to the available computation budget:

XdownV=Conv1D​\(Linear​\(XV\),stride=s\),XV∈ℝC×\(J\+1−K\)×L,XdownV∈ℝC×Lcontext×dV,X^\{V\}\_\{\\text\{down\}\}=\\text\{Conv1D\}\(\\text\{Linear\}\(X^\{V\}\),\\ \\text\{stride\}=s\),\\quad X^\{V\}\\in\\mathbb\{R\}^\{C\\times\(J\+1\-K\)\\times L\},\\ X^\{V\}\_\{\\text\{down\}\}\\in\\mathbb\{R\}^\{C\\times L^\{\\text\{context\}\}\\times d\_\{V\}\},\(13\)wheredVd\_\{V\}is the embedding dimension, and the stridessis dynamically selected such that the downsampled lengthLcontext=L/sL^\{\\text\{context\}\}=L/sfits within the computation budget\. A lightweight transformer encoder then models global dependencies over the shortened sequence, and the resulting context embedding is interpolated back to the original sequence lengthLLfor alignment with the detail stream:

EV=Interpolation​\(Transformer​\(XdownV\),size=L\),EV∈ℝC×L×dV\.E^\{V\}=\\text\{Interpolation\}\(\\text\{Transformer\}\(X^\{V\}\_\{\\text\{down\}\}\),\\ \\text\{size\}=L\),\\quad E^\{V\}\\in\\mathbb\{R\}^\{C\\times L\\times d\_\{V\}\}\.\(14\)
Fusion of Embeddings: Decomposing signals into detail and context streams provides a physics\-informed representation that captures both instantaneous cues and global structure\. For example, in IMU signals, the detail embeddingEUE^\{U\}can capture abrupt, high\-frequency transients such as wrist flicks or foot impacts, while the context embeddingEVE^\{V\}interprets these as transitions between broader activity phases, such as moving from walking to standing or from wiping to resting\. To integrate these complementary views, we fuse the two embeddings into a unified hierarchical embedding:

EF=EU\|\|EV,EF∈ℝC,L,d;d=dv\+du\.E^\{F\}=E^\{U\}\|\|E^\{V\},\\quad E^\{F\}\\in\\mathbb\{R\}^\{C,L,d\};\\ d=d\_\{v\}\+d\_\{u\}\.\(15\)

### C\.3Temporal Anchor Formation

![Refer to caption](https://arxiv.org/html/2605.14014v1/x17.png)Figure 11:Temporal Anchor Formation Module\.The hierarchical embeddings encode both fine\-grained transients and coarse\-grained contexts, providing multi\-scale information for detecting event anchors\. Rather than relying on fixed intervals, Dywave identifies anchors at semantic transitions in the embedding space\. Figure[11](https://arxiv.org/html/2605.14014#A3.F11)illustrates this module\.

Saliency Estimation via Key\-Query Projections\.To detect transitions between events, each fused embedding is projected into key and query spaces through learned linear mappings:

kt=Fk​\(EtF\),qt=Fq​\(EtF\),Fk,Fq:ℝd→ℝd/4\.k\_\{t\}=F\_\{k\}\(E^\{F\}\_\{t\}\),\\quad q\_\{t\}=F\_\{q\}\(E^\{F\}\_\{t\}\),\\quad F\_\{k\},F\_\{q\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{d/4\}\.\(16\)The saliency at each timestep measures representational dissimilarity between adjacent frames:

Pt=1−CosineSim​\(kt−1,qt\),t∈\[2,L\]\.P\_\{t\}=1\-\\text\{CosineSim\}\(k\_\{t\-1\},q\_\{t\}\),\\qquad t\\in\[2,L\]\.\(17\)Cosine similarity provides scale\-invariant transition detection: continuous physical processes produce slowly varying embeddings with high similarity \(lowPtP\_\{t\}\), whereas genuine event transitions induce abrupt representational shifts that yield high saliency\.

Anchor Selection via NMS and TopK\.The raw saliency sequence may contain noisy or redundant peaks\. To extract a compact set of anchors, we apply temporal non\-maximum suppression followed by top\-kkselection:

𝒜=TopK​\(NMS​\(P,wnms\),⌈τ⋅L⌉\),\\mathcal\{A\}=\\text\{TopK\}\(\\text\{NMS\}\(P,w\_\{\\text\{nms\}\}\),\\lceil\\tau\\cdot L\\rceil\),\(18\)whereτ∈\(0,1\)\\tau\\in\(0,1\)is the target compression ratio controlling the maximum number of output tokens\. NMS with window sizewnms=⌊L/\(2​⌈τ​L⌉\)⌋w\_\{\\text\{nms\}\}=\\lfloor L/\(2\\lceil\\tau L\\rceil\)\\rfloorretains only the local maximum within each window, enforcing minimum separation between anchors\.

In practice, NMS often produces fewer anchors than the budget⌈τ⋅L⌉\\lceil\\tau\\cdot L\\rceil, particularly for signals with sparse event transitions or long stationary intervals\. The subsequent TopK operation serves as a safeguard against over\-tokenization by capping the maximum token count to prevent computational explosion for signals with unusually dense saliency peaks\. When NMS already yields fewer than the budget, TopK passes through all anchors\.

### C\.4Dynamic Temporal Fusion

Once anchors are identified, Dywave consolidates embeddings within each segment into event\-aligned representations\. Real\-world signals often contain long intervals of stable dynamics where consecutive timesteps convey similar semantics; uniform patching wastes computation on such redundant information\. Dynamic temporal fusion addresses this by adaptively compressing coherent regions while preserving semantic integrity at event boundaries\. Figure[12](https://arxiv.org/html/2605.14014#A3.F12)illustrates this module\.

Anchor\-Centered Cluster Formation\.Each anchor corresponds to a semantically significant moment marking a shift in local dynamics\. During fusion, anchors serve as cluster centers, with surrounding timesteps assigned to their nearest anchor:

κ​\(t\)=arg⁡mina∈𝒜⁡\|t−a\|,t∈\[1,L\]\.\\kappa\(t\)=\\arg\\min\_\{a\\in\\mathcal\{A\}\}\|t\-a\|,\\qquad t\\in\[1,L\]\.\(19\)This induces a partition of the sequence into\|𝒜\|\|\\mathcal\{A\}\|contiguous segments withO​\(L\)O\(L\)complexity\. Each segment groups temporally coherent timesteps that share similar dynamics, forming natural clusters aligned with signal semantics\.

![Refer to caption](https://arxiv.org/html/2605.14014v1/x18.png)Figure 12:Temporal Fusion Module\.Saliency\-Weighted Aggregation\.Within each cluster, embeddings are aggregated using saliency scores as weights:

EkA=∑t:κ​\(t\)=akPt⋅EtF∑t:κ​\(t\)=akPt\+ε,k∈\[1,\|𝒜\|\],E^\{A\}\_\{k\}=\\frac\{\\sum\_\{t:\\kappa\(t\)=a\_\{k\}\}P\_\{t\}\\cdot E^\{F\}\_\{t\}\}\{\\sum\_\{t:\\kappa\(t\)=a\_\{k\}\}P\_\{t\}\+\\varepsilon\},\\qquad k\\in\[1,\|\\mathcal\{A\}\|\],\(20\)whereε=10−6\\varepsilon=10^\{\-6\}ensures numerical stability\. This weighting scheme emphasizes timesteps with higher saliency, which correspond to event boundaries and informative transitions, while down\-weighting redundant intervals with low saliency\. As a result, the fused embedding captures distinctive boundary characteristics while compactly representing stationary regions\.

The aggregated embeddings are projected to the final token embedding space via a two\-layer MLP:

E=MLP​\(EA\),E∈ℝ\|𝒜\|×d,E=\\text\{MLP\}\(E^\{A\}\),\\quad E\\in\\mathbb\{R\}^\{\|\\mathcal\{A\}\|\\times d\},\(21\)whereddis the final token embedding dimension equal to the hidden dimension of the backbone encoder\.

### C\.5Reconstruction Decoder

Dynamic fusion compresses variable\-length segments into fixed\-dimensional token embeddings, which risks discarding fine\-grained signal characteristics\. To regularize this compression and ensure fused tokens preserve multi\-scale structure, Dywave employs an auxiliary reconstruction objective during training\.

Decoder Architecture\.The decoder reconstructs MODWT coefficients from compressed tokens through three stages:

W^=AdaptivePool​\(ConvTranspose1D​\(Linear​\(E\)\),L\)∈ℝ\(J\+1\)×C×L\.\\hat\{W\}=\\text\{AdaptivePool\}\(\\text\{ConvTranspose1D\}\(\\text\{Linear\}\(E\)\),L\)\\in\\mathbb\{R\}^\{\(J\+1\)\\times C\\times L\}\.\(22\)First, a linear layer projects each token embedding from dimensionddto an intermediate dimension matching the number of wavelet levels times channels:Linear:ℝd→ℝ\(J\+1\)⋅C\\text\{Linear\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{\(J\+1\)\\cdot C\}\. Second, a transposed 1D convolution expands the temporal dimension, producing an initial reconstruction at length proportional to\|𝒜\|\|\\mathcal\{A\}\|\. Finally, adaptive average pooling resamples to the original sequence lengthLL, ensuring temporal alignment with the MODWT coefficients regardless of compression ratio\.

Wavelet Supervision\.Rather than reconstructing the raw signal directly, we supervise in the wavelet coefficient domain:

ℒrec=MSE​\(\{d​X1,…,d​XJ,A\},W^\)\.\\mathcal\{L\}\_\{\\text\{rec\}\}=\\text\{MSE\}\(\\\{dX\_\{1\},\\ldots,dX\_\{J\},A\\\},\\hat\{W\}\)\.\(23\)This formulation offers two advantages\. First, it explicitly enforces preservation of both high\-frequency transients \(via detail coefficients\{d​Xj\}\\\{dX\_\{j\}\\\}\) and low\-frequency trends \(via approximationAA\), preventing the model from collapsing to smooth reconstructions that ignore rapid dynamics\. Second, supervising at multiple scales provides richer gradient signals than raw signal MSE, which can be dominated by low\-frequency components\.

Training and Inference\.The reconstruction loss is weighted byλrec=0\.1\\lambda\_\{\\text\{rec\}\}=0\.1and combined with the task loss:ℒ=ℒtask\+λrec⋅ℒrec\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{task\}\}\+\\lambda\_\{\\text\{rec\}\}\\cdot\\mathcal\{L\}\_\{\\text\{rec\}\}\. At inference, the decoder is discarded entirely, adding no computational overhead to the forward pass\. The reconstruction objective thus serves as a regularizer that encourages information\-preserving compression\.

## Appendix DTraining and Implementation

### D\.1Training Details

All models are implemented in PyTorch 2\.6\.0 and trained on a single NVIDIA A5000 GPU with 24GB memory\. We use the Adam optimizer\(Kingma & Ba,[2014](https://arxiv.org/html/2605.14014#bib.bib30)\)with an initial learning rate of1×10−41\\times 10^\{\-4\}and a cosine annealing scheduler that decays the learning rate to1×10−61\\times 10^\{\-6\}over the training period\. Models are trained for 500 epochs\. The batch size is set to 256 for short\-context datasets and 64 for long\-context datasets to accommodate memory constraints\.

### D\.2Generalization Finetuning Details

For cross\-domain and sequence\-length generalization experiments, we use two finetuning strategies\. In*full backbone finetuning*, the tokenization module is frozen while the encoder and classification head are updated with a reduced learning rate of1×10−51\\times 10^\{\-5\}\. In*head\-only finetuning*, both the tokenization module and encoder are frozen, and only the classification head is trained with the same learning rate\. Both strategies use the Adam optimizer with cosine scheduler\.

## Appendix EEvaluation

### E\.1Ablation Studies \- Variants Descriptions

We evaluate Dywave with five variants by removing individual components or replacing the original design of Dywave with alternative implementations:

- •Dywave\-w/oWaveretains the anchor formation and dynamic fusion modules but replaces Dywave’s hierarchical embedding module with PatchTST, isolating the hierarchical decomposition\.
- •Dywave\-FixedDWTapplies the standard MODWT on PatchTST using the same optimal patch size and stride as PatchTST, dropping Dywave’s hierarchical embeddings and adaptive segmentation to evaluate the effect of fixed\-scale wavelet decomposition\.
- •Dywave\-w/oReconremoves the reconstruction objectiveℒrec\\mathcal\{L\}\_\{\\text\{rec\}\}and trains the model only with the backbone task loss to examine the influence of the reconstruction branch on representation quality\.
- •Dywave\-w/oFusionuses only the anchors from the anchor formation module without performing the subsequent fusion, allowing us to evaluate the role of temporal fusion module\.
- •Dywave\-CNNBoundreplaces the similarity\-based anchor formation module with convolutional layers followed by a sigmoid probability head to analyze the context\-aware similarity module\.
- •Dywave\-SpecBoundreplaces cosine similarity with spectral energy as the saliency criterion\. Specifically, the saliency score at each timestep is computed as theℓ2\\ell\_\{2\}norm of the wavelet detail coefficients in the frequency domain, rather than the cosine dissimilarity between adjacent hierarchical embeddings\. This isolates the contribution of semantic\-space saliency by substituting a signal\-level spectral measure that does not depend on the learned representation\.

### E\.2Computation Efficiency

Table 7:Transformerbackbone encoder inference latency on the Raspberry Pi 4 device\.We also evaluate how the compact input tokens generated by Dywave can translate to real\-world deployment efficiency\. Table[7](https://arxiv.org/html/2605.14014#A5.T7)shows the inference latency breakdown on a Raspberry Pi 4 device using input tokens generated by Dywave and the fixed\-size PatchTST baseline\. Across all datasets, Dywave consistently achieves lower latency due to the substantial reduction in input sequence length\. The gap is more significant with long\-context inputs in Ego4D\. In these scenarios, PatchTST produces hundreds of tokens, leading to rapidly increasing latency with sequence length\. In contrast, Dywave adaptively compresses long stationary regions into a small number of semantically coherent input tokens with up to an order\-of\-magnitude reduction in token count and a correspondingly lower inference delay of the backbone encoder\.

### E\.3Sensitivity Analysis

![Refer to caption](https://arxiv.org/html/2605.14014v1/x19.png)Figure 13:Sensitivity analysis on anchor budget\.![Refer to caption](https://arxiv.org/html/2605.14014v1/x20.png)Figure 14:Sensitivity analysis on reconstruction lossλrec\\lambda\_\{\\text\{rec\}\}\.Figures[13](https://arxiv.org/html/2605.14014#A5.F13)and[14](https://arxiv.org/html/2605.14014#A5.F14)report sensitivity to the maximum anchor budgetτ\\tauand the reconstruction loss weightλrec\\lambda\_\{\\text\{rec\}\}\. Performance remains mostly stable across anchor budgets from 64 to 512\. Since NMS and the learned saliency landscape together frequently yield far fewer anchors than the maximum, the budget serves as a safe upper bound rather than directly determining token boundaries\. Performance is similarly stable acrossλrec∈\[0\.1,1\.0\]\\lambda\_\{\\text\{rec\}\}\\in\[0\.1,1\.0\], and token count is non\-monotonic inλrec\\lambda\_\{\\text\{rec\}\}, confirming that reconstruction regularizes representation quality without inflating anchor density\. Fixed\-patch methods such as PatchTST require extensive grid search over patch size and stride, as these directly determine tokenization structure\. Dywave’s is stable across all five evaluation datasets without extensive per\-domain tuning, in contrast to PatchTST’s search for patch sizes that could fluctuate the performance\.

## Appendix FCase Study on Human Activity Recognition

### F\.1Visualization Analysis

![Refer to caption](https://arxiv.org/html/2605.14014v1/x21.png)Figure 15:Ego4D Boundary Visualization\. Signal events are manuallyannotated with red bounding boxes\.We perform a visualization study to examine how Dywave’s learned boundaries align with the semantics of human activity events\. Figure[15](https://arxiv.org/html/2605.14014#A6.F15)presents boundary visualizations on long\-sequence samples from the Ego4D accelerometer dataset under varying scenarios\. The top row shows annotated motion events with red bounding boxes to illustrate the correspondence between the dynamic regions and boundaries generated by Dywave\.

Even when thesame user performs the same activity\(*e\.g\.*, cleaning\), the motion patterns differ substantially across samples due to variations in style, duration, and contextual behavior \(row 1\)\. This variability becomes more significant whendifferent users perform the same activity\(row 2\), with user\-dependent rhythms and intensities\. Comparingdifferent activities\(row 3\) further highlights changes in temporal density and dynamic range, reflecting the inherent heterogeneity of real\-world motion across the samples\.

Under such diverse conditions, tokenization with fixed\-size patching fails to preserve semantic alignment and allocates equal granularity to both active and quiescent intervals\. In comparison, Dywave dynamically adapts segmentation granularity, as high\-motion regions result in fine\-grained tokens while stable regions are compactly represented\. For static activities such as reading, PatchTST produces redundant tokens over long stationary periods, whereas Dywave condenses them into a single event\-level token and focuses on transient motion bursts\.

These visualizations highlight how Dywave adapts to the inherent temporal heterogeneity of human activity signals, generating event\-aligned representations that remain semantically coherent across users, contexts, and motion patterns\. The boundaries produced by Dywave provide practitioners with a clear structure for analyzing non\-intuitive raw sensing signals and demonstrate efficiency for downstream tasks with minimal overhead\.

### F\.2Human\-Centric Micro\-Activity Decomposition

![Refer to caption](https://arxiv.org/html/2605.14014v1/x22.png)\(a\)Cooking
![Refer to caption](https://arxiv.org/html/2605.14014v1/x23.png)\(b\)Household Management

Figure 16:Example of micro\-activity decomposition with Dywave on Ego4D\.Real\-world human activities are composed of numerous short, fine\-grained motions such asreaching,grabbing, orwalkingthat sequentially form higher\-level tasks\. However, due to the annotation inefficiency and the non\-intuitive nature of raw sensor signals, most datasets provide only coarse\-grained activity labels \(*e\.g\.*,cooking,cleaning\), omitting these transient micro\-activities\. This lack of fine\-grained labeling limits our ability to understand how humans perform these tasks and how transitions occur between motions\. Manual annotation of such micro\-activities is time\-consuming and error\-prone, since many transitions occur in sub\-second intervals and lack clear visual or temporal boundaries in the sensor space\. Consequently, a deeper, structure\-aware understanding of human behavior through sensing signals has remained challenging\. This section conducts a qualitative case study of Dywave’s capability in mitigating this challenge on the Ego4D dataset\(Grauman et al\.,[2022](https://arxiv.org/html/2605.14014#bib.bib20)\), which provides synchronized egocentric video and IMU signals during daily activities such as*cooking*,*crafting*, and*household management*\. We extract 15\-second continuous IMU segments and apply Dywave to the accelerometer signals to automatically generate event\-aligned boundaries without any fine\-grained supervision\.

Micro\-Activity Identification:Figure[16](https://arxiv.org/html/2605.14014#A6.F16)shows two examples fromcookingandhousehold managementtasks\. For each detected segment, we display the corresponding video frame and manually annotate the micro\-activity being performed\. Dywave’s dynamic boundaries align closely with perceptible changes in human motion, such asreaching for the cabinet door,grabbing a bottle, orstart and end walking\. These short, contextually meaningful actions are not labeled in the original dataset but are automatically reflected by Dywave’s boundary detection\. Each resulting input token thus provides an intuitive and temporally localized representation of a micro\-activity\.

Human\-centric Micro\-Activity Decomposition:Dywave introduces a human\-centric capability for understanding and interacting with human behavior at a more granular level\. By transforming coarse, high\-level signals into semantically coherent micro\-activity units, it enables systems to learn the internal temporal structure of human actions without requiring expensive manual annotation\. This decomposition can serve as an automatic proxy for fine\-grained labeling, facilitate the construction of hierarchical activity taxonomies, and generate rich behavioral logs for long\-term studies of human routines\. In essence, Dywave shifts the focus of sensing intelligence from merely classifying what activity a person is doing to uncovering how it unfolds, through fine\-grained sub\-event detection and reasoning over complex temporal dynamics\. Without such adaptive segmentation, these insights would remain hidden in continuous, unstructured IMU signals, underscoring the role of Dywave as a bridge between raw sensor data and human\-understandable behavior representations\.

## Appendix GDiscussion

Dywave demonstrates that adaptive, physics\-informed tokenization can effectively bridge raw time\-series sensing signals with semantically meaningful input representations\. However, we recognize limitations of the current design and outline potential directions for future work\.

Computational Overhead:While Dywave substantially improves encoder efficiency and performance at inference time by reducing redundant input length, this introduces additional computational overhead during preprocessing compared to fixed\-size tokenization\. Specifically, the wavelet decomposition and hierarchical embedding modules require extra computations before the input tokens are passed to the backbone encoder\. Although this cost is amortized by the reduction in input length for the heavy backbone encoder, the preprocessing module may still be significant for ultra\-low\-power or edge deployments\. Future work could explore alternative techniques that maintain adaptivity with reduced overhead\.

Injecting Temporal Information:While Dywave does not explicitly encode real\-world elapsed time, its physics\-informed hierarchical embeddings already capture relative temporal dynamics across multiple scales\. Each token represents an event whose duration is implicitly reflected in the local continuity and wavelet coefficients\. Thus, temporal information is indirectly preserved in the learned embeddings\. Nonetheless, explicit modeling of absolute or physical time remains an open research direction\. Future extensions could explore joint representations that align the model’s internal time with human\-understandable physical timescales, potentially improving synchronization and interpretability in real\-world deployments\.

Dependency on Wavelet Basis:The discrete wavelet transforms provide Dywave with a physics\-informed structured view of the signals\. However, the choice of the wavelet basis \(*e\.g\.*, Daubechies, Symlets, Coiflets\(Lina & Mayrand,[1995](https://arxiv.org/html/2605.14014#bib.bib35); Graps,[1995](https://arxiv.org/html/2605.14014#bib.bib19)\)\) could yield different time\-frequency trade\-offs and potentially influence the quality of anchor selections across different sensing datasets and modalities\. Although our experiments using Daubechies 4 \(db4\) wavelet basis show general robustness, we acknowledge that certain applications may require task\-specific basis selection to better capture characteristic patterns\. An interesting direction of future research is to explore learnable modules for the wavelet transform as an alternative that can adapt to the signals effectively and efficiently\.

Multimodal Interaction:Although we have demonstrated Dywave’s robustness when applied to multimodal data, it does not explicitly model cross\-modal interactions\. Dywave primarily focuses on unimodal signals, where adaptive tokenization is guided by temporal dynamics within each individual modality\. Extending to multimodal sensing introduces additional challenges such as temporal synchronization and modality\-specific phase alignment, which can serve either as informative cues or redundant patterns that may be selectively pruned\. Extending Dywave to explicitly capture these multimodal dependencies is a potential direction for future work\.

Extension to Additional Tasks:Dywave is designed for event\-aligned representation learning in heterogeneous IoT sensing, where the primary task is classification\. This limits Dywave as a tokenization framework for IoT sensing classification rather than a universal time\-series method\. Extending dynamic tokenization to time\-seriesforecastingandanomaly detectionis an interesting future direction\. Additionally, the current framework requires task\-labeled data to supervise the saliency landscape through the downstream objective\. An important future direction isself\-supervisedtraining, where tokenization boundaries are learned jointly with the self\-supervised objective without relying on task\-specific annotations\. This would enable domain\-agnostic, label\-efficient tokenization that adapts to diverse sensing modalities without per\-task fine\-tuning\.

Similar Articles

TONIC: Token-Centric Semantic Communication for Task-Oriented Wireless Systems

arXiv cs.LG

This paper proposes TONIC, a token-centric semantic communication framework for task-oriented wireless systems that assigns utility-aware unequal error protection to tokens and uses confidence-aware gating with a Transformer-based completion model, outperforming baselines on image classification.