VCR: Learning Valid Contextual Representation for Incomplete Wearable Signals

arXiv cs.LG 05/20/26, 04:00 AM Papers

Summary

VCR is a self-supervised framework that learns robust representations from incomplete wearable signals using orthogonal tokenization and missing-aware mixture-of-experts, improving performance under modality missingness.

arXiv:2605.18837v1 Announce Type: new Abstract: Wearable devices enable continuous health monitoring from multimodal signals, but real-world deployment is hindered by limited labeled data and pervasive sensor incompleteness. While large-scale self-supervised pretraining reduces label dependence, most existing methods assume full modality availability. Current approaches for handling modality missingness often reconstruct entire absent signals, which can encourage hallucinating modality-specific details that are not inferable from the observed sensor signals and degrade robustness. We propose VCR, a self-supervised framework that learns to extract valid representations robust to modality missingness. VCR employs an orthogonal tokenizer to enforce strict orthogonal disentanglement by rectifying latent manifolds and applying a geometric projection, separating each modality into shared semantics and modality-specific residuals. This design preserves complete information integrity while serving as a structural foundation for robust learning under modality missingness. The resulting tokens are processed by a missing-aware mixture-of-experts backbone that adapts to varying patterns of modality availability. By constraining the objective to reconstruct only the shared components of missing modalities, VCR effectively mitigates hallucinations of non-inferable modality-specific details. Across multiple health monitoring tasks, VCR consistently improves performance and robustness under full, single-missing, and multiple-missing modality settings compared with strong supervised and self-supervised baselines.

Original Article

VCR: Learning Valid Contextual Representation for Incomplete Wearable Signals

Similar Articles

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Weakly Supervised Concept Learning for Object-centric Visual Reasoning

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

Vokenization: Multimodel Learning for Vision and Language

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

Submit Feedback

Similar Articles

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Weakly Supervised Concept Learning for Object-centric Visual Reasoning

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

Vokenization: Multimodel Learning for Vision and Language

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework