Tag
This paper introduces AnyMo, a unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, along with the OmniHuMo dataset of over 5,000 hours of motion data to enable high-quality synthesis under arbitrary modality combinations.
AudioMosaic introduces a contrastive learning-based audio encoder that uses structured time-frequency masking on spectrogram patches for efficient large-batch training, achieving state-of-the-art performance on audio benchmarks and improving audio-language models.
CSI-JEPA is a self-supervised framework for learning reusable representations from unlabeled Wi-Fi channel state information, enabling label-efficient multi-task sensing. It achieves up to 98% label savings and outperforms supervised models.