MotionVLA: Vision-Language-Action Model for Humanoid Motion
Summary
Proposes MotionVLA, a vision-language-action model for humanoid motion generation using a dual-stream frequency tokenizer that separately encodes pose and physical dynamics, achieving better diversity and consistency.
View Cached Full Text
Cached at: 06/17/26, 03:35 AM
Paper page - MotionVLA: Vision-Language-Action Model for Humanoid Motion
Source: https://huggingface.co/papers/2606.15142
Abstract
A dual-stream frequency tokenizer and autoregressive model are proposed to improve humanoid motion generation by separately encoding pose and physical dynamics, achieving better diversity and consistency compared to single-codebook approaches.
Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the samequantizationspace. Ourfrequency-domain analysisof human motion data reveals a clear mismatch between single-codebookquantizationand motion statistics: fiveDCT coefficientscapture 93% of joint-position energy but only 37% of joint-velocity energy, which can biasquantizationtoward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standardautoregressive modelto effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, adual-stream frequency tokenizerthat separates motion into Base and physical streams and compresses them independently withDCT truncationandBPE. Furthermore, we present MotionVLA, aQwen3.5-based modelthat arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments onHumanML3DandMBenchshow that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% onHumanML3Dand improves Motion-Condition Consistency by 3.8% onMBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressivemotion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.
View arXiv pageView PDFProject pageGitHub3Add to collection
Get this paper in your agent:
hf papers read 2606\.15142
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.15142 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.15142 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.15142 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
TBD-VLA: Temporal Block Diffusion Vision Language Action Model
TBD-VLA introduces a discrete vision-language-action framework that combines block diffusion with autoregressive generation to achieve efficient temporal action modeling and faster inference, significantly outperforming prior VLA approaches in simulation and real-world manipulation tasks.
AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding
AffordanceVLA introduces a unified framework using structured affordance forecasting as an intermediate representation to improve perception-action mapping in robotic manipulation, leveraging vision-language models and a Mixture-of-Transformer architecture.
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
LabVLA is a vision-language-action model for scientific laboratory automation, trained with a two-stage approach combining action token pretraining and flow matching. It achieves state-of-the-art success rates on the LabUtopia benchmark by leveraging simulated data to bridge the gap between household demonstrations and lab-specific tasks.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
Proposes AR-VLA, an autoregressive action expert that generates continuous action sequences with long-term memory for context-aware robotic policy training, improving trajectory smoothness and task success rates over reactive VLA models.
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
IntentVLA is a history-conditioned visual-language-action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations. It also introduces AliasBench, an ambiguity-aware benchmark for evaluating such methods.