MotionVLA: Vision-Language-Action Model for Humanoid Motion

Hugging Face Daily Papers 06/13/26, 12:00 AM Papers

Summary

Proposes MotionVLA, a vision-language-action model for humanoid motion generation using a dual-stream frequency tokenizer that separately encodes pose and physical dynamics, achieving better diversity and consistency.

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.

Original Article

View Cached Full Text

Cached at: 06/17/26, 03:35 AM

Paper page - MotionVLA: Vision-Language-Action Model for Humanoid Motion

Source: https://huggingface.co/papers/2606.15142

Abstract

A dual-stream frequency tokenizer and autoregressive model are proposed to improve humanoid motion generation by separately encoding pose and physical dynamics, achieving better diversity and consistency compared to single-codebook approaches.

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the samequantizationspace. Ourfrequency-domain analysisof human motion data reveals a clear mismatch between single-codebookquantizationand motion statistics: fiveDCT coefficientscapture 93% of joint-position energy but only 37% of joint-velocity energy, which can biasquantizationtoward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standardautoregressive modelto effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, adual-stream frequency tokenizerthat separates motion into Base and physical streams and compresses them independently withDCT truncationandBPE. Furthermore, we present MotionVLA, aQwen3.5-based modelthat arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments onHumanML3DandMBenchshow that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% onHumanML3Dand improves Motion-Condition Consistency by 3.8% onMBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressivemotion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.

View arXiv page View PDF Project page GitHub3 Add to collection

Get this paper in your agent:

hf papers read 2606\.15142

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.15142 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.15142 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.15142 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

MotionVLA: Vision-Language-Action Model for Humanoid Motion

Paper page - MotionVLA: Vision-Language-Action Model for Humanoid Motion

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Submit Feedback

Similar Articles

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation