MotionVLA: Vision-Language-Action Model for Humanoid Motion

Hugging Face Daily Papers Papers

Summary

Proposes MotionVLA, a vision-language-action model for humanoid motion generation using a dual-stream frequency tokenizer that separately encodes pose and physical dynamics, achieving better diversity and consistency.

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.
Original Article
View Cached Full Text

Cached at: 06/17/26, 03:35 AM

Paper page - MotionVLA: Vision-Language-Action Model for Humanoid Motion

Source: https://huggingface.co/papers/2606.15142

Abstract

A dual-stream frequency tokenizer and autoregressive model are proposed to improve humanoid motion generation by separately encoding pose and physical dynamics, achieving better diversity and consistency compared to single-codebook approaches.

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the samequantizationspace. Ourfrequency-domain analysisof human motion data reveals a clear mismatch between single-codebookquantizationand motion statistics: fiveDCT coefficientscapture 93% of joint-position energy but only 37% of joint-velocity energy, which can biasquantizationtoward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standardautoregressive modelto effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, adual-stream frequency tokenizerthat separates motion into Base and physical streams and compresses them independently withDCT truncationandBPE. Furthermore, we present MotionVLA, aQwen3.5-based modelthat arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments onHumanML3DandMBenchshow that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% onHumanML3Dand improves Motion-Condition Consistency by 3.8% onMBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressivemotion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.

View arXiv pageView PDFProject pageGitHub3Add to collection

Get this paper in your agent:

hf papers read 2606\.15142

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.15142 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.15142 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.15142 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

Hugging Face Daily Papers

TBD-VLA introduces a discrete vision-language-action framework that combines block diffusion with autoregressive generation to achieve efficient temporal action modeling and faster inference, significantly outperforming prior VLA approaches in simulation and real-world manipulation tasks.

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Hugging Face Daily Papers

LabVLA is a vision-language-action model for scientific laboratory automation, trained with a two-stage approach combining action token pretraining and flow matching. It achieves state-of-the-art success rates on the LabUtopia benchmark by leveraging simulated data to bridge the gap between household demonstrations and lab-specific tasks.

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Hugging Face Daily Papers

IntentVLA is a history-conditioned visual-language-action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations. It also introduces AliasBench, an ambiguity-aware benchmark for evaluating such methods.