Tag
Proposes MotionVLA, a vision-language-action model for humanoid motion generation using a dual-stream frequency tokenizer that separately encodes pose and physical dynamics, achieving better diversity and consistency.