Tag
Proposes MotionVLA, a vision-language-action model for humanoid motion generation using a dual-stream frequency tokenizer that separately encodes pose and physical dynamics, achieving better diversity and consistency.
ARM presents a unified autoregressive framework for image understanding, generation, and editing using discrete semantic tokenization and reinforcement learning optimization, showing cross-task synergy.