Tag
This paper introduces FFDC, a lightweight verifier for World Action Models that enables adaptive action chunk sizes by checking consistency between predicted and actual observations, improving efficiency and robustness in robotic manipulation.
HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.