HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
Summary
HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.
View Cached Full Text
Cached at: 04/20/26, 08:28 AM
Paper page - HiVLA: A Visual-Grained-Centric Hierarchical Embodied Manipulation System
Source: https://huggingface.co/papers/2604.14125 Authors:
,
,
,
,
,
,
,
,
,
Abstract
HiVLA presents a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert with cascaded cross-attention for improved robotic manipulation.
While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (https://huggingface.co/papers?q=Vision-Language%20Models) (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning (https://huggingface.co/papers?q=semantic%20planning) from low-level motor control (https://huggingface.co/papers?q=motor%20control). In the high-level part, a VLM planner first performs task decomposition (https://huggingface.co/papers?q=task%20decomposition) and visual grounding (https://huggingface.co/papers?q=visual%20grounding) to generate structured plans (https://huggingface.co/papers?q=structured%20plans), comprising a subtask instruction and a precise target bounding box (https://huggingface.co/papers?q=bounding%20box). Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (https://huggingface.co/papers?q=Diffusion%20Transformer) (DiT) action expert in the low-level part equipped with a novel cascaded cross-attention (https://huggingface.co/papers?q=cascaded%20cross-attention) mechanism. This design sequentially fuses global context, high-resolution object-centric crops, and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM’s zero-shot reasoning (https://huggingface.co/papers?q=zero-shot%20reasoning) while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition (https://huggingface.co/papers?q=skill%20composition) and fine-grained manipulation (https://huggingface.co/papers?q=fine-grained%20manipulation) of small objects in cluttered scenes (https://huggingface.co/papers?q=cluttered%20scenes).
View arXiv page (https://arxiv.org/abs/2604.14125) View PDF (https://arxiv.org/pdf/2604.14125) Project page (https://tianshuoy.github.io/HiVLA-page/) Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.14125)
Get this paper in your agent:
hf papers read 2604.14125
Don’t have the latest CLI? curl -LsSf https://hf.co/cli/install.sh | bash
Models citing this paper 0
No model linking this paper
Cite arxiv.org/abs/2604.14125 in a model README.md to link it from this page.
Datasets citing this paper 0
No dataset linking this paper
Cite arxiv.org/abs/2604.14125 in a dataset README.md to link it from this page.
Spaces citing this paper 0
No Space linking this paper
Cite arxiv.org/abs/2604.14125 in a Space README.md to link it from this page.
Collections including this paper 0
No Collection including this paper
Add this paper to a collection (https://huggingface.co/new-collection) to link it from this page.
Similar Articles
AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding
AffordanceVLA introduces a unified framework using structured affordance forecasting as an intermediate representation to improve perception-action mapping in robotic manipulation, leveraging vision-language models and a Mixture-of-Transformer architecture.
Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack
HyVLA-0.5 is an end-to-end robotic learning system that integrates data collection, model design, pre-training, fine-tuning, and reinforcement learning for real-world deployment.
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
IntentVLA is a history-conditioned visual-language-action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations. It also introduces AliasBench, an ambiguity-aware benchmark for evaluating such methods.
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
LabVLA is a vision-language-action model for scientific laboratory automation, trained with a two-stage approach combining action token pretraining and flow matching. It achieves state-of-the-art success rates on the LabUtopia benchmark by leveraging simulated data to bridge the gap between household demonstrations and lab-specific tasks.