HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Hugging Face Daily Papers Papers

Summary

HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:28 AM

Paper page - HiVLA: A Visual-Grained-Centric Hierarchical Embodied Manipulation System

Source: https://huggingface.co/papers/2604.14125 Authors:

,

,

,

,

,

,

,

,

,

Abstract

HiVLA presents a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert with cascaded cross-attention for improved robotic manipulation.

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (https://huggingface.co/papers?q=Vision-Language%20Models) (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning (https://huggingface.co/papers?q=semantic%20planning) from low-level motor control (https://huggingface.co/papers?q=motor%20control). In the high-level part, a VLM planner first performs task decomposition (https://huggingface.co/papers?q=task%20decomposition) and visual grounding (https://huggingface.co/papers?q=visual%20grounding) to generate structured plans (https://huggingface.co/papers?q=structured%20plans), comprising a subtask instruction and a precise target bounding box (https://huggingface.co/papers?q=bounding%20box). Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (https://huggingface.co/papers?q=Diffusion%20Transformer) (DiT) action expert in the low-level part equipped with a novel cascaded cross-attention (https://huggingface.co/papers?q=cascaded%20cross-attention) mechanism. This design sequentially fuses global context, high-resolution object-centric crops, and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM’s zero-shot reasoning (https://huggingface.co/papers?q=zero-shot%20reasoning) while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition (https://huggingface.co/papers?q=skill%20composition) and fine-grained manipulation (https://huggingface.co/papers?q=fine-grained%20manipulation) of small objects in cluttered scenes (https://huggingface.co/papers?q=cluttered%20scenes).

View arXiv page (https://arxiv.org/abs/2604.14125) View PDF (https://arxiv.org/pdf/2604.14125) Project page (https://tianshuoy.github.io/HiVLA-page/) Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.14125)

Get this paper in your agent:

hf papers read 2604.14125

Don’t have the latest CLI? curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.14125 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.14125 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.14125 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection (https://huggingface.co/new-collection) to link it from this page.

Similar Articles

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Hugging Face Daily Papers

IntentVLA is a history-conditioned visual-language-action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations. It also introduces AliasBench, an ambiguity-aware benchmark for evaluating such methods.

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Hugging Face Daily Papers

LabVLA is a vision-language-action model for scientific laboratory automation, trained with a two-stage approach combining action token pretraining and flow matching. It achieves state-of-the-art success rates on the LabUtopia benchmark by leveraging simulated data to bridge the gap between household demonstrations and lab-specific tasks.