HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Hugging Face Daily Papers 04/15/26, 12:00 AM Papers

Summary

HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/20/26, 08:28 AM

Paper page - HiVLA: A Visual-Grained-Centric Hierarchical Embodied Manipulation System

Source: https://huggingface.co/papers/2604.14125 Authors:

Abstract

HiVLA presents a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert with cascaded cross-attention for improved robotic manipulation.

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (https://huggingface.co/papers?q=Vision-Language%20Models) (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning (https://huggingface.co/papers?q=semantic%20planning) from low-level motor control (https://huggingface.co/papers?q=motor%20control). In the high-level part, a VLM planner first performs task decomposition (https://huggingface.co/papers?q=task%20decomposition) and visual grounding (https://huggingface.co/papers?q=visual%20grounding) to generate structured plans (https://huggingface.co/papers?q=structured%20plans), comprising a subtask instruction and a precise target bounding box (https://huggingface.co/papers?q=bounding%20box). Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (https://huggingface.co/papers?q=Diffusion%20Transformer) (DiT) action expert in the low-level part equipped with a novel cascaded cross-attention (https://huggingface.co/papers?q=cascaded%20cross-attention) mechanism. This design sequentially fuses global context, high-resolution object-centric crops, and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM’s zero-shot reasoning (https://huggingface.co/papers?q=zero-shot%20reasoning) while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition (https://huggingface.co/papers?q=skill%20composition) and fine-grained manipulation (https://huggingface.co/papers?q=fine-grained%20manipulation) of small objects in cluttered scenes (https://huggingface.co/papers?q=cluttered%20scenes).

View arXiv page (https://arxiv.org/abs/2604.14125) View PDF (https://arxiv.org/pdf/2604.14125) Project page (https://tianshuoy.github.io/HiVLA-page/) Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.14125)

Get this paper in your agent:

hf papers read 2604.14125

Don’t have the latest CLI? curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.14125 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.14125 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.14125 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection (https://huggingface.co/new-collection) to link it from this page.

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Paper page - HiVLA: A Visual-Grained-Centric Hierarchical Embodied Manipulation System

Abstract

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Similar Articles

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

A better method for planning complex visual tasks

Just open-sourced FastVLA

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

Submit Feedback

Similar Articles

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

A better method for planning complex visual tasks

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation