HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Hugging Face Daily Papers Papers

Summary

HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:28 AM

Paper page - HiVLA: A Visual-Grained-Centric Hierarchical Embodied Manipulation System

Source: https://huggingface.co/papers/2604.14125 Authors:

,

,

,

,

,

,

,

,

,

Abstract

HiVLA presents a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert with cascaded cross-attention for improved robotic manipulation.

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (https://huggingface.co/papers?q=Vision-Language%20Models) (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning (https://huggingface.co/papers?q=semantic%20planning) from low-level motor control (https://huggingface.co/papers?q=motor%20control). In the high-level part, a VLM planner first performs task decomposition (https://huggingface.co/papers?q=task%20decomposition) and visual grounding (https://huggingface.co/papers?q=visual%20grounding) to generate structured plans (https://huggingface.co/papers?q=structured%20plans), comprising a subtask instruction and a precise target bounding box (https://huggingface.co/papers?q=bounding%20box). Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (https://huggingface.co/papers?q=Diffusion%20Transformer) (DiT) action expert in the low-level part equipped with a novel cascaded cross-attention (https://huggingface.co/papers?q=cascaded%20cross-attention) mechanism. This design sequentially fuses global context, high-resolution object-centric crops, and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM’s zero-shot reasoning (https://huggingface.co/papers?q=zero-shot%20reasoning) while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition (https://huggingface.co/papers?q=skill%20composition) and fine-grained manipulation (https://huggingface.co/papers?q=fine-grained%20manipulation) of small objects in cluttered scenes (https://huggingface.co/papers?q=cluttered%20scenes).

View arXiv page (https://arxiv.org/abs/2604.14125) View PDF (https://arxiv.org/pdf/2604.14125) Project page (https://tianshuoy.github.io/HiVLA-page/) Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.14125)

Get this paper in your agent:

hf papers read 2604.14125

Don’t have the latest CLI? curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.14125 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.14125 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.14125 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection (https://huggingface.co/new-collection) to link it from this page.

Similar Articles

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Hugging Face Daily Papers

OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.

A better method for planning complex visual tasks

MIT News — Artificial Intelligence

MIT researchers developed VLMFP, a two-stage generative AI approach combining vision-language models with formal planning software to achieve 70% success rate on complex visual planning tasks like robot navigation, nearly 2.3x better than existing baselines. The method automatically translates visual scenarios into planning files that classical solvers can process, enabling effective long-horizon planning in novel environments.

Just open-sourced FastVLA

Reddit r/LocalLLaMA

FastVLA, an open-source Vision-Language-Action model, now runs 5 Hz robotics on an L4 GPU.