HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
Summary
HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.
View Cached Full Text
Cached at: 04/20/26, 08:28 AM
Paper page - HiVLA: A Visual-Grained-Centric Hierarchical Embodied Manipulation System
Source: https://huggingface.co/papers/2604.14125 Authors:
,
,
,
,
,
,
,
,
,
Abstract
HiVLA presents a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert with cascaded cross-attention for improved robotic manipulation.
While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (https://huggingface.co/papers?q=Vision-Language%20Models) (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning (https://huggingface.co/papers?q=semantic%20planning) from low-level motor control (https://huggingface.co/papers?q=motor%20control). In the high-level part, a VLM planner first performs task decomposition (https://huggingface.co/papers?q=task%20decomposition) and visual grounding (https://huggingface.co/papers?q=visual%20grounding) to generate structured plans (https://huggingface.co/papers?q=structured%20plans), comprising a subtask instruction and a precise target bounding box (https://huggingface.co/papers?q=bounding%20box). Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (https://huggingface.co/papers?q=Diffusion%20Transformer) (DiT) action expert in the low-level part equipped with a novel cascaded cross-attention (https://huggingface.co/papers?q=cascaded%20cross-attention) mechanism. This design sequentially fuses global context, high-resolution object-centric crops, and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM’s zero-shot reasoning (https://huggingface.co/papers?q=zero-shot%20reasoning) while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition (https://huggingface.co/papers?q=skill%20composition) and fine-grained manipulation (https://huggingface.co/papers?q=fine-grained%20manipulation) of small objects in cluttered scenes (https://huggingface.co/papers?q=cluttered%20scenes).
View arXiv page (https://arxiv.org/abs/2604.14125) View PDF (https://arxiv.org/pdf/2604.14125) Project page (https://tianshuoy.github.io/HiVLA-page/) Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.14125)
Get this paper in your agent:
hf papers read 2604.14125
Don’t have the latest CLI? curl -LsSf https://hf.co/cli/install.sh | bash
Models citing this paper 0
No model linking this paper
Cite arxiv.org/abs/2604.14125 in a model README.md to link it from this page.
Datasets citing this paper 0
No dataset linking this paper
Cite arxiv.org/abs/2604.14125 in a dataset README.md to link it from this page.
Spaces citing this paper 0
No Space linking this paper
Cite arxiv.org/abs/2604.14125 in a Space README.md to link it from this page.
Collections including this paper 0
No Collection including this paper
Add this paper to a collection (https://huggingface.co/new-collection) to link it from this page.
Similar Articles
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.
A better method for planning complex visual tasks
MIT researchers developed VLMFP, a two-stage generative AI approach combining vision-language models with formal planning software to achieve 70% success rate on complex visual planning tasks like robot navigation, nearly 2.3x better than existing baselines. The method automatically translates visual scenarios into planning files that classical solvers can process, enabling effective long-horizon planning in novel environments.
Just open-sourced FastVLA
FastVLA, an open-source Vision-Language-Action model, now runs 5 Hz robotics on an L4 GPU.
HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
HyperGVL introduces the first benchmark for evaluating Large Vision-Language Models on hypergraph understanding and reasoning, featuring 84,000 QA samples across 12 tasks and real-world applications. The paper also proposes WiseHyGR, a generalizable router that enhances LVLM performance through adaptive hypergraph representations.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI introduces a framework that turns text-conditioned synthetic videos into physically plausible dexterous robot control via a hybrid 3D-2D tracking reward, enabling zero-shot generalization to unseen objects.