VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation
Summary
VoLoAgent integrates vision-language models with robot capabilities for open-vocabulary long-horizon manipulation tasks, introducing a physical orchestrator that plans, monitors, and recovers using interruptible tools, and a benchmark called RoboVoLo for evaluation.
View Cached Full Text
Cached at: 06/10/26, 05:45 AM
Paper page - VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation
Source: https://huggingface.co/papers/2606.07723 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
VoLoAgent enables physical orchestration by integrating vision-language models with robot capabilities for open-vocabulary long-horizon manipulation tasks.
Open-vocabularylong-horizon manipulationrequires robots to reason over flexible instructions and complex multi-object scenes while adaptively planning, executing, monitoring, and recovering from failures. We address these demands with a closed agent loop in which a VLM orchestrates heterogeneousrobot capabilitiesasinterruptible tools. Unlike in virtual AI agents, the timing of decisions, actions and tool calls is important in a physical world that does not pause for reasoning. We refer to this setting asPhysical Orchestration, and propose VoLoAgent, a VLM that plans, monitors, and recovers by treating aVLA/WAMas an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high-fidelity benchmark foropen-vocabularylong-horizon manipulationacross common sense, memory/state tracking, complex references, and world knowledge, with bothtask-level successandfailure-mode diagnostics. Experiments show VoLoAgent substantially outperforms singleVLA/VLM or tool-based systems, with validation on real-robot experiments. Project page: https://chicychen.github.io/VoLo/
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.07723
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.07723 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.07723 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.07723 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.
Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack
HyVLA-0.5 is an end-to-end robotic learning system that integrates data collection, model design, pre-training, fine-tuning, and reinforcement learning for real-world deployment.
AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding
AffordanceVLA introduces a unified framework using structured affordance forecasting as an intermediate representation to improve perception-action mapping in robotic manipulation, leveraging vision-language models and a Mixture-of-Transformer architecture.
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
IntentVLA is a history-conditioned visual-language-action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations. It also introduces AliasBench, an ambiguity-aware benchmark for evaluating such methods.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA is a unified vision-language-action model for embodied decision-making, integrating manipulation, navigation, and trajectory prediction across different robot platforms. It uses a DiT-based action decoder and embodiment-aware prompt conditioning, achieving strong performance and out-of-distribution generalization.