VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

Hugging Face Daily Papers Papers

Summary

VoLoAgent integrates vision-language models with robot capabilities for open-vocabulary long-horizon manipulation tasks, introducing a physical orchestrator that plans, monitors, and recovers using interruptible tools, and a benchmark called RoboVoLo for evaluation.

Open-vocabulary long-horizon manipulation requires robots to reason over flexible instructions and complex multi-object scenes while adaptively planning, executing, monitoring, and recovering from failures. We address these demands with a closed agent loop in which a VLM orchestrates heterogeneous robot capabilities as interruptible tools. Unlike in virtual AI agents, the timing of decisions, actions and tool calls is important in a physical world that does not pause for reasoning. We refer to this setting as Physical Orchestration, and propose VoLoAgent, a VLM that plans, monitors, and recovers by treating a VLA/WAM as an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high-fidelity benchmark for open-vocabulary long-horizon manipulation across common sense, memory/state tracking, complex references, and world knowledge, with both task-level success and failure-mode diagnostics. Experiments show VoLoAgent substantially outperforms single VLA/VLM or tool-based systems, with validation on real-robot experiments. Project page: https://chicychen.github.io/VoLo/
Original Article
View Cached Full Text

Cached at: 06/10/26, 05:45 AM

Paper page - VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

Source: https://huggingface.co/papers/2606.07723 Authors:

,

,

,

,

,

,

,

,

,

,

Abstract

VoLoAgent enables physical orchestration by integrating vision-language models with robot capabilities for open-vocabulary long-horizon manipulation tasks.

Open-vocabularylong-horizon manipulationrequires robots to reason over flexible instructions and complex multi-object scenes while adaptively planning, executing, monitoring, and recovering from failures. We address these demands with a closed agent loop in which a VLM orchestrates heterogeneousrobot capabilitiesasinterruptible tools. Unlike in virtual AI agents, the timing of decisions, actions and tool calls is important in a physical world that does not pause for reasoning. We refer to this setting asPhysical Orchestration, and propose VoLoAgent, a VLM that plans, monitors, and recovers by treating aVLA/WAMas an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high-fidelity benchmark foropen-vocabularylong-horizon manipulationacross common sense, memory/state tracking, complex references, and world knowledge, with bothtask-level successandfailure-mode diagnostics. Experiments show VoLoAgent substantially outperforms singleVLA/VLM or tool-based systems, with validation on real-robot experiments. Project page: https://chicychen.github.io/VoLo/

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2606\.07723

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.07723 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.07723 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.07723 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Hugging Face Daily Papers

HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Hugging Face Daily Papers

IntentVLA is a history-conditioned visual-language-action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations. It also introduces AliasBench, an ambiguity-aware benchmark for evaluating such methods.