VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
Summary
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
View Cached Full Text
Cached at: 06/01/26, 03:20 PM
Paper page - VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
Source: https://huggingface.co/papers/2605.30011 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
VisualThinking-VLA enables fast, accurate vision-language-action policies through visual reasoning that preserves spatial precision and reduces latency compared to text-based approaches.
Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, avisual intermediate-reasoningframework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compactvisual-evidence interfacethat preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailoredselective routing mechanismto learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduceVisualEvidence-Kit, a supervision-and-audit resource centered on aVisualEvidence-Agentthat constructs a 754.7k VLA instructionsVisualEvidence-Setfor route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, onBridgeData V2, it reduces step latency from 8.377,s withECoTto 0.367,s, achieving a 22.8 times speedup.
View arXiv pageView PDFGitHub15Add to collection
Get this paper in your agent:
hf papers read 2605\.30011
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.30011 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.30011 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30011 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
IntentVLA is a history-conditioned visual-language-action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations. It also introduces AliasBench, an ambiguity-aware benchmark for evaluating such methods.
TBD-VLA: Temporal Block Diffusion Vision Language Action Model
TBD-VLA introduces a discrete vision-language-action framework that combines block diffusion with autoregressive generation to achieve efficient temporal action modeling and faster inference, significantly outperforming prior VLA approaches in simulation and real-world manipulation tasks.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA is a unified vision-language-action model for embodied decision-making, integrating manipulation, navigation, and trajectory prediction across different robot platforms. It uses a DiT-based action decoder and embodiment-aware prompt conditioning, achieving strong performance and out-of-distribution generalization.
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
LabVLA is a vision-language-action model for scientific laboratory automation, trained with a two-stage approach combining action token pretraining and flow matching. It achieves state-of-the-art success rates on the LabUtopia benchmark by leveraging simulated data to bridge the gap between household demonstrations and lab-specific tasks.