VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

Summary

VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

Original Article

View Cached Full Text

Cached at: 06/01/26, 03:20 PM

Paper page - VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Source: https://huggingface.co/papers/2605.30011 Authors:

Abstract

VisualThinking-VLA enables fast, accurate vision-language-action policies through visual reasoning that preserves spatial precision and reduces latency compared to text-based approaches.

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, avisual intermediate-reasoningframework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compactvisual-evidence interfacethat preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailoredselective routing mechanismto learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduceVisualEvidence-Kit, a supervision-and-audit resource centered on aVisualEvidence-Agentthat constructs a 754.7k VLA instructionsVisualEvidence-Setfor route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, onBridgeData V2, it reduces step latency from 8.377,s withECoTto 0.367,s, achieving a 22.8 times speedup.

View arXiv page View PDF GitHub15 Add to collection

Get this paper in your agent:

hf papers read 2605\.30011

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30011 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30011 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30011 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Paper page - VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Submit Feedback

Similar Articles

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories