Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning
Summary
Visual Para-Thinker++ proposes a single-policy multi-agent framework for visual reasoning that uses role-conditioned agents (Main, Worker, Summary) and dedicated training methods to reduce hallucinations and improve efficiency, outperforming baselines on hallucination-sensitive benchmarks.
View Cached Full Text
Cached at: 06/12/26, 06:51 AM
Paper page - Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning
Source: https://huggingface.co/papers/2606.09290
Abstract
A multi-agent framework with shared MLLM policy and role-specific training methods improves visual reasoning by reducing hallucinations and enabling efficient parallel processing.
Visual reasoningrequires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment andhallucination. We propose Visual Para-Thinker++, a single-policymulti-agent frameworkin which one sharedMLLM policyis instantiated as role-conditioned Main, Worker, andSummary Agents. TheMain Agentdecomposes the task with fixed allocation patterns;Worker Agentsreason in parallel under context isolation; and theSummary Agentreconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained byMulti-Agent Capability InjectionandRole-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reducegradient conflictamong collaborative roles. A native inference engine enables efficient multi-agent rollout through sharedvisual prefixandKV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains onhallucination-sensitivevisual reasoning.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.09290
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.09290 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.09290 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.09290 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators
The paper proposes Astra, an agentic spatial reasoning framework that couples a reinforcement learning-trained VLM policy with a world simulator to generate novel-view observations for improved spatial reasoning in Vision-Language Models.
Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models
Proposes the Pseudocode-guided Structured Reasoning framework (PStar) that adaptively selects structured pseudocode reasoning paths to reduce hallucinations in Vision-Language Models, achieving state-of-the-art scores on POPE and MMStar benchmarks.
Structured Role-Aware Policy Optimization for Multimodal Reasoning
This paper introduces Structured Role-Aware Policy Optimization (SRPO), a method that improves multimodal reasoning in Large Vision-Language Models by assigning token-level credit based on distinct perception and reasoning roles within reinforcement learning frameworks.
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
This paper uncovers that prolonged reasoning in vision-language models can impair perceptual grounding, causing recognition failures on basic visual questions. It proposes Vision-Anchored Policy Optimization (VAPO) to steer reasoning toward visually grounded trajectories, achieving state-of-the-art performance with the VAPO-Thinker-7B model.