Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Hugging Face Daily Papers Papers

Summary

Visual Para-Thinker++ proposes a single-policy multi-agent framework for visual reasoning that uses role-conditioned agents (Main, Worker, Summary) and dedicated training methods to reduce hallucinations and improve efficiency, outperforming baselines on hallucination-sensitive benchmarks.

Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.
Original Article
View Cached Full Text

Cached at: 06/12/26, 06:51 AM

Paper page - Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Source: https://huggingface.co/papers/2606.09290

Abstract

A multi-agent framework with shared MLLM policy and role-specific training methods improves visual reasoning by reducing hallucinations and enabling efficient parallel processing.

Visual reasoningrequires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment andhallucination. We propose Visual Para-Thinker++, a single-policymulti-agent frameworkin which one sharedMLLM policyis instantiated as role-conditioned Main, Worker, andSummary Agents. TheMain Agentdecomposes the task with fixed allocation patterns;Worker Agentsreason in parallel under context isolation; and theSummary Agentreconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained byMulti-Agent Capability InjectionandRole-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reducegradient conflictamong collaborative roles. A native inference engine enables efficient multi-agent rollout through sharedvisual prefixandKV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains onhallucination-sensitivevisual reasoning.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.09290

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.09290 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.09290 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.09290 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Structured Role-Aware Policy Optimization for Multimodal Reasoning

arXiv cs.AI

This paper introduces Structured Role-Aware Policy Optimization (SRPO), a method that improves multimodal reasoning in Large Vision-Language Models by assigning token-level credit based on distinct perception and reasoning roles within reinforcement learning frameworks.

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Papers with Code Trending

This paper uncovers that prolonged reasoning in vision-language models can impair perceptual grounding, causing recognition failures on basic visual questions. It proposes Vision-Anchored Policy Optimization (VAPO) to steer reasoning toward visually grounded trajectories, achieving state-of-the-art performance with the VAPO-Thinker-7B model.