Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

arXiv cs.AI Papers

Summary

This paper introduces a reinforcement learning framework that improves perception-reasoning synergy in vision-language models by explicitly rewarding perceptual fidelity, using a 'blindfolded reasoning' proxy and structured verbal verification to address ambiguity in modality credit assignment.

arXiv:2605.14054v1 Announce Type: new Abstract: Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:20 AM

# Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
Source: [https://arxiv.org/html/2605.14054](https://arxiv.org/html/2605.14054)
Qixin XuChangpeng WangTaofeng XueChong PengWenhu ChenFangzhen Lin

Perception, Reward, Vision\-Language, Reasoning

Abstract

> Achieving robust perception\-reasoning synergy is a central goal for advanced Vision\-Language Models \(VLMs\)\. Recent advancements have pursued this goal via architectural designs or agentic workflows\. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity\. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning\. This motivates a fundamental rethinking of the true bottleneck\. In this paper, we argue that the root cause of this trade\-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception \("bad seeing"\) or flawed logic \("bad thinking"\)? To resolve this, we introduce a reinforcement learning framework that improves perception\-reasoning synergy by reliably rewarding the perception fidelity\. We explicitly decompose the generation process into interleaved perception and reasoning steps\. This decoupling enables targeted supervision on perception\. Crucially, we introduce Perception Verification \(PV\), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes\. Furthermore, to scale training across free\-form VL tasks, we propose Structured Verbal Verification, which replaces high\-variance LLM judging with structured algorithmic execution\. These techniques are integrated into a Modality\-Aware Credit Assignment \(MoCA\) mechanism, which routes rewards to the specific source of error – either bad seeing or bad thinking – enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum\.

## 1Introduction

Humans navigate the world through a seamless, synergistic interplay between visual perception \(seeing\) and cognitive reasoning \(thinking\)\. We do not “see” once and then “think” in isolation; rather, we continuously re\-interrogate visual evidence to form, verify, and refine hypotheses\. This tight loop, where cognition guides perception and perception grounds cognition – a dynamic we termperception\-reasoning synergy– is the hallmark of robust, generalizable intelligence\.

Achieving this synergy has been the central pursuit of advanced Vision\-Language Models \(VLMs\), fueling distinct architectural paradigms\. Initial architectures \(e\.g\., Qwen\-VL\) focused on implicitly fusing visual tokens and textual embeddings, relying on the model’s static textual reasoning to process this fused information\(Liuet al\.,[2023](https://arxiv.org/html/2605.14054#bib.bib48); Xuet al\.,[2025a](https://arxiv.org/html/2605.14054#bib.bib54); Wanget al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib47); Baiet al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib80)\)\. More recent works, aspiring toward “thinking with images,” have embraced active perception by incorporating visual operations\(Wanget al\.,[2025d](https://arxiv.org/html/2605.14054#bib.bib71)\), more closely reflecting the human synergistic loop\.

These active\-perception models, however, are often implemented through complex, multi\-turn function\-calling or agentic frameworks\. This approach introduces both significant compute and engineering burden, such as implementing multi\-turn RL training and handling asynchrony for long\-tailed episodes, and often yields a "seesaw effect," where marginal visual gains come at the expense of reasoning capabilities\(Jianget al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib45); Wanget al\.,[2025d](https://arxiv.org/html/2605.14054#bib.bib71)\)\. We posit that a crucial yet under\-explored bottleneck, unique to vision\-language reasoning, is the lack of direct supervision signals over perception\. Current training paradigms fail to provide modality\-aware feedback for vision\-language reasoning\. Was the reasoning error caused by flawed perception \(bad seeing\) or flawed logic \(bad thinking\)?

This ambiguity in modality credit assignment is overlooked and challenging because a VLM’s perception is inherently difficult to access: it is either \(a\) buried in latent\-state activations or \(b\) inextricably entangled with the reasoning components in the final textual output\. This entanglement makes it nearly impossible to supervise perception independently of logic\. This prompts us a question:

*Can we improve multimodal reasoning by externalizing perception and directly rewarding it?*

Our answer is yes\. In this work, we explore a simple way to externalize perception, leveraging instruction\-following capabilities to enforce a structural separation in the generation process: perception\-centric blocks \(e\.g\.,<recognition\>\) that isolate visual evidence are interleaved with reasoning\-centric blocks \(e\.g\.,<think\>\) \(Fig\.[2](https://arxiv.org/html/2605.14054#S2.F2)\)\. This choice transforms the opaque "black box" of VLM perception into a transparent sequence, exposing the specific locus of error – bad seeing or bad thinking – and allowing us to reframe the intractable problem of joint supervision into two solvable, component\-level challenges:

1. 1\.The Perception Verification:How do we specifically supervise the quality of the model’s perception, independent of the final reasoning step?
2. 2\.The Outcome Verification:To achieve improvement across a wide task spectrum, How do we reliably reward free\-form answers, avoiding the brittleness of regex and the high variance of standard LLM judges?

Given isolated perception explicitly grounded in the textual space, we tackle the first challenge: the lack of ground truth for intermediate visual descriptions\. We introducePerception Verification via Proxy\. Our key insight is that in explicit vision\-language reasoning, visual details serve as discrete premises for logical deduction; therefore, perception sufficiency can be measured by reasoning feasibility\. We implement this as a "blindfolded reasoning" test: we feed the VLM’s grounded visual evidence to a strong, text\-only reasoner while withholding the image\. If this text\-only proxy can correctly solve the user’s question using*only*the VLM’s descriptions, it demonstrates that the perception block has successfully captured the sufficient statistics required for the task\. This signal rewards the VLM not for generic captioning, but for extracting the precise visual facts necessary to support downstream reasoning\.

While the proxy validates perception, we must also ensure robust supervision for the final reasoning outcomes\. Existing evaluators suffer a dichotomy: flexible LLM judges exhibit high variance, while rigid rule\-based verifiers are brittle\. We bridge this gap withStructured Verbal Verification \(SVV\)\. Rather than asking an LLM for a holistic and subjective "judgment," we provide a structured natural language algorithm—a verification protocol—and instruct the judge to explicitly "execute" this protocol step\-by\-step\. This shifts the verifier’s role from subjective estimation to structured execution, significantly reducing variance and ensuring that our reward signal remains reliable even for complex, free\-form responses\.

we unify these techniques into aModality\-aware Credit Assignment \(MoCA\)mechanism\. By routing these granular rewards to their specific source—punishing "bad seeing" or "bad thinking" distinctly—we dismantle the traditional "seesaw effect" where gains in one modality come at the expense of the other\. Conceptually, our approach mirrors the agentic "thinking with images" paradigm but internalizes it\. Rather than invoking external tools via slow, multi\-turn function loops, our model treats its own perception capabilities as an internal function, calling upon it to retrieve visual evidence before reasoning\. This offers a lightweight, scalable path to perception\-reasoning synergy that is orthogonal to, yet distinct from, existing external agentic frameworks\.

![Refer to caption](https://arxiv.org/html/2605.14054v1/x1.png)Figure 1:Overview of MoCA\.MoCA identifies "bad seeing" from "bad thinking" by separating Perception and Reasoning\. We introduce Perception Verification \(PV, top loop\), which uses a "blindfolded" text reasoner proxy to reward the VLM’s perception independently of its reasoning\. This, combined with Outcome Verification \(OV, bottom loop\), rewards for better perception\-reasoning synergy\. To render an economical implementation, we use the same verifier for PV and OV\.We conduct extensive empirical validation across a comprehensive suite of multimodal benchmarks\. Our results demonstrate that the proposed approach outperforms state\-of\-the\-art vision\-language reasoners across a wide task spectrum spanning from perception\-intensive reasoning to rich\-modality reasoning\.

Contributions\.We fundamentally rethinks VLM training by shifting from holistic outcome supervision to decoupled Modality\-aware Credit Assignment \(MoCA\)\.

- •We enforce an explicit architectural decoupling of perception and reasoning, transforming the VLM from an opaque box into an interpretable, modular reasoner\.
- •We introducePerception Verification via Proxy, a novel "blindfolded reasoning" paradigm that validates visual fidelity by checking if a text\-only reasoner can solve the task using the model’s descriptions alone\.
- •We proposeStructured Verbal Verification \(SVV\)for robust free\-form evaluation, enabling us to internalize the benefits of agentic "thinking" without external computational overhead\.

## 2Methodology

Our approach targets at scaling VL reasoning across a wide task spectrum\. To achieve this, we focus on addressing the perception supervision and outcome verification problem, optimizing theinternalperception\-reasoning synergy\. Below we first formalize the problem to expose its core challenge and then introduce our solution\.

### 2\.1Problem Formulation

![Refer to caption](https://arxiv.org/html/2605.14054v1/x2.png)Figure 2:Illustration of Interleaved Perception\-Reasoning\.We formulate the task of generating an answerYYgiven an imageVVand questionQQas a sequential decision\-making process\. The objective is to learn a single, generalizable policyπθ\\pi\_\{\\theta\}\(the VLM\) that generates a reasoning trajectoryτ=\(a1,a2,…,aT\)\\tau=\(a\_\{1\},a\_\{2\},\.\.\.,a\_\{T\}\)to arrive at the correct final answer\.

This process is a \(Partially Observable\) Markov Decision Process \(POMDP\)\(Sondik,[1971](https://arxiv.org/html/2605.14054#bib.bib70)\), where the statest=\(V,Q,a1,…,at−1\)s\_\{t\}=\(V,Q,a\_\{1\},\.\.\.,a\_\{t\-1\}\)represents the history of actions\. Our central insight is to deliberately decompose the VLM’s native auto\-regressive action space𝒜\\mathcal\{A\}into two distinct, component\-level sets of actions:

- •Perception Actions \(𝒜P\\mathcal\{A\}\_\{P\}\): are Perception\-centric text blocks \(e\.g\., <recognition\>…</recognition\>\) that serve to ground visual facts and extract relevant evidence from the imageVV\. We focus on multimodal task that requires explicit reasoning and posit that grounded perception text act as the discrete premises required for the subsequent logical deduction\.
- •Reasoning Actions \(𝒜R\\mathcal\{A\}\_\{R\}\)are reasoning\-centric text blocks that understand user requests, decompose problems, form hypotheses, draw inferences \(e\.g\.,<think\>\.\.\.</think\>\)\.

The policyπθ\\pi\_\{\\theta\}is thus an auto\-regressive modelp​\(at\|st\)p\(a\_\{t\}\|s\_\{t\}\)that generates an interleaved sequence of these perception and reasoning actions\.

This explicit decomposition exposes the fundamental bottleneck to perception\-reasoning synergy:Ambiguous Credit Assignment\. This challenge is distinct from the standard process supervision challenge of validating reasoning in text\-only LLMs\. In purely textual tasks, intermediate steps can be verified against textual logic\. Ourmodality credit assignmentproblem is unique: the perception actions \(apa\_\{p\}\) are grounded in thevisual inputs\(VV\), not text\. This makes it almost impossible to supervise perception quality by simply looking at the final answer, as a failure could stem from flawed perception \(’bad seeing’\) or flawed logic \(’bad thinking’\)\.

### 2\.2Eliciting Perception\-Reasoning Decomposition

To directly supervise the perception of VLMs, we externalize the perception\-reasoning synergy inherent in the model, in the interleaved perception\-reasoning format \(see Figure[2](https://arxiv.org/html/2605.14054#S2.F2)\)\. To achieve this, we leverage the instruction\-following capabilities of modern instruction\-tuned VLMs\.

We provide the model with a single, task\-agnostic instruction as part of its system prompt \(Figure[3](https://arxiv.org/html/2605.14054#S2.F3)\)\. This instruction compels the model to "think step\-by\-step" by externalizing its process into perception\-centric<recognition\>\.\.\.</recognition\>blocks and reasoning\-centric<think\>\.\.\.</think\>blocks\. This instruction\-driven approach avoids the need for a large\-scale, synthesized trajectory dataset and allows the subsequent reinforcement learning stage to optimize the model’s native decomposition capabilities from its pre\-trained state\.

![Refer to caption](https://arxiv.org/html/2605.14054v1/x3.png)Figure 3:Instruction to elicit interleaved perception\-reasoning\.
### 2\.3Rewarding Perception: The "Blindfolded Reasoner" Test

With perception now an explicit output \(<recognition\>\.\.\.\), our goal is to evaluate its quality\. The core challenge is that*there is no ground truth for an intermediate "perceptual thought"*\. We cannot use a static label\. Therefore, we must measure its quality functionally\. Our central insight is that to assign credit, we canisolatethe perception component’s contribution from the reasoning process\.

We implement this isolation using a strong, text\-only reasoner \(e\.g\., Qwen2\.5\-Instruct\-14B\) as the proxy for a reasoning\-perfect oracle\. Our premise is that the reasoning component of most vision\-language tasks, once visual facts are established, is less complex than pure, abstract reasoning\. We call this the "Blindfolded Reasoner" Test:

1. 1\.The VLM generates a perception actionapa\_\{p\}\(e\.g\., “After scanning the document, I locate the relevant figure which shows that …”\)\.
2. 2\.We provide this strong reasoner with the original questionQQand all the VLM’s generated perception text\{ap\}\\\{a\_\{p\}\\\},withholding the imageVV\.
3. 3\.We obtain a final answer from the reasoner using only these grounded visual evidences\.

If this "image\-blind" reasoner succeeds, the collection\{ap\}\\\{a\_\{p\}\\\}is demonstrated to be asufficient statisticfor the imageVVrelative to the questionQQ\. Conversely, if the reasoner fails, we assign the blame to the flawed\{ap\}\\\{a\_\{p\}\\\}perception grounding\. This yields the targeted, modality\-level reward signal,RP​\(τ\)R\_\{P\}\(\\tau\), that we need\.

We call this frameworkPerception Verification \(PV\)\. This technique is not ad\-hoc; it serves as a functional proxy for the Information Bottleneck \(IB\) principle\(Tishbyet al\.,[2000](https://arxiv.org/html/2605.14054#bib.bib68); Alemiet al\.,[2016](https://arxiv.org/html/2605.14054#bib.bib67)\)\. It rewards the perception block collection\{ap\}\\\{a\_\{p\}\\\}that is maximally informative about the answerYY\(by being sufficient for the oracle\) while being a minimal \(textual\) representation of the imageVV, i\.e\.,minp​\(Ap\|V\)I​\(V;Ap\)−β​I​\(Ap;Y\)\\min\_\{p\(A\_\{p\}\|V\)\}\\quad I\(V;A\_\{p\}\)\-\\beta I\(A\_\{p\};Y\)\. To ensure minimalism, we explicitly penalize perception blocks\{ap\}\\\{a\_\{p\}\\\}that exceed a predefined token limit \(e\.g\., 800 tokens\)\.

Validity of Textual Grounding\.One might question the feasibility of perception grounding, given that certain visual patterns \(e\.g\., mazes\) cannot be trivially converted to text\. However, within the scope of explicit multimodal reasoning \(System 2\), a solver must extract discrete task\-relevant variables \(e\.g\., "the slope is positive"\) to form a logical chain\. As a result, perception grounding, the identification and grounding of any visual feature critical to the solution, serves as a premise for symbolic reasoning\. To this end, the Blindfolded Reasoner functions as a sufficiency test: it rewards the VLM not for describing image pixels, but for extracting the sufficient statistics required to satisfy the specific reasoning query\.

Reward Definition\.Based on the proxy’s performance, we assign a binary perception rewardRP​\(τ\)∈\{0,1\}R\_\{P\}\(\\tau\)\\in\\\{0,1\\\}to the trajectory:

RP​\(τ\)=\{1,if Proxy​\(Q,\{ap\}\)=Ycorrect0,otherwiseR\_\{P\}\(\\tau\)=\\vskip\-5\.69046pt\\begin\{cases\}1,&\\text\{if Proxy\}\(Q,\\\{a\_\{p\}\\\}\)=Y\_\{\\text\{correct\}\}\\\\ 0,&\\text\{otherwise\}\\end\{cases\}This discrete signal is then used to gate the credit assignment in our MOCA framework\.

Discussion: Robustness to Oracle Fallibility\.We acknowledge that the text\-only reasoner is often a high\-capability proxy and not a perfect, infallible reasoner\. It can, therefore, introduce false\-negative noise \(i\.e\., failing to reason from a truly sufficientapa\_\{p\}\), leading to an incorrect penalty\. While our MoCA objective is designed to mitigate the impact of this noise during failed trajectories, we posit that such instances are infrequent\. We include a human\-evaluation study in the experiment section to quantify the agreement rate between our PV oracle and human judgment on "perceptual sufficiency," thereby validating the reliability ofRPR\_\{P\}as a reward signal\.

### 2\.4Structured Verbal Verification

Our "blindfolded reasoner" \(Section 2\.3\) provides the perception reward,RPR\_\{P\}\. To solve the full credit assignment problem, we also need a reliable reward for the final outcome,RO​\(τ\)R\_\{O\}\(\\tau\)\. This is a non\-trivial challenge for a generalist model handling free\-form answers\. Existing methods typically fall into two camps: Rigid Rule\-Based Verifiers \(e\.g\., regex, keyword matching\), which are precise but brittle, and LLM Prompting \(e\.g\., "Is this equivalent?"\), which is flexible but suffers from high variance\.

To bridge this gap, we propose Structured Verbal Verification \(SVV\)\. Rather than asking an LLM Judge to vaguely "judge" the final answerYYor asking it to write Python code, we provide a structured natural language algorithm—a verification protocol—and instruct the judge to explicitly "execute" this protocol step\-by\-step \(as shown in Figure 4\)\. This protocol forces the verifier to decompose the evaluation into distinct stages: \(1\) identifying the answer type, \(2\) extracting content, \(3\) reconstructing the reference, and \(4\) performing type\-specific semantic comparison\. This shifts the verifier’s role from subjective estimation to structured execution, significantly reducing variance and ensuring that our reward signal remains reliable even for complex, free\-form responses\.

A critical aspect of SVV is its scalability; we do not hand\-craft unique protocols for every question\. Instead, we employ a universal verification protocol that covers the majority of verification cases \(Figure 4\)\. By forcing the LLM Judge to "execute" this universal program verbally, we obtain the robust, low\-variance outcome rewardROR\_\{O\}required for stable RL training\.

![Refer to caption](https://arxiv.org/html/2605.14054v1/x4.png)Figure 4:Instruction Prompt of Structured Verbal Verification\.
### 2\.5MoCA: Resolving the "Bad Seeing vs\. Bad Thinking" Ambiguity

Given two core reward signals –ROR\_\{O\}for outcome correctness andRPR\_\{P\}for perceptual fidelity, we now integrate them into a coherent reinforcement learning objective\. To achieve perception\-reasoning synergy, we define trajectory\-level return as the sum of the outcome and perception rewards:

R​\(τ\)=RO​\(τ\)\+λ​RP​\(τ\)R\(\\tau\)=R\_\{O\}\(\\tau\)\+\\lambda R\_\{P\}\(\\tau\)
Following standard practice in Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib58)\), we estimate advantages by computing a group\-wise reward baseline\. For a given query, we sample a group ofkktrajectories\{τ1,…,τk\}\\\{\\tau\_\{1\},\.\.\.,\\tau\_\{k\}\\\}from our policyπθ\\pi\_\{\\theta\}\. The advantageAτ,tA\_\{\\tau,t\}for any tokenttin a trajectoryτ\\tauis its outcome rewardRO​\(τ\)R\_\{O\}\(\\tau\)normalized against the group’s mean reward:

Aτ,t=R​\(τ\)−1k​∑j=1kR​\(τj\)\\vskip\-5\.69046ptA\_\{\\tau,t\}=R\(\\tau\)\-\\frac\{1\}\{k\}\\sum\_\{j=1\}^\{k\}R\(\\tau\_\{j\}\)\\vskip\-8\.5359pt
The bottleneck lies in the failure cases where the outcome is incorrect \(RO=0R\_\{O\}=0\)\. Standard GRPO applies the calculated advantageAτ,tA\_\{\\tau,t\}\(which is negative\) uniformly across the sequence, punishing all tokens regardless of whether the error stemmed from perception or reasoning\. This creates a "seesaw effect": the model risks "unlearning" correct perception behaviors simply because the subsequent reasoning failed\.

To resolve this, our Modality\-Aware Credit Assignment \(MOCA\) mechanism uses the binary perception rewardRP∈\{0,1\}R\_\{P\}\\in\\\{0,1\\\}as a gate to route credit assignment\. We define the modified advantageAτ,tMOCAA\_\{\\tau,t\}^\{\\text\{MOCA\}\}specifically for failed trajectories \(RO=0R\_\{O\}=0\) to distinguish between two scenarios:

- •Case 1, "Bad Thinking"\.If the outcome is wrong \(RO=0R\_\{O\}=0\) but perception was verified as correct \(RP=1R\_\{P\}=1\), the error stems from reasoning\. We "protect" the innocent perception tokens \(τP\\tau\_\{P\}\) by dampening the penalty with a positive protection term: Aτ,t\+αprotect⋅\|Aτ,t\|\.\\vskip\-2\.84544ptA\_\{\\tau,t\}\+\\alpha\_\{\\text\{protect\}\}\\cdot\|A\_\{\\tau,t\}\|\.This prevents the gradient update from degrading valid visual grounding capabilities\.
- •Case 2, "Bad Seeing"\.If the outcome is wrong \(RO=0R\_\{O\}=0\) and perception was also failed \(RP=0R\_\{P\}=0\), the error likely stems from the visual grounding\. We amplify the penalty for these perception tokens: Aτ,t−αpunish⋅\|Aτ,t\|\.A\_\{\\tau,t\}\-\\alpha\_\{\\text\{punish\}\}\\cdot\|A\_\{\\tau,t\}\|\.

This mechanism precisely routes blame and protection to the correct components\. Furthermore, the "protect" mechanism \(Case 1\) is non\-trivial: it makes our framework robust to the potential false\-negative noise from our perception proxy \(discussed in Section 2\.3\), preventing the model from unlearning good perception due to imperfect verification\. By scaling up the data and verifier capabilities, this approach offers a generalizable solution to the credit assignment problem in vision\-language reasoning\.

Table 1:Main Results\.

## 3Experiments

To empirically validate our framework, we design experiments to answer three key research questions \(RQs\):

- •\(RQ1\) Main Claim:Does MoCA achieve simultaneous performance gains across perception\- and reasoning\-intensive tasks? How does it compare against existing state\-of\-the\-art models?
- •\(RQ2\) Component Value:Are the core components of our approach, Perception Verification \(PV\), Structured Verbal Verification \(SVV\), and modality\-aware credit assignment – all necessary for this success?
- •\(RQ3\) Reward Reliability:Is ourRPR\_\{P\}signal from PV a reliable proxy for "perceptual reliability," especially given potential oracle fallibility?

Training Datasets\.Our training corpus is curated to cover a diverse task spectrum, essential for training a generalizable model\. It includes: \(1\)Instruction & Reasoning Data:A combination of STEM\-focused reasoning tasks from ViRL39K\(Wanget al\.,[2025b](https://arxiv.org/html/2605.14054#bib.bib44)\)and general\-purpose visual instructions from VisualWebInstruct\-Verified\(Jiaet al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib46)\)\. \(2\)Perception\-Intensive Data:A collection of visually\-rich, fine\-grained queries sourced from the Pixel Reasoner dataset\(Wanget al\.,[2025d](https://arxiv.org/html/2605.14054#bib.bib71)\)\. \(3\)Modality\-Rich Data:A novel, curated collection of vision\-language queries that involves reasoning over rich modalities, such as the interplay between text, figures, diagrams, layouts, and tables\. This dataset is gathered by crawling and processing documents from arXiv, newspapers, and infographics\. We provide a further breakdown in the appendix\.

Evaluation Benchmarks and Protocols\.To comprehensively measure performance, we categorize benchmarks into three distinct groups\.

- •Perception\-Intensive Benchmarks:We use V\*\(Wu and Xie,[2023](https://arxiv.org/html/2605.14054#bib.bib96)\), HRBench\(Wanget al\.,[2025f](https://arxiv.org/html/2605.14054#bib.bib97)\), and InfoVQA\(Mathewet al\.,[2021](https://arxiv.org/html/2605.14054#bib.bib98)\)to evaluate fine\-grained visual perception and grounding\.
- •Reasoning\-Intensive Benchmarks:We use MathVista\(Luet al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib72)\), MMMU\(Yueet al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib73)\), and EMMA\(Haoet al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib95)\)to test complex, multi\-step reasoning grounded in visual information\.
- •Rich\-Modality Benchmarks:We use DUDE\(Landeghemet al\.,[2023](https://arxiv.org/html/2605.14054#bib.bib76)\), SlideVQA\(Tanakaet al\.,[2023](https://arxiv.org/html/2605.14054#bib.bib77)\), and MMLongBench\-Doc\(Maet al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib78)\)to assess performance on document\-related tasks requiring reasoning over rich, structured modalities, including text, figures, diagrams, layouts\.

Baselines\.We compare MoCA against a comprehensive suite of models, including capability\-enhanced open\-source models and top\-tier commercial systems\. To save space, we refer interested readers to the appendix for performance comparison with more models\.

- •General\-ability Models:GPT\-4o, 4o\-mini\(OpenAIet al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib81)\), Llava\-OV\(Liet al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib99)\), Qwen2\.5\-VL\-Instruct\-7B \(our base model\)\(Baiet al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib80)\)\.
- •Capability\-Enhanced Models:VL\-Rethinker\(Wanget al\.,[2025b](https://arxiv.org/html/2605.14054#bib.bib44)\)and R1\-VL\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.14054#bib.bib100)\), Pixel Reasoner\(Wanget al\.,[2025d](https://arxiv.org/html/2605.14054#bib.bib71)\), Docopilot\(Duanet al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib82)\), mPLUG\-Owl3\(Huet al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib34)\)\.

Implementation Details\.Unless otherwise specified, all open\-source methods are built upon the Qwen2\.5\-VL \(7B\-Instruct\) base model to ensure a fair comparison\. For our novel reward mechanisms, we use Qwen2\.5\-Instruct\-14B as the model for both the text\-only reasoner \(PV\) and the Verification by Structured Verbal Verification\.We will release all data, code and models to facilitate further research\.

### 3\.1Main Results \(RQ1\)

We present our main quantitative results in Table 1, providing strong empirical evidence for our central claim \(RQ1\): MoCA successfully resolves the perception\-reasoning tradeoff by achieving simultaneous, significant performance gains across all task categories, countering the "seesaw effect" that plagues other methods\. Compared to its Qwen2\.5\-VL\-Instruct \(7B\) base, MoCA demonstrates massive, broad\-spectrum improvements\. This clearly validates that our framework, by explicitly supervising perception and reasoning, successfully optimizes for their synergy without degrading one capability to enhance another\.

Furthermore, MoCA establishes itself as a top\-performing 7B model against specialized and proprietary systems\. It proves highly competitive with perception\-centric models like DeepEyes while setting a new standard in rich\-modality tasks \(58\.3 on DUDE\)\. Most notably, MoCA\-7B can even surpass proprietary models, e\.g\., perception\-and reasoning\-intensive tasks over GPT\-4o\. This balanced and elevated performance profile validates MoCA as a robust model that scales across the full spectrum of vision\-language tasks\.

### 3\.2Ablation Studies: Isolating Component Contributions \(RQ2\)

To validate each components of our framework, we conduct a comprehensive ablation study, with results presented in Table[2](https://arxiv.org/html/2605.14054#S3.T2)\. We compare the full MoCA against variants to isolate the contributions of Perception Verification \(PV\), Structured Verbal Verification, and our Modality\-aware Credit Assignment \(MoCA\) logic\.

Table 2:Ablation Studies\. We analyze the contribution of each component of MoCA \(Full\)\. All variants are trained with the same data and base model\.Δ\\Deltaindicates the performance drop from the full model\.#### Total Benefit of RL Optimization\.

First, we compareMoCA \(Full\)against theInstruction\-Only \(No RL\)baseline\. This baseline uses our decompositional prompt but undergoes no RL optimization\. The results show a massive performance gap across all categories\. Merely prompting the model for decomposition brings a performance drop compared to the base model, because the model is not familiar with the decomposed perception\-reasoning format\. This result also confirms that the proposed overall framework leads to significant improvement across the board\.

#### Benefits of Perception Verification \(PV\)\.

Thew/o PV \(ROR\_\{O\}only\)variant removes the perception rewardRPR\_\{P\}and relies only on the final outcome rewardROR\_\{O\}\. This equals the standard GRPO approach with the VP outcome reward\. The results show performance drops significantly on perception\-intensive tasks, e\.g\., \-6\.9 points on V\*, \-4\.7 on HRBench, and \-4\.1 on InfoVQA, indicating the effectiveness of properly rewarding the perception during vision\-language reasoning\. This confirms our core hypothesis: lacking targeted perception supervision is a crucial bottleneck in vision\-language reasoning\. WithoutRPR\_\{P\}, the model cannot distinguish "bad seeing" from "bad thinking" and fails to optimize its perceptual capabilities\. In the meantime, reasoning\-centric tasks are largely unaffected, because these tasks usually do not require complex perception capabilities\.

#### Benefits of Modality\-aware Credit Assignment \(MoCA\)\.

The baselinew/o MoCA \(RO\+λ​RPR\_\{O\}\+\\lambda R\_\{P\}\)uses both reward signals but naively combines them, which we argued creates ambiguity in failure cases\. The results strongly supports our claim\. This simple reward shaping variant performs worse than our full model, lagging by nearly 3 points on perception\-intensive tasks and 2 points on rich\-modality tasks\. This demonstrates that our MoCA logic is non\-trivial and essential\. By protecting "good" perception tokens and punishing "bad" ones during failed trajectories, MoCA correctly assigns blame and prevents the model from "unlearning" good perception habits, which the simple shaping approach fails to do\.

#### Benefits of Structured Verbal Verification\.

Finally, thew/o VP\+PV \(LLM Judge\)variant uses the outcome reward from LLM judge and applies the standard GRPO\. The results show a consistent performance drop fromw/o PVbaseline\. The hurt of high\-variance reward is indeed surprising\. We find that the model tends to reward hacking by exploiting the LLM judge, making the RL training less stable and misleading the model into irrelevant reward hacking behaviors\.

### 3\.3Reward Reliability \(R3\)

Perception Verification\.To validate the reliability of our Perception Verification \(PV\) proxy, we test its reliability against human judgment\. We randomly sampledN=979N=979data points, each containing a \(question, VLM\-generated perception text, golden answer\) triplet\. Three human annotators were asked to judge if the perception text alone was "Sufficient" or "Insufficient" to logically deduce the golden answer\. We refer interested readers to the supplementary materials for a full breakdown of the experimental setup, human annotation process, and detailed results\.

We then compared the PV oracle’s verdict \(using Qwen2\.5\-Instruct\-14B\) against the human majority vote\. As shown in Table[3](https://arxiv.org/html/2605.14054#S3.T3), the oracle achieved reasonable fidelity with human judgment \(86\.31% accuracy, Cohen’sκ=0\.707\\kappa=0\.707, indicating "Substantial" agreement\)\. Crucially, the analysis of disagreements reveals the oracle’s primary failure mode is conservative False Negatives \(9\.19%\)—where it fails to reason from text humans found sufficient—over False Positives \(4\.49%\)\. This result validates our oracle as a reliable proxy and confirms the necessity of our Decomposed Credit Assignment \(DCA\) objective \(Section 2\.5\), which is specifically designed to handle and mitigate this exact type of "fallible oracle" noise\.

Table 3:PV Oracle vs\. Human Majority Confusion Matrix \(N=979N=979\)\.Outcome Verification\.We validate our Structured Verbal Verification method against two common baselines: aRigid Ruleverifier \(regex, exact match\) and a standardLLM Promptingverifier \(e\.g\., "Are these answers semantically equivalent?"\)\. We created a "VP\-Challenge\-Set" \(N=273N=273\) of \(model answer, gold answer\) pairs, specifically including difficult semantic rephrasings and answers with subtle errors\. We measure both accuracy against human\-annotated labels andConsistency\(the percentage of identical verdicts over 5 runs atT=0\.7T=0\.7\)\. Further details on the challenge set construction and the universal verbal program are available in the supplementary materials\.

As shown in Table[4](https://arxiv.org/html/2605.14054#S3.T4), the baselines fail in predictable ways\.Rigid Rulehas high precision but fails on all semantic rephrasings, resulting in poor recall\.LLM Promptingis flexible but high\-variance and unreliable, achieving only 78\.6% consistency\. Our VP method achieves the highest accuracy \(91\.9%\) and F1\-score \(92\.7%\) by transforming the vague "judgment" into a structured "execution\." This makes it a robust, low\-variance reward signal, achieving 92\.3% consistency, which is critical for stable RL training\.

Table 4:Verifier performance on the VP\-Challenge\-Set \(N=273N=273\)\. VP provides the best balance of accuracy and reliability\.

## 4Related Work

Vision\-Language Reasoning\.Recent advancements in Large Vision\-Language Models \(LVLMs\) have significantly improved reasoning capabilities by aligning visual encoders with LLMs\(Liuet al\.,[2023](https://arxiv.org/html/2605.14054#bib.bib48); Wanget al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib47); Baiet al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib80); Daiet al\.,[2023](https://arxiv.org/html/2605.14054#bib.bib61); Weiet al\.,[2022](https://arxiv.org/html/2605.14054#bib.bib63)\)\. While Chain\-of\-Thought \(CoT\) prompting has unlocked complex reasoning in text, its application in vision remains challenging due to perceptual hallucinations\(Weiet al\.,[2023](https://arxiv.org/html/2605.14054#bib.bib53); Zhuet al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib24)\)\. Recent works attempt to mitigate this via supervised fine\-tuning and reinforcement learning or distillation\(Jiaet al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib46); Weiet al\.,[2022](https://arxiv.org/html/2605.14054#bib.bib63); Wanget al\.,[2025b](https://arxiv.org/html/2605.14054#bib.bib44); Maet al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib25); Dinget al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib26)\), or align modalities via perplexity, logit differences\(Zhouet al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib15); Lenget al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib28); Zhuet al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib27)\)\. However, these methods often treat perception and reasoning as an entangled latent process, making it difficult to diagnose whether errors stem from perception or reasoning\.

Synergy via Agentic Function\-Calling\.To bridge the perception\-reasoning gap, several approaches adopt agentic frameworks that utilize external tools or multi\-turn verification loops\(Wanget al\.,[2025d](https://arxiv.org/html/2605.14054#bib.bib71); Zhanget al\.,[2025b](https://arxiv.org/html/2605.14054#bib.bib57); Maet al\.,[2026](https://arxiv.org/html/2605.14054#bib.bib21)\)\. While effective, these systems incur significant latency and engineering complexity\. VL\-Scaler proposes an orthogonal direction: internalizing this synergy\. By explicitly interleaving perception and reasoning tokens within a single autoregressive generation, we retain the efficiency of end\-to\-end models while capturing the interpretability and robustness of agentic workflows\.

Reward Engineering and Credit Assignment\. Reinforcement learning \(RL\) has become standard for aligning VLMs\(Baiet al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib80); Teamet al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib84); Wanget al\.,[2025d](https://arxiv.org/html/2605.14054#bib.bib71)\)\. However, standard outcome\-based supervision \(e\.g\., RLHF, DPO\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.14054#bib.bib66); Rafailovet al\.,[2023](https://arxiv.org/html/2605.14054#bib.bib69)\)\) suffers from ambiguous credit assignment in multimodal contexts\(Wanget al\.,[2025e](https://arxiv.org/html/2605.14054#bib.bib43)\)\. Unlike methods that rely on high\-variance LLM judges or rigid regex\(Shaoet al\.,[2024](https://arxiv.org/html/2605.14054#bib.bib58)\), our Structured Verbal Verification and Perception Verification introduce deterministic, modality\-aware reward signals that explicitly decouple and supervise the visual and logical reasoning components\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- A\. A\. Alemi, I\. Fischer, J\. V\. Dillon, and K\. Murphy \(2016\)Deep variational information bottleneck\.arXiv preprint arXiv:1612\.00410\.Cited by:[§2\.3](https://arxiv.org/html/2605.14054#S2.SS3.p4.5)\.
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin \(2025\)Qwen2\.5\-vl technical report\.External Links:2502\.13923,[Link](https://arxiv.org/abs/2502.13923)Cited by:[§1](https://arxiv.org/html/2605.14054#S1.p2.1),[1st item](https://arxiv.org/html/2605.14054#S3.I3.i1.p1.1),[§4](https://arxiv.org/html/2605.14054#S4.p1.1),[§4](https://arxiv.org/html/2605.14054#S4.p3.1)\.
- W\. Dai, J\. Li, D\. Li, A\. Tiong, J\. Zhao, W\. Wang, B\. Li, P\. Fung, and S\. Hoi \(2023\)InstructBLIP: towards general\-purpose vision\-language models with instruction tuning\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=vvoWPYqZJA)Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.
- T\. Ding, L\. He, W\. Ma, and X\. Zhou \(2025\)MuSeLLM: sdf generation and understanding via multi\-scale tokenization with position\-aware guidance\.InProceedings of the 2025 International Conference on Multimedia Retrieval,ICMR ’25,New York, NY, USA,pp\. 192–201\.External Links:ISBN 9798400718779,[Link](https://doi.org/10.1145/3731715.3733405),[Document](https://dx.doi.org/10.1145/3731715.3733405)Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.
- Y\. Duan, Z\. Chen, Y\. Hu, W\. Wang, S\. Ye, B\. Shi, L\. Lu, Q\. Hou, T\. Lu, H\. Li,et al\.\(2025\)Docopilot: improving multimodal models for document\-level understanding\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 4026–4037\.Cited by:[2nd item](https://arxiv.org/html/2605.14054#S3.I3.i2.p1.1)\.
- Y\. Hao, J\. Gu, H\. W\. Wang, L\. Li, Z\. Yang, L\. Wang, and Y\. Cheng \(2025\)Can mllms reason in multimodality? emma: an enhanced multimodal reasoning benchmark\.arXiv preprint arXiv:2501\.05444\.Cited by:[2nd item](https://arxiv.org/html/2605.14054#S3.I2.i2.p1.1)\.
- A\. Hu, H\. Xu, J\. Ye, M\. Yan, L\. Zhang, B\. Zhang, C\. Li, J\. Zhang, Q\. Jin, F\. Huang, and J\. Zhou \(2024\)MPLUG\-docowl 1\.5: unified structure learning for ocr\-free document understanding\.External Links:2403\.12895,[Link](https://arxiv.org/abs/2403.12895)Cited by:[2nd item](https://arxiv.org/html/2605.14054#S3.I3.i2.p1.1)\.
- Y\. Jia, Y\. Du, K\. Jiang, Y\. Liang, Q\. Ren, Y\. Xin, R\. Yang, F\. Feng, M\. Chen, H\. Lu,et al\.\(2026\)Benchmarking multimodal knowledge conflict for large multimodal models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 22283–22291\.Cited by:[Appendix A](https://arxiv.org/html/2605.14054#A1.p1.1)\.
- Y\. Jia, J\. Li, X\. Yue, B\. Li, P\. Nie, K\. Zou, and W\. Chen \(2025\)Visualwebinstruct: scaling up multimodal instruction data through web search\.arXiv preprint arXiv:2503\.10582\.Cited by:[Appendix D](https://arxiv.org/html/2605.14054#A4.p1.1),[§3](https://arxiv.org/html/2605.14054#S3.p2.1),[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.
- D\. Jiang, Y\. Lu, Z\. Li, Z\. Lyu, P\. Nie, H\. Wang, A\. Su, H\. Chen, K\. Zou, C\. Du,et al\.\(2025\)Verltool: towards holistic agentic reinforcement learning with tool use\.arXiv preprint arXiv:2509\.01055\.Cited by:[§1](https://arxiv.org/html/2605.14054#S1.p3.1)\.
- J\. V\. Landeghem, R\. Tito, Ł\. Borchmann, M\. Pietruszka, P\. Józiak, R\. Powalski, D\. Jurkiewicz, M\. Coustaty, B\. Ackaert, E\. Valveny, M\. Blaschko, S\. Moens, and T\. Stanisławek \(2023\)Document understanding dataset and evaluation \(dude\)\.External Links:2305\.08455,[Link](https://arxiv.org/abs/2305.08455)Cited by:[3rd item](https://arxiv.org/html/2605.14054#S3.I2.i3.p1.1)\.
- S\. Leng, H\. Zhang, G\. Chen, X\. Li, S\. Lu, C\. Miao, and L\. Bing \(2024\)Mitigating object hallucinations in large vision\-language models through visual contrastive decoding\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13872–13882\.Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.
- B\. Li, Y\. Zhang, D\. Guo, R\. Zhang, F\. Li, H\. Zhang, K\. Zhang, P\. Zhang, Y\. Li, Z\. Liu, and C\. Li \(2024\)LLaVA\-onevision: easy visual task transfer\.External Links:2408\.03326,[Link](https://arxiv.org/abs/2408.03326)Cited by:[1st item](https://arxiv.org/html/2605.14054#S3.I3.i1.p1.1)\.
- H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee \(2023\)Visual instruction tuning\.Advances in neural information processing systems36,pp\. 34892–34916\.Cited by:[§1](https://arxiv.org/html/2605.14054#S1.p2.1),[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.
- P\. Lu, H\. Bansal, T\. Xia, J\. Liu, C\. Li, H\. Hajishirzi, H\. Cheng, K\. Chang, M\. Galley, and J\. Gao \(2024\)MathVista: evaluating mathematical reasoning of foundation models in visual contexts\.External Links:2310\.02255,[Link](https://arxiv.org/abs/2310.02255)Cited by:[2nd item](https://arxiv.org/html/2605.14054#S3.I2.i2.p1.1)\.
- W\. Ma, R\. Chen, K\. Zhang, S\. Wu, and S\. Ding \(2025\)Instruct where the model fails: generative data augmentation via guided self\-contrastive fine\-tuning\.Proceedings of the AAAI Conference on Artificial Intelligence39\(6\),pp\. 5991–5999\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/32640),[Document](https://dx.doi.org/10.1609/aaai.v39i6.32640)Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.
- W\. Ma, S\. Sun, T\. Yu, R\. Wang, T\. Chua, and J\. Bian \(2026\)Thinking with blueprints: assisting vision\-language models in spatial reasoning via structured object representation\.External Links:2601\.01984,[Link](https://arxiv.org/abs/2601.01984)Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p2.1)\.
- Y\. Ma, Y\. Zang, L\. Chen, M\. Chen, Y\. Jiao, X\. Li, X\. Lu, Z\. Liu, Y\. Ma, X\. Dong, P\. Zhang, L\. Pan, Y\. Jiang, J\. Wang, Y\. Cao, and A\. Sun \(2024\)MMLongBench\-doc: benchmarking long\-context document understanding with visualizations\.External Links:2407\.01523,[Link](https://arxiv.org/abs/2407.01523)Cited by:[3rd item](https://arxiv.org/html/2605.14054#S3.I2.i3.p1.1)\.
- M\. Mathew, V\. Bagal, R\. P\. Tito, D\. Karatzas, E\. Valveny, and C\. V\. Jawahar \(2021\)InfographicVQA\.External Links:2104\.12756,[Link](https://arxiv.org/abs/2104.12756)Cited by:[1st item](https://arxiv.org/html/2605.14054#S3.I2.i1.p1.1)\.
- OpenAI, :, A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford, A\. Mądry, A\. Baker\-Whitcomb, A\. Beutel, A\. Borzunov, A\. Carney, A\. Chow, A\. Kirillov, A\. Nichol, A\. Paino, A\. Renzin, A\. T\. Passos, A\. Kirillov, A\. Christakis, A\. Conneau, A\. Kamali, A\. Jabri, A\. Moyer, A\. Tam, A\. Crookes, A\. Tootoochian, A\. Tootoonchian, A\. Kumar, A\. Vallone, A\. Karpathy, A\. Braunstein, A\. Cann, A\. Codispoti, A\. Galu, A\. Kondrich, A\. Tulloch, A\. Mishchenko, A\. Baek, A\. Jiang, A\. Pelisse, A\. Woodford, A\. Gosalia, A\. Dhar, A\. Pantuliano, A\. Nayak, A\. Oliver, B\. Zoph, B\. Ghorbani, B\. Leimberger, B\. Rossen, B\. Sokolowsky, B\. Wang, B\. Zweig, B\. Hoover, B\. Samic, B\. McGrew, B\. Spero, B\. Giertler, B\. Cheng, B\. Lightcap, B\. Walkin, B\. Quinn, B\. Guarraci, B\. Hsu, B\. Kellogg, B\. Eastman, C\. Lugaresi, C\. Wainwright, C\. Bassin, C\. Hudson, C\. Chu, C\. Nelson, C\. Li, C\. J\. Shern, C\. Conger, C\. Barette, C\. Voss, C\. Ding, C\. Lu, C\. Zhang, C\. Beaumont, C\. Hallacy, C\. Koch, C\. Gibson, C\. Kim, C\. Choi, C\. McLeavey, C\. Hesse, C\. Fischer, C\. Winter, C\. Czarnecki, C\. Jarvis, C\. Wei, C\. Koumouzelis, D\. Sherburn, D\. Kappler, D\. Levin, D\. Levy, D\. Carr, D\. Farhi, D\. Mely, D\. Robinson, D\. Sasaki, D\. Jin, D\. Valladares, D\. Tsipras, D\. Li, D\. P\. Nguyen, D\. Findlay, E\. Oiwoh, E\. Wong, E\. Asdar, E\. Proehl, E\. Yang, E\. Antonow, E\. Kramer, E\. Peterson, E\. Sigler, E\. Wallace, E\. Brevdo, E\. Mays, F\. Khorasani, F\. P\. Such, F\. Raso, F\. Zhang, F\. von Lohmann, F\. Sulit, G\. Goh, G\. Oden, G\. Salmon, G\. Starace, G\. Brockman, H\. Salman, H\. Bao, H\. Hu, H\. Wong, H\. Wang, H\. Schmidt, H\. Whitney, H\. Jun, H\. Kirchner, H\. P\. de Oliveira Pinto, H\. Ren, H\. Chang, H\. W\. Chung, I\. Kivlichan, I\. O’Connell, I\. O’Connell, I\. Osband, I\. Silber, I\. Sohl, I\. Okuyucu, I\. Lan, I\. Kostrikov, I\. Sutskever, I\. Kanitscheider, I\. Gulrajani, J\. Coxon, J\. Menick, J\. Pachocki, J\. Aung, J\. Betker, J\. Crooks, J\. Lennon, J\. Kiros, J\. Leike, J\. Park, J\. Kwon, J\. Phang, J\. Teplitz, J\. Wei, J\. Wolfe, J\. Chen, J\. Harris, J\. Varavva, J\. G\. Lee, J\. Shieh, J\. Lin, J\. Yu, J\. Weng, J\. Tang, J\. Yu, J\. Jang, J\. Q\. Candela, J\. Beutler, J\. Landers, J\. Parish, J\. Heidecke, J\. Schulman, J\. Lachman, J\. McKay, J\. Uesato, J\. Ward, J\. W\. Kim, J\. Huizinga, J\. Sitkin, J\. Kraaijeveld, J\. Gross, J\. Kaplan, J\. Snyder, J\. Achiam, J\. Jiao, J\. Lee, J\. Zhuang, J\. Harriman, K\. Fricke, K\. Hayashi, K\. Singhal, K\. Shi, K\. Karthik, K\. Wood, K\. Rimbach, K\. Hsu, K\. Nguyen, K\. Gu\-Lemberg, K\. Button, K\. Liu, K\. Howe, K\. Muthukumar, K\. Luther, L\. Ahmad, L\. Kai, L\. Itow, L\. Workman, L\. Pathak, L\. Chen, L\. Jing, L\. Guy, L\. Fedus, L\. Zhou, L\. Mamitsuka, L\. Weng, L\. McCallum, L\. Held, L\. Ouyang, L\. Feuvrier, L\. Zhang, L\. Kondraciuk, L\. Kaiser, L\. Hewitt, L\. Metz, L\. Doshi, M\. Aflak, M\. Simens, M\. Boyd, M\. Thompson, M\. Dukhan, M\. Chen, M\. Gray, M\. Hudnall, M\. Zhang, M\. Aljubeh, M\. Litwin, M\. Zeng, M\. Johnson, M\. Shetty, M\. Gupta, M\. Shah, M\. Yatbaz, M\. J\. Yang, M\. Zhong, M\. Glaese, M\. Chen, M\. Janner, M\. Lampe, M\. Petrov, M\. Wu, M\. Wang, M\. Fradin, M\. Pokrass, M\. Castro, M\. O\. T\. de Castro, M\. Pavlov, M\. Brundage, M\. Wang, M\. Khan, M\. Murati, M\. Bavarian, M\. Lin, M\. Yesildal, N\. Soto, N\. Gimelshein, N\. Cone, N\. Staudacher, N\. Summers, N\. LaFontaine, N\. Chowdhury, N\. Ryder, N\. Stathas, N\. Turley, N\. Tezak, N\. Felix, N\. Kudige, N\. Keskar, N\. Deutsch, N\. Bundick, N\. Puckett, O\. Nachum, O\. Okelola, O\. Boiko, O\. Murk, O\. Jaffe, O\. Watkins, O\. Godement, O\. Campbell\-Moore, P\. Chao, P\. McMillan, P\. Belov, P\. Su, P\. Bak, P\. Bakkum, P\. Deng, P\. Dolan, P\. Hoeschele, P\. Welinder, P\. Tillet, P\. Pronin, P\. Tillet, P\. Dhariwal, Q\. Yuan, R\. Dias, R\. Lim, R\. Arora, R\. Troll, R\. Lin, R\. G\. Lopes, R\. Puri, R\. Miyara, R\. Leike, R\. Gaubert, R\. Zamani, R\. Wang, R\. Donnelly, R\. Honsby, R\. Smith, R\. Sahai, R\. Ramchandani, R\. Huet, R\. Carmichael, R\. Zellers, R\. Chen, R\. Chen, R\. Nigmatullin, R\. Cheu, S\. Jain, S\. Altman, S\. Schoenholz, S\. Toizer, S\. Miserendino, S\. Agarwal, S\. Culver, S\. Ethersmith, S\. Gray, S\. Grove, S\. Metzger, S\. Hermani, S\. Jain, S\. Zhao, S\. Wu, S\. Jomoto, S\. Wu, Shuaiqi, Xia, S\. Phene, S\. Papay, S\. Narayanan, S\. Coffey, S\. Lee, S\. Hall, S\. Balaji, T\. Broda, T\. Stramer, T\. Xu, T\. Gogineni, T\. Christianson, T\. Sanders, T\. Patwardhan, T\. Cunninghman, T\. Degry, T\. Dimson, T\. Raoux, T\. Shadwell, T\. Zheng, T\. Underwood, T\. Markov, T\. Sherbakov, T\. Rubin, T\. Stasi, T\. Kaftan, T\. Heywood, T\. Peterson, T\. Walters, T\. Eloundou, V\. Qi, V\. Moeller, V\. Monaco, V\. Kuo, V\. Fomenko, W\. Chang, W\. Zheng, W\. Zhou, W\. Manassra, W\. Sheu, W\. Zaremba, Y\. Patil, Y\. Qian, Y\. Kim, Y\. Cheng, Y\. Zhang, Y\. He, Y\. Zhang, Y\. Jin, Y\. Dai, and Y\. Malkov \(2024\)GPT\-4o system card\.External Links:2410\.21276,[Link](https://arxiv.org/abs/2410.21276)Cited by:[1st item](https://arxiv.org/html/2605.14054#S3.I3.i1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p3.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p3.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§2\.5](https://arxiv.org/html/2605.14054#S2.SS5.p2.7),[§4](https://arxiv.org/html/2605.14054#S4.p3.1)\.
- E\. J\. Sondik \(1971\)The optimal control of partially observable markov processes\.Stanford University\.Cited by:[§2\.1](https://arxiv.org/html/2605.14054#S2.SS1.p2.2)\.
- R\. Tanaka, K\. Nishida, K\. Nishida, T\. Hasegawa, I\. Saito, and K\. Saito \(2023\)SlideVQA: a dataset for document visual question answering on multiple images\.External Links:2301\.04883,[Link](https://arxiv.org/abs/2301.04883)Cited by:[3rd item](https://arxiv.org/html/2605.14054#S3.I2.i3.p1.1)\.
- C\. Team, Z\. Yue, Z\. Lin, Y\. Song, W\. Wang, S\. Ren, S\. Gu, S\. Li, P\. Li, L\. Zhao, L\. Li, K\. Bao, H\. Tian, H\. Zhang, G\. Wang, D\. Zhu, Cici, C\. He, B\. Ye, B\. Shen, Z\. Zhang, Z\. Jiang, Z\. Zheng, Z\. Song, Z\. Luo, Y\. Yu, Y\. Wang, Y\. Tian, Y\. Tu, Y\. Yan, Y\. Huang, X\. Wang, X\. Xu, X\. Song, X\. Zhang, X\. Yong, X\. Zhang, X\. Deng, W\. Yang, W\. Ma, W\. Lv, W\. Zhuang, W\. Liu, S\. Deng, S\. Liu, S\. Chen, S\. Yu, S\. Liu, S\. Wang, R\. Ma, Q\. Wang, P\. Wang, N\. Chen, M\. Zhu, K\. Zhou, K\. Zhou, K\. Fang, J\. Shi, J\. Dong, J\. Xiao, J\. Xu, H\. Liu, H\. Xu, H\. Qu, H\. Zhao, H\. Lv, G\. Wang, D\. Zhang, D\. Zhang, D\. Zhang, C\. Ma, C\. Liu, C\. Cai, and B\. Xia \(2025\)MiMo\-vl technical report\.External Links:2506\.03569,[Link](https://arxiv.org/abs/2506.03569)Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p3.1)\.
- N\. Tishby, F\. C\. Pereira, and W\. Bialek \(2000\)The information bottleneck method\.arXiv preprint physics/0004057\.Cited by:[§2\.3](https://arxiv.org/html/2605.14054#S2.SS3.p4.5)\.
- H\. Wang, L\. Li, C\. Qu, F\. Zhu, W\. Xu, W\. Chu, and F\. Lin \(2025a\)To code or not to code? adaptive tool integration for math language models via expectation\-maximization\.arXiv preprint arXiv:2502\.00691\.Cited by:[Appendix A](https://arxiv.org/html/2605.14054#A1.p1.1)\.
- H\. Wang, C\. Qu, Z\. Huang, W\. Chu, F\. Lin, and W\. Chen \(2025b\)Vl\-rethinker: incentivizing self\-reflection of vision\-language models with reinforcement learning\.arXiv preprint arXiv:2504\.08837\.Cited by:[Appendix D](https://arxiv.org/html/2605.14054#A4.p1.1),[2nd item](https://arxiv.org/html/2605.14054#S3.I3.i2.p1.1),[§3](https://arxiv.org/html/2605.14054#S3.p2.1),[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.
- H\. Wang, H\. Que, Q\. Xu, M\. Liu, W\. Zhou, J\. Feng, W\. Zhong, W\. Ye, T\. Yang, W\. Huang,et al\.\(2025c\)Reverse\-engineered reasoning for open\-ended generation\.arXiv preprint arXiv:2509\.06160\.Cited by:[Appendix A](https://arxiv.org/html/2605.14054#A1.p1.1)\.
- H\. Wang, A\. Su, W\. Ren, F\. Lin, and W\. Chen \(2025d\)Pixel reasoner: incentivizing pixel\-space reasoning with curiosity\-driven reinforcement learning\.External Links:2505\.15966,[Link](https://arxiv.org/abs/2505.15966)Cited by:[Appendix D](https://arxiv.org/html/2605.14054#A4.p1.1),[§1](https://arxiv.org/html/2605.14054#S1.p2.1),[§1](https://arxiv.org/html/2605.14054#S1.p3.1),[2nd item](https://arxiv.org/html/2605.14054#S3.I3.i2.p1.1),[§3](https://arxiv.org/html/2605.14054#S3.p2.1),[§4](https://arxiv.org/html/2605.14054#S4.p2.1),[§4](https://arxiv.org/html/2605.14054#S4.p3.1)\.
- H\. Wang, C\. Wei, W\. Ren, J\. Liu, F\. Lin, and W\. Chen \(2026\)RationalRewards: reasoning rewards scale visual generation both training and test time\.arXiv preprint arXiv:2604\.11626\.Cited by:[Appendix A](https://arxiv.org/html/2605.14054#A1.p1.1)\.
- H\. Wang, Q\. Xu, C\. Liu, J\. Wu, F\. Lin, and W\. Chen \(2025e\)Emergent hierarchical reasoning in llms through reinforcement learning\.External Links:2509\.03646,[Link](https://arxiv.org/abs/2509.03646)Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p3.1)\.
- P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge,et al\.\(2024\)Qwen2\-vl: enhancing vision\-language model’s perception of the world at any resolution\.arXiv preprint arXiv:2409\.12191\.Cited by:[§1](https://arxiv.org/html/2605.14054#S1.p2.1),[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.
- W\. Wang, L\. Ding, M\. Zeng, X\. Zhou, L\. Shen, Y\. Luo, W\. Yu, and D\. Tao \(2025f\)Divide, conquer and combine: a training\-free framework for high\-resolution image perception in multimodal large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 7907–7915\.Cited by:[1st item](https://arxiv.org/html/2605.14054#S3.I2.i1.p1.1)\.
- J\. Wei, M\. Bosma, V\. Zhao, K\. Guu, A\. W\. Yu, B\. Lester, N\. Du, A\. M\. Dai, and Q\. V\. Le \(2022\)Finetuned language models are zero\-shot learners\.InInternational Conference on Learning Representations,Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2023\)Chain\-of\-thought prompting elicits reasoning in large language models\.External Links:2201\.11903,[Link](https://arxiv.org/abs/2201.11903)Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.
- P\. Wu and S\. Xie \(2023\)V\*: guided visual search as a core mechanism in multimodal llms\.External Links:2312\.14135,[Link](https://arxiv.org/abs/2312.14135)Cited by:[1st item](https://arxiv.org/html/2605.14054#S3.I2.i1.p1.1)\.
- G\. Xu, P\. Jin, Z\. Wu, H\. Li, Y\. Song, L\. Sun, and L\. Yuan \(2025a\)LLaVA\-cot: let vision language models reason step\-by\-step\.External Links:2411\.10440,[Link](https://arxiv.org/abs/2411.10440)Cited by:[§1](https://arxiv.org/html/2605.14054#S1.p2.1)\.
- Q\. Xu, H\. Wang, C\. Liu, F\. Lin, and W\. Chen \(2025b\)CogDoc: towards unified thinking in documents\.External Links:2512\.12658,[Link](https://arxiv.org/abs/2512.12658)Cited by:[Appendix A](https://arxiv.org/html/2605.14054#A1.p1.1)\.
- X\. Yue, Y\. Ni, K\. Zhang, T\. Zheng, R\. Liu, G\. Zhang, S\. Stevens, D\. Jiang, W\. Ren, Y\. Sun, C\. Wei, B\. Yu, R\. Yuan, R\. Sun, M\. Yin, B\. Zheng, Z\. Yang, Y\. Liu, W\. Huang, H\. Sun, Y\. Su, and W\. Chen \(2024\)MMMU: a massive multi\-discipline multimodal understanding and reasoning benchmark for expert agi\.InProceedings of CVPR,Cited by:[2nd item](https://arxiv.org/html/2605.14054#S3.I2.i2.p1.1)\.
- J\. Zhang, J\. Huang, H\. Yao, S\. Liu, X\. Zhang, S\. Lu, and D\. Tao \(2025a\)R1\-vl: learning to reason with multimodal large language models via step\-wise group relative policy optimization\.External Links:2503\.12937,[Link](https://arxiv.org/abs/2503.12937)Cited by:[2nd item](https://arxiv.org/html/2605.14054#S3.I3.i2.p1.1)\.
- X\. Zhang, Z\. Gao, B\. Zhang, P\. Li, X\. Zhang, Y\. Liu, T\. Yuan, Y\. Wu, Y\. Jia, S\. Zhu, and Q\. Li \(2025b\)Chain\-of\-focus: adaptive visual search and zooming for multimodal reasoning via rl\.External Links:2505\.15436,[Link](https://arxiv.org/abs/2505.15436)Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p2.1)\.
- Y\. Zhou, C\. Cui, R\. Rafailov, C\. Finn, and H\. Yao \(2024\)Aligning modalities in vision large language models via preference fine\-tuning\.arXiv preprint arXiv:2402\.11411\.Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.
- K\. Zhu, L\. Zhao, Z\. Ge, and X\. Zhang \(2024\)Self\-supervised visual preference alignment\.InProceedings of the 32nd ACM International Conference on Multimedia,pp\. 291–300\.Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.
- X\. Zhu, C\. Wang, H\. Wang, X\. Liu, and F\. Lin \(2025\)OOTSM: a decoupled linguistic framework for effective scene graph anticipation\.arXiv preprint arXiv:2509\.05661\.Cited by:[§4](https://arxiv.org/html/2605.14054#S4.p1.1)\.

## Appendix AAppendix

\(Wanget al\.,[2025c](https://arxiv.org/html/2605.14054#bib.bib19),[2026](https://arxiv.org/html/2605.14054#bib.bib17),[a](https://arxiv.org/html/2605.14054#bib.bib16); Xuet al\.,[2025b](https://arxiv.org/html/2605.14054#bib.bib101); Jiaet al\.,[2026](https://arxiv.org/html/2605.14054#bib.bib31)\)

## Appendix BTraining Setup and Implementation Details

We utilizeQwen2\.5\-VL\-Instruct\-7Bas the base policy modelπθ\\pi\_\{\\theta\}, while employing the text\-onlyQwen2\.5\-Instruct\-14Bfor both perception verification and outcome verification\. Training is conducted using Group Relative Policy Optimization \(GRPO\) on a cluster of 16×\\timesNVIDIA H100 \(80GB\) GPUs, taking approximately 16 hours for convergence\. Key hyperparameters include a learning rate of1​e−61e^\{\-6\}with a cosine decay scheduler, a global batch size of 256 accumulated gradient steps and 512 QA rollouts per policy sync, a group size ofG=8G=8, and a KL divergence coefficient ofβ=0\.0\\beta=0\.0\. The reward signal combines an outcome weight of1\.01\.0and a perception weight ofλ=0\.3\\lambda=0\.3, utilizing MoCA penalties withαp​r​o​t​e​c​t=0\.2\\alpha\_\{protect\}=0\.2andαp​u​n​i​s​h=0\.2\\alpha\_\{punish\}=0\.2\. The hyperparameters are selected based on an initial hyperparameter sweep using smaller 3B models\.

## Appendix CBenchmark Categorization and Details

We categorize benchmarks based on the primary bottleneck for model performance\.

#### 1\. Perception\-Intensive Benchmarks\.

These tasks require fine\-grained visual recognition, OCR, or grounding, where the reasoning is straightforward once the visual element is identified\.

- •V\* \(V\-Star\):Focuses on guided visual search and spotting small details in high\-resolution images\.
- •HRBench \(4K\):Evaluates hyper\-resolution perception, critical for detecting minute objects often lost in standard resizing\.
- •InfoVQA:Requires reading and extracting specific text from large high\-resolution document images and infographics\.

#### 2\. Reasoning\-Intensive Benchmarks\.

These tasks provide visually distinct information but require complex logical chains, mathematical calculation, or domain knowledge\.

- •MathVista:Requires multi\-step mathematical reasoning and geometric logic\.
- •MMMU:A massive multi\-discipline benchmark requiring expert\-level knowledge and logic across college\-level subjects\.
- •EMMA:Specifically designed to test "Chain\-of\-Thought" reasoning in multimodality\.

#### 3\. Rich\-Modality Benchmarks\.

These tasks occupy the intersection, requiring the model to parse complex structures \(layouts, tables, diagrams\) and reason over them\.

- •DUDE:Document Understanding Dataset and Evaluation, involving multi\-page documents with diverse layouts\.
- •SlideVQA:Requires retrieving information from a deck of presentation slides, testing layout understanding and sequence logic\.
- •MMLongBench\-Doc:Tests the model’s ability to handle long\-context multimodal documents\.

Table 5:Extended Main Results\.

## Appendix DModality\-Rich Training Dataset

Table[6](https://arxiv.org/html/2605.14054#A4.T6)details the composition of our curated training dataset, including STEM\-related reasoning\-intensive QAs from VL\-Rethinker\(Wanget al\.,[2025b](https://arxiv.org/html/2605.14054#bib.bib44)\)and VisualWebInstruct\(Jiaet al\.,[2025](https://arxiv.org/html/2605.14054#bib.bib46)\), high\-resolution natural scene QAs from Pixel Reasoner\(Wanget al\.,[2025d](https://arxiv.org/html/2605.14054#bib.bib71)\), and the collected modality\-rich queries\. We show a few examples of the curated modality\-rich queries in Fig\.[5](https://arxiv.org/html/2605.14054#A6.F5)\. We will release the code and dataset upon paper acceptance\.

Table 6:Composition of the Training Dataset\.Source DomainDescriptionSamplesKey ModalitiesSci\-Graph\(arXiv\)Extracted figures and captions from STEM papers\.5,902Line Plots, Histograms, SchematicsFin\-SheetFinancial reports and earning statements\.3,345Dense Tables, Excel GridsNews\-LayoutNewspaper front pages and newsletter scans\.4,722Multi\-column Text, Headlines, InsetsInfo\-GraphicEducational and marketing infographics\.3,738Icons, Flowcharts, stylized textManualsTechnical instruction manuals\.1,968Diagrams with callouts, Step\-by\-step visual instructionsHigh\-Res ScenesHigh\-quality natural images and complex scenes\.8,905Dense Natural ScenesSTEM\-QAsQuestion and Answer pairs in STEM domains\.16,271Text, Equations, Scientific NotationTotal44,851

## Appendix EUniversal Verbal Program \(VP\) Prompt

Below is the full system instruction used for the Verification by Verbal Program \(VP\)\.

System Prompt for Verification by Verbal ProgramYou are an impartial AI judge evaluating the semantic similarity between a Model Response and a Ground Truth Answer\.\{question\_section\}Ground Truth Answer: \{ground\_truth\}Model Response: \{prediction\}\#\# EVALUATION ALGORITHMFollow these steps sequentially:\#\#\# STEP 1: Identify Answer TypeDetermine if the ground truth is:•Type A: Numerical \(integers, floats, percentages, mathematical expressions\)•Type B: Simple Phrase \(short answer, typically≤\\leq5 words, factual\)•Type C: Multi\-Choice options \(letters such as A, B, C, D, \.\.\.\)•Type D: Sentence/Paragraph \(complete sentence or longer explanation\)Write: "Ground Truth Type: \[A/B/C/D\]"\#\#\# STEP 2: Extract Boxed ContentCheck if the model response contains\\\\backslashboxed\{\} command\.Write:•"Boxed content found: \[content inside\\\\backslashboxed\{\}\]" OR•"Boxed content found: None"\#\#\# STEP 3: Apply Type\-Specific Scoring\#\#\#\# IF Type A \(Numerical\) OR Type B \(Simple Phrase\):3\.1Compare boxed content with ground truth:•Extract the value/phrase from the model response•For numerical: Check exact match \(account for equivalent representations: 9\.5 million = 9,500,000\)•For simple phrase: Check semantic equivalenceWrite:•"Model Answer: \[extracted content\]"•"Ground truth: \[ground truth value\]"•"Match status: \[Exact match / Equivalent / No match\]"3\.2Calculate score:•If exact/equivalent match: Base score = 1\.0•If no match: Base score = 0\.0•If\\\\backslashboxed\{\} is missing or improperly used: Deduct 0\.1 from base score \(minimum 0\.0\)Write: "Type A/B Score: \[score\]"\#\#\# IF Type C \(Multi\-Choice Options\):3\.1Identify the selected option:•Check boxed content for the letter choice \(A, B, C, D, etc\.\)•If boxed content contains choice description/content instead of letter, match it to the corresponding option•If boxed content is missing, scan the model response for explicit option selectionWrite:•"Identified option letter: \[letter\]" OR "Identified option letter: None \(content only\)" OR "Identified option letter: None"•"Match method: \[Direct letter / Matched by content / Not found\]"3\.2Ground truth comparison:Write:•"Ground truth option: \[letter\] \- \[full content of the correct option\]"•"Model’s answer: \[letter/content as identified in 3\.1\]"•"Comparison result: \[Correct letter / Correct content only / Incorrect\]"3\.3Scoring rules:Apply the following score based on the comparison:•Correct letter in\\\\backslashboxed\{\} \| 1\.0•Correct letter without\\\\backslashboxed\{\} \| 0\.9•Correct letter in\\\\backslashboxed\{\} \+ correct content following \| 0\.7•Correct content in\\\\backslashboxed\{\} without letter \| 0\.3•Else: 0\.0Write: "Type C Score: \[score\] \- \[reason based on table above\]"Note: The scoring prioritizes the option letter, but provides partial credit when the correct choice content is identified without the letter\#\#\#\# IF Type D \(Sentence/Paragraph\):3\.1Reconstruct reference answer:Write: "Reference answer sentence: \[complete sentence based on question context and ground truth\]"3\.2Extract implied answer from model:Write: "Model’s implied answer: \[sentence/meaning extracted from model response\]"3\.3Semantic comparison:Analyze whether both sentences convey the same meaning\. Consider:•Core facts and claims•Logical equivalence•Key information preservationSpecial case: if the reference answer implies the question is not answerable, it means the relevant information is not provided in the given queries\. So the model response is correct if it implies similar meanings\.Write: "Analysis: \[detailed comparison\]"3\.4Assign semantic similarity score:•1\.0: Perfect semantic equivalence \(same meaning, different wording acceptable\)•0\.7\-0\.9: High similarity with minor differences•0\.4\-0\.6: Partial overlap, some key information matches•0\.1\-0\.3: Minimal overlap•0\.0: No semantic overlap or contradictoryWrite: "Type C Score: \[score\]"\#\#\# STEP 4: Final JudgmentProvide:1\."Conclusion: \[brief summary of evaluation\]"2\.Final score on a new line:\\\\backslashboxed\{score\}Important: Use only ONE\\\\backslashboxed\{\} for the final score\.Now execute the algorithm above step by step\.

## Appendix FExperiment Details

### F\.1Construction of Eval Set for Validating VP

This experiment is designed to validate the central claims of our Verification by Verbal Program \(VP\) method\. We aim to prove that VP is not onlymore accuratethan standard LLM prompting but also significantlymore reliable \(low\-variance\)\. The experiment tests VP’s ability to handle semantic ambiguity – whereRigid Ruleverifiers fail – and subtle inaccuracies – where standardLLM Promptingverifiers often fail\.

#### Experimental Setup\.

- •Challenge Set:We manually curated a "VP\-Challenge\-Set" of 273 \(model answer, gold answer\) pairs, selected from the both the training set and test set\. This set was meticulously labeled by human experts with a binary \(Correct/Incorrect\) verdict\. The set is composed of 144 "Correct" \(positive\) samples \(61 exact matches, 83 semantic rephrasings\) and 129 "Incorrect" \(negative\) samples \(79 subtle errors, 50 mixed\-fact answers\)\.
- •Verifiers Compared: 1. 1\.Rigid Rule:A verifier using a combination of exact string match, keyword matching, and regular expressions\. 2. 2\.LLM Prompting:Our base LLM \(Qwen2\.5\-Instruct\-14B\) prompted with a natural language question \(i\.e\., "Is the model prediction semantically equivalent to the given ground truth answers? output TRUE if equivalent, otherwise FALSE\."\)\. 3. 3\.VP \(Ours\):The same LLM judge \(Qwen2\.5\-Instruct\-14B\) instructed to "execute" our universal Verbal Program \(VP\) to arrive at a verdict\.
- •Metrics: 1. 1\.Accuracy \(Acc\.\), F1\-Score \(F1\):Standard classification metrics comparing the verifier’s verdict against the 273 human\-annotated ground truth labels\. 2. 2\.Consistency:To measure reliability, we ran the stochastic verifiers \(LLM Prompting and VP\) five times for each sample with a non\-zero temperature \(T=0\.7T=0\.7\)\. This metric reports the percentage of samples for which the verifier produced theexact same verdictin all 5 runs\.

![Refer to caption](https://arxiv.org/html/2605.14054v1/x5.png)Figure 5:Example Queries of Collected Modality\-Rich Dataset\.

### F\.2Validation of the Perception Verification \(PV\) Oracle

A core premise of our VL\-Scaler framework is the utility of our Perception Verification \(PV\) mechanism\. This mechanism relies on a text\-only LLM \(Qwen2\.5\-Instruct\-14B\), which acts as a "reasoning\-oracle proxy," to evaluate whether a VLM’s generated perception blocks \(apa\_\{p\}\) are a "sufficient statistic" for answering a given question \(QQ\)\. The proxy’s verdict,RP∈\{Sufficient,Insufficient\}R\_\{P\}\\in\\\{\\text\{Sufficient\},\\text\{Insufficient\}\\\}, is a critical reward signal\.

However, this proxy is a high\-capability, not an infallible, reasoner \(as discussed in Section 2\.3\)\. It is susceptible to "fallibility," particularly "false negatives," where it fails to reason from a text block that a human would find sufficient\. This experiment is designed to quantify the reliability of our PV oracle by comparing its verdicts against a human\-annotated ground truth\.

#### Data Sampling\.

We randomly sampledN=979N=979data points from our model’s rollouts\. Each data point consisted of a triplet: \(1\) the inputQuestion\(QQ\), \(2\) the VLM\-generatedPerception Blocks\(apa\_\{p\}\), and \(3\) theGolden Answer\(Yg​o​l​dY\_\{gold\}\)\.

#### Task Design & Annotation\.

We recruited three independent annotators via the Amazon Mechanical Turk \(AMT\) platform\. To ensure high\-quality, non\-expert annotations, the annotators were given a clear, binary\-choice task:

"Given the ’Question’, and the ’Golden Answer’, does the provided ’Perception Text’ contain all the necessary factual information to logically deduce the ’Golden Answer’? You must not use any external knowledge or information from the \(omitted\) image\. Judge basedonlyon the text provided\."

The output for each annotator was a binary label:SufficientorInsufficient\. We established our human ground truth by taking the majority vote of the three annotators\.

## Appendix GExamples

Example QueryQuestion:Which of the following company logos do not appear in the slides? Facebook, AWS, Cargo and Manta\.Ground Truth:AWS

Model Response of VL\-Scaler<think\>First, I need to understand the nature of these documents\. They appear to be slides from a presentation about mastering strategic SEO audits\. The slides seem to discuss background information, avoidances, and possibly some company logos\. My goal is to determine which company logos are missing from the slides\. Let’s break down the user’s request\. They want me to identify which of the following company logos are not present: Facebook, AWS, Cargo, and Manta\. I’ll start by looking for these logos in the images provided\.</think\><recognition\>Scanning through the images, I notice various company logos, but I need to carefully check for Facebook, AWS, Cargo, and Manta\. After a thorough examination, I can see that Facebook, Cargo, and Manta are indeed present in the images\. However, AWS does not appear anywhere in the slides\.</recognition\><think\>Great, I’ve found that AWS is the only logo among the four that is not present in the slides\. Now, I need to confirm my findings\. Let me double\-check the images one more time to ensure accuracy\.</think\><recognition\>After another careful scan, I confirm that Facebook, Cargo, and Manta are all present in the images\. AWS is still missing\. This means that AWS is the correct answer to the user’s request\.</recognition\><think\>So, the company logo that does not appear in the slides is AWS\. This matches the solution I was given\. I’m confident in this conclusion because I’ve double\-checked the images multiple times\. Before finalizing, I should reflect on whether I missed anything\. Wait, let me double\-check again just to be sure\.</think\><recognition\>After the final scan, I reaffirm that Facebook, Cargo, and Manta are all present, while AWS is missing\. This confirms that AWS is the correct answer\.</recognition\><think\>Alright, everything checks out\. The company logo that does not appear in the slides is AWS\. This is the final answer\. I’m satisfied with this conclusion because it aligns with the evidence I’ve gathered from the images\.</think\><answer\>The company logo that does not appear in the slides is AWS, as it is the only one among the options Facebook, AWS, Cargo, and Manta that is missing from the images\.</answer\>

Model Response of Baseline GRPOLet’s analyze the company logos shown in the images step by step:1\.Image 1 \(Background Information\):–Petco–Disney Interactive Studios–NBC Universal–Mashable–GSN–UGG Australia–Beats by Dr\. Dre–Chamber of Commerce…2\.Image 2 \(Shiny Object Distractions\):–Addthis–AIM–AIM \(alt\)–Apple–Bebo…Now, let’s compare the logos from Image 1 and Image 2 to the list of logos provided in the question \(Facebook, AWS, Cargo, and Manta\):–Facebook is present in Image 2\.–AWS is not present in either image\.–Cargo is present in Image 2\.Based on the analysis, the company logos that do not appear in the slides are AWS and Manta\.

Similar Articles

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Papers with Code Trending

This paper uncovers that prolonged reasoning in vision-language models can impair perceptual grounding, causing recognition failures on basic visual questions. It proposes Vision-Anchored Policy Optimization (VAPO) to steer reasoning toward visually grounded trajectories, achieving state-of-the-art performance with the VAPO-Thinker-7B model.

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

arXiv cs.CL

This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.