Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

arXiv cs.CL Papers

Summary

Researchers introduce Groupwise Ranking Reward to fix reasoning-answer inconsistency in multimodal RL, boosting reliability-conditioned accuracy from 47.4% to 54.7% over standard RLVR.

arXiv:2604.18892v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves multimodal reasoning by rewarding verifiable final answers. Yet answer-correct trajectories may still rely on incomplete derivations, weak evidence, or statements that contradict their conclusions. This gap between answer correctness and reasoning validity, which we call reasoning-answer inconsistency, motivates trajectory supervision in multimodal RL. We compare two main approaches: reward models (RMs), and Generative Rewards (GRs). RMs are efficient and help early in training, but their gains weaken as the policy distribution shifts; GRs improve performance, but may give unstable rewards and computationally expensive. We therefore propose Groupwise Ranking Reward, which ranks verifier-passed trajectories for the same prompt in one pass and redistributes reward accordingly. Groupwise comparison better separates stronger and weaker correct trajectories with lower judge overhead than GRs. Experiments show that RLVR aggravates reasoning-answer inconsistency, while trajectory supervision alleviates it. Groupwise Ranking Reward performs best overall, improving reliability-conditioned accuracy from 47.4% to 54.7% over RLVR.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/22/26, 08:29 AM

# Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness
Source: [https://arxiv.org/html/2604.18892](https://arxiv.org/html/2604.18892)
###### Abstract

Reinforcement Learning with Verifiable Rewards \(RLVR\) improves multimodal reasoning by rewarding verifiable final answers\. Yet answer\-correct trajectories may still rely on incomplete derivations, weak evidence, or statements that contradict their conclusions\. This gap between answer correctness and reasoning validity, which we call*reasoning\-answer inconsistency*, motivates trajectory supervision in multimodal RL\. We compare two main approaches: reward models \(RMs\), and Generative Rewards \(GRs\)\. RMs are efficient and help early in training, but their gains weaken as the policy distribution shifts; GRs improve performance, but may give unstable rewards and computationally expensive\. We therefore proposeGroupwise Ranking Reward, which ranks verifier\-passed trajectories for the same prompt in one pass and redistributes reward accordingly\. Groupwise comparison better separates stronger and weaker correct trajectories with lower judge overhead than GRs\. Experiments show that RLVR aggravates reasoning\-answer inconsistency, while trajectory supervision alleviates it\. Groupwise Ranking Reward performs best overall, improving reliability\-conditioned accuracy from 47\.4% to 54\.7% over RLVR\.

PrioritizingtheBest: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

Mengzhao Jia, Zhihan Zhang, Meng JiangUniversity of Notre Damemjia2@nd\.edu

## 1Introduction

![Refer to caption](https://arxiv.org/html/2604.18892v1/x1.png)Figure 1:Reasoning\-answer inconsistency under outcome\-only RLVR and the motivation for Groupwise Ranking Reward\.\(a\) An example of reasoning\-answer inconsistency: the trajectory derivesX=−11X=\-11in its reasoning, but the final boxed answer is1010\. \(b\) Training dynamics under outcome\-only RLVR and Groupwise Ranking Reward\. Outcome\-only RLVR steadily increases inconsistency and eventually reduces reliability\-conditioned accuracy \(RC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}\), whereas Groupwise Ranking Reward keeps inconsistency stable and continues improvingRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}\. \(c\) Comparison of trajectory supervision schemes\. PRMs and pointwise GRs score each trajectory independently, while Groupwise Ranking Reward compares verifier\-passed trajectories for the same prompt jointly and assigns ranking\-based rewards according to their relative quality\.Multimodal Large Language Models \(MLLMs\) are increasingly capable of visually grounded reasoning\(Shi et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib17); Wang et al\.,[2025c](https://arxiv.org/html/2604.18892#bib.bib23)\), and Reinforcement Learning with Verifiable Rewards \(RLVR\)DeepSeek\-AI et al\. \([2025](https://arxiv.org/html/2604.18892#bib.bib4)\)has become a standard way to push this progress further by directly optimizing verifiable outcomes such as final answer correctness\(Zhang et al\.,[2025b](https://arxiv.org/html/2604.18892#bib.bib33); Wang et al\.,[2025e](https://arxiv.org/html/2604.18892#bib.bib26); Wei et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib28)\)\. At the same time, recent studies have noted that under outcome\-only supervision, MLLMs’ reasoning process can become less reliable as training continues, even when the final answer is correct\(Chen et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib2); Kan et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib8); Huang et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib6); Wang et al\.,[2025b](https://arxiv.org/html/2604.18892#bib.bib21); Jia et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib7)\)\. A closer look at the training process shows that answer\-correct responses increasingly rely on incomplete derivations, weak evidence, or even statements that conflict with their own conclusions\. As shown in Figure[1](https://arxiv.org/html/2604.18892#S1.F1)\(a\): the reasoning trajectory derivesX=−11X=\-11, but the final boxed answer is1010, the response would still be rewarded by outcome\-only supervision despite an unfaithful reasoning chain\. We trace this problem to a basic limitation of outcome\-only RLVR: it supervises whether the final answer is correct, but not whether the intermediate reasoning process actually justifies that answer\. We refer to this failure mode asreasoning\-answer inconsistency\. Left unresolved, this failure mode can make MLLMs unreliable in applications such as medical decision support, autonomous driving, and scientific analysis, where a correct\-looking answer with contradictory reasoning is not acceptable\.

A natural way to address this problem is to add supervision over the validity of the reasoning trajectory\. However, how to introduce such supervision signals into multimodal RL in a way that is both effective and computationally efficient remains underexplored\. In this paper, we compare two main families of trajectory supervision methods:Reward Models\(RMs\) andGenerative Rewards\(GRs\)\. RMs directly assign scalar quality scores to the reasoning process\. We use the fine\-grainedProcess Reward Model\(PRM\) that assigns a score to each intermediate reasoning step\(Uesato et al\.,[2022](https://arxiv.org/html/2604.18892#bib.bib19); Lightman et al\.,[2023](https://arxiv.org/html/2604.18892#bib.bib9); Wang et al\.,[2023](https://arxiv.org/html/2604.18892#bib.bib24)\)\. On the other hand, GRs use a Large Language Model \(LLM\) to follow a predefined evaluation instruction, generate a textual judgment then assign a scalar score\(Zheng et al\.,[2023](https://arxiv.org/html/2604.18892#bib.bib35); Yuan et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib29); Wang et al\.,[2025d](https://arxiv.org/html/2604.18892#bib.bib25)\)\. Our experiments show that RMs are computationally efficient and help in the early stages of training, but their benefits weaken as the policy distribution shifts; GRs improve performance to some extent, but their gains are limited by instability and come with severe efficiency costs\. Based on these observations, we propose aGroupwise Ranking Reward, which ranks verifier\-passed trajectories of the same problem in a single pass, and redistributes rewards based on that ranking\. Figure[1](https://arxiv.org/html/2604.18892#S1.F1)\(c\) highlights the key distinction: PRMs and pointwise GRs score each trajectory independently, whereas our method compares multiple trajectories jointly and rewards them according to relative quality\. By using groupwise comparison instead of independent scoring, the method better distinguishes stronger from weaker correct trajectories with lower judge computation overhead than GRs\. It achieves the strongest performance among all trajectory supervision methods tested\.

We conduct extensive experiments with different judge models, reward designs, and training data to systematically analyze this problem and evaluate the proposed method\. The results show that RLVR can aggravate reasoning\-answer inconsistency, whereas trajectory supervision alleviates this effect\. Figure[1](https://arxiv.org/html/2604.18892#S1.F1)\(b\) previews this trend: under outcome\-only RLVR, inconsistency rises over training and eventually reduces Reliability\-Conditioned Accuracy \(RC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}\), while Groupwise Ranking Reward keeps inconsistency stable and continues improvingRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}\. Among the approaches we study, the Groupwise Ranking Reward achieves both the best accuracy and reasoning faithfulness, improvingRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}from 47\.4% to 54\.7% over RLVR\. These gains indicate that even when answer correctness is already verified, multimodal RL still lacks a way to prefer better\-grounded correct trajectories\.

## 2Related Work

Reinforcement Learning for Multimodal Reasoning\. Chain\-of\-Thought prompting and RLHF laid the foundation for training LLMs to produce explicit multi\-step reasoning\(Wei et al\.,[2022](https://arxiv.org/html/2604.18892#bib.bib27); Ouyang et al\.,[2022](https://arxiv.org/html/2604.18892#bib.bib14)\)\. Building on this trend, recent MLLMs extend stepwise reasoning to visually grounded tasks through reasoning\-oriented supervision and post\-training, especially in multimodal mathematical reasoning and general\-purpose visual reasoning\(Shi et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib17); Wang et al\.,[2025c](https://arxiv.org/html/2604.18892#bib.bib23); Zhang et al\.,[2025b](https://arxiv.org/html/2604.18892#bib.bib33); Wang et al\.,[2025e](https://arxiv.org/html/2604.18892#bib.bib26); Wei et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib28)\)\. RLVR has become a natural recipe in this setting because verifiable rewards allow direct optimization of answer correctness without requiring dense trajectory annotations, and recent pipelines such as GRPO and DeepSeek\-R1 show that this paradigm can scale beyond narrow domains\(Shao et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib16); DeepSeek\-AI et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib4); Su et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib18)\)\. Our work complements these efforts by studying how RLVR should assign differential credit among verifier\-passed rollouts, encouraging the policy to prefer better\-grounded and more reliable reasoning trajectories\.

Reasoning\-Answer Inconsistency in MLLMs\. Recent multimodal RL work identifies reasoning\-answer inconsistency as a common side effect of outcome\-only RLVR, where answer accuracy can improve even as answer\-reasoning coherence degrades\. Existing fixes mainly follow three directions\. One uses reference\-policy calibration\(Chen et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib2); Kan et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib8)\), but this signal becomes less reliable as the online policy drifts\. Another enforces consistency under semantics\-preserving perturbations\(Huang et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib6); Wang et al\.,[2025b](https://arxiv.org/html/2604.18892#bib.bib21)\), though these methods are most natural for closed\-answer settings\. A third line uses rubric\-based generative rewards\(Jia et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib7)\), which provide richer supervision but introduce substantial judge overhead without being updated jointly with the evolving policy\.

Trajectory Supervision\. A parallel line of work improves reasoning by supervising intermediate steps rather than only final answers\. Trajectory supervision was first studied in text\-only mathematical reasoning\(Uesato et al\.,[2022](https://arxiv.org/html/2604.18892#bib.bib19); Lightman et al\.,[2023](https://arxiv.org/html/2604.18892#bib.bib9)\)and has recently been extended to multimodal settings\(Zhang et al\.,[2025a](https://arxiv.org/html/2604.18892#bib.bib32)\)\. Another line of work uses language models themselves as evaluators or reward models\(Zheng et al\.,[2023](https://arxiv.org/html/2604.18892#bib.bib35); Yuan et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib29); Gu et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib5)\)\. Recent work further shows that listwise or distribution\-aware judge inference can be more reliable than independently assigning a single absolute score to each candidate\(Zhao et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib34); Wang et al\.,[2025d](https://arxiv.org/html/2604.18892#bib.bib25)\)\. Closely related, RLRR replaces absolute reward shaping with relative ranking in group\-based RL and trains a ranking reward model to produce groupwise preferences\(Niu et al\.,[2026](https://arxiv.org/html/2604.18892#bib.bib11)\)\. However, that work mainly uses ranking to improve reward modeling and optimization\. Our method aligns with these directions: we study the effects of both RMs and GRs for trajectory supervision in multimodal RL\. Moreover, we propose groupwise ranking methods aimed at improving the reliability of multimodal reasoning\.

## 3Method

RLVR improves answer correctness, but it does not distinguish among verifier\-passed trajectories whose reasoning quality differs substantially\. As a result, the policy model can still be rewarded for correct answers supported by incomplete or contradictory traces\. To address this limitation, we study three trajectory supervision reward designs within the same RLVR pipeline\. The first is aReward Model\(RM\), which assigns scalar scores; in our experiments, this RM is instantiated as a fine\-grainedProcess Reward Model\(PRM\) that scores intermediate reasoning steps\(Uesato et al\.,[2022](https://arxiv.org/html/2604.18892#bib.bib19); Lightman et al\.,[2023](https://arxiv.org/html/2604.18892#bib.bib9); Wang et al\.,[2023](https://arxiv.org/html/2604.18892#bib.bib24); Zhang et al\.,[2025a](https://arxiv.org/html/2604.18892#bib.bib32)\)\. The second is a standardGenerative Reward\(GR\), which uses an LLM\-as\-a\-Judge to produce a textual judgment and score each trajectory independently\(Zheng et al\.,[2023](https://arxiv.org/html/2604.18892#bib.bib35); Yuan et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib29); Gu et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib5)\)\. The third is ourGroupwise Ranking Reward, a groupwise variant of generative reward that compares verifier\-passed trajectories for the same prompt and redistributes reward according to their relative quality ranking\(Zhao et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib34); Wang et al\.,[2025d](https://arxiv.org/html/2604.18892#bib.bib25)\)\.

### 3\.1Problem Setup

Let𝒟\\mathcal\{D\}be a multimodal training set\. Each training example is denoted by\(x,a∗\)\(x,a^\{\*\}\), wherex=\(v,q\)x=\(v,q\),vvis the input image,qqis the textual question, anda∗a^\{\*\}is the ground\-truth answer\. Givenxx, the policyπθ\(⋅∣x\)\\pi\_\{\\theta\}\(\\cdot\\mid x\)generates a structured response

wherezzis the reasoning trajectory enclosed in<think\>\.\.\.</think\>andaais the final answer extracted from the\\boxed\{\.\.\.\}expression\. In the RLVR pipeline, each sampled complete responsey=\[z,a\]y=\[z,a\]is referred to as a*rollout*\.

Following standard RLVR pipelines for reasoning\(Shao et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib16); DeepSeek\-AI et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib4)\), a rule\-based verifier provides a deterministic verification reward

rver​\(y\)=𝕀​\[Verify⁡\(a,a∗\)=1\],r\_\{\\text\{ver\}\}\(y\)=\\mathbb\{I\}\[\\operatorname\{Verify\}\(a,a^\{\*\}\)=1\],\(2\)where𝕀​\[⋅\]\\mathbb\{I\}\[\\cdot\]is the indicator function\. RLVR maximizes𝔼\(x,a∗\)∼𝒟,y∼πθ​\[rver​\(y\)\]\\mathbb\{E\}\_\{\(x,a^\{\*\}\)\\sim\\mathcal\{D\},y\\sim\\pi\_\{\\theta\}\}\[r\_\{\\text\{ver\}\}\(y\)\]\. This objective gives identical reward to all verifier\-passed responses; however, their reasoning quality may differ substantially\. We use the standard GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib16)\)optimization pipeline throughout: for each promptxx, we sample a rollout groupY=\{yi=\[zi,ai\]\}i=1NY=\\\{y\_\{i\}=\[z\_\{i\},a\_\{i\}\]\\\}\_\{i=1\}^\{N\}fromπθ\(⋅∣x\)\\pi\_\{\\theta\}\(\\cdot\\mid x\)and optimize over this group using verifier and trajectory rewards\. Here we focus on how different methods define the trajectory supervision signal; the GRPO optimization objective and equations are given in Appendix[A](https://arxiv.org/html/2604.18892#A1)\.

### 3\.2Trajectory Reward Variants

Beyond the verifier rewardrver​\(yi\)r\_\{\\text\{ver\}\}\(y\_\{i\}\), we add an auxiliary trajectory rewardraux,ir\_\{\\text\{aux\},i\}to provide extra supervision on rolloutyi=\[zi,ai\]y\_\{i\}=\[z\_\{i\},a\_\{i\}\]\. Common trajectory rewards include RMs and GRs\.

##### RMs\.

RMs assign scalar quality scores without generating extra natural language justifications\. They can operate at different supervision granularities\. In this paper, we instantiate the RM as a fine\-grained PRM that decomposes the reasoning trajectory into intermediate steps and assigns a step\-level quality signal along the trajectory\. Let𝒮prm\\mathcal\{S\}\_\{\\text\{prm\}\}denote the reward model and let\{si,tprm\}t=1Ti=𝒮prm​\(x,zi\)\\\{s^\{\\text\{prm\}\}\_\{i,t\}\\\}\_\{t=1\}^\{T\_\{i\}\}=\\mathcal\{S\}\_\{\\text\{prm\}\}\(x,z\_\{i\}\)be the scores for theTiT\_\{i\}reasoning steps in rolloutii\. The rollout\-level PRM reward is obtained by aggregating these step scores,

rprm,i=Agg⁡\(\{si,tprm\}t=1Ti\),r\_\{\\text\{prm\},i\}=\\operatorname\{Agg\}\\\!\\left\(\\\{s^\{\\text\{prm\}\}\_\{i,t\}\\\}\_\{t=1\}^\{T\_\{i\}\}\\right\),\(3\)whereAgg⁡\(⋅\)\\operatorname\{Agg\}\(\\cdot\)denotes the aggregation rule\. PRM provides dense supervision over the reasoning process\.

##### GRs\.

GRs use an LLM judge that follows an evaluation instruction with explicit scoring rubrics\. For each rollout, the judge generates a textual judgment together with a rubric\-grounded scalar score, from which the reward is parsed:

rgr,i=ParseScore⁡\(𝒮gr​\(q,zi,ai\)\)\.r\_\{\\text\{gr\},i\}=\\operatorname\{ParseScore\}\\\!\\left\(\\mathcal\{S\}\_\{\\text\{gr\}\}\(q,z\_\{i\},a\_\{i\}\)\\right\)\.\(4\)Compared with RMs, GRs do not directly emit scores; instead, the judge first produces an evaluative judgment, and the scalar reward is then extracted from that judgment\. In our setup, the rubric casts the judge as an expert evaluator of reasoning quality, defines anchored score levels in\[0,1\]\[0,1\], and asks it to assess dimensions such as logical coherence, calculation quality, completeness of the conclusion, and honest handling of uncertainty\. The judge returns a short feedback string together with a JSON\-formatted scalar score, which we parse as the reward\. We refer to the setting where each rollout is judged independently as*pointwise GR*, to distinguish it from the*groupwise GR*setting introduced next\.

RMs and GRs are both widely used trajectory supervision signals, but they also have limitations: PRM can weaken under policy shift, while pointwise GR depends on the absolute score calibration and requires separate judging for each rollout\. Motivated by this, we further introduce Groupwise Ranking Reward as a groupwise variant of GRs\.

##### Groupwise Ranking Reward\.

Groupwise Ranking Reward extends GRs from independent scoring to within\-group ranking\. For each promptxx, we keep only the verifier\-passed subsetI\+=\{i:rver​\(yi\)=1\}I^\{\+\}=\\\{i:r\_\{\\text\{ver\}\}\(y\_\{i\}\)=1\\\}within the sampled rollout groupYY, where\|I\+\|\|I^\{\+\}\|denotes the number of verifier\-passed rollouts, because incorrect rollouts are already handled by the verifier\. We focus on finding which correct rollout is best in reliability and quality\.

The groupwise judge𝒮cmp\\mathcal\{S\}\_\{\\text\{cmp\}\}takes the shared questionqqtogether with all verifier\-passed reasoning\-answer pairs\{\(zi,ai\)\}i∈I\+\\\{\(z\_\{i\},a\_\{i\}\)\\\}\_\{i\\in I^\{\+\}\}, compares them side\-by\-side, and outputsMMordered tie\-aware tiers

\(τ1,τ2,…,τM\)=𝒮cmp​\(q,\{\(zi,ai\)\}i∈I\+\),\(\\tau\_\{1\},\\tau\_\{2\},\\dots,\\tau\_\{M\}\)=\\mathcal\{S\}\_\{\\text\{cmp\}\}\\\!\\left\(q,\\\{\(z\_\{i\},a\_\{i\}\)\\\}\_\{i\\in I^\{\+\}\}\\right\),\(5\)whereτ1≻τ2≻⋯≻τM\\tau\_\{1\}\\succ\\tau\_\{2\}\\succ\\cdots\\succ\\tau\_\{M\},τm\\tau\_\{m\}contains the candidates assigned to themm\-th rank tier, andτ1\\tau\_\{1\}is the top tier\.

We design a rank\-to\-score conversion based on a simple intuition: a rollout should receive a higher score if it is ranked ahead of more verifier\-passed rollouts, and a lower score if it is ranked ahead of fewer\. We also preserve ties instead of forcing arbitrary tie\-breaking, avoiding injecting artificial noise when the judge views several traces as equally rigorous\. Concretely, for any rollouti∈τmi\\in\\tau\_\{m\}, letbi=∑ℓ=m\+1M\|τℓ\|b\_\{i\}=\\sum\_\{\\ell=m\+1\}^\{M\}\|\\tau\_\{\\ell\}\|be the number of lower\-ranked correct rollouts and letci=\|τm\|−1c\_\{i\}=\|\\tau\_\{m\}\|\-1be the number of tied rollouts\. The raw score is

s~i=bi\+12​ci\|I\+\|−1,\\tilde\{s\}\_\{i\}=\\frac\{b\_\{i\}\+\\frac\{1\}\{2\}c\_\{i\}\}\{\|I^\{\+\}\|\-1\},\(6\)where each lower\-ranked rollout contributes one point, each tie contributes half a point, and the result is normalized by the number of available comparisons\. Thus, candidates inτ1\\tau\_\{1\}receive the highest raw score and candidates inτM\\tau\_\{M\}the lowest, while prompts with different\|I\+\|\|I^\{\+\}\|remain on the same\[0,1\]\[0,1\]scale\. This normalization makes the rank\-to\-score mapping well\-defined for prompt groups with any number of verifier\-passed rollouts\.

The final ranking reward is obtained by centering these raw scores within the verifier\-passed set\. Lets¯=1\|I\+\|​∑j∈I\+s~j\\bar\{s\}=\\frac\{1\}\{\|I^\{\+\}\|\}\\sum\_\{j\\in I^\{\+\}\}\\tilde\{s\}\_\{j\}denote the mean raw score overI\+I^\{\+\}\.

rrank,i=\{s~i−s¯,if​i∈I\+,\|I\+\|≥2,0,otherwise\.r\_\{\\text\{rank\},i\}=\\begin\{cases\}\\tilde\{s\}\_\{i\}\-\\bar\{s\},&\\text\{if \}i\\in I^\{\+\},\\,\|I^\{\+\}\|\\geq 2,\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}\(7\)This centering makes the ranking reward zero\-mean within the verifier\-passed set\. Under this mapping,rrank,ir\_\{\\text\{rank\},i\}is bounded in\[−0\.5,0\.5\]\[\-0\.5,0\.5\]\. Thus, the judge acts as a pure redistribution mechanism: better\-supported correct rollouts receive positive ranking reward, while weaker correct rollouts can still have positive verifier reward but negative ranking reward, which effectively penalizes weaker correct reasoning among answer\-correct rollouts\. Incorrect rollouts remain untouched and still receive zero total reward\. If all verifier\-passed rollouts fall into the same tier, the centered reward is again zero\. Appendix[B](https://arxiv.org/html/2604.18892#A2)gives the full implementation details and summarizes alternative rank\-to\-score mappings\.

### 3\.3Training Objective

The final rollout reward combines verification with one auxiliary trajectory reward:

ℛi=rver​\(yi\)\+λ​raux,i,\\mathcal\{R\}\_\{i\}=r\_\{\\text\{ver\}\}\(y\_\{i\}\)\+\\lambda r\_\{\\text\{aux\},i\},\(8\)whereλ≥0\\lambda\\geq 0controls the strength of trajectory supervision andraux,i∈\{rprm,i,rgr,i,rrank,i\}r\_\{\\text\{aux\},i\}\\in\\\{r\_\{\\text\{prm\},i\},r\_\{\\text\{gr\},i\},r\_\{\\text\{rank\},i\}\\\}\. In our experiments, we useλ=1\\lambda=1for all three variants\. The optimization algorithm is unchanged\.

## 4Experiments

### 4\.1Experimental Setup

Table 1:Model performance comparison results under a matched training budget\. The second and third columns report five\-benchmark averageRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}andAcc\\mathrm\{Acc\}, respectively, while the remaining benchmark columns reportRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}\. Groupwise Ranking Reward achieves the best overall performance, yielding the highest averageRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}and leading on four of the five benchmarks\. We bold the best value in each numeric column and underline the second best\.Training Hyperparameters\.All RL training runs start from the save base checkpoint,Qwen2\.5\-VL\-7B\-Instruct\(Bai et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib1)\)\. Across all RL variants, we use rollout budgetn=8n=8, auxiliary\-reward coefficientλ=1\\lambda=1, learning rate1×10−61\\times 10^\{\-6\}, KL coefficient1×10−21\\times 10^\{\-2\}\. All experiments use the sameViRL\(Wang et al\.,[2025a](https://arxiv.org/html/2604.18892#bib.bib20)\)training data and are run for 2 epochs\.

Methods to Compare\.Our main experiment compares four RL variants onViRL: outcome\-only RLVR,w/PRM,w/Pointwise GR, andw/Groupwise Ranking Reward\. All four use the same GRPO optimization pipeline and differ only in how trajectory supervision is added\. The PRM variant uses a visual process\-reward model, Qwen\-VL\-PRM\-7BOng et al\. \([2025](https://arxiv.org/html/2604.18892#bib.bib12)\), that scores each trajectory step with a binary true/false decision and averages rewards across all steps\. The two GR\-based variants both usegpt\-oss\-20b\(OpenAI,[2025](https://arxiv.org/html/2604.18892#bib.bib13)\)as the judge over textualized question\-trajectory\-answer tuples:w/Pointwise GR gives a scalarjudge\_scorein\[0,1\]\[0,1\], whilew/Groupwise Ranking Reward assigns tie\-aware ranks for verifier\-passed candidates and maps them to centered groupwise rewards\. Appendix[B\.1](https://arxiv.org/html/2604.18892#A2.SS1)lists the judge prompts\.

Benchmarks\.We evaluate final checkpoints on five multimodal reasoning benchmarks: three math\-focused benchmarks, MathVision\(Wang et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib22)\), MathVista\(Lu et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib10)\), and WeMath\(Qiao et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib15)\), and two broader general\-reasoning benchmarks, MMMU\(Yue et al\.,[2024a](https://arxiv.org/html/2604.18892#bib.bib30)\)and MMMU\-Pro\(Yue et al\.,[2024b](https://arxiv.org/html/2604.18892#bib.bib31)\)\. To make the reported average reasoning–answer inconsistency across all benchmarks fair, we randomly sample 500 instances from each dataset to construct our final test set\. In this way, each benchmark contributes equally to the overall average; otherwise, we find the value could be artificially inflated\.

Evaluation Metrics\.The standard answer Accuracy \(Acc\\mathrm\{Acc\}\) checks only whether the final answer matches the ground truth\. In our setting, this is not sufficient because we frequently observe*reasoning\-answer inconsistency*: the final answer is correct, but the trajectory supports a different conclusion or does not justify the boxed answer\. Such predictions are hard to trust even when they count as correct under the standard Accuracy metric\. To measure this failure mode, we ask a*judge model*to read the whole trajectory, including the final answer, and generate a binary consistency score\. We then report the overall inconsistency rate \(IncR\\mathrm\{IncR\}\), the correct\-but\-inconsistent rate \(CBIR\\mathrm\{CBIR\}, the final answer is correct but the trajectory fails to pass the inconsistency judgment\), and Reliability\-Conditioned Accuracy \(RC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}\) which is defined asRC​\-​Acc=Acc−CBIR\\mathrm\{RC\\text\{\-\}Acc\}=\\mathrm\{Acc\}\-\\mathrm\{CBIR\}\. That is to say,RC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}removes answer\-correct but reasoning\-inconsistent cases, making a stricter criteria, thus we use it as the primary metric throughout the paper\.

Reasoning\-Answer Inconsistency Judge\.When judging whether a trajectory is inconsistent, we decompose the task into four simple steps: \(i\) extract the conclusion implied by the trajectory, \(ii\) extract and normalize the final boxed answer, including mapping option letters to option content when the prompt provides it, \(iii\) compare the two under lightweight equivalence rules such as unit omission and numeric reformatting, and \(iv\) abstain with N/A when the model did not properly provide a valid answer \(abstentions do not count toward inconsistency rate\)\. This decomposition makes the intermediate decision points explicit and easier to audit\.

### 4\.2Main Results

Table[1](https://arxiv.org/html/2604.18892#S4.T1)reports the five\-benchmark averageRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}andAcc\\mathrm\{Acc\}, together with the benchmark\-wiseRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}, after training with different methods for 2 epochs\. Appendix Table[2](https://arxiv.org/html/2604.18892#A4.T2)provides the complete scores, includingAcc\\mathrm\{Acc\},RC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}, andCBIR\\mathrm\{CBIR\}\.

RLVR improves answer accuracy but not reasoning reliability\.Compared to the base checkpoint, RLVR raises average answer accuracy, but itsRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}remains unchanged because correct\-but\-inconsistent predictions become much more frequent \(see Table[2](https://arxiv.org/html/2604.18892#A4.T2)forCBIR\\mathrm\{CBIR\}scores\)\. This reveals a significant side effect of outcome\-only RL: the policy gets the final answer right more often, but the reward still cannot tell whether the trajectory actually supports that answer\.GR\-based rewards recover this gap, and Groupwise Ranking Reward gives the best overall trade\-off\.Under the same training budget, both GR\-based variants outperform RLVR and the PRM variant, and Groupwise Ranking Reward reaches the strongest averageRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}at54\.7%54\.7\\%, compared with52\.3%52\.3\\%for Pointwise GR and47\.4%47\.4\\%for RLVR\. This suggests that comparing correct candidates within the same prompt gives a more reliable signal than scoring each rollout in isolation\.The gains are broad rather than benchmark\-specific\.Groupwise Ranking Reward performs best on four of the five benchmarks and remains competitive on MMMU\-Pro, with especially large gains on WeMath\. The improvement is therefore not tied to one dataset or one answer format\.

![Refer to caption](https://arxiv.org/html/2604.18892v1/x2.png)Figure 2:Training dynamics across reward variants\. The four panels report Accuracy, RC\-Acc, IncR, and CBIR, and all reported values are percentages\. The last two panels use a broken y\-axis so the large RLVR values remain visible while differences among the lower\-valued methods are still readable\. Although outcome\-only RL can improve final\-answer correctness, it simultaneously degrades reasoning reliability in multimodal reasoning tasks\. By contrast, Groupwise Ranking Reward is the only method that consistently improves both answer accuracy andRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}while simultaneously reducingIncR\\mathrm\{IncR\}andCBIR\\mathrm\{CBIR\}\.
### 4\.3Analysis

#### 4\.3\.1Training Dynamics

Figure[2](https://arxiv.org/html/2604.18892#S4.F2)shows the five\-benchmark average trends forAcc\\mathrm\{Acc\},RC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\},IncR\\mathrm\{IncR\}, andCBIR\\mathrm\{CBIR\}, so we can see how accuracy and inconsistency evolve throughout training rather than only at the final checkpoint\.

Outcome\-only reward harms reasoning reliability as training progresses\.Although RLVR improves standard answer accuracy from51%51\\%to54%54\\%, as shown in the first panel of Figure[2](https://arxiv.org/html/2604.18892#S4.F2), this gain does not translate into more reliable or faithful reasoning trajectories\. ItsRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}peaks early at around51\.20%51\.20\\%between 20 and 30 training steps, but then declines to47\.36%47\.36\\%by step 140\. This deterioration is driven by a substantial increase in reasoning\-answer inconsistency during the later stages of training:IncR\\mathrm\{IncR\}rises from6%6\\%at step 60 to19%19\\%at step 140, whileCBIR\\mathrm\{CBIR\}increases steadily from2%2\\%to6%6\\%over the full training process\. These results suggest that while outcome\-only RL can improve final\-answer correctness, it simultaneously degrades reasoning reliability in multimodal reasoning tasks\.

PRM offers only limited benefits for reducing reasoning\-answer inconsistency\.Although PRM has been reported to be effective for some text\-only reasoning tasks, incorporating a PRM reward does not yield a substantial improvement inRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}under our multimodal reasoning setup\. Instead, its trend resembles that of vanilla RLVR:RC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}plateaus during the middle stage of training and then declines afterward\. PRM\-based training is able to prevent inconsistency rates from growing markedly, but it does not reduce them in a meaningful way either\. This finding suggests that PRMs remain constrained by their generalization ability, which is a common criticism of such approaches, because the policy model can generate responses that differ substantially from the PRM’s training distribution\.

Our group\-wise ranking reward is the most effective at reducing inconsistent reasoning trajectories\.It is the only method that delivers stable and continuous improvement across all four panels: it consistently improves both answer accuracy andRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}, while simultaneously reducingIncR\\mathrm\{IncR\}andCBIR\\mathrm\{CBIR\}\. These results indicate that our method provides more stable and reliable reward signals\. More importantly, they offer clear evidence that our generative reward alters the direction of optimization, rather than merely stacking training compute\.

![Refer to caption](https://arxiv.org/html/2604.18892v1/x3.png)Figure 3:Qwen2\.5\-VL\-3B and 7B models trained with vanilla RLVR: Accuracy \(left\) and Inconsistency \(right\)\. Here, Inconsistency denotes the correct\-but\-inconsistent rate \(CBIR\\mathrm\{CBIR\}\)\. In each panel, the first two bars show the initial checkpoint and the last two bars show the final checkpoint\. All values are percentages\.
#### 4\.3\.2Effect of Model Size on Inconsistency

We next ask whether the significance of the reasoning–answer inconsistency problem is related to the size of the model\. Figure[3](https://arxiv.org/html/2604.18892#S4.F3)compares Qwen2\.5\-VL 3B and 7B checkpoints under the same ViRL training setup\. We find thatscaling mitigates the model’s inconsistency, but it does not remove the incentive that creates it\.The 7B model starts from a lower correct\-but\-inconsistent rate than the 3B model, but both models still drift upward under outcome\-only RL\. The result suggests that model capacity changes the severity of the failure mode, whereas the underlying cause remains the reward signal\.

#### 4\.3\.3Effect of the Rank\-to\-Score Mapping Function

After the judge produces an ordering over verifier\-passed responses, we still need a scalar mapping from that ordering to rollout reward\. We compare four simple options\.Pairwise\-Comparison Score \(PCS\), our default mapping, converts the sameKK\-candidate ranking into a tie\-aware pairwise win rate: each candidate gets11point against every lower\-ranked verifier\-passed candidate,0\.50\.5against every tied candidate, and0against every higher\-ranked candidate, then normalizes byK−1K\-1\. The three alternatives areExponential\-Decay Normalization \(EDN\), which uses an exponentially decaying rank score before normalization;Tiered\-Rank Score \(TRS\), which maps each final rank tier directly to evenly spaced linear rewards; andInverse\-Rank Normalization \(IRN\), which assigns each candidate a reciprocal score1/r1/rand then min\-max normalizes it within the prompt\. Figure[4](https://arxiv.org/html/2604.18892#S4.F4)compares these four mappings using the two most relevant summary metrics for this design choice: five\-benchmark averageRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}andIncR\\mathrm\{IncR\}\.

![Refer to caption](https://arxiv.org/html/2604.18892v1/x4.png)Figure 4:Ablation on the rank\-to\-score mapping function under the matched training setup\. The left group reportsRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}with the left y\-axis, and the right group reportsIncR\\mathrm\{IncR\}with the right y\-axis\. Bars are ordered as PCS \(default\), EDN, TRS, and IRN\.As the figure suggests,the main gain comes from group\-wise comparison itself, not from a fragile post\-ranking mapping\.Across the four mappings, averageRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}varies only from53\.24%53\.24\\%to54\.72%54\.72\\%, a spread of1\.481\.48points\. PCS performs the best, but the performance differences among the four variants are still minor\. This shows that once the judge can reliably compare correct candidates within a prompt, the exact monotonic rank\-to\-rewards mapping matters much less\.

#### 4\.3\.4Stability Comparison Between Pointwise GRs and Groupwise Ranking

Groupwise Ranking Reward is motivated by the fact that all rollouts for the same prompt are judged together in a single pass under a shared comparison standard\. In contrast, pointwise GR scores each response independently, so judgingNNrollouts requiresNNseparate judge calls, and the judge may apply slightly different internal standards across those calls\. Therefore, we assume that Groupwise Ranking Reward has a stability advantage over pointwise GR\. We evaluate this stability through a controlled simulation: if a judge is stable, repeated calls on the same fixed responses should produce similar rewards\.

Specifically, we randomly sample500500prompts, generate88rollouts for each prompt as in the training setting, and rerun the reward assignment process44times at a relatively low temperature \(0\.70\.7\)\. In each repeat, we construct rewards exactly as in training: pointwise GR scores each response independently, whereas Groupwise Ranking Reward jointly ranks the verifier\-passed responses within the same prompt and maps that ranking to scalar rewards\. We then compute the variance of the repeated reward values and average this quantity over the full sampled set\. This metric measures how much the assigned reward changes under small perturbations; lower values therefore indicate more stable reward assignment\. We find thatgroupwise ranking yields substantially lower reward variance\.The mean variance drops from0\.026\\mathbf\{0\.026\}under pointwise GR to0\.016\\mathbf\{0\.016\}under Groupwise Ranking Reward, a37\.6%\\mathbf\{37\.6\\%\}reduction\. This suggests that relative comparison makes the assigned reward less sensitive to judge randomness and provides a more stable supervision signal\.

## 5Conclusion

We study reasoning\-answer inconsistency in multimodal RLVR and show that outcome\-only optimization can increase this mismatch even when answer accuracy improves\. We compare trajectory supervision methods and find that they mitigate this effect\. More broadly, these results suggest that answer correctness alone is an incomplete objective for multimodal reasoning, and that trajectory\-level supervision remains valuable even after verifiable rewards already improve end accuracy\. Among them,Groupwise Ranking Rewardperforms best by jointly ranking verifier\-passed trajectories for the same prompt\. It achieves the strongest balance between accuracy and faithfulness, improving average accuracy from 53\.6% to 55\.9% and reliability\-conditioned accuracy from 47\.4% to 54\.7% over RLVR, while requiring less judge computation than pointwise generative rewards\. We also find that groupwise ranking is more stable than pointwise GR, suggesting that shared within\-prompt comparison provides a cleaner optimization signal\.

## Limitations

Our study focuses on a specific phenomenon that becomes visible in answer\-verifiable RLVR settings: the model can receive reward for a correct final answer even when the reasoning trajectory does not actually support it\. This framing fits domains where correctness can be checked reliably, but it does not directly cover less verifiable generation settings without a single canonical answer, such as creative writing, free\-form visual dialogue, subjective caption revision, or brainstorming tasks\. In such settings, the same failure mode may not appear in the same explicit form, and both inconsistency measurement and reward design may need to be reformulated\. Extending the analysis to these less verifiable multimodal tasks is an important direction for future work\.

## References

- Bai et al\. \(2025\)Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, and 8 others\. 2025\.[Qwen2\.5\-VL technical report](https://arxiv.org/abs/2502.13923)\.*arXiv preprint arXiv:2502\.13923*\.
- Chen et al\. \(2025\)Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, and Xihui Liu\. 2025\.[Grpo\-care: Consistency\-aware reinforcement learning for multimodal reasoning](https://arxiv.org/abs/2506.16141)\.*arXiv preprint arXiv:2506\.16141*\.
- Cobbe et al\. \(2021\)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\. 2021\.[Training verifiers to solve math word problems](https://arxiv.org/abs/2110.14168)\.*arXiv preprint arXiv:2110\.14168*\.
- DeepSeek\-AI et al\. \(2025\)DeepSeek\-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others\. 2025\.[Deepseek\-R1: Incentivizing reasoning capability in LLMs via reinforcement learning](https://arxiv.org/abs/2501.12948)\.*arXiv preprint arXiv:2501\.12948*\.
- Gu et al\. \(2024\)Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, and 1 others\. 2024\.[A survey on llm\-as\-a\-judge](https://arxiv.org/abs/2411.15594)\.*arXiv preprint arXiv:2411\.15594*\.
- Huang et al\. \(2025\)Minbin Huang, Runhui Huang, Chuanyang Zheng, Jingyao Li, Guoxuan Chen, Han Shi, and Hong Cheng\. 2025\.[Answer\-consistent chain\-of\-thought reinforcement learning for multi\-modal large langauge models](https://arxiv.org/abs/2510.10104)\.*arXiv preprint arXiv:2510\.10104*\.
- Jia et al\. \(2025\)Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, and Peng Qi\. 2025\.[Autorubric\-r1v: Rubric\-based generative rewards for faithful multimodal reasoning](https://arxiv.org/abs/2510.14738)\.*arXiv preprint arXiv:2510\.14738*\.
- Kan et al\. \(2025\)Zhehan Kan, Yanlin Liu, Kun Yin, Xinghua Jiang, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Qingmin Liao, and Wenming Yang\. 2025\.[Taco: Think\-answer consistency for optimized long\-chain reasoning and efficient data learning via reinforcement learning in lvlms](https://arxiv.org/abs/2505.20777)\.*arXiv preprint arXiv:2505\.20777*\.
- Lightman et al\. \(2023\)Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\. 2023\.[Let’s verify step by step](https://arxiv.org/abs/2305.20050)\.*arXiv preprint arXiv:2305\.20050*\.
- Lu et al\. \(2024\)Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai\-Wei Chang, Michel Galley, and Jianfeng Gao\. 2024\.[MathVista: Evaluating mathematical reasoning of foundation models in visual contexts](https://proceedings.iclr.cc/paper_files/paper/2024/hash/663bce02a0050c4a11f1eb8a7f1429d3-Abstract-Conference.html)\.In*The Twelfth International Conference on Learning Representations*\.
- Niu et al\. \(2026\)Wenzhe Niu, Wei He, Zongxia Xie, Jinpeng Ou, Huichuan Fan, Yuchen Ge, Yanru Sun, Ziyin Wang, Yizhao Sun, Chengshun Shi, Jiuchong Gao, Jinghua Hao, and Renqing He\. 2026\.[From absolute to relative: Rethinking reward shaping in group\-based reinforcement learning](https://arxiv.org/abs/2601.23058)\.*arXiv preprint arXiv:2601\.23058*\.
- Ong et al\. \(2025\)Brandon Ong, Tej Deep Pala, Vernon Toh, William Chandra Tjhi, and Soujanya Poria\. 2025\.Training vision\-language process reward models for test\-time scaling in multimodal reasoning: Key insights and lessons learned\.*arXiv preprint arXiv:2509\.23250*\.
- OpenAI \(2025\)OpenAI\. 2025\.[gpt\-oss\-120b & gpt\-oss\-20b model card](https://openai.com/index/gpt-oss-model-card/)\.OpenAI model card\.Accessed 2026\-03\-13\.
- Ouyang et al\. \(2022\)Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, and 1 others\. 2022\.[Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)\.*arXiv preprint arXiv:2203\.02155*\.
- Qiao et al\. \(2025\)Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma GongQue, Shanglin Lei, Yifan Zhang, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Xiao Zong, Yida Xu, Peiqing Yang, Zhimin Bao, Muxi Diao, Chen Li, and Honggang Zhang\. 2025\.[We\-Math: Does your large multimodal model achieve human\-like mathematical reasoning?](https://doi.org/10.18653/v1/2025.acl-long.983)In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 20023–20070\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\. 2024\.[Deepseekmath: Pushing the limits of mathematical reasoning in open language models](https://arxiv.org/abs/2402.03300)\.*arXiv preprint arXiv:2402\.03300*\.
- Shi et al\. \(2024\)Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See\-Kiong Ng, Lidong Bing, and Roy Ka\-Wei Lee\. 2024\.[Math\-LLaVA: Bootstrapping mathematical reasoning for multimodal large language models](https://doi.org/10.18653/v1/2024.findings-emnlp.268)\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 4663–4680\.
- Su et al\. \(2025\)Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu\. 2025\.[Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains](https://arxiv.org/abs/2503.23829)\.*arXiv preprint arXiv:2503\.23829*\.
- Uesato et al\. \(2022\)Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins\. 2022\.[Solving math word problems with process\- and outcome\-based feedback](https://arxiv.org/abs/2211.14275)\.*arXiv preprint arXiv:2211\.14275*\.
- Wang et al\. \(2025a\)Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen\. 2025a\.[VL\-Rethinker: Incentivizing self\-reflection of vision\-language models with reinforcement learning](https://arxiv.org/abs/2504.08837)\.*arXiv preprint arXiv:2504\.08837*\.
- Wang et al\. \(2025b\)Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, and Jinguo Zhu\. 2025b\.[Enhancing the outcome reward\-based rl training of mllms with self\-consistency sampling](https://arxiv.org/abs/2511.10648)\.*arXiv preprint arXiv:2511\.10648*\.
- Wang et al\. \(2024\)Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li\. 2024\.[Measuring multimodal mathematical reasoning with MATH\-Vision dataset](https://arxiv.org/abs/2402.14804)\.*arXiv preprint arXiv:2402\.14804*\.
- Wang et al\. \(2025c\)Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, and Hongsheng Li\. 2025c\.[Mathcoder\-VL: Bridging vision and code for enhanced multimodal mathematical reasoning](https://doi.org/10.18653/v1/2025.findings-acl.128)\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 2505–2534\.
- Wang et al\. \(2023\)Peiyi Wang, Lei Li, Zhihong Shao, R\. X\. Xu, Damai Dai, Yifei Li, Deli Chen, Y\. Wu, and Zhifang Sui\. 2023\.[Math\-shepherd: Verify and reinforce llms step\-by\-step without human annotations](https://arxiv.org/abs/2312.08935)\.*arXiv preprint arXiv:2312\.08935*\.
- Wang et al\. \(2025d\)Victor Wang, Michael JQ Zhang, and Eunsol Choi\. 2025d\.[Improving LLM\-as\-a\-judge inference with the judgment distribution](https://doi.org/10.18653/v1/2025.findings-emnlp.1259)\.In*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 23173–23199, Suzhou, China\. Association for Computational Linguistics\.
- Wang et al\. \(2025e\)Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, and 2 others\. 2025e\.[Internvl3\.5: Advancing open\-source multimodal models in versatility, reasoning, and efficiency](https://arxiv.org/abs/2508.18265)\.*arXiv preprint arXiv:2508\.18265*\.
- Wei et al\. \(2022\)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou\. 2022\.[Chain\-of\-thought prompting elicits reasoning in large language models](https://arxiv.org/abs/2201.11903)\.*arXiv preprint arXiv:2201\.11903*\.
- Wei et al\. \(2025\)Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, and Vishal M\. Patel\. 2025\.[Open vision reasoner: Transferring linguistic cognitive behavior for visual reasoning](https://arxiv.org/abs/2507.05255)\.*arXiv preprint arXiv:2507\.05255*\.
- Yuan et al\. \(2024\)Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston\. 2024\.[Self\-rewarding language models](https://arxiv.org/abs/2401.10020)\.*arXiv preprint arXiv:2401\.10020*\.
- Yue et al\. \(2024a\)Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, and 3 others\. 2024a\.[MMMU: A massive multi\-discipline multimodal understanding and reasoning benchmark for expert AGI](https://cvpr.thecvf.com/virtual/2024/poster/31040)\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*\.
- Yue et al\. \(2024b\)Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig\. 2024b\.[MMMU\-Pro: A more robust multi\-discipline multimodal understanding benchmark](https://arxiv.org/abs/2409.02813)\.*arXiv preprint arXiv:2409\.02813*\.
- Zhang et al\. \(2025a\)Jianghangfan Zhang, Yibo Yan, Kening Zheng, Xin Zou, Song Dai, and Xuming Hu\. 2025a\.[Gm\-prm: A generative multimodal process reward model for multimodal mathematical reasoning](https://arxiv.org/abs/2508.04088)\.*arXiv preprint arXiv:2508\.04088*\.
- Zhang et al\. \(2025b\)Kaichen Zhang, Keming Wu, Zuhao Yang, Bo Li, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, and Lidong Bing\. 2025b\.[Openmmreasoner: Pushing the frontiers for multimodal reasoning with an open and general recipe](https://arxiv.org/abs/2511.16334)\.*arXiv preprint arXiv:2511\.16334*\.
- Zhao et al\. \(2024\)Yang Zhao, Yixin Wang, and Mingzhang Yin\. 2024\.[Permutative preference alignment from listwise ranking of human judgments](https://arxiv.org/abs/2410.04346)\.*arXiv preprint arXiv:2410\.04346*\.
- Zheng et al\. \(2023\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P\. Xing, Hao Zhang, Joseph E\. Gonzalez, and Ion Stoica\. 2023\.[Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena](https://arxiv.org/abs/2306.05685)\.*arXiv preprint arXiv:2306\.05685*\.

## Appendix AStandard GRPO Objective

For completeness, we summarize the standard GRPO objective used in DeepSeekMath and later RLVR systems\(Shao et al\.,[2024](https://arxiv.org/html/2604.18892#bib.bib16); DeepSeek\-AI et al\.,[2025](https://arxiv.org/html/2604.18892#bib.bib4)\)\. Given rollout rewards\{ℛi\}i=1N\\\{\\mathcal\{R\}\_\{i\}\\\}\_\{i=1\}^\{N\}for one sampled group, GRPO first standardizes them within the prompt:

Ai=ℛi−ℛ¯std⁡\(\{ℛj\}j=1N\)\+ϵA,ℛ¯=1N​∑j=1Nℛj,A\_\{i\}=\\frac\{\\mathcal\{R\}\_\{i\}\-\\bar\{\\mathcal\{R\}\}\}\{\\operatorname\{std\}\(\\\{\\mathcal\{R\}\_\{j\}\\\}\_\{j=1\}^\{N\}\)\+\\epsilon\_\{A\}\},\\qquad\\bar\{\\mathcal\{R\}\}=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\mathcal\{R\}\_\{j\},\(9\)whereϵA\\epsilon\_\{A\}is a small constant for numerical stability\. Let

ρi​\(θ\)=πθ​\(yi∣x\)πθold​\(yi∣x\)\.\\rho\_\{i\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(y\_\{i\}\\mid x\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(y\_\{i\}\\mid x\)\}\.\(10\)The clipped GRPO surrogate is

ℒGRPO​\(θ\)=𝔼x∼𝒟,\{yi\}i=1N∼πθold\(⋅∣x\)\[1N∑i=1Nℓclip\(θ;x,yi,Ai\)−β𝔻KL\(πθ\(⋅∣x\)∥πref\(⋅∣x\)\)\]\\begin\{aligned\} \\mathcal\{L\}\_\{\\text\{GRPO\}\}\(\\theta\)&=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\\{y\_\{i\}\\\}\_\{i=1\}^\{N\}\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\\mid x\)\}\\Bigg\[\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\ell\_\{\\text\{clip\}\}\(\\theta;x,y\_\{i\},A\_\{i\}\)\\\\ &\\qquad\-\\beta\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\(\\cdot\\mid x\)\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\\mid x\)\\right\)\\Bigg\]\\end\{aligned\}

\(11\)where

ℓclip​\(θ;x,yi,Ai\)=min⁡\(ρi​\(θ\)​Ai,clip⁡\(ρi​\(θ\),1−ϵ,1\+ϵ\)​Ai\)\\ell\_\{\\text\{clip\}\}\(\\theta;x,y\_\{i\},A\_\{i\}\)=\\min\\Bigl\(\\begin\{aligned\} &\\rho\_\{i\}\(\\theta\)A\_\{i\},\\\\ &\\operatorname\{clip\}\(\\rho\_\{i\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)A\_\{i\}\\end\{aligned\}\\Bigr\)

\(12\)Our method does not modify this optimizer; it only changes howℛi\\mathcal\{R\}\_\{i\}is constructed\.

## Appendix BMore Details of Groupwise Ranking Reward

##### Groupwise Ranking Reward\.

For the same prompt, letI\+=\{i:rver​\(yi\)=1\}I^\{\+\}=\\\{i:r\_\{\\text\{ver\}\}\(y\_\{i\}\)=1\\\}be the verifier\-passed set and letK=\|I\+\|K=\|I^\{\+\}\|\. The groupwise judge reads all textualized candidates\{ti\}i∈I\+\\\{t\_\{i\}\\\}\_\{i\\in I^\{\+\}\}together and returns ordered tie\-aware tiers\(τ1,…,τM\)\(\\tau\_\{1\},\\dots,\\tau\_\{M\}\)\. We then apply exactly the tie\-aware pairwise\-win\-rate mapping and within\-group centering defined in Section[3\.2](https://arxiv.org/html/2604.18892#S3.SS2); ifK<2K<2, the auxiliary reward is set to zero\. This is only a post\-processing of the sameKK\-way ranking, not a separate round of pairwise judge calls\. A candidate is treated as beating every lower tier, tying candidates in its own tier, and losing to every higher tier\. For example, ifK=4K=4and the judge outputsτ1=\{A\}\\tau\_\{1\}=\\\{A\\\},τ2=\{B,C\}\\tau\_\{2\}=\\\{B,C\\\}, andτ3=\{D\}\\tau\_\{3\}=\\\{D\\\}, then the raw scores before centering ares~A=1\\tilde\{s\}\_\{A\}=1,s~B=s~C=0\.5\\tilde\{s\}\_\{B\}=\\tilde\{s\}\_\{C\}=0\.5, ands~D=0\\tilde\{s\}\_\{D\}=0\. If all verifier\-passed rollouts tie, the centered reward becomes zero for all of them\.

### B\.1Reward Judge Instructions

To make the controlled comparison concrete, we summarize the prompt families used by the three trajectory\-supervision variants\. The PRM follows a visual stepwise process\-reward setup, while the two GR\-based variants use the samegpt\-oss\-20bjudge backbone with different user instructions\.

##### PRM instruction\.

For each trajectory step, the PRM uses the following prompt pair and asks the model to emit exactly one token for the current step:

PRM PromptSystemYou are a process reward model\. Given the current reasoning step and prior context, output exactly one token:\+or\-\. Use\+only if the current step is valid and consistent with the question/context\.UserQuestion\{question\}Solution Process\{step\}

When the trajectory has multiple steps, previous steps are replayed as earlier user turns paired with positive assistant replies, and the current step is queried last\. The rollout\-level PRM reward is then obtained by averaging the per\-step rewards\.

##### Pointwise GR instruction\.

Pointwise GR uses an instruction that asks the judge to score one textualized trajectory independently:

Pointwise GR InstructionYou are an expert evaluator for reasoning quality\. Evaluate the assistant response based on reasoning\-quality assessment\.Scoring RubricScore the overall reasoning quality on a scalar in\[0, 1\], where:•1\.0: clear, logically coherent, and mathematically sound derivation\.•0\.8: mostly solid reasoning with only minor slips that do not break the core logic\.•0\.5: partially valid reasoning; noticeable gaps or mistakes reduce reliability\.•0\.2: largely flawed reasoning with major logical/calculation issues\.•0\.0: reasoning is missing, nonsensical, or clearly fails to solve the task\.Evaluation GuidanceWhen assigning the single score, consider the overall logical coherence of the steps, correct use of given conditions or visual facts, calculation and transformation quality, clarity/completeness of the conclusion, and whether uncertainty is handled honestly without unsupported guessing\.Note\.The original problem may include image input, but you cannot access the image\. If the reasoning references information that could plausibly come from the image and is not contradicted by available text, do not treat that as an error by itself; score based on textual reasoning quality\.Output SchemaReturn JSON only with this schema:\{"reasoning\_feedback":"<short explanation of strengths, weaknesses, and key issues\>","judge\_score": <float between 0 and 1\>\}Generate"reasoning\_feedback"first and place"judge\_score"last\.If some information is missing, still provide the best\-effort score from available text\.Problem\{problem\}Response\{response\}

##### Groupwise Ranking Reward instruction\.

The groupwise variant uses the same judge backbone but replaces scalar scoring with a tie\-aware ranking instruction over all verifier\-passed candidates for the same prompt:

Groupwise Ranking InstructionYou are ranking multiple candidate solutions to the same geometry problem\.Problem\{problem\}Reference Answer\{reference\_answer\}Candidate Solutions\{candidate\_solutions\}Ranking Principles1\.Audit the reasoning steps, not just the final answer; flag leaps, contradictions, or missing justification\.2\.Prioritise solutions whose final answers align with the reference answer\.3\.If solutions are equally correct and use materially similar reasoning, give them the same rank; do not force an ordering without a clear qualitative difference\. If all are equally correct, all ranks must be 1\.4\.When reasoning differs, choose the mathematically valid, well\-supported argument even if multiple solutions reach the same final answer\.5\.Penalise calculation mistakes, invalid assumptions, or logical gaps relative to error\-free solutions\.6\.Keep the comparison focused on mathematical correctness and clarity\.Output FormatReturnstrict JSONwith no commentary:\{"solutions": \[\{"index": 1,"rank": 1,"justification": "Short comparison that explains the placement\.","agreement\_with\_reference":"match" \| "different" \| "unknown","errors": \["optional list of key mistakes"\]\},\.\.\.\]\}Rules•Lower rank numbers correspond to better solutions\.•Use the same rank for tied solutions that should be treated equally\.•Include every candidate exactly once\.•If you assign the same rank, the numericrankfields must be identical \(e\.g\., all 1s for a full tie\); do*not*output1,2,3\.\.\.when you describe them as equal\.•Keep the justification concise, citing key reasoning differences or errors that justify the rank\.

## Appendix CInconsistency Judge Design and Reliability

##### Qualitative manual inspection and observed failure modes\.

We further conducted targeted manual inspection of the inconsistency judgments across benchmarks to understand the judge’s remaining failure cases\. These inspections suggest that the problematic cases concentrate in a small number of recurring patterns rather than arbitrary misclassifications\. The first is*image\-grounded option semantics*\. Some multiple\-choice problems place the semantic content of the options in the image itself, while the textual prompt only exposes letters such as A/B/C/D\. In those cases, a text\-only judge may see that the reasoning discusses a concrete visual choice while the boxed answer contains only a letter, and it can incorrectly flag the sample as inconsistent because the letter\-to\-content mapping is visually hidden\. The second is*implicit answer revision*\. Some responses mention an early provisional answer, then later switch to a different boxed answer with only a terse correction or an incomplete explanation\. Humans often read the later resolution as the model’s true conclusion, but a strict extractor may attach to the earlier explicit statement and mark the sample inconsistent\. These observations clarify the main residual gap between automatic judging and human inspection, and suggest that future improvements should focus on better handling visually grounded option mappings and late\-stage answer revisions\.

## Appendix DDetailed Controlled Comparison

Table[2](https://arxiv.org/html/2604.18892#A4.T2)provides the full version of the main controlled comparison\. Each cell reports the three\-number breakdown used throughout the paper:RC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\},Acc\\mathrm\{Acc\}, andCBIR\\mathrm\{CBIR\}\. We place this detailed table in the appendix so the main text can emphasize the faithfulness\-aware metric directly, while still making the complete accuracy\-versus\-inconsistency trade\-off explicit\.

Table 2:Detailed controlled comparison at global step 140\. Each cell reports\[Strict AccuracyStandard Accuracy / Correct\-but\-Inconsistent Rate\]

\. We bold the best Strict Accuracy in each column and underline the second best\.MethodAvg\.MathVisionMathVistaMMMUMMMU\-ProWeMath*Reference checkpoints*Qwen2\.5\-VL\-7B\-IT47\.349\.0 / 1\.724\.826\.2 / 1\.467\.268\.6 / 1\.452\.655\.0 / 2\.435\.237\.4 / 2\.256\.658\.0 / 1\.4RLVR47\.453\.6 / 6\.222\.828\.6 / 5\.870\.875\.6 / 4\.848\.255\.8 / 7\.636\.240\.4 / 4\.258\.867\.4 / 8\.6*Controlled comparison with different reward methods*w/PRM46\.849\.1 / 2\.326\.426\.8 / 0\.466\.470\.0 / 3\.650\.051\.6 / 1\.630\.635\.2 / 4\.660\.861\.8 / 1\.0w/Pointwise GR52\.353\.8 / 1\.531\.032\.4 / 1\.471\.072\.2 / 1\.256\.658\.0 / 1\.439\.040\.0 / 1\.063\.866\.2 / 2\.4w/Groupwise Ranking Reward54\.755\.9 / 1\.231\.433\.2 / 1\.874\.876\.4 / 1\.657\.858\.6 / 0\.838\.839\.6 / 0\.870\.871\.6 / 0\.8
## Appendix EMore Results on Factors Affecting Inconsistency

### E\.1Question Type and Training Objective

The five\-benchmark evaluation suite is not balanced by question type, so raw inconsistency counts are hard to interpret\. Across the 2,500 evaluated problems, 1,990 are multiple\-choice and 510 are open\-ended\. The open\-ended subset appears only in MathVision \(247\), MathVista \(233\), and a small 30\-example subset of MMMU, while MMMU\-Pro and WeMath are entirely multiple\-choice\. We therefore normalize every rate in this subsection by the total number of examples within the same question type rather than by the global benchmark total\.

Table 3:Question\-type breakdown under three training settings\. Every percentage is normalized by the total number of examples within the same question type, not by the global benchmark total\.![Refer to caption](https://arxiv.org/html/2604.18892v1/x5.png)Figure 5:Question\-type trade\-off trajectories under three training settings\. Color denotes question type and marker denotes training setup\. For each question type, both trained checkpoints are connected back to the shared before\-training model, so the figure shows how RLVR andw/Groupwise Ranking Reward move the model in strict\-accuracy versus inconsistency space\. The panel plotsStrictAcc\\mathrm\{StrictAcc\}against Inconsistent Rate, where Inconsistent Rate denotes the correct\-but\-inconsistent rate\. Every percentage is normalized within the same question type\.Training objective matters more than the raw question\-type split\.Table[3](https://arxiv.org/html/2604.18892#A5.T3)and Figure[5](https://arxiv.org/html/2604.18892#A5.F5)show that the low open\-ended inconsistency of the reference checkpoint does not survive outcome\-only RL\. Before training,IncR\\mathrm\{IncR\}is5\.88%5\.88\\%for multiple\-choice and1\.76%1\.76\\%for open\-ended\. Under RLVR, these rates jump to20\.40%20\.40\\%and14\.71%14\.71\\%, andCBIR\\mathrm\{CBIR\}rises to6\.98%6\.98\\%and3\.14%3\.14\\%\. Question type still changes the absolute level, but the dominant effect is the reward design: outcome\-only training pushes both question types toward much larger reasoning\-answer mismatch\.

Groupwise Ranking Reward recovers faithfulness without giving up answer gains\.On multiple\-choice problems, Groupwise Ranking Reward improvesStdAcc\\mathrm\{StdAcc\}from51\.76%51\.76\\%to58\.74%58\.74\\%while drivingIncR\\mathrm\{IncR\}below the starting checkpoint, from5\.88%5\.88\\%to3\.12%3\.12\\%\. The open\-ended comparison is even cleaner: RLVR and Groupwise Ranking Reward reach the sameStdAcc\\mathrm\{StdAcc\}of44\.71%44\.71\\%, yet Groupwise Ranking Reward cutsIncR\\mathrm\{IncR\}from14\.71%14\.71\\%to2\.75%2\.75\\%andCBIR\\mathrm\{CBIR\}from3\.14%3\.14\\%to0\.39%0\.39\\%, which liftsStrictAcc\\mathrm\{StrictAcc\}from41\.57%41\.57\\%to44\.31%44\.31\\%\. In other words, the open\-ended improvement comes almost entirely from better faithfulness rather than better Standard Accuracy\.

The qualitative failure mode still differs by question type\.For multiple\-choice problems, the dominant pattern is still option\-selection drift: the reasoning converges to one content answer, but the final boxed output switches to a different option letter or to a nearest\-choice guess\. For open\-ended problems, the mismatch is more often a late revision or answer\-type collapse: the model derives one free\-form result and then boxes a different scalar, count, or endpoint without a new derivation\. These mechanisms differ, but both are strongly amplified by RLVR and sharply reduced by Groupwise Ranking Reward\.

### E\.2Training Data Distribution: ViRL vs\. GSM8K

A second factor is the RL training data distribution\. To study it, we compare two runs that start from the same initialization and are evaluated with the same five\-benchmark multimodal inconsistency pipeline, but are trained on two very different data sources: a multimodal image\-grounded RL corpusViRL\(Wang et al\.,[2025a](https://arxiv.org/html/2604.18892#bib.bib20)\), andGSM8K, a pure text arithmetic dataset\(Cobbe et al\.,[2021](https://arxiv.org/html/2604.18892#bib.bib3)\)\. Both runs use the same GRPO training framework, but the underlying task distributions and correctness signals are not identical across the two datasets\. We therefore present this as an empirical cross\-dataset comparison rather than a clean modality\-only causal ablation\.

![Refer to caption](https://arxiv.org/html/2604.18892v1/x6.png)Figure 6:Average evaluation dynamics when the same RL framework is trained on text\-only GSM8K or multimodal ViRL\. Both runs start from the same initialization and are evaluated with the same five\-benchmark multimodal inconsistency pipeline\. ViRL reaches higher answer accuracy, but its inconsistency grows much faster during training\.Figure[6](https://arxiv.org/html/2604.18892#A5.F6)shows that GSM8K keeps inconsistency low while still improving multimodal evaluation\.Relative to the shared initialization, training on GSM8K improves average Standard Accuracy from49\.12%49\.12\\%to51\.96%51\.96\\%and Strict Accuracy from47\.20%47\.20\\%to50\.72%50\.72\\%\. At the same time, overall inconsistency drops from5\.04%5\.04\\%to3\.44%3\.44\\%, and the correct\-but\-inconsistent rate drops from1\.92%1\.92\\%to1\.24%1\.24\\%\. Across training, the inconsistency curves remain in a relatively tight band and stay well below the shared initialization\. This means that text\-only RL data can still transfer useful reasoning discipline to multimodal evaluation, even though it does not directly train image\-grounded skills\.

ViRL offers higher upside, but the faithfulness trade\-off is much less stable\.On the same evaluation pipeline, ViRL reaches higher answer accuracy and a stronger bestRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}than GSM8K\. However, this gain is accompanied by substantially higher inconsistency, and the later part of training degrades further rather than remaining stable\. At the end of training, ViRL still keeps Standard Accuracy above the shared initialization at52\.88%52\.88\\%, but overall inconsistency rises to8\.24%8\.24\\%andRC​\-​Acc\\mathrm\{RC\\text\{\-\}Acc\}falls to49\.88%49\.88\\%\. The same pattern appears broadly across benchmarks rather than in one isolated dataset: at the final checkpoints, ViRL has higher inconsistency than GSM8K on all five evaluation sets, for example10\.0%10\.0\\%vs\.3\.0%3\.0\\%on MathVision and7\.8%7\.8\\%vs\.3\.2%3\.2\\%on WeMath\.

The training logs support the same asymmetry\.We also inspect the judge\-flagged inconsistent samples saved from training\-time logs\. On GSM8K, these cases are nearly absent: only0\.0%−0\.4%0\.0\\%\-0\.4\\%of the 500 inspected rollouts per checkpoint are flagged inconsistent, and high\-score inconsistent samples never exceed0\.2%0\.2\\%\. On ViRL, the corresponding numbers are much larger, ranging from6\.2%6\.2\\%to10\.8%10\.8\\%for all flagged inconsistencies and from2\.4%2\.4\\%to5\.6%5\.6\\%for inconsistent samples with score\>0\.5\>0\.5\. This does not prove that multimodality alone causes inconsistency, but it does show that the ViRL training distribution exposes the policy to many more internally mismatched yet nontrivially rewarded rollouts, which is consistent with the evaluation\-side drift in Figure[6](https://arxiv.org/html/2604.18892#A5.F6)\.

Overall, this comparison suggests that the training data distribution has a first\-order effect on faithfulness dynamics under RL\. GSM8K behaves like a conservative regularizer: it transfers some answer\-format and reasoning discipline, keeps inconsistency low, and improves Strict Accuracy steadily, but its multimodal accuracy ceiling is limited\. ViRL has a higher ceiling on multimodal tasks, yet it also produces a much stronger late\-stage inconsistency drift\. The practical implication is not that text\-only data is universally preferable, but that multimodal RL appears to need stronger inconsistency control if we want to keep its accuracy gains from turning into reasoning\-answer mismatch\.

Similar Articles

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

arXiv cs.CL

AgentV-RL introduces an Agentic Verifier framework that enhances reward modeling through bidirectional verification with forward and backward agents augmented with tools, achieving 25.2% improvement over state-of-the-art ORMs. The approach addresses error propagation and grounding issues in verifiers for complex reasoning tasks through multi-turn deliberative processes combined with reinforcement learning.

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

arXiv cs.CL

This paper investigates whether reinforcement learning can improve the direct recall of parametric knowledge in LLMs beyond reasoning tasks. It demonstrates that RL with binary rewards yields significant gains in factual QA benchmarks by redistributing probability mass to unlock latent knowledge rather than acquiring new facts.

Structured Role-Aware Policy Optimization for Multimodal Reasoning

arXiv cs.AI

This paper introduces Structured Role-Aware Policy Optimization (SRPO), a method that improves multimodal reasoning in Large Vision-Language Models by assigning token-level credit based on distinct perception and reasoning roles within reinforcement learning frameworks.

Reinforcing Multimodal Reasoning Against Visual Degradation

Hugging Face Daily Papers

This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.