Improving Multimodal Reasoning via Worst Dimension Optimization

arXiv cs.AI Papers

Summary

This paper introduces Multimodal Multi-Dimensional Scalarization Process Reward Modeling (MMS-PRM), which enforces the worst dimension's robustness in multimodal reasoning to prevent failures like visual hallucinations from being masked by strong text logic.

arXiv:2606.07801v1 Announce Type: new Abstract: Multimodal reasoning requires a path that retains integrity over a wide range of constraints, from visual grounding to logic consistency. However, the current Process Reward Models focus on heuristically defined rewards that equally weigh these factors, which may lead to the concealment of individual dimension failures by the dominating factors, without guaranteeing the validity of the reasoning process in general.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:53 AM

# Improving Multimodal Reasoning via Worst Dimension Optimization
Source: [https://arxiv.org/html/2606.07801](https://arxiv.org/html/2606.07801)
Huaping ZhangQiuchi LiLei Li &Chunxiao Gao Beijing Institute of Technology \{3120255822, kevinzhang, liqiuchi,lilei,gao\_chunxiao\}@bit\.edu\.cnCorresponding author\.

###### Abstract

Multimodal reasoning requires a path that retains integrity over a wide range of constraints, from visual grounding to logic consistency\. However, the current Process Reward Models \(PRMs\) focus on heuristically defined rewards that equally weigh these factors, which may lead to the concealment of individual dimension failures \(such as visual hallucinations\) by the dominating factors, without guaranteeing the validity of the reasoning process in general\. Therefore, to overcome the limitation, the paper proposes the concept of Multimodal Multi\-Dimensional Scalarization Process Reward Modeling \(MMS\-PRM\), a paradigm specifically developed to enforce the worst dimension’s robustness in multimodal reasoning\. Specifically, a hierarchical fine\-grained reward space is developed to represent the multimodal risks in the reasoning tasks, and a Chebyshev\-based Monte Carlo Tree Search \(MCTS\) algorithm is introduced, in which the primary focus during the path searching is given to the worst\-performing dimension\. Moreover, a curriculum\-based Direct Preference Optimization \(DPO\) approach is developed to gradually learn the balanced reasoning skills in the policy\. The experimental results show that, without the dimension collapse issue, the MMS\-PRM approach significantly improves the reliability of the multimodal reasoning performance and reaches competitive results in various challenging tasks\. The code is available at[https://github\.com/leibniz\-Man/MMS\-PRM](https://github.com/leibniz-Man/MMS-PRM)\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.07801v1/x1.png)Figure 1:Comparison Between Existing Methods and Our Method\.Multimodal Large Language Models \(MLLMs\) are shown to perform well on complex reasoning tasks such as mathematical diagrams and scientific figures\. In contrast to the reasoning task on pure\-text reasoning, multimodal reasoning requires satisfying multiple constraints simultaneously, including visual grounding and logical correctness, where violation of any single aspect invalidates the entire reasoning trajectory\.

To supervise such processes, prior work introduces Process Reward Models \(PRMs\) that provide step\-level feedback\. However, existing PRMsWanget al\.\([2025](https://arxiv.org/html/2606.07801#bib.bib47)\); Luoet al\.\([2025](https://arxiv.org/html/2606.07801#bib.bib48)\); Onget al\.\([2025](https://arxiv.org/html/2606.07801#bib.bib49)\); Gaoet al\.\([2025](https://arxiv.org/html/2606.07801#bib.bib50)\)typically collapse multiple quality dimensions into a single scalar reward\. This makes it possible for good performance in some factors to compensate bad performance in other factors, which causes incorrect reasoning paths to be reinforced\.

The problem caused by the compensation mechanism is even more severe in multimodal reasoning tasks\. As shown in Figure 1, the reasoning process can have a mathematically well\-organized reasoning chain with the use of the hallucinated visual relations\. Because the text logic is well\-organized, the reasoning process will normally be assigned with a high confidence score by the scalar PRM, without punishing the reasoning process for the factual error\. The failure indicates that there is a severe weakness in the current averaging approach for designing the reward\.

Such findings lead to the primary tenet in this paper:*the integrity of an active multimodal reasoning trajectory is measured not by average quality but by the worst active dimension\.*An active reasoning step that is valid but unground in images is inherently invalid and should not be reinforced\. Therefore, successful multimodal alignment should progress from optimizing for expectation maximization rewards to non\-compensatory goals that punish worst\-dimension degeneration\.

To operationalize this principle, we propose MMS\-PRM, a search\-enhanced, multi\-dimensional process reward framework for multimodal reasoning\. We reformulate multimodal process rewarding as a multi\-objective trajectory optimization problem, where the quality of each reasoning step and consequently the whole reasoning trajectory is governed by its weakest relevant dimension rather than an aggregated score\. First, we construct a hierarchical, fine\-grained reward space that decomposes multimodal reasoning quality into interpretable dimensions and sub\-dimensions, allowing rewards to be dynamically activated based on the evolving reasoning context\. Second, we introduce a Chebyshev\-guided Monte Carlo Tree Search \(MCTS\) that explicitly prioritizes the worst\-performing reward dimension during trajectory exploration, preventing compensation across conflicting criteria and promoting balanced reasoning paths\. Finally, we integrate a curriculum\-style Direct Preference Optimization \(DPO\) strategy that progressively trains MLLMs with our obtained balanced reasoning trajectories, from short, high\-confidence trajectories to long\-horizon multimodal reasoning chains\.

Through this closed\-loop framework, MMS\-PRM enforces balance between visual grounding, logical coherence, and semantic correctness at both step and trajectory levels\. Extensive experiments on diverse and challenging multimodal reasoning benchmarks demonstrate that MMS\-PRM significantly improves reasoning reliability and robustness, particularly on long\-horizon and visually demanding tasks\.

The contributions of this work are as follows:

- •We identify a fundamental limitation of scalar process rewards in multimodal reasoning and reformulate process reward modeling as a non\-compensatory, multi\-dimensional trajectory optimization problem\.
- •We propose a worst\-dimension\-aware search framework that combines hierarchical process rewards with Chebyshev\-scalarized MCTS to enforce balanced multimodal reasoning\.
- •We introduce a curriculum\-based preference alignment strategy that effectively transfers search\-discovered reasoning behaviors into the model, achieving strong performance and generalization across benchmarks\.

## 2Related Work

VLM Reasoning\.As VLMs continue to be used for increasingly complex tasks in mathematics and scienceYueet al\.\([2024](https://arxiv.org/html/2606.07801#bib.bib9)\); Yaoet al\.\([2025](https://arxiv.org/html/2606.07801#bib.bib55)\); Chenet al\.\([2025](https://arxiv.org/html/2606.07801#bib.bib64)\); Yanet al\.\([2024](https://arxiv.org/html/2606.07801#bib.bib63)\); Zhanget al\.\([2026](https://arxiv.org/html/2606.07801#bib.bib62)\); Guanet al\.\([2025b](https://arxiv.org/html/2606.07801#bib.bib61)\); Liuet al\.\([2024c](https://arxiv.org/html/2606.07801#bib.bib60)\), the need for improving their reasoning ability has become crucial\. Previous studies aligning visual regions with reasoning stepsShaoet al\.\([2024](https://arxiv.org/html/2606.07801#bib.bib46)\); Yanet al\.\([2026](https://arxiv.org/html/2606.07801#bib.bib59)\); Jiaet al\.\([2026](https://arxiv.org/html/2606.07801#bib.bib58)\); Caiet al\.\([2025b](https://arxiv.org/html/2606.07801#bib.bib57),[a](https://arxiv.org/html/2606.07801#bib.bib56)\), or decomposing long\-chain reasoning via multi\-agent frameworksDonget al\.\([2025b](https://arxiv.org/html/2606.07801#bib.bib14)\); Shiet al\.\([2026](https://arxiv.org/html/2606.07801#bib.bib53)\); Liet al\.\([2026](https://arxiv.org/html/2606.07801#bib.bib52),[2025](https://arxiv.org/html/2606.07801#bib.bib54)\); Li \([2024](https://arxiv.org/html/2606.07801#bib.bib51)\)\. These advances demonstrate the importance of modeling intermediate reasoning stepsZhanget al\.\([2025b](https://arxiv.org/html/2606.07801#bib.bib15)\)\. Nevertheless, the majority of the previous studies use coarse\-grained reasoning supervisionLiet al\.\([2024](https://arxiv.org/html/2606.07801#bib.bib16)\); Shaoet al\.\([2024](https://arxiv.org/html/2606.07801#bib.bib46)\); Li \([2024](https://arxiv.org/html/2606.07801#bib.bib51)\)\. In contrast, our work introduces a fine\-grained, structured reasoning paradigm to more precisely enhance VLM reasoning\. Process Reward Model\.With more complex tasks being addressed by LLMs, there is a need to reason over longer trajectories, rendering it inadequate to check coherence and correctness of reasoning using Outcome reward model\(ORM\)Snellet al\.\([2024](https://arxiv.org/html/2606.07801#bib.bib18)\); Luoet al\.\([2024](https://arxiv.org/html/2606.07801#bib.bib19)\); Caiet al\.\([2025a](https://arxiv.org/html/2606.07801#bib.bib56)\)\. Early PRM work focuses on objective domains such as mathematics, where intermediate steps admit clear correctness criteria, either by directly modeling step\-wise correctnessWanget al\.\([2024a](https://arxiv.org/html/2606.07801#bib.bib24)\)or estimating its likelihoodWanget al\.\([2024d](https://arxiv.org/html/2606.07801#bib.bib25)\); Guanet al\.\([2025a](https://arxiv.org/html/2606.07801#bib.bib20)\)\. Recent studies extend PRMs to multimodal reasoning\. VisualPRMWanget al\.\([2025](https://arxiv.org/html/2606.07801#bib.bib47)\)aligns intermediate reasoning with visual evidence via step\-level rewards, URSALuoet al\.\([2025](https://arxiv.org/html/2606.07801#bib.bib48)\)integrates process supervision into policy learning through reinforcement learning, while VL\-PRM300KOnget al\.\([2025](https://arxiv.org/html/2606.07801#bib.bib49)\)and SVIPGaoet al\.\([2025](https://arxiv.org/html/2606.07801#bib.bib50)\)enable detailed multimodal supervision with large\-scale annotated datasets and visual programming\. The current state\-of\-the\-art in multimodal PRM is based on scalarized rewards, which may hiding important dimensions\. To address this issue, we introduce the MMS\-PRM, which represents multimodal process\-level supervision in a multi\-dimensional reward space to capture the quality of reasoning more accurately\. Tree\-based Search in LLMs\.Tree structures have demonstrated significant potential in language modelsQiet al\.\([2025](https://arxiv.org/html/2606.07801#bib.bib4)\); Wuet al\.\([2024](https://arxiv.org/html/2606.07801#bib.bib5)\)\. Recent efforts explore applying these tree search methods to identify effective reasoning paths for MLLMs\. AR\-MCTSDonget al\.\([2025a](https://arxiv.org/html/2606.07801#bib.bib6)\)enhances multimodal reasoning by integrating MCTS with active retrieval, but its high computational overhead and extensive iterations limit its practicality\. Similarly, MulberryYaoet al\.\([2026](https://arxiv.org/html/2606.07801#bib.bib7)\)distills 260K long\-chain reasoning samples via tree structures from powerful models like GPT\-4o, but relies heavily on resource\-intensive teacher models\.

## 3Method

Multimodal reasoning differs from pure\-text reasoning in that the model must keep every intermediate step visually grounded, logically coherent, and eventually answer\-correct\. This requires supervising both single\-step quality and the dynamics from steps to the full trajectory\. Simple final\-answer supervision may overlook intermediate reasoning errors, while only step\-level supervision may produce locally sound steps that fail to compose into a globally valid reasoning chain\. To cope with this, we build a closed\-loop alignment framework consisting of three parts: \(1\) a hierarchical, fine\-grained reward space that models multimodal reasoning quality at the single\-step level via multi\-dimensional rewards; \(2\) a reward\-guided Chebyshev MCTS that dynamically searches for trajectories balancing all activated reward dimensions; and \(3\) a curriculum\-style DPO that progressively aligns the MLLM policy from easy/short chains to hard/long chains\. The whole pipeline is shown in Figure[2](https://arxiv.org/html/2606.07801#S3.F2)\.

![Refer to caption](https://arxiv.org/html/2606.07801v1/x2.png)Figure 2:Overview of MMS\-PRM\. The framework consists of three components: \(a\) a hierarchical, fine\-grained reward space that decomposes multimodal reasoning quality into interpretable dimensions and assigns step\-level rewards; \(b\) a Chebyshev\-guided Monte Carlo Tree Search \(MCTS\) that explores reasoning trajectories by explicitly prioritizing the worst\-performing dimension; and \(c\) a curriculum\-style Direct Preference Optimization \(DPO\) that transfers balanced reasoning behaviors discovered by search into the policy\. Together, these components form a closed\-loop process that enforces non\-compensatory, balanced multimodal reasoning\.### 3\.1Hierarchical Fine\-Grained Reward Space

The first component turns raw model outputs into reusable, structured reward dimensions\.

##### Candidate Criteria Generation\.

For each instance in the training set, we derive criteria based on the reasoning steps\. Given an inputxxand a reasoning sequencey^\\hat\{y\}produced by model sampling, we apply an automated analyzerJ​\(⋅\)J\(\\cdot\), typically instantiated as a vision\-language model, to extract at most 5 relevant criteria:

J​\(x,y^\)=\{c1,c2,…\},J\(x,\\hat\{y\}\)=\\\{c\_\{1\},c\_\{2\},\\dots\\\},\(1\)These criteria assess the reasoning quality, addressing various factors such as visual alignment, semantic correctness, logical consistency, stepwise coherence, and conciseness\.

##### Embedding and Clustering\.

The criteria gathered are embedded into add\-dimensional vector space:

V​\(ci\)=\[vi\(1\),…,vi\(d\)\],V\(c\_\{i\}\)=\[v^\{\(1\)\}\_\{i\},\\dots,v^\{\(d\)\}\_\{i\}\],\(2\)where d is the embedding dimensionality\. hierarchical clustering is performed to group similar criteria together, halting when the similarity between criteria drops below a certain thresholdξ\\xi\. This results in a hierarchical reward structureℋ=\{ℋ1,ℋ2,…\}\\mathcal\{H\}=\\\{\\mathcal\{H\}\_\{1\},\\mathcal\{H\}\_\{2\},\\dots\\\}, where higher levels are broad \(e\.g\., ”visually grounded”\) and lower levels are more specific \(e\.g\., ”accurately references bar \#3”\)\.

##### Stepwise Reward Allocation\.

The dynamic process of reward assignment employs a reward treeTTto assign stepwise reward signals\. For each stepy^​\(t\)\\hat\{y\}\(t\)in thenn\-step reasoning trajectoryy^=\{y^​\(1\),y^​\(2\),…,y^​\(n\)\}\\hat\{y\}=\\\{\\hat\{y\}\(1\),\\hat\{y\}\(2\),\\dots,\\hat\{y\}\(n\)\\\}, the corresponding reward is allocated dynamically\.

Specifically, for thett\-th stepy^​\(t\)\\hat\{y\}\(t\), a selection function first chooses a coarse\-grained parent reward noderparentr\_\{\\text\{parent\}\}from the top level of the reward treeTT, representing the primary risk dimension to be evaluated at this step\. Conditioned ony^​\(t\)\\hat\{y\}\(t\)andrparentr\_\{\\text\{parent\}\}, an analysis functionΦ\\Phidetermines whether finer\-grained evaluation is required, and if so, generates a set of candidate textual evaluation criteria:

𝒞t=Φ​\(y^​\(t\),rparent\),𝒞t=\{ct1,ct2,…,ctKt\}\.\\mathcal\{C\}\_\{t\}=\\Phi\(\\hat\{y\}\(t\),r\_\{\\text\{parent\}\}\),\\quad\\mathcal\{C\}\_\{t\}=\\\{c\_\{t\}^\{1\},c\_\{t\}^\{2\},\\dots,c\_\{t\}^\{K\_\{t\}\}\\\}\.\(3\)
Each criterioncti∈𝒞tc\_\{t\}^\{i\}\\in\\mathcal\{C\}\_\{t\}is embedded into add\-dimensional semantic space using the embedding functionV​\(⋅\)V\(\\cdot\)\. To filter out criteria that are weakly related or semantically drifting away from the selected parent reward, we compute the cosine distance between each criterion embedding and the parent reward embedding:

δit=D​\(V​\(cti\),V​\(rparent\)\),\\delta\_\{i\}^\{t\}=D\\bigl\(V\(c\_\{t\}^\{i\}\),V\(r\_\{\\text\{parent\}\}\)\\bigr\),\(4\)whereD​\(⋅\)D\(\\cdot\)denotes cosine distance\.

This distance\-based filtering serves as a heuristic safeguard rather than a strict semantic entailment test, aiming to remove candidate criteria that are loosely related to the intended risk dimension or shift focus away from the core evaluation objective\. Accordingly, the set of activated reward nodes for stepttis defined as:

Rt=\{rti∣cti∈𝒞t,δit≤ζ\},R\_\{t\}=\\\{r\_\{t\}^\{i\}\\mid c\_\{t\}^\{i\}\\in\\mathcal\{C\}\_\{t\},\\;\\delta\_\{i\}^\{t\}\\leq\\zeta\\\},\(5\)where eachrtir\_\{t\}^\{i\}denotes the reward tree node associated with criterionctic\_\{t\}^\{i\}, andζ\\zetais a predefined distance threshold\.

Finally, each activated reward noderti∈Rtr\_\{t\}^\{i\}\\in R\_\{t\}is scored using a step\-level scoring functionS​\(⋅\)S\(\\cdot\):

si​\(t\)=S​\(y^​\(t\),rti\),s\_\{i\}\(t\)=S\\bigl\(\\hat\{y\}\(t\),r\_\{t\}^\{i\}\\bigr\),\(6\)
All step\-level reward scores produced by the scoring functionS​\(⋅\)S\(\\cdot\)are initially defined on a discrete scale of\[1,10\]\[1,10\]\. To ensure consistency across heterogeneous reward dimensions, we linearly normalize each reward dimension into\[0,1\]\[0,1\]before any trajectory aggregation or scalarization is applied\. Specifically, a raw scores∈\[1,10\]s\\in\[1,10\]is mapped tos~=\(s−1\)/9\\tilde\{s\}=\(s\-1\)/9\.

### 3\.2Dynamic Reward\-Guided Chebyshev MCTS

The second component uses search to explore trajectories that are simultaneously good across all activated dimensions, where the reward space assigns a multi\-dimensional reward vector to each reasoning step rather than a single scalar or final outcome\.

##### Chebyshev scalarization\.

Given a reward vectorv=\(v1,…,vM\)v=\(v\_\{1\},\\dots,v\_\{M\}\)and an adaptive ideal pointz∗=\(z1∗,…,zM∗\)z^\{\\ast\}=\(z^\{\\ast\}\_\{1\},\\dots,z^\{\\ast\}\_\{M\}\), we first consider the standard Chebyshev scalarization, defined as:

gcheby​\(v;z∗\)=−maxj⁡\(zj∗−vj\)\.g\_\{\\text\{cheby\}\}\(v;z^\{\\ast\}\)=\-\\max\_\{j\}\\bigl\(z^\{\\ast\}\_\{j\}\-v\_\{j\}\\bigr\)\.\(7\)However, multimodal reasoning tasks often involve noisy reward signals, where a single isolated misclassification by the reward model could excessively penalize a valid trajectory under the strict min\-max criterion\. To mitigate this noise and enhance robustness, we employ the*Augmented Chebyshev Scalarization*:

gaug​\(v;z∗\)=gcheby​\(v;z∗\)−ρ​∑j=1M\(zj∗−vj\)g\_\{\\text\{aug\}\}\(v;z^\{\*\}\)=g\_\{\\text\{cheby\}\}\(v;z^\{\\ast\}\)\-\\rho\\sum\_\{j=1\}^\{M\}\(z\_\{j\}^\{\*\}\-v\_\{j\}\)\(8\)whereρ\\rhois a small positive constant serving as a regularization term\. This formulation preserves 1\. focus on the worst dimension due to the chebyshev term; 2\. for trajectories that are good on all aspects, the augmented term can capture their subtle differences and identify the best candidate

##### MCTS Implementation\.

The MCTS algorithm constructs a tree of reasoning steps, where each node corresponds to a partial reasoning trajectory\. In a nutshell, the algorithm dynamically searches for the best trajectory based on Chebyshev scalarized rewards:

- •Selection:Starting from the root, the algorithm traverses the tree by selecting the child node that maximizes the Upper Confidence Bound applied to Trees \(UCT\) score: UCT​\(n\)=gaug​\(Rstep​\(n\);z∗\)\+c​log⁡N​\(parent​\(n\)\)N​\(n\),\\text\{UCT\}\(n\)=g\_\{\\text\{aug\}\}\(R\_\{\\text\{step\}\}\(n\);z^\{\\ast\}\)\+c\\sqrt\{\\frac\{\\log N\(\\text\{parent\}\(n\)\)\}\{N\(n\)\}\},\(9\)whereRstep​\(n\)R\_\{\\text\{step\}\}\(n\)denote theMM\-dimensional reward vector associated with nodenn, while the scalarized step quality is always obtained viagaug​\(⋅\)g\_\{\\text\{aug\}\}\(\\cdot\)\.cccontrols the exploration\-exploitation trade\-off\. Here, N\(n\) represents the number of node n has been visited, whileN​\(parent​\(n\)\)N\(\\text\{parent\}\(n\)\)denotes the visit count of its parent node\. The next node is selected viannext=arg⁡maxn′∈children​\(n\)UCT​\(n′\)n\_\{\\text\{next\}\}=\\mathop\{\\arg\\max\}\_\{n^\{\\prime\}\\in\\text\{children\}\(n\)\}\\text\{UCT\}\(n^\{\\prime\}\)\.
- •Expansion:Upon reaching a leaf node, we expand the tree by samplingkkcandidate steps\{y1\(t\),…,yk\(t\)\}\\\{y^\{\(t\)\}\_\{1\},\\dots,y^\{\(t\)\}\_\{k\}\\\}from the current policyπθ​\(y\(t\)\|x,y\(<t\)\)\\pi\_\{\\theta\}\(y^\{\(t\)\}\|x,y^\{\(<t\)\}\)\. Each candidate instantiates a new child node\.
- •Evaluation:For a newly expanded nodenn, we evaluate its immediate quality\. Instead of random rollouts, we compute the fine\-grained process reward vectorRstep​\(n\)R\_\{\\text\{step\}\}\(n\)\. We then apply the Augmented Chebyshev Scalarizationgaug​\(⋅\)g\_\{\\text\{aug\}\}\(\\cdot\)to condense this vector into a scalar, strictly penalizing the worst\-performing dimension while retaining discrimination via regularization\.
- •Backpropagation:The scalarized value and visit counts are propagated from the leaf to the root\. For every node along the path,N​\(n\)N\(n\)is incremented and statistics are updated, dynamically guiding the search toward paths that are balanced across all reward dimensions\.

##### Adaptive Ideal Point Update\.

In our framework, the ideal point is updated according to the worst\-performing reward dimension observed in the current MCTS simulation\. Given the current ideal pointz∗=\(z1∗,…,zM∗\)z^\{\*\}=\(z^\{\*\}\_\{1\},\\ldots,z^\{\*\}\_\{M\}\)and the reward vectorv=\(v1,…,vM\)v=\(v\_\{1\},\\ldots,v\_\{M\}\), we compute the dimension\-wise gap between the ideal point and the observed reward:

Δj=zj∗−vj,j=1,…,M\.\\Delta\_\{j\}=z^\{\*\}\_\{j\}\-v\_\{j\},\\quad j=1,\\ldots,M\.\(10\)
The worst\-performing dimension is selected as the dimension with the largest gap:

j∗=arg⁡maxj∈\{1,…,M\}⁡Δj\.j^\{\*\}=\\arg\\max\_\{j\\in\\\{1,\\ldots,M\\\}\}\\Delta\_\{j\}\.\(11\)
The maximum gap is then used as the adaptive signal for the next\-round ideal point:

Δmax=Δj∗=maxj∈\{1,…,M\}⁡\(zj∗−vj\)\.\\Delta\_\{\\max\}=\\Delta\_\{j^\{\*\}\}=\\max\_\{j\\in\\\{1,\\ldots,M\\\}\}\\left\(z^\{\*\}\_\{j\}\-v\_\{j\}\\right\)\.\(12\)
Finally, we broadcast this worst\-dimension gap to construct the next ideal\-point vector:

znext∗=Δmax​𝟏M\.z^\{\*\}\_\{\\mathrm\{next\}\}=\\Delta\_\{\\max\}\\mathbf\{1\}\_\{M\}\.\(13\)
This update directly uses the largest deviation from the ideal point as the next reference signal, thereby forcing the following MCTS iteration to concentrate on the currently weakest dimension\. Unlike coordinate\-wise exponential moving average updates, the proposed rule does not allow strong dimensions to dilute the failure of weak ones\. Therefore, it better matches the non\-compensatory objective of Chebyshev\-guided search and strengthens the worst\-dimension optimization principle of MMS\-PRM\.

### 3\.3Curriculum DPO for Policy Alignment

After applying the MCTS algorithm, we obtain an ideal pointz∗z^\{\*\}together with a set of candidate trajectories for each input sample\. The third component then converts these searched trajectories into preference data and aligns the policy using an easy\-to\-hard curriculum\.

##### Score fusion and pair construction\.

Our method allows us to evaluate the overall quality of any reasoning trajectoryYYvia a fused scalar scoreG​\(Y\)G\(Y\)\. Based on this evaluation, preference pairs\(Y\+,Y−\)\(Y^\{\+\},Y^\{\-\}\)are constructed from the searched trajectories produced by MCTS, satisfyingG​\(Y\+\)−G​\(Y−\)≥δG\(Y^\{\+\}\)\-G\(Y^\{\-\}\)\\geq\\delta\. Giving a reasoning trajectoryY=\{y\(1\),y\(2\),…,y\(n\)\}Y=\\\{y^\{\(1\)\},y^\{\(2\)\},\\dots,y^\{\(n\)\}\\\}, we define the trajectory\-level reward vector𝐑traj​\(Y\)∈ℝM\\mathbf\{R\}\_\{\\text\{traj\}\}\(Y\)\\in\\mathbb\{R\}^\{M\}as:

𝐑traj​\(Y\)=1n​∑t=1n𝐑step​\(y\(t\)\)\\mathbf\{R\}\_\{\\text\{traj\}\}\(Y\)=\\frac\{1\}\{n\}\\sum\_\{t=1\}^\{n\}\\mathbf\{R\}\_\{\\text\{step\}\}\(y^\{\(t\)\}\)\(14\)where each𝐑step​\(y\(t\)\)\\mathbf\{R\}\_\{\\text\{step\}\}\(y^\{\(t\)\}\)represents theMM\-dimensional reward scores \(e\.g\., logical consistency, visual alignment\) at steptt\. This definition ensures that the trajectory reward maintains the same dimensionality and physical meaning as the step\-wise rewards\. AfterKKsimulations, each trajectoryYYhas a fused score

G¯​\(Y\)=η​gaug​\(Rtraj​\(Y\)\)\+\(1−η\)​Rans​\(Y\),\\bar\{G\}\(Y\)=\\eta\\,g\_\{\\text\{aug\}\}\(R\_\{\\text\{traj\}\}\(Y\)\)\+\(1\-\\eta\)R\_\{\\text\{ans\}\}\(Y\),\(15\)whereη∈\[0,1\]\\eta\\in\[0,1\]is a balancing coefficient\.Ra​n​s​\(n\)R\_\{ans\}\(n\)denotes the outcome correctness reward, defined as11if the reasoning chain at terminal nodennmatches the ground truth answer, and0otherwise\. For non\-terminal nodes,Ra​n​s​\(n\)R\_\{ans\}\(n\)is set to0\.

Finally, given multiple reasoning trajectories sampled for the same inputxx, we induce preference pairs\(Y\+,Y−\)\(Y^\{\+\},Y^\{\-\}\)for DPO\-style optimization, where the trajectory with a higher aggregated score is treated as preferred, i\.e\.,

G¯​\(Y\+\)−G¯​\(Y−\)≥δ\.\\bar\{G\}\(Y^\{\+\}\)\-\\bar\{G\}\(Y^\{\-\}\)\\geq\\delta\.\(16\)

##### Warm\-up and DPO\.

Given the constructed preference pairs\(Y\+,Y−\)\(Y^\{\+\},Y^\{\-\}\)derived from the search process, this stage aims to align the policy with step\-level reasoning preferences through a two\-phase training procedure\. Specifically, we first warm up the policy using supervised fine\-tuning \(SFT\) to stabilize generation, and then apply Direct Preference Optimization \(DPO\) to directly optimize the policy toward preferred trajectories\.

ℒSFT​\(πθ\)=−𝔼\(x,Y\)​∑t=1Tlog⁡πθ​\(y\(t\)∣x,y\(<t\)\)\.\\mathcal\{L\}\_\{\\text\{SFT\}\}\(\\pi\_\{\\theta\}\)=\-\\mathbb\{E\}\_\{\(x,Y\)\}\\sum\_\{t=1\}^\{T\}\\log\\pi\_\{\\theta\}\(y^\{\(t\)\}\\mid x,y^\{\(<t\)\}\)\.\(17\)Then we apply DPO with a reference modelπref\\pi\_\{\\text\{ref\}\}:

L​\(θ\)=−𝔼x,y^<t,\(y^\+,y^−\)∈V​\[log⁡σ​\(F\+−F−\)\]L\(\\theta\)=\-\\mathbb\{E\}\_\{x,\\hat\{y\}^\{<t\},\(\\hat\{y\}^\{\+\},\\hat\{y\}^\{\-\}\)\\in V\}\\left\[\\log\\sigma\(F^\{\+\}\-F^\{\-\}\)\\right\]\(18\)where,

F\+=β​log⁡πθ​\(y^\+\|x;y^<t\)πref​\(y^\+\|x;y^<t\)F^\{\+\}=\\beta\\log\\frac\{\\pi\_\{\\theta\}\(\\hat\{y\}^\{\+\}\|x;\\hat\{y\}^\{<t\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\\hat\{y\}^\{\+\}\|x;\\hat\{y\}^\{<t\}\)\}\(19\)
F−=β​log⁡πθ​\(y^−\|x;y^<t\)πref​\(y^−\|x;y^<t\)F^\{\-\}=\\beta\\log\\frac\{\\pi\_\{\\theta\}\(\\hat\{y\}^\{\-\}\|x;\\hat\{y\}^\{<t\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\\hat\{y\}^\{\-\}\|x;\\hat\{y\}^\{<t\}\)\}\(20\)

##### Difficulty scheduling\.

We define the difficulty of a reasoning trajectory based on two intuitive factors: its overall reward quality and its reasoning length\.

A trajectory whose reward vector is closer to the ideal pointz∗z^\{\*\}corresponds to higher overall reward, and is therefore considered easier\. Similarly, shorter reasoning trajectories generally indicate simpler reasoning processes and are also treated as easier cases\.

Taking both factors into account, we define the difficulty score of a trajectoryYYas a linear combination of its normalized depth and its distance to the ideal point:

D​\(Y\)=α​depth​\(Y\)Tmax\+\(1−α\)​‖z∗−𝐑traj​\(Y\)‖2M,D\(Y\)=\\alpha\\frac\{\\text\{depth\}\(Y\)\}\{T\_\{\\max\}\}\+\(1\-\\alpha\)\\frac\{\\\|z^\{\*\}\-\\mathbf\{R\}\_\{\\text\{traj\}\}\(Y\)\\\|\_\{2\}\}\{\\sqrt\{M\}\},\(21\)whereTmaxT\_\{\\max\}denotes the maximum allowed number of reasoning steps,MMis the number of reward dimensions, and𝐑traj​\(Y\)\\mathbf\{R\}\_\{\\text\{traj\}\}\(Y\)represents the aggregated reward vector of trajectoryYY\. TheL2L\_\{2\}distance measures how far the trajectory performance deviates from the ideal point, and dividing byM\\sqrt\{M\}normalizes the term to\[0,1\]\[0,1\]\.

By construction, a smaller value ofD​\(Y\)D\(Y\)indicates an easier trajectory, while larger values correspond to harder cases\. We use this difficulty score to perform curriculum\-style difficulty scheduling during training, prioritizing easier trajectories in early stages and gradually incorporating harder ones as training progresses\.

##### Closed\-loop training\.

Searching with the current policy yields better\-balanced trajectories; these trajectories give us preference data with graded difficulty; the updated policy, in turn, improves the quality of the search\. This closes the loop and steadily pushes the model toward high\-quality multimodal reasoning\.

## 4Experiment

Table 1:Comparison with open\-source VLMs\. We evaluate our method on six benchmarks covering both general and task\-specific reasoning capacities\. Our method has consistently strong performances across these benchmarks, surpassing different baselines and other state\-of\-the\-art VLMs by large margins\. The items in bold and underlined respectively represent the first or second highest scores\.### 4\.1Experimental Setup

#### 4\.1\.1Implementation details

We employ InternVL\-2\.5\-MPOChenet al\.\([2024b](https://arxiv.org/html/2606.07801#bib.bib37)\)as our base Vision\-Language Model \(VLM\)\. The supervised fine\-tuning is conducted on ShareGPT\-step\-300K for one epoch\. Qwen2\.5\-VL\-32B\-Instruct is adopted as the backbone of the Reward and Criteria Generation Model\. The BAAI/bge\-en\-icl modelLeeet al\.\([2025](https://arxiv.org/html/2606.07801#bib.bib40)\)is used to construct the embedding spaceV∈ℝdV\\in\\mathbb\{R\}^\{d\}, where each element corresponds to a vector representation of a generated criterion, and the dimensionality is set tod=4096d=4096\. To organize these criteria at multiple granularities, hierarchical clustering is performed using the BIRCH algorithmZhanget al\.\([1996](https://arxiv.org/html/2606.07801#bib.bib41)\), resulting in a hierarchical structureHHover the criterion embeddings\. For the training time, SFT takes about 9 hours on a single A800 node, and three rounds of Curriculum DPO takes about 18 hours in total on a 4 \* A100 node\. We set branch factork=3k=3for a good balance between exploration and computation time\. We set Search DepthD=10D=10to limit the search space while maintaining adequate coverage of reasoning paths\. We empirically set Reward Parametersη=0\.5\\eta=0\.5,ρ=0\.1\\rho=0\.1, andλ=0\.2\\lambda=0\.2to optimize reasoning quality while avoiding overfitting to any single dimension\.

#### 4\.1\.2Benchmarks

We evaluate our method on several representative benchmarks, spanning a wide range of tasks that demand complex multimodal reasoning abilities\. For most of these benchmarks, performance is measured by accuracy \(or relaxed accuracy\)—i\.e\., the ratio of correct predictions to total examples—unless specified otherwise\. MathVistaLuet al\.\([2024a](https://arxiv.org/html/2606.07801#bib.bib42)\)is a well\-known benchmark for multimodal mathematical reasoning, focusing on tasks such as plane geometry, functions, and visual puzzles\. MMMUYueet al\.\([2024](https://arxiv.org/html/2606.07801#bib.bib9)\)evaluates multimodal reasoning across diverse university\-level disciplines\. ChartQAMasryet al\.\([2022](https://arxiv.org/html/2606.07801#bib.bib43)\)aims to assess logical and numerical reasoning over chart\-based visual data\. MMStarChenet al\.\([2024a](https://arxiv.org/html/2606.07801#bib.bib44)\)focuses on diverse multimodal problems that require fine\-grained visual understanding\. M3CoTShaoet al\.\([2024](https://arxiv.org/html/2606.07801#bib.bib46)\)evaluates multimodal chain\-of\-thought reasoning capabilities\. AI2DKembhaviet al\.\([2016](https://arxiv.org/html/2606.07801#bib.bib45)\)is designed to test scientific diagram understanding and reasoning\.

### 4\.2Main Results

Table 1 reports the comparison between MMS\-PRM and state\-of\-the\-art open\-source VLMs across six multimodal reasoning benchmarks\. Our method consistently achieves performance gains across the majority of benchmarks Starting from InternVL2\.5\-MPO, supervised fine\-tuning yields limited gains, indicating that performance is close to saturation under standard training\. In contrast, introducing MMS\-PRM leads to consistent improvements across all benchmarks, demonstrating the effectiveness of search\-enhanced, multi\-dimensional process supervision beyond conventional SFT\. The gains are particularly pronounced on reasoning\-intensive benchmarks such as M3CoT and MathVista, which require long\-horizon, multi\-step reasoning\. This suggests that MMS\-PRM is especially effective at refining complex multimodal reasoning behaviors rather than optimizing for final\-answer correctness alone\.

Overall, these results show that explicitly balancing multiple reward dimensions at the process level remains a powerful mechanism for improving strong vision\-language models\.

### 4\.3Efficiency and Generalization Analysis

To further validate the practicality of MMS\-PRM, we compare it against state\-of\-the\-art multimodal PRMs that require extensive supervised fine\-tuning\. As shown in Table[2](https://arxiv.org/html/2606.07801#S4.T2), while training\-based methods such as Visual\-PRM and SVIP achieve slightly superior accuracy by optimizing billions of parameters on large\-scale step\-level datasets, MMS\-PRM remains highly competitive despite utilizing a training\-free reward mechanism\. This design avoids the need for training a separate reward model, distinguishing it from baselines that require optimizing billions of parameters for step\-level supervision\. The minor performance trade\-off is significantly offset by our model’s superior deployment efficiency and its robust generalization across diverse benchmarks, as it avoids the risk of overfitting to specific training distributions or reward hacking\. Consequently, MMS\-PRM offers a more resource\-efficient and scalable paradigm for real\-world multimodal reasoning\.

Table 2:Comparison with state\-of\-the\-art multimodal PRMs\. Our method achieves competitive performance using a training\-free process reward model, whereas baselines require extensive supervision to train their reward models\.
### 4\.4Ablation Study

We perform ablation studies on the M3CoT validation set to analyze the contribution of each component in MMS\-PRM\. As shown in Table[3](https://arxiv.org/html/2606.07801#S4.T3), introducing the hierarchical fine\-grained process rewards consistently improves the SFT baseline, indicating the effectiveness of step\-level supervision for multimodal reasoning\. Further incorporating reward\-guided Chebyshev MCTS leads to additional gains, demonstrating that explicitly searching for balanced reasoning trajectories is more effective than direct optimization\. Compared with weighted\-sum MCTS, Chebyshev scalarization achieves better performance, suggesting that penalizing the weakest reward dimension is crucial for multimodal reasoning\. Finally, the full MMS\-PRM framework, which combines hierarchical rewards, Chebyshev MCTS, and curriculum\-style DPO, achieves the best performance\. In contrast, removing MCTS and applying DPO alone results in a noticeable drop, highlighting the necessity of search\-based trajectory optimization\.

Table 3:Ablation study results on the M3CoT validation set\.
### 4\.5Impact of Reward Count

We further investigate the impact of supervision density by analyzing the average reward counts\. In this experiment, we control the density of process supervision by explicitly constraining the number of output rewards in the prompt provided to the criteria generation model\. Specifically, we adjust the instructions to request a varying number of criteria—ranging from a strict constraint \(e\.g\., ”output only the single most critical criterion”\) to a relaxed setting that encourages generating comprehensive criteria across multiple dimensions\. As shown in Table[4](https://arxiv.org/html/2606.07801#S4.T4), the reasoning performance consistently improves as the average number of activated process rewards increases from 1\.0 to 4\.22\. Here, the average values are induced by prompt\-level constraints on the number of generated criteria \(e\.g\., exactly 1, 1–3,≤\\leq5, and 3–5\), rather than being directly set\. This trend indicates that MMS\-PRM benefits significantly from joint multi\-dimensional constraints\.

Table 4:Effect of reward dimensionality measured by the average number of activated rewards per reasoning step\.
### 4\.6Worst\-Dimension Analysis

We analyze the minimum reward dimensionvmin=minj⁡vjv\_\{\\min\}=\\min\_\{j\}v\_\{j\}of final trajectories generated by Chebyshev\-guided MCTS and weighted\-sum MCTS under identical rollout budgets and reward models\. As shown in Figure 3, Chebyshev scalarization consistently shifts the distribution ofvminv\_\{\\min\}toward higher values, indicating that the weakest reward dimension is significantly improved\. In contrast, weighted\-sum MCTS exhibits a heavier tail in the low\-vminv\_\{\\min\}region, suggesting that severe failures in individual dimensions can be masked by strong performance in others\. This confirms that Chebyshev\-based optimization effectively enforces balanced multimodal reasoning by penalizing worst\-dimension collapse rather than allowing compensation across dimensions\.

![Refer to caption](https://arxiv.org/html/2606.07801v1/Figure/reward_prob.png)Figure 3:Distribution of the minimum reward dimensionvminv\_\{\\min\}across reasoning trajectories

## 5Conclusion

This paper introduced MMS\-PRM, which replaces superficial scalar rewards with fine\-grained, multi\-dimensional supervision via Chebyshev\-scalarized MCTS\. Our framework effectively balances visual and logical reasoning, achieving competitive results and robust generalization\. These findings highlight the critical role of structured process rewards and search mechanisms in advancing reliable, interpretable multimodal reasoning\.

## Acknowledgments

This work was supported by the National Key R&D Program of China under Grant No\. 2024YFC3308101, for the project “Long\- and Short\-Term Holographic Profiling of Bond Investors Based on Trading Behavior Features”, led by Prof\. Huaping Zhang\.

## References

- C\. Cai, H\. Liu, X\. Zhao, Z\. Jiang, T\. Zhang, Z\. Wu, J\. Lee, J\. Hwang, and L\. Li \(2025a\)Bayesian Optimization for Controlled Image Editing via LLMs\.InProceedings of the Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- C\. Cai, X\. Zhao, H\. Liu, Z\. Jiang, T\. Zhang, Z\. Wu, J\. Hwang, and L\. Li \(2025b\)The Role of Deductive and Inductive Reasoning in Large Language Models\.InProceedings of the Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- L\. Chen, J\. Li, X\. Dong, P\. Zhang, Y\. Zang, Z\. Chen, H\. Duan, J\. Wang, Y\. Qiao, D\. Lin,et al\.\(2024a\)Are we on the right way for evaluating large vision\-language models?\.Advances in Neural Information Processing Systems37,pp\. 27056–27087\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.07801#S4.SS1.SSS2.p1.1)\.
- S\. Chen, J\. Zhou, and L\. Li \(2025\)Dense point clouds matter: dust\-gs for scene reconstruction from sparse viewpoints\.InICASSP 2025\-2025 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 1–5\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- Z\. Chen, W\. Wang, Y\. Cao, Y\. Liu, Z\. Gao, E\. Cui, J\. Zhu, S\. Ye, H\. Tian, Z\. Liu,et al\.\(2024b\)Expanding performance boundaries of open\-source multimodal models with model, data, and test\-time scaling\.arXiv preprint arXiv:2412\.05271\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.07801#S4.SS1.SSS1.p1.8),[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.14.14.1)\.
- G\. Dong, C\. Zhang, M\. Deng, Y\. Zhu, Z\. Dou, and J\. Wen \(2025a\)Progressive multimodal reasoning via active retrieval\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3579–3602\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- Y\. Dong, Z\. Liu, H\. Sun, J\. Yang, W\. Hu, Y\. Rao, and Z\. Liu \(2025b\)Insight\-v: exploring long\-chain visual reasoning with multimodal large language models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 9062–9072\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1),[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.13.13.1)\.
- M\. Gao, X\. Liu, Z\. Yue, Y\. Wu, S\. Chen, J\. Li, S\. Tang, F\. Wu, T\. Chua, and Y\. Zhuang \(2025\)Benchmarking multimodal cot reward model stepwise by visual program\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 1718–1728\.Cited by:[§1](https://arxiv.org/html/2606.07801#S1.p2.1),[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- X\. Guan, L\. L\. Zhang, Y\. Liu, N\. Shang, Y\. Sun, Y\. Zhu, F\. Yang, and M\. Yang \(2025a\)RStar\-math: small llms can master math reasoning with self\-evolved deep thinking\.arXiv preprint arXiv:2501\.04519\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- Y\. Guan, Y\. Liu, K\. Zhou, H\. Li, S\. Jia, Z\. Shen, Z\. Wang, X\. Zhang, T\. Chen, J\. Hwang,et al\.\(2025b\)Learning an efficient optimizer via hybrid\-policy sub\-trajectory balance\.arXiv preprint arXiv:2511\.00543\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- W\. Huang, B\. Jia, Z\. Zhai, S\. Cao, Z\. Ye, F\. Zhao, Z\. Xu, X\. Tang, Y\. Hu, and S\. Lin \(2025\)Vision\-r1: incentivizing reasoning capability in multimodal large language models\.arXiv preprint arXiv:2503\.06749\.Cited by:[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.7.7.1)\.
- S\. Jia, N\. Zhu, J\. Zhong, J\. Zhou, H\. Zhang, J\. Hwang, and L\. Li \(2026\)RAM: recover any 3d human motion in\-the\-wild\.arXiv preprint arXiv:2603\.19929\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- A\. Kembhavi, M\. Salvato, E\. Kolve, M\. Seo, H\. Hajishirzi, and A\. Farhadi \(2016\)A diagram is worth a dozen images\.InEuropean conference on computer vision,pp\. 235–251\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.07801#S4.SS1.SSS2.p1.1)\.
- C\. Lee, R\. Roy, M\. Xu, J\. Raiman, M\. Shoeybi, B\. Catanzaro, and W\. Ping \(2025\)Nv\-embed: improved techniques for training llms as generalist embedding models\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 79310–79333\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.07801#S4.SS1.SSS1.p1.8)\.
- B\. Li, Y\. Zhang, D\. Guo, R\. Zhang, F\. Li, H\. Zhang, K\. Zhang, P\. Zhang, Y\. Li, Z\. Liu,et al\.\(2024\)Llava\-onevision: easy visual task transfer\.arXiv preprint arXiv:2408\.03326\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1),[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.11.11.1)\.
- L\. Li, S\. Jia, and J\. Hwang \(2026\)Multiple human motion understanding\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 6297–6305\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- L\. Li, S\. Jia, J\. Wang, Z\. Jiang, F\. Zhou, J\. Dai, T\. Zhang, Z\. Wu, and J\. Hwang \(2025\)Human motion instruction tuning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- L\. Li \(2024\)Image semantic segmentation via chain\-of\-thought prompts\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision \(WACV\),Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- H\. Liu, C\. Li, Y\. Li, and Y\. J\. Lee \(2024a\)Improved baselines with visual instruction tuning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 26296–26306\.Cited by:[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.2.2.1)\.
- H\. Liu, C\. Li, Y\. Li, B\. Li, Y\. Zhang, S\. Shen, and Y\. J\. Lee \(2024b\)Llavanext: improved reasoning, ocr, and world knowledge\.Cited by:[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.15.15.1)\.
- L\. Liu, S\. Chen, S\. Jia, J\. Shi, Z\. Jiang, C\. Jin, W\. Zongkai, J\. Hwang, and L\. Li \(2024c\)Graph canvas for controllable 3d scene generation\.arXiv preprint arXiv:2412\.00091\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- P\. Lu, H\. Bansal, T\. Xia, J\. Liu, C\. Li, H\. Hajishirzi, H\. Cheng, K\. Chang, M\. Galley, and J\. Gao \(2024a\)Mathvista: evaluating mathematical reasoning of foundation models in visual contexts\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 23439–23554\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.07801#S4.SS1.SSS2.p1.1)\.
- S\. Lu, Y\. Li, Q\. Chen, Z\. Xu, W\. Luo, K\. Zhang, and H\. Ye \(2024b\)Ovis: structural embedding alignment for multimodal large language model\.arXiv preprint arXiv:2405\.20797\.Cited by:[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.4.4.1)\.
- L\. Luo, Y\. Liu, R\. Liu, S\. Phatale, M\. Guo, H\. Lara, Y\. Li, L\. Shu, Y\. Zhu, L\. Meng,et al\.\(2024\)Improve mathematical reasoning in language models by automated process supervision\.arXiv preprint arXiv:2406\.06592\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- R\. Luo, Z\. Zheng, Y\. Wang, Y\. Yu, X\. Ni, Z\. Lin, J\. Zeng, and Y\. Yang \(2025\)Ursa: understanding and verifying chain\-of\-thought reasoning in multimodal mathematics\.arXiv e\-prints,pp\. arXiv–2501\.Cited by:[§1](https://arxiv.org/html/2606.07801#S1.p2.1),[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- A\. Masry, X\. L\. Do, J\. Q\. Tan, S\. Joty, and E\. Hoque \(2022\)Chartqa: a benchmark for question answering about charts with visual and logical reasoning\.InFindings of the association for computational linguistics: ACL 2022,pp\. 2263–2279\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.07801#S4.SS1.SSS2.p1.1)\.
- B\. Ong, T\. D\. Pala, V\. Toh, W\. C\. Tjhi, and S\. Poria \(2025\)Training vision\-language process reward models for test\-time scaling in multimodal reasoning: key insights and lessons learned\.arXiv preprint arXiv:2509\.23250\.Cited by:[§1](https://arxiv.org/html/2606.07801#S1.p2.1),[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- Z\. Qi, M\. Ma, J\. Xu, L\. L\. Zhang, F\. Yang, and M\. Yang \(2025\)Mutual reasoning makes smaller llms stronger problem\-solver\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 20788–20807\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- H\. Shao, S\. Qian, H\. Xiao, G\. Song, Z\. Zong, L\. Wang, Y\. Liu, and H\. Li \(2024\)Visual cot: advancing multi\-modal language models with a comprehensive dataset and benchmark for chain\-of\-thought reasoning\.Advances in Neural Information Processing Systems37,pp\. 8612–8642\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.07801#S4.SS1.SSS2.p1.1)\.
- J\. Shi, Q\. Ma, H\. Liu, H\. Zhao, J\. Hwang, and L\. Li \(2026\)Intrinsic entropy of context length scaling in llms\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- O\. Thawakar, D\. Dissanayake, K\. P\. More, R\. Thawkar, A\. Heakl, N\. Ahsan, Y\. Li, I\. Z\. M\. Zumri, J\. Lahoud, R\. M\. Anwer,et al\.\(2025\)Llamav\-o1: rethinking step\-by\-step visual reasoning in llms\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 24290–24315\.Cited by:[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.6.6.1)\.
- P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024a\)Math\-shepherd: verify and reinforce llms step\-by\-step without human annotations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9426–9439\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge,et al\.\(2024b\)Qwen2\-vl: enhancing vision\-language model’s perception of the world at any resolution\.arXiv preprint arXiv:2409\.12191\.Cited by:[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.12.12.1)\.
- W\. Wang, Z\. Chen, W\. Wang, Y\. Cao, Y\. Liu, Z\. Gao, J\. Zhu, X\. Zhu, L\. Lu, Y\. Qiao,et al\.\(2024c\)Enhancing the reasoning ability of multimodal large language models via mixed preference optimization\.arXiv preprint arXiv:2411\.10442\.Cited by:[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.17.17.1)\.
- W\. Wang, Z\. Gao, L\. Chen, Z\. Chen, J\. Zhu, X\. Zhao, Y\. Liu, Y\. Cao, S\. Ye, X\. Zhu,et al\.\(2025\)Visualprm: an effective process reward model for multimodal reasoning\.arXiv preprint arXiv:2503\.10291\.Cited by:[§1](https://arxiv.org/html/2606.07801#S1.p2.1),[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- Z\. Wang, Y\. Li, Y\. Wu, L\. Luo, L\. Hou, H\. Yu, and J\. Shang \(2024d\)Multi\-step problem solving through a verifier: an empirical analysis on model\-induced process supervision\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 7309–7319\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- J\. Wu, M\. Feng, S\. Zhang, F\. Che, Z\. Wen, C\. Liao, and J\. Tao \(2024\)Beyond examples: high\-level automated reasoning paradigm in in\-context learning via mcts\.arXiv preprint arXiv:2411\.18478\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- G\. Xu, P\. Jin, Z\. Wu, H\. Li, Y\. Song, L\. Sun, and L\. Yuan \(2025\)Llava\-cot: let vision language models reason step\-by\-step\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 2087–2098\.Cited by:[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.5.5.1)\.
- Z\. Yan, L\. Li, Y\. Shao, S\. Chen, Z\. Wu, J\. Hwang, H\. Zhao, and F\. Remondino \(2024\)3dsceneeditor: controllable 3d scene editing with gaussian splatting\.arXiv preprint arXiv:2412\.01583\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- Z\. Yan, Y\. Shao, M\. Liao, S\. Chen, N\. Wang, M\. Lin, J\. Hwang, H\. Zhao, F\. Remondino, and L\. Li \(2026\)3DSceneEditor: controllable 3d scene editing with gaussian splatting\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision \(WACV\),pp\. 1852–1863\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- Y\. Yang, X\. He, H\. Pan, X\. Jiang, Y\. Deng, X\. Yang, H\. Lu, D\. Yin, F\. Rao, M\. Zhu,et al\.\(2025\)R1\-onevision: advancing generalized multimodal reasoning through cross\-modal formalization\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 2376–2385\.Cited by:[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.9.9.1)\.
- H\. Yao, J\. Huang, W\. Wu, J\. Zhang, Y\. Wang, S\. Liu, Y\. Wang, Y\. Song, H\. Feng, L\. Shen,et al\.\(2026\)Mulberry: empowering mllm with o1\-like reasoning and reflection via collective monte carlo tree search\.Advances in Neural Information Processing Systems38,pp\. 29918–29952\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- Y\. Yao, T\. Yu, A\. Zhang, C\. Wang, J\. Cui, H\. Zhu, T\. Cai, H\. Li, W\. Zhao, Z\. He,et al\.\(2024\)Minicpm\-v: a gpt\-4v level mllm on your phone\.arXiv preprint arXiv:2408\.01800\.Cited by:[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.10.10.1)\.
- Z\. Yao, X\. Cheng, Z\. Huang, and L\. Li \(2025\)CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- X\. Yue, Y\. Ni, K\. Zhang, T\. Zheng, R\. Liu, G\. Zhang, S\. Stevens, D\. Jiang, W\. Ren, Y\. Sun,et al\.\(2024\)Mmmu: a massive multi\-discipline multimodal understanding and reasoning benchmark for expert agi\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 9556–9567\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.07801#S4.SS1.SSS2.p1.1)\.
- J\. Zhang, J\. Huang, H\. Yao, S\. Liu, X\. Zhang, S\. Lu, and D\. Tao \(2025a\)R1\-vl: learning to reason with multimodal large language models via step\-wise group relative policy optimization\.arXiv preprint arXiv:2503\.12937\.Cited by:[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.8.8.1)\.
- P\. Zhang, X\. Dong, Y\. Zang, Y\. Cao, R\. Qian, L\. Chen, Q\. Guo, H\. Duan, B\. Wang, L\. Ouyang,et al\.\(2024\)Internlm\-xcomposer\-2\.5: a versatile large vision language model supporting long\-contextual input and output\.arXiv preprint arXiv:2407\.03320\.Cited by:[Table 1](https://arxiv.org/html/2606.07801#S4.T1.1.3.3.1)\.
- R\. Zhang, B\. Zhang, Y\. Li, H\. Zhang, Z\. Sun, Z\. Gan, Y\. Yang, R\. Pang, and Y\. Yang \(2025b\)Improve vision language model chain\-of\-thought reasoning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1631–1662\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.
- T\. Zhang, R\. Ramakrishnan, and M\. Livny \(1996\)BIRCH: an efficient data clustering method for very large databases\.ACM sigmod record25\(2\),pp\. 103–114\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.07801#S4.SS1.SSS1.p1.8)\.
- X\. Zhang, S\. Chen, J\. Zhou, and L\. Li \(2026\)PSGS: text\-driven panorama sliding scene generation via gaussian splatting\.arXiv preprint arXiv:2602\.00463\.Cited by:[§2](https://arxiv.org/html/2606.07801#S2.p1.1)\.

Similar Articles

Reinforcing Multimodal Reasoning Against Visual Degradation

Hugging Face Daily Papers

This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

arXiv cs.CL

This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.