ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

arXiv cs.CL Papers

Summary

ProcessThinker introduces a practical post-training pipeline that provides step-level process rewards without training an explicit process reward model. It uses rollout-based rewards to give dense credit assignment for multi-step reasoning in multimodal LLMs, consistently improving performance on video benchmarks.

arXiv:2606.11209v1 Announce Type: new Abstract: Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:35 PM

# ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward
Source: [https://arxiv.org/html/2606.11209](https://arxiv.org/html/2606.11209)
Jingpei Wu1,5\\NoHyper\\endNoHyperXiao Han1\\NoHyper11footnotemark:1\\endNoHyperWeixiang Shen1Boer Zhang2Zifeng Ding3,4Volker Tresp1,5 1LMU Munich2Harvard University3University of Cambridge4Mina AI 5Konrad Zuse School of Excellence in Reliable AI \(relAI\)

###### Abstract

Visual question answering increasingly requires multi\-step reasoning\. Recent post\-training with reinforcement learning under verifiable rewards \(RLVR\) and Group Relative Policy Optimization \(GRPO\) can improve multimodal reasoning, but most approaches rely on sparse outcome\-only rewards\. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start\. A common solution is to train a process reward model \(PRM\) for step\-level supervision, but this typically requires large\-scale high\-quality chain\-of\-thought annotations and additional training cost\. We proposeProcessThinker, a practical post\-training pipeline that provides step\-level*process rewards*without training an explicit PRM\. ProcessThinker first rewrites reasoning traces into a step\-tagged format for cold\-start supervised fine\-tuning, then applies GRPO with a standard format reward and our rollout\-based process reward\. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate \(final\-answer verification\) as the step reward\. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self\-contradictory progress across steps – a key issue in logical reasoning\. Across four challenging video benchmarks \(Video\-MMMU, MMVU, VideoMathQA, and LongVideoBench\), ProcessThinker consistently improves over the baseline modelQwen3\-VL\-8B\-Instruct\.

## 1Introduction

Multimodal large language models \(MLLMs\) have made rapid progress in open\-ended visual understanding and question answering\. With chain\-of\-thought \(CoT\) prompting, MLLMs can produce multi\-step reasoning traces and often improve task performance\. This long\-horizon reasoning ability is increasingly important for complex problems where the answer depends on a sequence of intermediate inferences\. Recently, post\-training with reinforcement learning under verifiable rewards \(RLVR\), often paired with group\-based objectives such as Group Relative Policy Optimization \(GRPO\), has further strengthened multi\-step reasoning in both text\-only and multimodal settings\(Shaoet al\.,[2024](https://arxiv.org/html/2606.11209#bib.bib4); DeepSeek\-AI,[2025](https://arxiv.org/html/2606.11209#bib.bib3); Suet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib42); Simet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib41); Fenget al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib5); Zhanget al\.,[2025b](https://arxiv.org/html/2606.11209#bib.bib6); Parket al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib8); Wanget al\.,[2025b](https://arxiv.org/html/2606.11209#bib.bib9); Zhanget al\.,[2025d](https://arxiv.org/html/2606.11209#bib.bib12); Yanget al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib11); Huanget al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib7)\)\. However, the supervision signal in GRPO\-style RLVR is typically*sparse*: the verifier only checks the final answer\. For long reasoning traces, many samples within a group can receive identical outcome rewards, which weakens learning signals and motivates reward shaping and sampling strategies\(Yaoet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib39); Chenet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib44); Zhanget al\.,[2025c](https://arxiv.org/html/2606.11209#bib.bib45); Niuet al\.,[2026](https://arxiv.org/html/2606.11209#bib.bib47); Yari and Koto,[2026](https://arxiv.org/html/2606.11209#bib.bib46); Taoet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib40); Lyuet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib43)\)\. Even with these improvements, outcome\-only supervision still provides little information about*which intermediate steps*were helpful when the final answer is wrong\.

A natural way to densify supervision is to score intermediate steps\. Process reward models \(PRMs\) provide step\-level feedback and have been used for reranking, search, and test\-time scaling by evaluating the quality of the reasoning process\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.11209#bib.bib13); Setluret al\.,[2024](https://arxiv.org/html/2606.11209#bib.bib14); Khalifaet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib21); Zhaoet al\.,[2025a](https://arxiv.org/html/2606.11209#bib.bib24); Wanget al\.,[2025a](https://arxiv.org/html/2606.11209#bib.bib23); Duet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib25)\)\. Recent work also explores using PRMs to supply process rewards during RL training\(Luoet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib32)\)\. However, PRM\-based supervision usually requires high\-quality step annotations \(or a complex pipeline to synthesize them\), and automated approaches often rely on Monte\-Carlo rollouts or MCTS\-style search, which can be noisy and sensitive to how “steps” are defined\(Zhanget al\.,[2024](https://arxiv.org/html/2606.11209#bib.bib15);[2025a](https://arxiv.org/html/2606.11209#bib.bib16);[2025e](https://arxiv.org/html/2606.11209#bib.bib22); Tanet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib26); Dinget al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib27)\)\. Moreover, training and maintaining a separate PRM adds engineering overhead and can introduce a mismatch between the PRM and the final policy\.

This raises a question central to logical multi\-step reasoning:*can we obtain step\-level training signals that encourage more consistent reasoning traceswithouttraining a separate PRM?*StepGRPO\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.11209#bib.bib6)\)is an important step in this direction, using rule\-based step\-wise rewards \(e\.g\., rewarding the presence of key steps and enforcing a well\-structured reasoning format\)\. However, it still does not directly measure whether a specific intermediate step actually makes the problem easier to solve\. VinePPO\(Kazemnejadet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib1)\)shares a similar intuition, using Monte Carlo rollouts to estimate step\-level values for credit assignment in PPO; however, it targets text\-only LLMs and does not provide rollout\-based process rewards within a GRPO framework\. We proposeProcessThinker, which assigns a rollout\-based*process reward*to each reasoning step via*continuation solvability*\(Figure[1](https://arxiv.org/html/2606.11209#S1.F1)\)\. The key idea is simple: an intermediate step is useful if, conditioned on the partial trace, the model is more likely to reach the correct final answer\. We estimate this by sampling multiple continuations from the current policy starting from each step prefix and computing the empirical success rate under the same final\-answer verifier used in RLVR\. This provides a direct, model\-free estimate of step utility and encourages reasoning traces whose steps more reliably support a correct conclusion, reducing inconsistent progress across steps—a core challenge in logical reasoning\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.11209#bib.bib13)\)\. We instantiate ProcessThinker onQwen3\-VL\-8B\-Instruct\(Baiet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib2)\)and train in two stages: \(i\) a SFT warm\-up on a step\-tagged dataset obtained by rewritingVideo\-R1\-CoTtraces into explicit step decompositions using a stronger teacher model, and \(ii\) GRPO post\-training with a weighted combination of sparse outcome reward and our rollout\-based process reward, along with lightweight formatting incentives\. We evaluate in the video domain, but the proposed reward construction is model\- and modality\-agnostic\.

In summary, we make the following contributions: \(1\) We propose a simple GRPO\-based post\-training framework that incorporates step\-level rewards*without*training an explicit process reward model \(PRM\)\. \(2\) We introduce a rollout\-based process reward that scores each reasoning step by the empirical success rate of multiple continuations conditioned on the step prefix\. \(3\) We demonstrate consistent improvements overQwen3\-VL\-8B\-Instructon four video reasoning benchmarks, and ablations show that increasing the weight of the process reward yields larger gains\.

![Refer to caption](https://arxiv.org/html/2606.11209v1/figures/ProcessThinker_pipeline.png)Figure 1:Rollout\-based process reward inside one GRPO update\. For a questionQQ, we sample a group ofGGcandidate responses from the current policy\. For each response, we extract step segments\{st\}\\\{s\_\{t\}\\\}\(incl\. multipleaa\) and score each stepkkby the success rate ofMMcontinuation rollouts from previous steps\(s1,…,sk\)\(s\_\{1\},\\dots,s\_\{k\}\)under policy model, producing step scoresctc\_\{t\}and an averaged process score\. The final reward for each response combines format reward, process reward, and step\-count shaping \(bonus \+ penalty gate\), which is then used to compute group\-relative advantages for GRPO\.
## 2Method

Given a multimodal contextxx\(video frames or an image plus text prompt\), the model generates an outputyy\. We enforce a step\-tagged reasoning format

y=⟨think⟩…⟨step⟩sk⟨/step⟩…⟨/think⟩⟨answer⟩ans⟨/answer⟩,y=\\langle\\texttt\{think\}\\rangle\\dots\\langle\\texttt\{step\}\\rangle s\_\{k\}\\langle/\\texttt\{step\}\\rangle\\dots\\langle/\\texttt\{think\}\\rangle\\langle\\texttt\{answer\}\\rangle ans\\langle/\\texttt\{answer\}\\rangle,\(1\)which makes intermediate steps explicit and enables step\-wise scoring \(Figure[1](https://arxiv.org/html/2606.11209#S1.F1)\)\.

### 2\.1SFT data construction \(format \+ filtering\)

Starting fromVideo\-R1\-CoT\-165k\(Fenget al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib5)\), we rewrite each sample into the step\-tagged format using a stronger teacher modelQwen3\-VL\-30B\-A3B\-Instruct\(Baiet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib2)\)\. The teacher is instructed to preserve the original solution while segmenting the reasoning into non\-trivial, non\-redundant steps\. To reduce rewriting noise \(missing/duplicated steps, semantic drift, step\-answer mismatch\), we apply a second\-pass filter with the teacher that scores: \(i\) answer fidelity vs\. the original solution, \(ii\) consistency between steps and the final answer, and \(iii\) step quality\. We keep the top 19k samples for SFT and sample 1,250 prompts for RL\. We fine\-tuneQwen3\-VL\-8B\-Instructon the 19k set to obtain ProcessThinker\-SFT, which reliably emits parsable step\-tagged traces\.

### 2\.2GRPO with rollout\-based process rewards

GPROFor each promptxx, we sample a group ofGGresponses\{y\(g\)\}g=1G\\\{y^\{\(g\)\}\\\}\_\{g=1\}^\{G\}from the current policyπθ\(⋅\|x\)\\pi\_\{\\theta\}\(\\cdot\|x\), compute a scalar rewardr\(g\)r^\{\(g\)\}for each response, and normalize rewards within the group to obtain relative advantages \(mean/variance normalization\)\. The policy is then updated using the standard GRPO recipe with KL regularization to a reference policy\. ProcessThinker differs from prior RLVR work mainly in the reward design below\.

Rollout\-based process reward \(continuation solvability\)\.For a sampled responseyywith stepss1:Ks\_\{1:K\}and ground\-truth answera​n​s⋆ans^\{\\star\}, we score each prefixpi=\(s1,…,si\)p\_\{i\}=\(s\_\{1\},\\ldots,s\_\{i\}\)by how often the model can successfully*finish*the problem when conditioned on that prefix\. We sampleMMcontinuationsy^i\(m\)∼πθ\(⋅\|x,pi\)\\hat\{y\}\_\{i\}^\{\(m\)\}\\sim\\pi\_\{\\theta\}\(\\cdot\|x,p\_\{i\}\)and define the step score as the empirical success rate:

ci=1M​∑m=1M𝐈​\[Ans​\(y^i\(m\)\)=a​n​s⋆\]\.c\_\{i\}=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\mathbf\{I\}\\\!\\left\[\\mathrm\{Ans\}\(\\hat\{y\}\_\{i\}^\{\(m\)\}\)=ans^\{\\star\}\\right\]\.\(2\)The trajectory\-level process reward averages prefix solvability:

Rproc​\(y\)=1min⁡\(K,Kmax\)​∑i=1min⁡\(K,Kmax\)ci,R\_\{\\text\{proc\}\}\(y\)=\\frac\{1\}\{\\min\(K,K\_\{\\max\}\)\}\\sum\_\{i=1\}^\{\\min\(K,K\_\{\\max\}\)\}c\_\{i\},\(3\)usingM=4M\{=\}4andKmax=6K\_\{\\max\}\{=\}6unless noted\. This gives dense credit assignment: early steps can receive partial credit even if the final answer inyyis wrong\.

Format reward, bounded step bonus, and penalty gate\.We use a strict format rewardrfmtr\_\{\\text\{fmt\}\}that is awarded only if tags are properly nested andK∈\[Kmin,Kmax\]K\\in\[K\_\{\\min\},K\_\{\\max\}\]\(and optionally a length bonus if within\[Lmin,Lmax\]\[L\_\{\\min\},L\_\{\\max\}\]\), similar in spirit to step\-structured RL recipes\. To encourage using more than the minimum number of steps without step inflation, we add a bounded step bonus

B​\(K\)=α​clip​\(K−KminKmax−Kmin,0,1\)\.B\(K\)=\\alpha\\sqrt\{\\mathrm\{clip\}\\Big\(\\frac\{K\-K\_\{\\min\}\}\{K\_\{\\max\}\-K\_\{\\min\}\},\\,0,\\,1\\Big\)\}\.\(4\)To reduce reward hacking \(too few steps or shallow steps that only satisfy format\), we gate rewards with a simple penalty:

R¯acc=\{1,Racc=1,−B​\(K\),otherwise,R¯proc=\{Rproc,Rproc≥τ,−B​\(K\),otherwise,\\bar\{R\}\_\{\\text\{acc\}\}=\\begin\{cases\}1,&R\_\{\\text\{acc\}\}=1,\\\\ \-\\,B\(K\),&\\text\{otherwise\},\\end\{cases\}\\qquad\\bar\{R\}\_\{\\text\{proc\}\}=\\begin\{cases\}R\_\{\\text\{proc\}\},&R\_\{\\text\{proc\}\}\\geq\\tau,\\\\ \-\\,B\(K\),&\\text\{otherwise\},\\end\{cases\}\(5\)whereRacc∈\{0,1\}R\_\{\\text\{acc\}\}\\in\\\{0,1\\\}is final\-answer accuracy andτ=0\.5\\tau\{=\}0\.5\.

Final reward\.For format\-valid responses, the reward is

r\(g\)=\(rfmt\+β\)\+λacc​R¯acc\+λproc​R¯proc\+B​\(K\),λacc\+λproc=1\.r^\{\(g\)\}=\(r\_\{\\text\{fmt\}\}\+\\beta\)\\;\+\\;\\lambda\_\{\\text\{acc\}\}\\bar\{R\}\_\{\\text\{acc\}\}\\;\+\\;\\lambda\_\{\\text\{proc\}\}\\bar\{R\}\_\{\\text\{proc\}\}\\;\+\\;B\(K\),\\qquad\\lambda\_\{\\text\{acc\}\}\+\\lambda\_\{\\text\{proc\}\}=1\.\(6\)If formatting is invalid, we setr\(g\)=0r^\{\(g\)\}\{=\}0and skip continuation rollouts for efficiency\.

## 3Experiments

Table 1:Main results on four video reasoning benchmarks\. All ProcessThinker variants share the same SFT warm\-up and are trained with GRPO using the overall reward in equation[6](https://arxiv.org/html/2606.11209#S2.E6)\. “process\-only” uses our rollout\-based step\-wise process reward \(Section[2\.2](https://arxiv.org/html/2606.11209#S2.SS2)\), while “outcome\-only” uses only final\-answer correctness\.We evaluate on four video reasoning benchmarks:Video\-MMMU\(Huet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib35)\),MMVU\(Zhaoet al\.,[2025b](https://arxiv.org/html/2606.11209#bib.bib36)\),VideoMathQA\(Rasheedet al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib37)\), andLongVideoBench\(Wuet al\.,[2024](https://arxiv.org/html/2606.11209#bib.bib38)\), and report accuracy following each benchmark’s official protocol\.

Training setup\.All ProcessThinker variants share the same SFT warm\-up and differ only in the reward mixture\(λacc,λproc\)\(\\lambda\_\{\\text\{acc\}\},\\lambda\_\{\\text\{proc\}\}\)in equation[6](https://arxiv.org/html/2606.11209#S2.E6)\. Unless otherwise stated, we sampleG=4G\{=\}4responses per prompt for GRPO and compute process rewards withM=4M\{=\}4continuation rollouts per step, capped atKmax=6K\_\{\\max\}\{=\}6\.

Main results\.As shown in Table[1](https://arxiv.org/html/2606.11209#S3.T1), ProcessThinker \(process\-only\) improves over theQwen3\-VL\-8B\-Instructbaseline on all four benchmarks, raising the average score from 56\.30 to 59\.72 \(\+3\.42\)\.Video\-R1\-7B\(Fenget al\.,[2025](https://arxiv.org/html/2606.11209#bib.bib5)\)is included for reference, though it uses the olderQwen2\.5\-VLbackbone and is not directly comparable\. The largest gain is onVideoMathQA\(\+6\.47\), which requires multi\-step reasoning while integrating sparse cues over time; we also observe a clear improvement onLongVideoBench\(\+3\.90\), suggesting better reasoning under long\-context inputs\. Overall, the consistent gains across benchmarks indicate that step\-level credit assignment from rollout\-based process rewards transfers beyond a single dataset type\.

SFT warm\-up: necessary but insufficient\.ProcessThinker\-SFTunderperforms the instruction\-tuned baseline despite improved format compliance, suggesting that step\-tag supervision mainly teaches*how to write*structured traces, but does not directly optimize verified task success\. GRPO post\-training is therefore critical to turn well\-formed reasoning traces into higher final\-answer accuracy\.

Process reward vs\. outcome reward\.Among reward mixtures,*process\-only*performs best, while*outcome\-only*and a balanced mixture lag behind\. This supports our hypothesis that outcome\-only supervision is too sparse for long\-horizon reasoning, whereas rollout\-based process rewards provide denser signals that better capture the usefulness of intermediate steps\.

## 4Conclusion

ProcessThinker shows that step\-wise credit assignment can be obtained from sparse verifiable outcomes without training a separate PRM: we score each step by its*continuation solvability*via rollout success rates\. This rollout\-based process reward encourages intermediate steps that more reliably support a correct conclusion, addressing a key issue in logical multi\-step reasoning: unproductive or inconsistent progress across steps\. Combined with strict formatting constraints and GRPO\-style RLVR, ProcessThinker consistently improvesQwen3\-VL\-8B\-Instructon four video reasoning benchmarks, with the strongest gains when emphasizing process rewards over outcome\-only supervision\. The main limitation is efficiency: although we avoid PRM annotation/training, continuation\-solvability requires multiple rollouts per step, and the added inference cost can offset the benefit of using fewer data\. The resulting reward can also be noisy due to rollout stochasticity and sensitivity to step segmentation\.

#### Acknowledgments

This paper is supported by the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Research, Technology and Space\.

## References

- S\. Bai, Y\. Cai, R\. Chen, K\. Chen, X\. Chen, Z\. Cheng, L\. Deng, W\. Ding, C\. Gao, C\. Ge, W\. Ge, Z\. Guo, Q\. Huang, J\. Huang, F\. Huang, B\. Hui, S\. Jiang, Z\. Li, M\. Li, M\. Li, K\. Li, Z\. Lin, J\. Lin, X\. Liu, J\. Liu, C\. Liu, Y\. Liu, D\. Liu, S\. Liu, D\. Lu, R\. Luo, C\. Lv, R\. Men, L\. Meng, X\. Ren, X\. Ren, S\. Song, Y\. Sun, J\. Tang, J\. Tu, J\. Wan, P\. Wang, P\. Wang, Q\. Wang, Y\. Wang, T\. Xie, Y\. Xu, H\. Xu, J\. Xu, Z\. Yang, M\. Yang, J\. Yang, A\. Yang, B\. Yu, F\. Zhang, H\. Zhang, X\. Zhang, B\. Zheng, H\. Zhong, J\. Zhou, F\. Zhou, J\. Zhou, Y\. Zhu, and K\. Zhu \(2025\)Qwen3\-vl technical report\.External Links:2511\.21631Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.11209#S2.SS1.p1.1)\.
- W\. Chen, X\. Li, Z\. Yang, J\. Jin, and Y\. Yang \(2025\)XRPO: pushing the limits of grpo with targeted exploration and exploitation\.External Links:2510\.06672Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.External Links:2501\.12948Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- Y\. Ding, X\. Shi, J\. Li, X\. Liang, Z\. Tu, and M\. Zhang \(2025\)SCAN: self\-denoising monte carlo annotation for robust process reward learning\.External Links:2509\.16548Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p2.1)\.
- L\. Du, F\. Meng, Z\. Liu, Z\. Zhou, P\. Luo, Q\. Zhang, and W\. Shao \(2025\)MM\-prm: enhancing multimodal mathematical reasoning with scalable step\-level supervision\.External Links:2505\.13427Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p2.1)\.
- K\. Feng, K\. Gong, B\. Li, Z\. Guo, Y\. Wang, T\. Peng, J\. Wu, X\. Zhang, B\. Wang, and X\. Yue \(2025\)Video\-r1: reinforcing video reasoning in multimodal large language models\.External Links:2503\.21776Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.11209#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.11209#S3.T1.11.11.6.1),[§3](https://arxiv.org/html/2606.11209#S3.p3.1)\.
- K\. Hu, P\. Wu, F\. Pu, W\. Xiao, Y\. Zhang, X\. Yue, B\. Li, and Z\. Liu \(2025\)Video\-mmmu: evaluating knowledge acquisition from multi\-discipline professional videos\.External Links:2501\.13826Cited by:[§3](https://arxiv.org/html/2606.11209#S3.p1.1)\.
- W\. Huang, B\. Jia, Z\. Zhai, S\. Cao, Z\. Ye, F\. Zhao, Y\. Hu, and S\. Lin \(2025\)Vision\-r1: incentivizing reasoning capability in multimodal large language models\.External Links:2503\.06749Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- A\. Kazemnejad, M\. Aghajohari, E\. Portelance, A\. Sordoni, S\. Reddy, A\. Courville, and N\. Le Roux \(2025\)VinePPO: unlocking rl potential for llm reasoning through refined credit assignment\.InInternational Conference on Machine Learning,External Links:2410\.01679Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p3.1)\.
- M\. Khalifa, R\. Agarwal, L\. Logeswaran, J\. Kim, H\. Peng, M\. Lee, H\. Lee, and L\. Wang \(2025\)Process reward models that think\.External Links:2504\.16828Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p2.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.External Links:2305\.20050Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p2.1),[§1](https://arxiv.org/html/2606.11209#S1.p3.1)\.
- R\. Luo, Z\. Zheng, Y\. Wang, X\. Ni, Z\. Lin, S\. Jiang, Y\. Yu, C\. Shi, L\. Wang, R\. Chu, Y\. Qian, and D\. Liu \(2025\)Unlocking multimodal mathematical reasoning via process reward model\.External Links:2501\.04686Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p2.1)\.
- C\. Lyu, S\. Gao, Y\. Gu, W\. Zhang, J\. Gao, K\. Liu, Z\. Wang, S\. Li, Q\. Zhao, H\. Huang, W\. Cao, J\. Liu, H\. Liu, J\. Liu, S\. Zhang, D\. Lin, and K\. Chen \(2025\)Exploring the limit of outcome reward for learning mathematical reasoning\.External Links:2502\.06781Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- R\. Niu, Y\. Wu, B\. Wang, and X\. Xiang \(2026\)From absolute to relative: rethinking reward shaping in group\-based reinforcement learning\.External Links:2601\.23058Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- J\. Park, J\. Na, J\. Kim, and H\. J\. Kim \(2025\)DeepVideo\-r1: video reinforcement fine\-tuning via difficulty\-aware regressive grpo\.External Links:2506\.07464Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- H\. Rasheed, A\. Shaker, A\. Tang, M\. Maaz, M\. Yang, S\. Khan, and F\. S\. Khan \(2025\)VideoMathQA: benchmarking mathematical reasoning via multimodal understanding in videos\.External Links:2506\.05349Cited by:[§3](https://arxiv.org/html/2606.11209#S3.p1.1)\.
- A\. Setlur, C\. Nagpal, A\. Fisch, X\. Geng, J\. Eisenstein, R\. Agarwal, A\. Agarwal, J\. Berant, and A\. Kumar \(2024\)Rewarding progress: scaling automated process verifiers for llm reasoning\.External Links:2410\.08146Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p2.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- S\. H\. Sim, T\. D\. Pala, V\. Toh, H\. L\. Chieu, A\. Zadeh, C\. Li, N\. Majumder, and S\. Poria \(2025\)Lessons from training grounded llms with verifiable rewards\.External Links:2506\.15522Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- Y\. Su, D\. Yu, L\. Song, J\. Li, H\. Mi, Z\. Tu, M\. Zhang, and D\. Yu \(2025\)Crossing the reward bridge: expanding rl with verifiable rewards across diverse domains\.External Links:2503\.23829Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- X\. Tan, T\. Yao, C\. Qu, B\. Li, M\. Yang, D\. Lu, H\. Wang, X\. Qiu, W\. Chu, Y\. Xu, and Y\. Qi \(2025\)AURORA: automated training framework of universal process reward models via ensemble prompting and reverse verification\.External Links:2502\.11520Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p2.1)\.
- L\. Tao, I\. Kulikov, S\. Saha, T\. Wang, J\. Xu, S\. Li, J\. E\. Weston, and P\. Yu \(2025\)Hybrid reinforcement: when reward is sparse, it’s better to be dense\.External Links:2510\.07242Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- W\. Wang, Z\. Gao, L\. Chen, Z\. Wang, Z\. Zhang, K\. Zhang, R\. Liu, and J\. Zhao \(2025a\)VisualPRM: an effective process reward model for multimodal reasoning\.External Links:2503\.10291Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p2.1)\.
- Y\. Wang, Z\. Wang, B\. Xu, Y\. Du, K\. Lin, Z\. Xiao, Z\. Yue, J\. Ju, L\. Zhang, D\. Yang, X\. Fang, Z\. He, Z\. Luo, W\. Wang, J\. Lin, J\. Luan, and Q\. Jin \(2025b\)Time\-r1: post\-training large vision language model for temporal video grounding\.External Links:2506\.12520Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- H\. Wu, D\. Li, B\. Chen, and J\. Li \(2024\)LongVideoBench: a benchmark for long\-context interleaved video\-language understanding\.External Links:2407\.15754Cited by:[§3](https://arxiv.org/html/2606.11209#S3.p1.1)\.
- Y\. Yang, X\. He, H\. Pan, X\. Jiang, Y\. Deng, X\. Yang, H\. Lu, D\. Yin, F\. Rao, M\. Zhu, B\. Zhang, and W\. Chen \(2025\)R1\-onevision: advancing generalized multimodal reasoning through cross\-modal formalization\.pp\. 2376–2385\.Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- H\. Yao, Q\. Yin, J\. Zhang, M\. Yang, Y\. Wang, W\. Wu, F\. Su, L\. Shen, M\. Qiu, D\. Tao, and J\. Huang \(2025\)R1\-sharevl: incentivizing reasoning capability of multimodal large language models via share\-grpo\.External Links:2505\.16673Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- S\. Yari and F\. Koto \(2026\)AMIR\-grpo: inducing implicit preference signals into grpo\.External Links:2601\.03661Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- D\. Zhang, Z\. Hu, Y\. Yue, Y\. Dong, and J\. Tang \(2025a\)ReST\-rl: process reward guided reinforcement learning for large language model reasoning\.External Links:2501\.07302Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p2.1)\.
- D\. Zhang, S\. Zhoubian, Z\. Hu, Y\. Yue, Y\. Dong, and J\. Tang \(2024\)ReST\-mcts\*: llm self\-training via process reward guided tree search\.External Links:2406\.03816Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p2.1)\.
- J\. Zhang, J\. Huang, H\. Yao, S\. Liu, X\. Zhang, S\. Lu, and D\. Tao \(2025b\)R1\-vl: learning to reason with multimodal large language models via step\-wise group relative policy optimization\.External Links:2503\.12937Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1),[§1](https://arxiv.org/html/2606.11209#S1.p3.1)\.
- X\. Zhang, W\. Wu, Y\. Peng, H\. Zhou, R\. Wu, and L\. Pang \(2025c\)GRPO\-lead: a difficulty\-aware reinforcement learning approach for concise mathematical reasoning in language models\.External Links:2504\.09696Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- X\. Zhang, S\. Wen, W\. Wu, and L\. Huang \(2025d\)TinyLLaVA\-video\-r1: towards smaller lmms for video reasoning\.External Links:2504\.09641Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p1.1)\.
- Z\. Zhang, C\. Zheng, Y\. Wu, B\. Zhang, R\. Lin, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin \(2025e\)The lessons of developing process reward models in mathematical reasoning\.External Links:2501\.07301Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p2.1)\.
- J\. Zhao, R\. Liu, K\. Zhang, Z\. Zhou, J\. Gao, D\. Li, J\. Lyu, Z\. Qian, B\. Qi, X\. Li, and B\. Zhou \(2025a\)GenPRM: scaling test\-time compute of process reward models via generative reasoning\.External Links:2504\.00891Cited by:[§1](https://arxiv.org/html/2606.11209#S1.p2.1)\.
- Y\. Zhao, Y\. Li, H\. Wu, Y\. Zhang, Z\. Liu, and Y\. Qiao \(2025b\)MMVU: measuring expert\-level multi\-discipline video understanding\.External Links:2501\.12380Cited by:[§3](https://arxiv.org/html/2606.11209#S3.p1.1)\.

Similar Articles

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

arXiv cs.CL

This paper introduces OmniThoughtVis, a scalable pipeline for distilling multimodal reasoning capabilities from large teacher models to smaller, deployment-oriented MLLMs. The method uses curated chain-of-thought data to significantly improve reasoning performance on benchmarks like MathVerse and MMMU-Pro for models ranging from 2B to 8B parameters.

Unsupervised Process Reward Models

Hugging Face Daily Papers

This paper proposes unsupervised Process Reward Models (uPRM) that eliminate the need for human annotations by using LLM next-token probabilities to identify erroneous reasoning steps, achieving up to 15% accuracy improvements over LLM-as-a-Judge and performing comparably to supervised PRMs as verifiers and reward signals.

Rubric-Guided Process Reward for Stepwise Model Routing

arXiv cs.AI

RoRo introduces a rubric-guided process reward framework for stepwise model routing in Large Reasoning Models, using process rewards alongside outcome rewards to train a routing policy via GRPO, outperforming baselines on reasoning benchmarks.

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

Hugging Face Daily Papers

This paper introduces DeScore, a video reward model that decouples reasoning and scoring processes to improve training efficiency and generalization. It addresses the limitations of existing discriminative and generative reward models by using a 'think-then-score' paradigm with multimodal large language models.