Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

arXiv cs.CL Papers

Summary

Introduces Metacognition-as-Reward (MaR), a reinforcement learning framework that guides LLM reasoning via metacognitive knowledge and regulation signals, achieving up to 11% improvement over vanilla methods on reasoning benchmarks.

arXiv:2605.23384v1 Announce Type: new Abstract: Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limited guidance for intermediate reasoning behaviors. (2) Rubrics-as-reward (RaR) goes beyond final-answer checking by using natural-language rubrics to assess reasoning quality and task compliance, but often requires instance-specific rubrics and substantial design effort. To address these issues, we introduce Metacognition-as-Reward (MaR), a metacognition-inspired RL framework that guides LLM reasoning through two general process dimensions: i) metacognitive knowledge, which identifies task-relevant information without hand-crafted instance-specific rubrics, and ii) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final-answer outcomes. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7.7% gain over the base model and up to an 11.0% gain over vanilla DAPO. Notably, Qwen3.5-9B + MaR narrows the gap to frontier models, surpassing GPT-OSS-120B on overall average and outperforming stronger models on several individual benchmarks. Process-level analysis further shows substantial improvements in reasoning process quality. MaR also generalizes to out-of-domain datasets, where MaR-trained models improve over their corresponding base models on average.
Original Article
View Cached Full Text

Cached at: 05/25/26, 09:01 AM

# Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
Source: [https://arxiv.org/abs/2605.23384](https://arxiv.org/abs/2605.23384)
[View PDF](https://arxiv.org/pdf/2605.23384)

> Abstract:Recent RL methods have substantially improved the reasoning abilities of LLMs\. Existing reward designs mainly follow two paradigms: \(1\) Reinforcement learning with verifiable rewards \(RLVR\) derives outcome signals from executable checks or ground\-truth answers, but provides limited guidance for intermediate reasoning behaviors\. \(2\) Rubrics\-as\-reward \(RaR\) goes beyond final\-answer checking by using natural\-language rubrics to assess reasoning quality and task compliance, but often requires instance\-specific rubrics and substantial design effort\. To address these issues, we introduce Metacognition\-as\-Reward \(MaR\), a metacognition\-inspired RL framework that guides LLM reasoning through two general process dimensions: i\) metacognitive knowledge, which identifies task\-relevant information without hand\-crafted instance\-specific rubrics, and ii\) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final\-answer outcomes\. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory\-level reward over task knowledge coverage, regulation fidelity, and final\-answer correctness\. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions\. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7\.7% gain over the base model and up to an 11\.0% gain over vanilla DAPO\. Notably, Qwen3\.5\-9B \+ MaR narrows the gap to frontier models, surpassing GPT\-OSS\-120B on overall average and outperforming stronger models on several individual benchmarks\. Process\-level analysis further shows substantial improvements in reasoning process quality\. MaR also generalizes to out\-of\-domain datasets, where MaR\-trained models improve over their corresponding base models on average\.

## Submission history

From: Sirui Chen \[[view email](https://arxiv.org/show-email/0f9c2a67/2605.23384)\] **\[v1\]**Fri, 22 May 2026 08:54:37 UTC \(981 KB\)

Similar Articles

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

arXiv cs.CL

This paper investigates whether reinforcement learning can improve the direct recall of parametric knowledge in LLMs beyond reasoning tasks. It demonstrates that RL with binary rewards yields significant gains in factual QA benchmarks by redistributing probability mass to unlock latent knowledge rather than acquiring new facts.

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

arXiv cs.CL

This paper identifies and addresses the problem of 'Miracle Steps' in LLM mathematical reasoning—unjustified jumps to correct answers that indicate reward hacking—by proposing Rubric Reward Model (RRM), a process-oriented reward function that evaluates entire reasoning trajectories. RRM achieves significant improvements on AIME2024 (26.7% to 62.6% Verified Pass@1024) and reduces Miracle Steps by 71%.