Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
Summary
Introduces Metacognition-as-Reward (MaR), a reinforcement learning framework that guides LLM reasoning via metacognitive knowledge and regulation signals, achieving up to 11% improvement over vanilla methods on reasoning benchmarks.
View Cached Full Text
Cached at: 05/25/26, 09:01 AM
# Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals Source: [https://arxiv.org/abs/2605.23384](https://arxiv.org/abs/2605.23384) [View PDF](https://arxiv.org/pdf/2605.23384) > Abstract:Recent RL methods have substantially improved the reasoning abilities of LLMs\. Existing reward designs mainly follow two paradigms: \(1\) Reinforcement learning with verifiable rewards \(RLVR\) derives outcome signals from executable checks or ground\-truth answers, but provides limited guidance for intermediate reasoning behaviors\. \(2\) Rubrics\-as\-reward \(RaR\) goes beyond final\-answer checking by using natural\-language rubrics to assess reasoning quality and task compliance, but often requires instance\-specific rubrics and substantial design effort\. To address these issues, we introduce Metacognition\-as\-Reward \(MaR\), a metacognition\-inspired RL framework that guides LLM reasoning through two general process dimensions: i\) metacognitive knowledge, which identifies task\-relevant information without hand\-crafted instance\-specific rubrics, and ii\) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final\-answer outcomes\. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory\-level reward over task knowledge coverage, regulation fidelity, and final\-answer correctness\. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions\. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7\.7% gain over the base model and up to an 11\.0% gain over vanilla DAPO\. Notably, Qwen3\.5\-9B \+ MaR narrows the gap to frontier models, surpassing GPT\-OSS\-120B on overall average and outperforming stronger models on several individual benchmarks\. Process\-level analysis further shows substantial improvements in reasoning process quality\. MaR also generalizes to out\-of\-domain datasets, where MaR\-trained models improve over their corresponding base models on average\. ## Submission history From: Sirui Chen \[[view email](https://arxiv.org/show-email/0f9c2a67/2605.23384)\] **\[v1\]**Fri, 22 May 2026 08:54:37 UTC \(981 KB\)
Similar Articles
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs
Introduces Latent Reward Steering (Lrs), an adaptive inference-time framework that uses sparse autoencoder latent states and a learned reward model to implicitly promote cognitive behaviors like verification and backtracking in reasoning LLMs, improving performance across multiple models and benchmarks.
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
This paper investigates whether reinforcement learning can improve the direct recall of parametric knowledge in LLMs beyond reasoning tasks. It demonstrates that RL with binary rewards yields significant gains in factual QA benchmarks by redistributing probability mass to unlock latent knowledge rather than acquiring new facts.
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
This paper proposes a metacognitive harness that separates monitoring from reasoning in LLMs, using pre-solve feeling-of-knowing and post-solve judgment-of-learning signals to control when to trust, retry, or aggregate answers, improving accuracy on text, code, and multimodal benchmarks without parameter updates.
Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards
This paper identifies and addresses the problem of 'Miracle Steps' in LLM mathematical reasoning—unjustified jumps to correct answers that indicate reward hacking—by proposing Rubric Reward Model (RRM), a process-oriented reward function that evaluates entire reasoning trajectories. RRM achieves significant improvements on AIME2024 (26.7% to 62.6% Verified Pass@1024) and reduces Miracle Steps by 71%.
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
This paper introduces a mutual reasoning technique that enhances the problem-solving capabilities of smaller LLMs by iteratively refining candidate solutions through self-feedback and reward functions.