Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

arXiv cs.CL 05/25/26, 04:00 AM Papers

Summary

Introduces Metacognition-as-Reward (MaR), a reinforcement learning framework that guides LLM reasoning via metacognitive knowledge and regulation signals, achieving up to 11% improvement over vanilla methods on reasoning benchmarks.

arXiv:2605.23384v1 Announce Type: new Abstract: Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limited guidance for intermediate reasoning behaviors. (2) Rubrics-as-reward (RaR) goes beyond final-answer checking by using natural-language rubrics to assess reasoning quality and task compliance, but often requires instance-specific rubrics and substantial design effort. To address these issues, we introduce Metacognition-as-Reward (MaR), a metacognition-inspired RL framework that guides LLM reasoning through two general process dimensions: i) metacognitive knowledge, which identifies task-relevant information without hand-crafted instance-specific rubrics, and ii) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final-answer outcomes. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7.7% gain over the base model and up to an 11.0% gain over vanilla DAPO. Notably, Qwen3.5-9B + MaR narrows the gap to frontier models, surpassing GPT-OSS-120B on overall average and outperforming stronger models on several individual benchmarks. Process-level analysis further shows substantial improvements in reasoning process quality. MaR also generalizes to out-of-domain datasets, where MaR-trained models improve over their corresponding base models on average.

Original Article

View Cached Full Text

Cached at: 05/25/26, 09:01 AM

# Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
Source: [https://arxiv.org/abs/2605.23384](https://arxiv.org/abs/2605.23384)
[View PDF](https://arxiv.org/pdf/2605.23384)

> Abstract:Recent RL methods have substantially improved the reasoning abilities of LLMs\. Existing reward designs mainly follow two paradigms: \(1\) Reinforcement learning with verifiable rewards \(RLVR\) derives outcome signals from executable checks or ground\-truth answers, but provides limited guidance for intermediate reasoning behaviors\. \(2\) Rubrics\-as\-reward \(RaR\) goes beyond final\-answer checking by using natural\-language rubrics to assess reasoning quality and task compliance, but often requires instance\-specific rubrics and substantial design effort\. To address these issues, we introduce Metacognition\-as\-Reward \(MaR\), a metacognition\-inspired RL framework that guides LLM reasoning through two general process dimensions: i\) metacognitive knowledge, which identifies task\-relevant information without hand\-crafted instance\-specific rubrics, and ii\) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final\-answer outcomes\. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory\-level reward over task knowledge coverage, regulation fidelity, and final\-answer correctness\. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions\. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7\.7% gain over the base model and up to an 11\.0% gain over vanilla DAPO\. Notably, Qwen3\.5\-9B \+ MaR narrows the gap to frontier models, surpassing GPT\-OSS\-120B on overall average and outperforming stronger models on several individual benchmarks\. Process\-level analysis further shows substantial improvements in reasoning process quality\. MaR also generalizes to out\-of\-domain datasets, where MaR\-trained models improve over their corresponding base models on average\.

## Submission history

From: Sirui Chen \[[view email](https://arxiv.org/show-email/0f9c2a67/2605.23384)\] **\[v1\]**Fri, 22 May 2026 08:54:37 UTC \(981 KB\)

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

Similar Articles

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Submit Feedback

Similar Articles

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers