Weak-to-Strong Elicitation via Mismatched Wrong Drafts

arXiv cs.CL Papers

Summary

The paper proposes a method using mismatched wrong drafts from a weaker model to elicit superior reasoning in a stronger learner via GRPO, achieving state-of-the-art results on Mathstral-7B for MATH-500 and AIME benchmarks.

arXiv:2605.17314v1 Announce Type: new Abstract: We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields $+1.62$pp on MATH-500 (greedy pass@1) over the matched-wrong variant ($n=10$ seeds, $p=0.0015$, Welch's $t$). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across both greedy pass@1 and sampling pass@$k$. On out-of-distribution AIME 2025 and 2026, the mismatched-wrong variant uniquely lifts pass@$k$ above both Mathstral-7B (in its native [INST] format) and the Qwen2.5-Math-1.5B draft model at every sample budget from $k=1$ to $k=1024$ across 2 seeds ($+14.2$pp on 2025 and $+9.0$pp on 2026 at pass@1024 over Mathstral-7B), and at pass@1024 also leads no-draft, matched-wrong, and mismatched-correct variants on both years. All variants use the same prompt with no draft injection at test time. The recipe -- trained on a single GPU with no SFT, no reward models, no synthesized data, and no produce-critique-revise inner loop -- reaches 71.98% MATH-500 on Mathstral-7B-v0.1, the highest published result on this model to our knowledge, surpassing the heavier WizardMath pipeline at 70.9% on full MATH (SFT + PPO with process/instruction reward models).
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:39 AM

# Weak-to-Strong Elicitation via Mismatched Wrong Drafts
Source: [https://arxiv.org/html/2605.17314](https://arxiv.org/html/2605.17314)
###### Abstract

We consider whether off\-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on\-policy RL fine\-tuning \(e\.g\., GRPO\) does not reach\. We find that injecting mathematically*wrong*drafts from a smaller but more domain\-trained model—*mismatched*to the current problem—into a stronger learner’s GRPO context consistently outperforms standard on\-policy GRPO on held\-out MATH\-500 and out\-of\-distribution AIME 2025/2026\. Concretely, we use Mathstral\-7B as the learner, Qwen2\.5\-Math\-1\.5B as the draft model, 8\.8K Level 3–5 MATH problems \(with MATH\-500 held out\), and train with Dr\. GRPO\. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields\+1\.62\+1\.62pp on MATH\-500 \(greedy pass@1\) over the matched\-wrong variant \(n=10n\{=\}10seeds,p=0\.0015p\{=\}0\.0015, Welch’stt\)\. In fact, the mismatched\-wrong variant leads all other variants we tested on MATH\-500 across both greedy pass@1 and sampling pass@kk\. On out\-of\-distribution AIME 2025 and 2026, the mismatched\-wrong variant uniquely lifts pass@kkabove both Mathstral\-7B \(in its native \[INST\] format\) and the Qwen2\.5\-Math\-1\.5B draft model at every sample budget fromk=1k\{=\}1tok=1024k\{=\}1024across 2 seeds \(\+14\.2\+14\.2pp on 2025 and\+9\.0\+9\.0pp on 2026 at pass@1024 over Mathstral\-7B\), and at pass@1024 also leads no\-draft, matched\-wrong, and mismatched\-correct variants on both years\. All variants use the same prompt with no draft injection at test time\. The recipe—trained on a single GPU with no SFT, no reward models, no synthesized data, and no produce\-critique\-revise inner loop—reaches71\.98%71\.98\\%MATH\-500 on Mathstral\-7B\-v0\.1, the highest published result on this model to our knowledge, surpassing the heavier WizardMath pipeline at70\.9%70\.9\\%on full MATH \(SFT \+ PPO with process/instruction reward models\)\.

![Refer to caption](https://arxiv.org/html/2605.17314v1/aime_passk_2025_2026.png)Figure 1:On out\-of\-distribution AIME 2025 and 2026, the mismatched\-wrong variant \(ours, red\) uniquely lifts pass@kkabove both Mathstral\-7B and the Qwen2\.5\-Math\-1\.5B draft model at every sample budget fromk=1k\{=\}1tok=1024k\{=\}1024\. Mathstral\-7B is evaluated in its native\[INST\]chat format; all other variants use the training\-matched nodraft prompt \(literalN/Aplaceholder\)\.N=2048N\{=\}2048samples per problem atT=0\.6T\{=\}0\.6, top\-p=0\.95p\{=\}0\.95, max 4096 completion tokens; mean across 2 seedss=\{42,137\}s\{=\}\\\{42,137\\\}\.## 1Introduction

Several paradigms aim to improve large language model reasoning: supervised fine\-tuning on*correct*traces, either from stronger models \(e\.g\., DeepSeek\-R1\-Distill\-Qwen\(DeepSeek\-AI,[2025](https://arxiv.org/html/2605.17314#bib.bib1)\)\) or self\-bootstrapped from the model’s own correct rollouts \(STaR\(Zelikman et al\.,[2022](https://arxiv.org/html/2605.17314#bib.bib2)\),Huang et al\. \([2023](https://arxiv.org/html/2605.17314#bib.bib3)\)\); iterative correction\-and\-refinement pipelines that produce, critique, and revise their own outputs\(Madaan et al\.,[2023](https://arxiv.org/html/2605.17314#bib.bib4)\), including RL\-trained self\-correction\(Kumar et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib5)\); reinforcement learning from human feedback \(RLHF\(Ouyang et al\.,[2022](https://arxiv.org/html/2605.17314#bib.bib6)\)\), which trains against a learned reward model fitted on human preferences; and on\-policy reinforcement learning with verifiable rewards \(RLVR\), most prominently GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2605.17314#bib.bib7)\), which trains on the model’s own rollouts using a verifier\. On\-policy RL is appealing because it does not require any supervision except a verifier, but in its standard form the input distribution is narrow: each training prompt is the bare problem statement, and reward can only select among trajectories the strong model already samples in response\. This is a recognized limitation: a growing line of empirical analyses argues that on\-policy RL fine\-tuning sharpens existing modes rather than expanding the base model’s intrinsic coverage, with pass@kkat largekkoften matching or falling below the base\(Yue et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib8)\)\.

A natural way to expand what the learner produces under GRPO rollouts—and therefore what reward can score and select—is to broaden the training prompt distribution, while keeping the learner robust under the resulting training–inference distribution discrepancy\. Consider another model that has been more domain\-trained: it has seen more data, accumulated a record of attempts, mistakes, and partial solutions that are uncharted and lie dormant in the learner\. We focus on the special case where the other model is*smaller*, with different training experience from the learner, and ask whether its*wrong*draft traces, placed in the learner’s prompt context window, can elicit capability that on\-policy GRPO from bare prompts does not reach\.

The answer hinges on a second choice: whether the injected draft is about the current problem or about a different one\. With everything else fixed—learner \(Mathstral\-7B\), draft model \(Qwen2\.5\-Math\-1\.5B\), data \(∼\\sim8\.8K Level 3–5 MATH problems with MATH\-500 held out\), algorithm \(Dr\. GRPO\(Liu et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib9)\)\), eval protocol—we isolate two axes simultaneously: draft content \(correct vs\. wrong\) and draft assignment \(matched vs\. mismatched\)\. We compare these four variants and a no\-draft GRPO baseline, as well as the Mathstral\-7B base, on MATH\-500 and out\-of\-distribution AIME 2025/2026\. Only mismatched\-wrong consistently exceeds no\-draft GRPO on both evaluations and uniquely lifts pass@kkabove both Mathstral\-7B and the Qwen2\.5\-Math\-1\.5B draft model at every sample budget fromk=1k\{=\}1tok=1024k\{=\}1024on AIME 2025/2026 \(Figure[1](https://arxiv.org/html/2605.17314#S0.F1)\)\.

Both the mismatch step and the wrongness of the draft are active ingredients\. We randomly select a draft with a wrong answer \(avoiding wrong\-but\-quasi\-correct drafts when possible\), and shuffle it to a different problem; the draft, now about a different problem, implicitly*lifts*the training prompt to a more general but*masked*task, of which the original bare problem is a degenerate special case\. The mismatched wrong draft is an*observation*—an off\-policy trace of an attempt at a masked problem, sitting in context alongside the actual question\. The strong model produces a solution from scratch in a single rollout per prompt, with no produce\-critique\-revise loop or second pass\. The recipe is standard on\-policy RL fine\-tuning\. GRPO’s reward then selects, across rollouts, the solutions the strong model finds from its own intrinsic capabilities\. Because the task of interest is a degenerate special case of the training prompt, the training–inference discrepancy is minimal\. The weak model is not supervised\-fine\-tuning the strong learner\(Burns et al\.,[2023](https://arxiv.org/html/2605.17314#bib.bib10)\), and the strong learner is not correcting the weaker draft\.

The recipe is materially simpler than the strongest published Mathstral\-7B\-v0\.1 pipeline yet beats it: with a single GPU, no SFT, no reward models, no synthesized data, and no produce\-critique\-revise inner loop, the mismatched\-wrong variant reaches71\.98%71\.98\\%on MATH\-500 \(n=10n\{=\}10seeds, 95% CI±0\.80\\pm 0\.80pp\)\. For reference, WizardMath\(Luo et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib11)\)reports 70\.9% on full MATH using a synthesized SFT stage followed by PPO with both a process and an instruction reward model\.

#### Contributions\.

- •Weak\-to\-strong elicitation can simultaneously sharpen and expand the strong learner’s coverage under on\-policy RLVR with GRPO\.Recent analyses argue that on\-policy RL fine\-tuning only sharpens existing modes\. Our recipe is a counterexample: MATH\-500 greedy pass@1 lifts by\+17\.78\+17\.78pp over Mathstral\-7B base \(n=10n\{=\}10seeds,p<0\.0001p<0\.0001\) and pass@kkon out\-of\-distribution AIME 2025/2026 lifts above Mathstral\-7B base at every sample budget fromk=1k\{=\}1tok=1024k\{=\}1024\(2 seeds\)\.
- •We show that mismatch×\\timeswrongness is the active ingredient\.We isolate the full2×22\{\\times\}2\(draft assignment matched/mismatched×\\timesdraft content correct/wrong\) variants under the same draft model, training data, and recipe; only the mismatched\-wrong variant consistently lifts above the Mathstral\-7B base\.
- •A small recipe that beats heavier pipelines on Mathstral\-7B\.71\.98%71\.98\\%on MATH\-500—exceeding WizardMath’s heavier 70\.9% \(full MATH\)—with a single\-GPU and outcome\-reward\-only recipe\.

## 2Related Work

RLVR for mathematical reasoning\.Reinforcement learning has driven much of the recent progress in LLM for mathematics, exemplified by GRPO and descendants\(Shao et al\.,[2024](https://arxiv.org/html/2605.17314#bib.bib7); Liu et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib9); Yu et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib12)\)and the “zero”\-style line of work showing that strong reasoning emerges directly from RL without an SFT stage\(DeepSeek\-AI,[2025](https://arxiv.org/html/2605.17314#bib.bib1); Hu et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib13); Zeng et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib14)\)\. WizardMath\(Luo et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib11)\)represents the heavier end of the spectrum, combining synthesized SFT data with PPO and process/instruction reward models; it is our headline70\.9%70\.9\\%Mathstral\-7B comparison\. Our recipe uses Dr\. GRPO\(Liu et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib9)\)unchanged, and the novelty sits at the*task*the learner is trained on\.

Coverage vs\. sharpening under RL post\-training\.A growing line of empirical analyses argues that on\-policy RL fine\-tuning sharpens existing modes while leaving the base model’s pass@kkcoverage at largekkunchanged or even reduced\(Yue et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib8)\); concurrently, methods that explicitly trade off generation diversity against quality during RL have been proposed\(Li et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib15)\)\. Our recipe is a counterexample to the sharpen\-only reading \(see §[4](https://arxiv.org/html/2605.17314#S4)\)\.

Weak\-to\-strong and self\-improvement\.Prior approaches all use the weaker \(or earlier\) model as a supervision signal: weak\-to\-strong supervision distills a weaker model’s labels into a stronger one\(Burns et al\.,[2023](https://arxiv.org/html/2605.17314#bib.bib10)\); self\-bootstrapping methods iteratively retrain on the model’s own correct rollouts filtered by reward \(STaR\(Zelikman et al\.,[2022](https://arxiv.org/html/2605.17314#bib.bib2)\), ReSTEM\(Singh et al\.,[2024](https://arxiv.org/html/2605.17314#bib.bib16)\)\); iterative correction\-and\-refinement pipelines train models to revise their own attempts via produce–critique–revise loops\(Welleck et al\.,[2023](https://arxiv.org/html/2605.17314#bib.bib17)\), and SCoRe\(Kumar et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib5)\)uses multi\-turn RL and reward shaping to train models to correct their own first\-attempt mistakes\. Closest to our setting, bothBurns et al\. \([2023](https://arxiv.org/html/2605.17314#bib.bib10)\)andBansal et al\. \([2025](https://arxiv.org/html/2605.17314#bib.bib18)\)use a weaker model to produce supervised training data for a stronger one \(labels and synthesized data respectively\); we instead inject wrong drafts into the strong model’s GRPO context window\. In all of these prior approaches the weaker \(or earlier\) model serves as a teacher or starting point for revision; in ours, it is an off\-policy explorer that lifts the training task to a more general one, while the loss remains on\-policy with respect to the strong learner\.

## 3Method

### 3\.1Data

Training uses∼\\sim8\.8K Level 3–5 problems among the 12K problems in MATH\(Hendrycks et al\.,[2021](https://arxiv.org/html/2605.17314#bib.bib19)\)after removing the 500 problems of MATH\-500\(Lightman et al\.,[2024](https://arxiv.org/html/2605.17314#bib.bib20)\)\. Testing uses the held\-out MATH\-500 and AIME 2024/2025/2026\(MathArena,[2025](https://arxiv.org/html/2605.17314#bib.bib21)\)\.

### 3\.2Wrong Drafts

For each training problemxx, we sample 32 draft completions from the weaker modelπW\\pi\_\{W\}at temperatureT=0\.8T\{=\}0\.8, top\-p=0\.95p\{=\}0\.95, max 2560 completion tokens\. We define a helpermathematically\_quasi\_correct\(⋅\)\(\\cdot\)that runs math\-verify\(Kydlíček and Hugging Face,[2025](https://arxiv.org/html/2605.17314#bib.bib22)\)against an answer extracted via a prioritized fallback chain:\\boxed\{⋅\\cdot\}first, then natural\-language patterns \(“the answer is X”\), inline math expressions \($​…​$\\mathdollar\\ldots\\mathdollar\), and bare assignment lines \(“var = VALUE”\)\. Among the 32, we randomly sample a completion that is wrong and non\-trivially so \(mathematically\_quasi\_correct=False\), falling back to one rejected by the strict boxed\-only criterion if all are quasi\-correct, and finally to any completion\. The result is an offline paired set\{\(x,dx−\)\}x∈𝒟\\\{\(x,d^\{\-\}\_\{x\}\)\\\}\_\{x\\in\\mathcal\{D\}\}with∼\\sim8\.8K problems, each carrying one selected draft, sampled once before RL training begins\.

### 3\.3Mismatched Wrong Drafts

We apply a random 1\-1 derangementσ:𝒟→𝒟\\sigma:\\mathcal\{D\}\\to\\mathcal\{D\}, pairing each problem with the wrong draft of another problem:

train​dataset=\{\(x,dσ​\(x\)−\):x∈𝒟\}\.\\mathrm\{train\\;dataset\}\\;=\\;\\\{\(x,\\,d^\{\-\}\_\{\\sigma\(x\)\}\):x\\in\\mathcal\{D\}\\\}\.\(1\)\(In an unconstrained random permutation the expected number of fixed points is 1\.\) We then run on\-policy Dr\. GRPO\(Liu et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib9)\)onπS\\pi\_\{S\}over augmented promptsx~=Template​\(x,dσ​\(x\)−\)\\tilde\{x\}=\\mathrm\{Template\}\(x,d^\{\-\}\_\{\\sigma\(x\)\}\); rollouts and gradients remain on\-policy with respect toπS\\pi\_\{S\}\. The exact prompt template is shown in Figure[2](https://arxiv.org/html/2605.17314#S3.F2)\. The derangement is fixed once at the start of training\.

Problem: \{problem\} Thinking: \{draft\} The thinking section may contain errors\. Solve the math problem step by step\. Write your own correct solution\. Put your final answer within \\boxed\{\}\. Correct Solution:

Figure 2:Prompt template\. At training time,\{draft\}is the \(mismatched, wrong\) draftdσ​\(x\)−d^\{\-\}\_\{\\sigma\(x\)\}\. At evaluation time,\{draft\}is the literal string “N/A”\.
### 3\.4Reward

The reward is binary and outcome\-only:11ifmathematically\_quasi\_correct\(completion, gold\)returns True, and0otherwise\. We opt for this lenient check rather than a strict boxed\-only requirement to accelerate reward signal acquisition during training\. We use no format, length, or process reward\. We apply Dr\. GRPO\(Liu et al\.,[2025](https://arxiv.org/html/2605.17314#bib.bib9)\)to maximize the efficiency of our limited completion\-length budget\. Training details are in §[4\.1](https://arxiv.org/html/2605.17314#S4.SS1)\.

## 4Experiments

### 4\.1Experimental setup

Training\.We fine\-tune Mathstral\-7B\(Mistral AI,[2024](https://arxiv.org/html/2605.17314#bib.bib23)\)via LoRA adapters of rank1616on all 7 linear projections per transformer block \(attention \+ MLP\)\(Hu et al\.,[2022](https://arxiv.org/html/2605.17314#bib.bib24)\), drawing drafts from Qwen2\.5\-Math\-1\.5B\(Yang et al\.,[2024](https://arxiv.org/html/2605.17314#bib.bib25)\), on a single B200 GPU\. Optimizer: AdamW with constant learning rate5×10−65\\times 10^\{\-6\},β2=0\.99\\beta\_\{2\}\{=\}0\.99\. RL config \(Dr\. GRPO\):β=0\\beta\{=\}0\(no KL penalty\), group sizeG=16G\{=\}16, gradient accumulation44, 2222 steps \(1 epoch\)\. Generation: max completion length40964096tokens, max prompt length30723072\. Checkpoints saved every5050steps\. Each run takes up to 30\+ hours wall\-clock\. Implementation uses TRL\(von Werra et al\.,[2020](https://arxiv.org/html/2605.17314#bib.bib26)\), vLLM\(Kwon et al\.,[2023](https://arxiv.org/html/2605.17314#bib.bib27)\), and Unsloth\(Han et al\.,[2023](https://arxiv.org/html/2605.17314#bib.bib28)\)\.

Evaluation\.Our evaluation spans the MATH\-500 and AIME 2024–2026 datasets, tracking two primary metrics: \(1\) greedy pass@1 \(T=0T\{=\}0, max 4096 completion tokens\) and \(2\) sampling pass@kkacross various budgets \(N=256N=256samples per problem for MATH\-500,N=2048N=2048for AIME, max 4096 completion tokens,T=0\.6T=0\.6, top\-p=0\.95p=0\.95\), calculated via the unbiased estimator fromChen et al\. \([2021](https://arxiv.org/html/2605.17314#bib.bib29)\)\. We maintain a consistent prompt template \(Figure[2](https://arxiv.org/html/2605.17314#S3.F2)\) during evaluation for our trained models and the Qwen2\.5\-Math\-1\.5B drafter\. Specifically, the\{draft\}field is populated with the literal string “N/A” during evaluations \(as well as during training of the no\-draft variant\)\. The only exception is the Mathstral\-7B base model, which we test using its default\[INST\]chat format:\{problem\}\\n\\nfollowed by “Please reason step by step, and put your final answer within\\boxed\{\}\.”, all enclosed in\[INST\]…\[/INST\]tokens\. Section[4\.3](https://arxiv.org/html/2605.17314#S4.SS3)confirms that our performance gains over Mathstral\-7B remain valid despite this formatting difference\.

### 4\.2MATH\-500

The mismatched\-wrong variant achieves71\.98%71\.98\\%on MATH\-500 \(n=10n=10seeds, 95% CI±0\.80\\pm 0\.80pp\), surpassing the heavier WizardMath pipeline at70\.9%70\.9\\%on full MATH \(Table[1](https://arxiv.org/html/2605.17314#S4.T1)\)\. Beyond greedy pass@1, the mismatched\-wrong variant also leads on sampling pass@kk\(Figure[3](https://arxiv.org/html/2605.17314#S4.F3)\)\.

Table 1:Greedy pass@1 on MATH for methods fine\-tuning Mathstral\-7B\-v0\.1\. WizardMath reports on the full MATH test \(5,000 problems\); our results on the MATH\-500 subset\(Lightman et al\.,[2024](https://arxiv.org/html/2605.17314#bib.bib20)\)\.![Refer to caption](https://arxiv.org/html/2605.17314v1/math500_passk_overall_L5.png)Figure 3:MATH\-500 pass@kk\.Left:overall \(500 problems\)\.Right:Level 5 only \(134 problems\)\. 2\-seed mean \(s=\{42,137\}s\{=\}\\\{42,137\\\}\) for all 5 lines \(Mathstral\-7B base, no\-draft GRPO, matched\-wrong, mismatched\-correct, mismatched\-wrong\)\.
### 4\.3AIME 2025 and 2026

Table 2:Mean across 2 seeds \(s=\{42,137\}s\{=\}\\\{42,137\\\}\)\. Endpoint values of Figure[1](https://arxiv.org/html/2605.17314#S0.F1)\.If the recipe merely sharpens the strong model’s distribution—reweighting probability mass toward already\-reachable solutions—its pass@kkcurve at largekkshould saturate at or below the base model\. If it expands the policy’s reachable set, the curve should dominate the baseline at everykk\. We probe this on out\-of\-distribution AIME 2025 and AIME 2026, where contamination of the underlying models is implausible \(both years post\-date the training cutoff of Mathstral\-7B and Qwen2\.5\-Math\-1\.5B\)\. The data falls on the side of expansion \(Figure[1](https://arxiv.org/html/2605.17314#S0.F1), Table[2](https://arxiv.org/html/2605.17314#S4.T2)\):\+14\.2\+14\.2pp on 2025 and\+9\.0\+9\.0pp on 2026 atk=1024k\{=\}1024over Mathstral\-7B in its native\[INST\]format\. Table[2](https://arxiv.org/html/2605.17314#S4.T2)reports both prompting formats for completeness; within each model the two formats give comparable numbers, but the training\-consistent format \(\[INST\]for base,nodraftfor the trained variants\) generally does better\.

Per\-problem analysis\.The overall improvement stems from large, concentrated gains on specific problems rather than marginal improvements across the board\. Furthermore, these gains outweigh the losses in both frequency and magnitude\. Out of 60 AIME 2025\+2026 problems,1111see a pass@1024 increase of≥30\\geq 30pp over the Mathstral\-7B baseline, compared to only44that lose≥30\\geq 30pp\. Similarly,66problems gain≥50\\geq 50pp while only33lose≥50\\geq 50pp \(Figure[4](https://arxiv.org/html/2605.17314#S4.F4)\)\.

We also observe1313“capability\-creation” cases—instances where the baseline scores0%0\\%but our model achieves a positive pass rate\. The most striking of these reach near\-perfect success \(e\.g\., AIME 2026 P8:0%→100%0\\%\\\!\\to\\\!100\\%; AIME 2025 P15:0%→84\.4%0\\%\\\!\\to\\\!84\.4\\%\)\. Conversely, the inverse scenario—where our model collapses to0%0\\%on a problem the baseline could solve—is rare, occurring on just22problems \(AIME 2026 P22:87\.5%→0%87\.5\\%\\\!\\to\\\!0\\%; AIME 2026 P15:50%→0%50\\%\\\!\\to\\\!0\\%\)\.

![Refer to caption](https://arxiv.org/html/2605.17314v1/aime_per_problem_scatter.png)Figure 4:Per\-problem AIME pass@1024 \(one dot = one problem, 2\-seed mean, slight diagonal jitter\): mismatched\-wrong \(ours, nodraft\) vs Mathstral\-7B base \(\[INST\]\)\. Points above the diagonal indicate our variant wins\. Green: capability\-creation cases \(base0%0\\%, ours\>0%\>0\\%;1313problems\)\. Red: inverse \(ours0%0\\%, base\>0%\>0\\%;22problems\)\.
### 4\.4AIME 2024

AIME 2024 predates the release of both Mathstral\-7B\-v0\.1 and Qwen2\.5\-Math\-1\.5B, raising the possibility that one or both models were exposed to it during training\. We exclude AIME 2024 from the headline claim and report it here for completeness, with caveats discussed in §[5](https://arxiv.org/html/2605.17314#S5)and reasoning\-rigor results in §[D\.3](https://arxiv.org/html/2605.17314#A4.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.17314v1/aime2024_passk.png)Figure 5:AIME 2024 pass@kk\(n=2048n\{=\}2048samples per problem, mean across 2 seedss∈\{42,137\}s\\in\\\{42,137\\\}\)\. Base lags trained variants at lowkkbut catches up atk=1024k\{=\}1024to within∼1\\sim\\\!1pp of the leaders\.Table 3:Mean across 2 seeds \(s=\{42,137\}s\{=\}\\\{42,137\\\}\)\. Endpoint values of Figure[5](https://arxiv.org/html/2605.17314#S4.F5)\.![Refer to caption](https://arxiv.org/html/2605.17314v1/aime2024_per_problem_scatter.png)

Figure 6:Per\-problem AIME 2024 pass@1024 \(one dot = one problem, 2\-seed mean, slight diagonal jitter\):yy\-axis = mismatched\-wrong \(ours, nodraft\),xx\-axis = Mathstral\-7B base \(\[INST\]\)\. Of 30 problems: 12 above the diagonal, 11 on \(gap<0\.1<0\.1pp; most are problems both models solve at∼\\sim100%\), 7 below\. Green: capability\-creation cases \(base0%0\\%, ours\>0%\>0\\%;33problems\)\. Red: inverse \(ours0%0\\%, base\>0%\>0\\%;44problems\)\. The asymmetry observed on AIME 2025/2026 \(Figure[4](https://arxiv.org/html/2605.17314#S4.F4)\) reverses on AIME 2024\. Per §[D\.3](https://arxiv.org/html/2605.17314#A4.SS3), however, the creation case I\-6 yields a rigorous solution from our variant, while none of the44inverse cases yield any valid solution\.Unlike on AIME 2025/2026, the mismatched\-wrong recipe \(ours, nodraft\) does not lead the trained variants on AIME 2024: no\-draft GRPO is∼4\.7\\sim\\\!4\.7pp ahead at pass@10241024\(72\.97%72\.97\\%vs68\.28%68\.28\\%\)\. Within the wrong\-draft axis, however, the mismatched\-wrong variant still leads matched\-wrong by11–44pp acrosskkon AIME 2024 \(Figure[5](https://arxiv.org/html/2605.17314#S4.F5)\)\.

### 4\.5The 2×\\times2 variants vs\. no\-draft GRPO vs\. Mathstral\-7B

To isolate the active ingredient, we fix the learner \(Mathstral\-7B\), draft model \(Qwen2\.5\-Math\-1\.5B\), training data \(∼\\sim8\.8K Level 3–5 MATH problems\), algorithm \(Dr\. GRPO\), and vary two binary axes: draft assignment \(matched to the current problem vs\. shuffled to a different problem\) and draft content \(correct vs\. wrong\)\. We compare these four variants against no\-draft GRPO and the Mathstral\-7B base \(Figure[7](https://arxiv.org/html/2605.17314#S4.F7), Table[4](https://arxiv.org/html/2605.17314#S4.T4)\)\. Figure[7](https://arxiv.org/html/2605.17314#S4.F7)and Table[4](https://arxiv.org/html/2605.17314#S4.T4)reveal a strict interaction effect: neither mismatch alone nor wrongness alone advances the policy—only their intersection, mismatched\-wrong, consistently outperforms no\-draft GRPO\.

![Refer to caption](https://arxiv.org/html/2605.17314v1/v6X_train_metrics_compare.png)Figure 7:Training dynamics across the2×22\{\\times\}2\+ no\-draft GRPO, 1 epoch\.Left:completion entropy in nats\.Right:completion length per step in tokens\. Both panels show a rolling mean over a 20\-step window for clarity\. A few patterns are visible: \(i\)*Correct\+Matched*collapses into a copying shortcut\. \(ii\)*Correct\+Mismatched*sits below no\-draft GRPO on entropy, while both wrong\-draft variants lie above it—suggesting that correct content in context constrains the rollout distribution while wrong content widens it\. \(iii\)*Wrong\+Mismatched \(Ours\)*reaches the longest completions and the highest entropy throughout training, suggesting reasoning development\.Table 4:MATH\-500 greedy pass@1 accuracy \(%, 10\-seed mean\) by difficulty level, comparing the 2×\\times2 ablation \(draft assignment×\\timesdraft content\) against no\-draft GRPO and Mathstral\-7B base\.The other three quadrants each fail differently\. With a matched, correct draft, the policy collapses into a*copying shortcut*: completion entropy plummets and rollouts shrink to near\-direct\-copy \(Figure[7](https://arxiv.org/html/2605.17314#S4.F7)\), and the model has learned to extract the visible answer without reasoning\. Matched\-wrong drafts let the policy fall into an*anchoring trap*: the relevant\-but\-wrong trace acts as a local\-optimum prior, and the policy stays near it and edits it minimally into reward\. Mismatched\-correct drafts fail for two reasons: \(i\) the correct draft acts as a reasoning analogy—the strong learner can often infer the masked problem from the correct draft\(Morris et al\.,[2024](https://arxiv.org/html/2605.17314#bib.bib30)\)and anchor its reasoning on the solution to the inferred problem; \(ii\) nontrivially\-wrong traces are more information\-dense than correct ones\.

The mismatched\-wrong variant avoids all three failure modes\. The path of least resistance is for the learner to reason from its own intrinsic capabilities\. Consistent with this, mismatched\-wrong sustains the highest completion entropy and longest rollouts during training \(Figure[7](https://arxiv.org/html/2605.17314#S4.F7)\)\.

## 5Discussion

Capability expansion under on\-policy RL\.A growing consensus in recent empirical analyses suggests that on\-policy RL fine\-tuning merely sharpens a model’s existing modes, reweighting probability mass toward already\-reachable solutions without expanding the base model’s intrinsic coverage at largekk\. Our results challenge that reading\. By lifting the training task to a more general one, our recipe yields strict pass@kkimprovements at large sample budgets on out\-of\-distribution AIME 2025/2026 \(§[4\.3](https://arxiv.org/html/2605.17314#S4.SS3)\)—demonstrating that under the same on\-policy GRPO algorithm, an altered context distribution can drive capability*expansion*rather than mere sharpening\.

Closing optimization shortcuts\.The2×22\{\\times\}2ablation \(§[4\.5](https://arxiv.org/html/2605.17314#S4.SS5)\) makes a case for*what fails*: a copying\-shortcut collapse, an anchoring trap, and information\-density loss\. Why the remaining mismatched\-wrong quadrant*succeeds*is less pinned down\. Our working hypothesis is that closing all three failure modes pushes the strong learner to fall back on its own intrinsic capability\. But success may also hinge on the draft model’s training, on training data distribution and curation \(we did little here\), and possibly more\. A precise characterization is left to future work\.

Eliciting latent capability\.We view this dynamic as an instance of*eliciting latent knowledge*\(Christiano et al\.,[2021](https://arxiv.org/html/2605.17314#bib.bib31)\): surfacing reasoning capabilities already present but dormant in the model by applying an appropriate contextual transformation\. The weak draft model acts not as a teacher providing explicit supervision but as a contextual probe—its off\-policy traces alter the path of least resistance and expose the strong learner to regions of the solution space it wouldn’t otherwise explore, surfacing capabilities that don’t emerge from a collapsed bare prompt\.

Caveats and limitations\.On AIME 2024 the recipe does not outperform no\-draft GRPO at highkk\(§[4\.4](https://arxiv.org/html/2605.17314#S4.SS4)\)\. Our working explanation is test\-distribution contamination in either the learner or the draft model, but this is one hypothesis that we have not verified; understanding why AIME 2024 differs from AIME 2025/2026 under this recipe is open future work\. The recipe also relies on the strong model carrying the latent capability to pick up the lifted task within a finite generation budget \(4096 tokens in our experiments\); tasks demanding substantially longer reasoning chains, or capability genuinely beyond the model’s intrinsic reach, would require a different setup\. Our recipe uses outcome\-only reward, which carries a reward\-hacking risk amplified by AIME’s finite answer space \(\[0,999\]\[0,999\]\): solutions can be scored correct despite mathematically wrong reasoning\. A rigor scan on 239 correct rollouts \(Appendix[D\.3](https://arxiv.org/html/2605.17314#A4.SS3)\) finds this pattern is pervasive \(96\.7%96\.7\\%reward\-hacked\), affecting all trained model variants as well as the Mathstral\-7B base\. Specific cases are documented in Appendix[C\.2](https://arxiv.org/html/2605.17314#A3.SS2)\(our trained variant\) and Appendix[C\.3](https://arxiv.org/html/2605.17314#A3.SS3)\(Mathstral\-7B base\)\. Addressing this is open future work\. Finally, our results come from a single learner, a single draft model, and a single domain; whether the recipe transfers to other models, domains, or scales is open\.

## 6Conclusion

We demonstrate that weak\-to\-strong elicitation can simultaneously sharpen and expand a strong learner’s reasoning coverage under on\-policy RLVR, challenging the assumption that on\-policy RL strictly sharpens existing modes\. The active ingredient is the interaction of two axes: the draft is nontrivially wrong, and permuted to a mismatched problem\. This combination injects off\-policy tokens into the context while closing optimization shortcuts, pushing the learner to elicit its own intrinsic reasoning capabilities\. We see this as a small step toward injecting off\-policy exploration into on\-policy RL post\-training\.

## References

- DeepSeek\-AI \[2025\]DeepSeek\-AI\.DeepSeek\-R1: Incentivizing reasoning capability in LLMs via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*, 2025\.
- Zelikman et al\. \[2022\]Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D\. Goodman\.STaR: Bootstrapping reasoning with reasoning\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2022\.
- Huang et al\. \[2023\]Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han\.Large language models can self\-improve\.In*Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2023\.URL[https://aclanthology\.org/2023\.emnlp\-main\.67/](https://aclanthology.org/2023.emnlp-main.67/)\.
- Madaan et al\. \[2023\]Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark\.Self\-Refine: Iterative refinement with self\-feedback\.*arXiv preprint arXiv:2303\.17651*, 2023\.
- Kumar et al\. \[2025\]Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D\. Co\-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M\. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust\.Training language models to self\-correct via reinforcement learning\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.
- Ouyang et al\. \[2022\]Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe\.Training language models to follow instructions with human feedback\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2022\.
- Shao et al\. \[2024\]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\.DeepSeekMath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Yue et al\. \[2025\]Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang\.Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2025\.URL[https://openreview\.net/forum?id=4OsgYD7em5](https://openreview.net/forum?id=4OsgYD7em5)\.
- Liu et al\. \[2025\]Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin\.Understanding R1\-Zero\-like training: A critical perspective\.In*Conference on Language Modeling \(COLM\)*, 2025\.URL[https://openreview\.net/forum?id=5PAF7PAY2Y](https://openreview.net/forum?id=5PAF7PAY2Y)\.
- Burns et al\. \[2023\]Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu\.Weak\-to\-strong generalization: Eliciting strong capabilities with weak supervision\.*arXiv preprint arXiv:2312\.09390*, 2023\.
- Luo et al\. \[2025\]Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian\-Guang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, and Dongmei Zhang\.WizardMath: Empowering mathematical reasoning for large language models via reinforced evol\-instruct\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.
- Yu et al\. \[2025\]Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al\.DAPO: An open\-source LLM reinforcement learning system at scale\.*arXiv preprint arXiv:2503\.14476*, 2025\.
- Hu et al\. \[2025\]Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung\-Yeung Shum\.Open\-Reasoner\-Zero: An open source approach to scaling up reinforcement learning on the base model\.*arXiv preprint arXiv:2503\.24290*, 2025\.
- Zeng et al\. \[2025\]Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He\.SimpleRL\-Zoo: Investigating and taming zero reinforcement learning for open base models in the wild\.*arXiv preprint arXiv:2503\.18892*, 2025\.
- Li et al\. \[2025\]Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang\.Jointly reinforcing diversity and quality in language model generations\.*arXiv preprint arXiv:2509\.02534*, 2025\.
- Singh et al\. \[2024\]Avi Singh, John D\. Co\-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J\. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T\. Parisi, Abhishek Kumar, Alexander A\. Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Fathy Elsayed, Hanie Sedghi, Igor Mordatch, et al\.Beyond human data: Scaling self\-training for problem\-solving with language models\.*Transactions on Machine Learning Research*, 2024\.
- Welleck et al\. \[2023\]Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi\.Generating sequences by learning to self\-correct\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Bansal et al\. \[2025\]Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q\. Tran, and Mehran Kazemi\.Smaller, weaker, yet better: Training LLM reasoners via compute\-optimal sampling\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.URL[https://openreview\.net/forum?id=3OyaXFQuDl](https://openreview.net/forum?id=3OyaXFQuDl)\.
- Hendrycks et al\. \[2021\]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the MATH dataset\.In*Advances in Neural Information Processing Systems Datasets and Benchmarks Track*, 2021\.
- Lightman et al\. \[2024\]Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\.Let’s verify step by step\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- MathArena \[2025\]MathArena\.MathArena: Evaluating LLMs on uncontaminated math olympiad problems\.[https://matharena\.ai](https://matharena.ai/), 2025\.
- Kydlíček and Hugging Face \[2025\]Hynek Kydlíček and Hugging Face\.Math\-Verify: Robust mathematical expression verification for language models\.[https://github\.com/huggingface/Math\-Verify](https://github.com/huggingface/Math-Verify), 2025\.
- Mistral AI \[2024\]Mistral AI\.Mathstral 7B\.[https://huggingface\.co/mistralai/Mathstral\-7B\-v0\.1](https://huggingface.co/mistralai/Mathstral-7B-v0.1), 2024\.
- Hu et al\. \[2022\]Edward J\. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\.LoRA: Low\-rank adaptation of large language models\.In*International Conference on Learning Representations \(ICLR\)*, 2022\.
- Yang et al\. \[2024\]An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al\.Qwen2\.5\-Math technical report: Toward mathematical expert model via self\-improvement\.*arXiv preprint arXiv:2409\.12122*, 2024\.
- von Werra et al\. \[2020\]Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang\.TRL: Transformer reinforcement learning\.[https://github\.com/huggingface/trl](https://github.com/huggingface/trl), 2020\.
- Kwon et al\. \[2023\]Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E\. Gonzalez, Hao Zhang, and Ion Stoica\.Efficient memory management for large language model serving with PagedAttention\.In*Proceedings of the ACM Symposium on Operating Systems Principles \(SOSP\)*, 2023\.
- Han et al\. \[2023\]Daniel Han, Michael Han, and Unsloth team\.Unsloth: Fast and memory\-efficient fine\-tuning of LLMs\.[https://github\.com/unslothai/unsloth](https://github.com/unslothai/unsloth), 2023\.
- Chen et al\. \[2021\]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al\.Evaluating large language models trained on code\.*arXiv preprint arXiv:2107\.03374*, 2021\.
- Morris et al\. \[2024\]John Xavier Morris, Wenting Zhao, Justin T\. Chiu, Vitaly Shmatikov, and Alexander M\. Rush\.Language model inversion\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- Christiano et al\. \[2021\]Paul Christiano, Ajeya Cotra, and Mark Xu\.Eliciting latent knowledge: How to tell if your eyes deceive you\.Alignment Research Center technical report, 2021\.

## Appendix AExperimental setup details

Table[5](https://arxiv.org/html/2605.17314#A1.T5)consolidates the settings underlying our experiments\.

Table 5:Experimental setup details\.
## Appendix BMATH\-500 greedy pass@1 trajectories across late\-training checkpoints

We report MATH\-500 greedy pass@1 accuracy \(mean±\\pmstd,T=0T\{=\}0,nodraftprompt\) evaluated at 50\-step intervals across late\-training checkpoints\. This sweep establishes our checkpoint\-selection protocol: we select ckpt\-2000 as our headline checkpoint because it performs best globally across all four variants, ensuring our results are not cherry\-picked\. Unless otherwise noted, all cells usen=10n\{=\}10seeds\. We additionally evaluated Correct \+ Mismatched atn=4n\{=\}4seeds outside this sweep, yielding69\.35±0\.9069\.35\{\\pm\}0\.90\(Overall\) /40\.30±1\.2240\.30\{\\pm\}1\.22\(L5\) at ckpt\-1950 and69\.10±0\.8969\.10\{\\pm\}0\.89/38\.06±2\.7238\.06\{\\pm\}2\.72at ckpt\-2222\.

ckptNo\-draft GRPOWrong \+ MatchedCorrect \+ Mismatched∗Wrong \+ Mismatched \(Ours\)Overall MATH\-500200070\.82±0\.7470\.82\{\\pm\}0\.7470\.36±1\.0970\.36\{\\pm\}1\.0969\.02±0\.7969\.02\{\\pm\}0\.7971\.98±0\.80\\mathbf\{71\.98\{\\pm\}0\.80\}205068\.90±0\.8768\.90\{\\pm\}0\.8768\.88±1\.0368\.88\{\\pm\}1\.0368\.75±0\.6668\.75\{\\pm\}0\.6671\.34±0\.7171\.34\{\\pm\}0\.71210070\.16±0\.8570\.16\{\\pm\}0\.8568\.64±0\.9168\.64\{\\pm\}0\.91—71\.40±0\.7471\.40\{\\pm\}0\.74215070\.04±0\.8870\.04\{\\pm\}0\.8868\.54±0\.8168\.54\{\\pm\}0\.81—71\.08±0\.6971\.08\{\\pm\}0\.69220067\.90±0\.6167\.90\{\\pm\}0\.6169\.16±0\.9369\.16\{\\pm\}0\.93—70\.22±1\.1570\.22\{\\pm\}1\.15L5 only200041\.79±1\.4541\.79\{\\pm\}1\.4543\.13±1\.9843\.13\{\\pm\}1\.9842\.61±1\.7442\.61\{\\pm\}1\.7444\.48±1\.9044\.48\{\\pm\}1\.90205040\.90±1\.1640\.90\{\\pm\}1\.1640\.30±1\.6540\.30\{\\pm\}1\.6540\.49±1\.5440\.49\{\\pm\}1\.5447\.39±1\.23\\mathbf\{47\.39\{\\pm\}1\.23\}210040\.00±2\.0640\.00\{\\pm\}2\.0641\.42±2\.1241\.42\{\\pm\}2\.12—43\.51±2\.1143\.51\{\\pm\}2\.11215039\.63±2\.0639\.63\{\\pm\}2\.0641\.42±2\.7541\.42\{\\pm\}2\.75—41\.72±1\.8841\.72\{\\pm\}1\.88220034\.85±1\.4134\.85\{\\pm\}1\.4139\.78±1\.5439\.78\{\\pm\}1\.54—41\.79±3\.0741\.79\{\\pm\}3\.07Table 6:Wrong \+ Mismatched \(Ours\) maintains a consistent lead\.∗Correct \+ Mismatched usesn=4n\{=\}4seeds at ckpt\-2050\.
## Appendix CAIME 2025/2026 case studies

To contextualize the quantitative pass@10241024metrics, we present three qualitative case studies examining the raw reasoning traces of both the Mathstral\-7B base model and our mismatched\-wrong variant\. These examples are specifically selected to illustrate true mathematical capability expansion, as well as the reward\-hacking vulnerabilities of outcome\-only RLVR discussed in §[5](https://arxiv.org/html/2605.17314#S5)\.

- •Genuine Capability Creation \(AIME 2026 P8, §[C\.1](https://arxiv.org/html/2605.17314#A3.SS1)\): The base model fails completely \(0%0\\%\), while our variant achieves100%100\\%pass@10241024\. The sample trace demonstrates a true positive: the model arrives at the correct final answer through a rigorous, mathematically valid derivation\.
- •Reward\-Hacked Capability Creation \(AIME 2025 P15, §[C\.2](https://arxiv.org/html/2605.17314#A3.SS2)\): The base model fails \(0%0\\%\), while our variant achieves84\.4%84\.4\\%pass@10241024\. However, the sample trace reveals a false positive: the model reaches the correct final numerical answer via wrong logic, concretely illustrating the outcome\-reward caveat from §[5](https://arxiv.org/html/2605.17314#S5)\.
- •Reward\-Hacked Baseline / Inverse Case \(AIME 2026 P22, §[C\.3](https://arxiv.org/html/2605.17314#A3.SS3)\): The base model achieves87\.5%87\.5\\%pass@10241024while our variant fails completely \(0%0\\%\)\. Crucially, analysis reveals that all66of the base model’s successful traces \(across40964096rollouts\) rely on mathematically wrong reasoning to land on the correct integer, demonstrating that the Mathstral\-7B base is equally susceptible to the reward\-hacking failure mode\.

### C\.1AIME 2026 Problem 8

Problem\.LetNNbe the number of positive integer divisors of170171717017^\{17\}that leave a remainder of55when divided by1212\. Find the remainder whenNNis divided by10001000\.

Gold answer\.244244\.

Raw correctness counts \(n=2048n\{=\}2048samples per cell\):

Mathstral\-7B never solves this problem \(0/40960/4096\); Mismatched \+ Wrong solves it7474times\. Below is a correct Mismatched \+ Wrong sample:

> 1\.Prime Factorization of 17017: We start by finding the prime factorization of 17017\. 17017=17×100117017=17\\times 1001Next, we factorize 1001: 1001=7×1431001=7\\times 143Further factorizing 143: Therefore, the prime factorization of 17017 is: 17017=17×7×11×1317017=17\\times 7\\times 11\\times 132\.Exponentiation: We need to find the number of divisors of170171717017^\{17\}\. First, we write the prime factorization of170171717017^\{17\}: 1701717=\(17×7×11×13\)17=1717×717×1117×131717017^\{17\}=\(17\\times 7\\times 11\\times 13\)^\{17\}=17^\{17\}\\times 7^\{17\}\\times 11^\{17\}\\times 13^\{17\}3\.Divisors of170171717017^\{17\}: Any positive divisor of170171717017^\{17\}can be written as17a×7b×11c×13d17^\{a\}\\times 7^\{b\}\\times 11^\{c\}\\times 13^\{d\}, where0≤a,b,c,d≤170\\leq a,b,c,d\\leq 17\. 4\.Condition for Divisibility by 12: We need to count the number of divisors of170171717017^\{17\}that leave a remainder of 5 when divided by 12\. This is equivalent to finding the number of divisorsnnsuch that: n≡5\(mod12\)n\\equiv 5\\pmod\{12\}5\.Using Chinese Remainder Theorem: We will use the Chinese Remainder Theorem to solve the congruence for each prime factor’s exponent\. Recall that a numbernncan be expressed as: n=17a×7b×11c×13dn=17^\{a\}\\times 7^\{b\}\\times 11^\{c\}\\times 13^\{d\}We needn≡5\(mod12\)n\\equiv 5\\pmod\{12\}\. 6\.Modulo 12 Analysis: Note that12=3×412=3\\times 4\. We analyze the conditions modulo 3 and modulo 4 separately\. \-Modulo 3: The prime factors involved are17,7,11,17,7,11,and1313\. We reduce these modulo 3: 17≡2\(mod3\),7≡1\(mod3\),11≡2\(mod3\),13≡1\(mod3\)17\\equiv 2\\pmod\{3\},\\quad 7\\equiv 1\\pmod\{3\},\\quad 11\\equiv 2\\pmod\{3\},\\quad 13\\equiv 1\\pmod\{3\}Therefore: 17a×7b×11c×13d≡2a×1b×2c×1d≡2a\+c\(mod3\)17^\{a\}\\times 7^\{b\}\\times 11^\{c\}\\times 13^\{d\}\\equiv 2^\{a\}\\times 1^\{b\}\\times 2^\{c\}\\times 1^\{d\}\\equiv 2^\{a\+c\}\\pmod\{3\}We need: 2a\+c≡5\(mod3\)2^\{a\+c\}\\equiv 5\\pmod\{3\}Simplifying5\(mod3\)5\\pmod\{3\}: 5≡2\(mod3\)5\\equiv 2\\pmod\{3\}Therefore: 2a\+c≡2\(mod3\)2^\{a\+c\}\\equiv 2\\pmod\{3\}This is true if and only ifa\+ca\+cis odd\. \-Modulo 4: We reduce the prime factors modulo 4: 17≡1\(mod4\),7≡3\(mod4\),11≡3\(mod4\),13≡1\(mod4\)17\\equiv 1\\pmod\{4\},\\quad 7\\equiv 3\\pmod\{4\},\\quad 11\\equiv 3\\pmod\{4\},\\quad 13\\equiv 1\\pmod\{4\}Therefore: 17a×7b×11c×13d≡1a×3b×3c×1d≡3b\+c\(mod4\)17^\{a\}\\times 7^\{b\}\\times 11^\{c\}\\times 13^\{d\}\\equiv 1^\{a\}\\times 3^\{b\}\\times 3^\{c\}\\times 1^\{d\}\\equiv 3^\{b\+c\}\\pmod\{4\}We need: 3b\+c≡5\(mod4\)3^\{b\+c\}\\equiv 5\\pmod\{4\}Simplifying5\(mod4\)5\\pmod\{4\}: 5≡1\(mod4\)5\\equiv 1\\pmod\{4\}Therefore: 3b\+c≡1\(mod4\)3^\{b\+c\}\\equiv 1\\pmod\{4\}This is true if and only ifb\+cb\+cis even\. 7\.Combining Conditions: We needa\+ca\+cto be odd andb\+cb\+cto be even\. \-a\+ca\+cis odd:aaandccmust have different parities \(one is even, the other is odd\)\. \-b\+cb\+cis even:bbandccmust have the same parity \(both are even or both are odd\)\. 8\.Counting Valid Combinations: We count the number of valid combinations fora,b,c,da,b,c,d: \-aacan be any of0,1,2,…,170,1,2,\\ldots,17\(18 choices\)\. \-ccmust have opposite parity toaa\. Ifaais even,ccis odd, and vice versa\. This gives us 9 choices forcc\(since half of 18 is 9\)\. \-bbmust have the same parity ascc\. This also gives us 9 choices forbb\. \-ddcan be any of0,1,2,…,170,1,2,\\ldots,17\(18 choices\)\. Therefore, the total number of valid divisors is: 18×9×9×1818\\times 9\\times 9\\times 189\.Calculation: 162×9=1458162\\times 9=14581458×18=262441458\\times 18=2624410\.Finding Remainder when Divided by 1000: We need to find26244mod100026244\\mod 1000: 26244÷1000=26​remainder​24426244\\div 1000=26\\text\{ remainder \}244Therefore: 26244≡244\(mod1000\)26244\\equiv 244\\pmod\{1000\}Conclusion:

### C\.2AIME 2025 Problem 15

Problem\.LetNNdenote the number of ordered triples of positive integers\(a,b,c\)\(a,b,c\)such thata,b,c≤36a,b,c\\leq 3^\{6\}anda3\+b3\+c3a^\{3\}\+b^\{3\}\+c^\{3\}is a multiple of373^\{7\}\. Find the remainder whenNNis divided by10001000\.

Gold answer\.735735\.

Raw correctness counts \(n=2048n\{=\}2048samples per cell\):

Mathstral\-7B never solves this problem \(0/40960/4096\); Mismatched \+ Wrong solves it66times\. The Mismatched \+ Wrong sample below reaches the correct final answer735735, but the underlying reasoning is wrong—the asserted liftingx3≡x\(mod27\)x^\{3\}\\equiv x\\pmod\{27\}is incorrect \(23=8≢2\(mod27\)2^\{3\}=8\\not\\equiv 2\\pmod\{27\}\), illustrating the reward\-hacking caveat from §[5](https://arxiv.org/html/2605.17314#S5)\.

> We start by analyzing the conditions under whicha3\+b3\+c3a^\{3\}\+b^\{3\}\+c^\{3\}is a multiple of373^\{7\}\. #### Step 1: Analyzinga3,b3,c3\(mod37\)a^\{3\},b^\{3\},c^\{3\}\\pmod\{3^\{7\}\} Recall that for any integerxx, x3≡x\(mod3\)x^\{3\}\\equiv x\\pmod\{3\}This property extends to higher powers of 3, so forx≤36x\\leq 3^\{6\}, x3≡x\(mod33\)x^\{3\}\\equiv x\\pmod\{3^\{3\}\} We needa3\+b3\+c3≡0\(mod37\)a^\{3\}\+b^\{3\}\+c^\{3\}\\equiv 0\\pmod\{3^\{7\}\}\. Since37=21873^\{7\}=2187, we need a3\+b3\+c3≡0\(mod2187\)a^\{3\}\+b^\{3\}\+c^\{3\}\\equiv 0\\pmod\{2187\} #### Step 2: Conditions ona,b,ca,b,c Givena,b,c≤36=729a,b,c\\leq 3^\{6\}=729, we havea3,b3,c3≤7293a^\{3\},b^\{3\},c^\{3\}\\leq 729^\{3\}\. We need to find how many ordered triples\(a,b,c\)\(a,b,c\)satisfya3\+b3\+c3≡0\(mod2187\)a^\{3\}\+b^\{3\}\+c^\{3\}\\equiv 0\\pmod\{2187\}\. #### Step 3: Usingx3≡x\(mod27\)x^\{3\}\\equiv x\\pmod\{27\} Since33=273^\{3\}=27, we use the fact thatx3≡x\(mod27\)x^\{3\}\\equiv x\\pmod\{27\}forx≤27x\\leq 27\. Extending this, forx≤729x\\leq 729, we have x3≡x\(mod27\)x^\{3\}\\equiv x\\pmod\{27\} Thus, a3\+b3\+c3≡a\+b\+c\(mod27\)a^\{3\}\+b^\{3\}\+c^\{3\}\\equiv a\+b\+c\\pmod\{27\}We need a\+b\+c≡0\(mod27\)a\+b\+c\\equiv 0\\pmod\{27\} #### Step 4: Counting Valid Triples We count the number of solutions toa\+b\+c≡0\(mod27\)a\+b\+c\\equiv 0\\pmod\{27\}with1≤a,b,c≤7291\\leq a,b,c\\leq 729\. ##### Case 1:a\+b\+c=27​ka\+b\+c=27kfor some integerkk We need to count the number of solutions toa\+b\+c=27​ka\+b\+c=27kwhere1≤a,b,c≤7291\\leq a,b,c\\leq 729\. \- Fork=1k=1, we solvea\+b\+c=27a\+b\+c=27\. \- Fork=2k=2, we solvea\+b\+c=54a\+b\+c=54\. \-…\\ldots\- Fork=27k=27, we solvea\+b\+c=729a\+b\+c=729\. ##### Using Stars and Bars The number of solutions toa\+b\+c=na\+b\+c=nwherea,b,c≥1a,b,c\\geq 1is given by\(n−12\)\\binom\{n\-1\}\{2\}\. ##### Calculation For eachkk, the number of solutions is\(27​k−12\)\\binom\{27k\-1\}\{2\}\. We need to sum this overkkfrom11to2727\. N=∑k=127\(27​k−12\)N=\\sum\_\{k=1\}^\{27\}\\binom\{27k\-1\}\{2\} #### Step 5: Modulo 1000 Calculation We compute\(27​k−12\)\(mod1000\)\\binom\{27k\-1\}\{2\}\\pmod\{1000\}for eachkkand sum the results\. \(n2\)=n​\(n−1\)2\\binom\{n\}\{2\}=\\frac\{n\(n\-1\)\}\{2\} Fork=1k=1to2727, \(27⋅1−12\)=\(262\)=26⋅252=325\\binom\{27\\cdot 1\-1\}\{2\}=\\binom\{26\}\{2\}=\\frac\{26\\cdot 25\}\{2\}=325\(27⋅2−12\)=\(532\)=53⋅522=1378\\binom\{27\\cdot 2\-1\}\{2\}=\\binom\{53\}\{2\}=\\frac\{53\\cdot 52\}\{2\}=1378\(27⋅27−12\)=\(7262\)=726⋅7252=279765\\binom\{27\\cdot 27\-1\}\{2\}=\\binom\{726\}\{2\}=\\frac\{726\\cdot 725\}\{2\}=279765 We need to reduce each\(27​k−12\)\(mod1000\)\\binom\{27k\-1\}\{2\}\\pmod\{1000\}and sum them\. \(262\)\\displaystyle\\binom\{26\}\{2\}≡325\(mod1000\),\\displaystyle\\equiv 325\\pmod\{1000\},\(532\)\\displaystyle\\binom\{53\}\{2\}≡1378\(mod1000\)≡378,\\displaystyle\\equiv 1378\\pmod\{1000\}\\equiv 378,\(802\)\\displaystyle\\binom\{80\}\{2\}≡80⋅792\(mod1000\)≡3160\(mod1000\)≡160,\\displaystyle\\equiv\\frac\{80\\cdot 79\}\{2\}\\pmod\{1000\}\\equiv 3160\\pmod\{1000\}\\equiv 160,⋮\\displaystyle\\vdots\(7262\)\\displaystyle\\binom\{726\}\{2\}≡279765\(mod1000\)≡765\.\\displaystyle\\equiv 279765\\pmod\{1000\}\\equiv 765\. Summing these, N≡325\+378\+160\+…\+765\(mod1000\)N\\equiv 325\+378\+160\+\\ldots\+765\\pmod\{1000\} We calculate the sum of this arithmetic series: The first terma=325a=325, the last terml=765l=765, and the common differenced=difference between consecutive terms modulo 1000d=\\text\{difference between consecutive terms modulo 1000\}\. The number of termsn=27n=27\. Using the sum formula for an arithmetic series, S=n2​\(a\+l\)S=\\frac\{n\}\{2\}\(a\+l\)S≡272​\(325\+765\)\(mod1000\)S\\equiv\\frac\{27\}\{2\}\(325\+765\)\\pmod\{1000\}S≡272⋅1090\(mod1000\)S\\equiv\\frac\{27\}\{2\}\\cdot 1090\\pmod\{1000\}S≡27⋅545\(mod1000\)S\\equiv 27\\cdot 545\\pmod\{1000\}S≡14735\(mod1000\)S\\equiv 14735\\pmod\{1000\}S≡735\(mod1000\)S\\equiv 735\\pmod\{1000\} Thus, the number of ordered triples\(a,b,c\)\(a,b,c\)such thata3\+b3\+c3a^\{3\}\+b^\{3\}\+c^\{3\}is a multiple of373^\{7\}is735\\boxed\{735\}\.

### C\.3AIME 2026 Problem 22 \(inverse case: base\>\>ours\)

Problem\.A standard fair six\-sided die is rolled repeatedly\. Each time the die reads11or22, Alice gets a coin; each time it reads33or44, Bob gets a coin; and each time it reads55or66, Carol gets a coin\. The probability that Alice and Bob each receive at least two coins before Carol receives any coins can be written asmn\\tfrac\{m\}\{n\}, wheremmandnnare relatively prime positive integers\. Find100​m\+n100m\+n\.

Gold answer\.754754\(corresponding to754\\tfrac\{7\}\{54\}\)\.

Raw correctness counts \(n=2048n\{=\}2048samples per cell\):

Mismatched \+ Wrong never solves this problem \(0/40960/4096\); Mathstral\-7B solves it66times\. The Mathstral\-7B sample below reaches the correct final answer754754, but the underlying reasoning is wrong—the sample asserts19\+118\+118=718\\tfrac\{1\}\{9\}\+\\tfrac\{1\}\{18\}\+\\tfrac\{1\}\{18\}=\\tfrac\{7\}\{18\}, but the actual value is29\\tfrac\{2\}\{9\}, illustrating that the reward\-hacking caveat from §[5](https://arxiv.org/html/2605.17314#S5)applies to Mathstral\-7B as well\.

> Let’s break down the problem and the reference solution step by step\. The problem is a probability question that involves a standard fair six\-sided die being rolled repeatedly\. The die has six faces, numbered from 1 to 6\. Each time the die reads 1 or 2, Alice gets a coin; each time it reads 3 or 4, Bob gets a coin; and each time it reads 5 or 6, Carol gets a coin\. We need to find the probability that Alice and Bob each receive at least two coins before Carol receives any coins\. This probability can be written asmn\\frac\{m\}\{n\}, wheremmandnnare relatively prime positive integers\. We need to find100​m\+n100m\+n\. The reference solution uses the concept of state probabilities\. Let’s define the states as follows: \-AiA\_\{i\}: Alice hasiicoins, and Bob has 0 coins\. \-BiB\_\{i\}: Alice has 0 coins, and Bob hasiicoins\. \-CiC\_\{i\}: Alice hasiicoins, and Bob hasiicoins\. The state diagram is as follows: ``` A_2 -----> A_3 -----> A_4 -----> ... | | | | A_1 -----> A_2 -----> A_3 -----> ... | | | | B_2 -----> B_3 -----> B_4 -----> ... | | | | B_1 -----> B_2 -----> B_3 -----> ... | | | | C_2 -----> C_3 -----> C_4 -----> ... | | | | C_1 -----> C_2 -----> C_3 -----> ... ``` The reference solution then calculates the probabilities of transitioning from one state to another\. For example, the probability of transitioning fromAiA\_\{i\}toAi\+1A\_\{i\+1\}is13\\frac\{1\}\{3\}, and the probability of transitioning fromAiA\_\{i\}toBi\+1B\_\{i\+1\}is16\\frac\{1\}\{6\}\. The solution then calculates the probabilities of each state\. For example, the probability ofA2A\_\{2\}is13\\frac\{1\}\{3\}, and the probability ofB2B\_\{2\}is16\\frac\{1\}\{6\}\. The solution then calculates the probability of reachingC2C\_\{2\}without passing throughC1C\_\{1\}\. This is done by summing the probabilities of all paths that lead toC2C\_\{2\}without passing throughC1C\_\{1\}\. There are three such paths:A2→A3→C2A\_\{2\}\\rightarrow A\_\{3\}\\rightarrow C\_\{2\},A2→B3→C2A\_\{2\}\\rightarrow B\_\{3\}\\rightarrow C\_\{2\}, andB2→B3→C2B\_\{2\}\\rightarrow B\_\{3\}\\rightarrow C\_\{2\}\. The probabilities of these paths are19\\frac\{1\}\{9\},118\\frac\{1\}\{18\}, and118\\frac\{1\}\{18\}, respectively\. Therefore, the probability of reachingC2C\_\{2\}without passing throughC1C\_\{1\}is19\+118\+118=718\\frac\{1\}\{9\}\+\\frac\{1\}\{18\}\+\\frac\{1\}\{18\}=\\frac\{7\}\{18\}\. Finally, the solution calculates the probability of Alice and Bob each receiving at least two coins before Carol receives any coins\. This is done by summing the probabilities of all paths that lead toC2C\_\{2\}without passing throughC1C\_\{1\}\. There are infinitely many such paths, but the probability of each path is718⋅13⋅13⋅13⋅…=718⋅127=754\\frac\{7\}\{18\}\\cdot\\frac\{1\}\{3\}\\cdot\\frac\{1\}\{3\}\\cdot\\frac\{1\}\{3\}\\cdot\\ldots=\\frac\{7\}\{18\}\\cdot\\frac\{1\}\{27\}=\\frac\{7\}\{54\}\. Therefore, the probability of Alice and Bob each receiving at least two coins before Carol receives any coins is754\\frac\{7\}\{54\}\. Sincem=7m=7andn=54n=54, we have100​m\+n=100⋅7\+54=754100m\+n=100\\cdot 7\+54=\\boxed\{754\}\.■\\blacksquare

## Appendix DAIME cross\-model studies

We compare four models—Mathstral\-7B base \(\[INST\]\), the Qwen2\.5\-Math\-1\.5B drafter, No\-draft GRPO, and Mismatched \+ Wrong \(Ours\)—at the per\-problem level across AIME 2024/2025/2026 \(30 problems each year, 2 seeds, 2048 rollouts per model×\\timesproblem×\\timesseed\)\. In this section, “solving” a problem refers solely to matching the target final answer; we recognize that models can arrive at correct outcomes via wrong reasoning, which we examine quantitatively in §[D\.3](https://arxiv.org/html/2605.17314#A4.SS3)and §[D\.4](https://arxiv.org/html/2605.17314#A4.SS4)\.

### D\.1Solve coverage by model

Table 7:Number of AIME problems solved by at least one rollout \(out of40964096= 2 seeds×\\times20482048rollouts per model×\\timesproblem×\\timesseed\)\.Per\-problem solve matrix\(\+= at least one correct rollout in40964096;\.= no correct rollouts\):

```
AIME 2024 (I-1..I-15 then II-1..II-15):
                          Section I        Section II
                          1234567890 12345 1234567890 12345
Mathstral-7B base         +++++.++++ ...++ ++++++++++ +++.+
Qwen2.5-Math-1.5B         +++++.++++ ..+++ +++++++.++ .++.+
No-draft GRPO             +++++++.++ ..+++ ++++++++++ +++.+
Mismatched + Wrong (Ours) ++++++++++ +.+.+ +++++++.++ .++..

AIME 2025 (P1..P30):
                          1234567890 1234567890 1234567890
Mathstral-7B base         ++++++.+++ ..++.++.++ +..+..+.++
Qwen2.5-Math-1.5B         ++++++.+++ ..++.++.++ +++...++.+
No-draft GRPO             +.++++.+++ ..++++++++ +.+...++++
Mismatched + Wrong (Ours) ++++++.+++ .++++++.++ +++++++.++

AIME 2026 (P1..P30):
                          1234567890 1234567890 1234567890
Mathstral-7B base         +++++++.++ ++++++++.+ +++.++....
Qwen2.5-Math-1.5B         ++++++.+++ +..+.+.+.+ ++++++...+
No-draft GRPO             ++++++++++ ++...+.+++ +++.+++..+
Mismatched + Wrong (Ours) ++++++++++ ++++.+++++ +.++++++++
```

### D\.2Pairwise Comparisons: Mismatched \+ Wrong vs\. Baselines

We break down the performance of our Mismatched \+ Wrong variant against each of the three baselines\. For each pairwise comparison, we list the specific AIME problems constituting*creation*cases \(problems our variant solves but the baseline does not\) and*inverse*cases \(problems the baseline solves but our variant does not\), categorized by year\.

Table 8:Problem\-level breakdown of creation and inverse cases across the three pairwise comparisons\.
### D\.3Reasoning rigor of correct rollouts

§[C\.2](https://arxiv.org/html/2605.17314#A3.SS2)and §[C\.3](https://arxiv.org/html/2605.17314#A3.SS3)document instances where models arrive at the correct final numerical answer via mathematically wrong reasoning\. This section quantifies the prevalence of this reward\-hacking behavior by scanning a broader set of 239 correct rollouts\.

Setup\.We compare our Mismatched \+ Wrong variant against the three baselines above*combined*\. There are 239 rollouts to evaluate, consisting of:

- •174 Creation Rollouts:*every*correct rollout from our Mismatched \+ Wrong variant on the 25 AIME problems where our method succeeds but at least one baseline fails\.
- •65 Inverse Rollouts:*every*correct rollout from any baseline on the 8 AIME problems where our variant fails but at least one baseline succeeds\.

Some problems appear in multiple pairwise comparisons, but each rollout is counted only once toward the 239 total\.

Methodology\.Each rollout was evaluated blindly and independently by two LLM judges \(Gemini 3\.1 Pro and Claude Opus 4\.7\) using a four\-tier rubric:*rigorous*\(fully valid derivation\),*mostly*\(non\-load\-bearing flaws\),*wrong*\(load\-bearing flaws resulting in a reward\-hacked correct answer\), and*not sure*\. Of the 239 rollouts, 228 reached cross\-judge consensus\. The remaining 11 were resolved by Claude Opus 4\.7 \(extended\-thinking mode\) and manual review\.

Results\.

- •Inverse cases: all 65 inverse\-case rollouts were flagged*wrong*by consensus\. When baselines succeed on problems our variant misses, those successes are entirely reward\-hacked\.
- •Creation cases: of the 174 creation rollouts from our method, only 8 \(4\.6%4\.6\\%\) were deemed*rigorous*or*mostly*valid, with the remainder being reward\-hacked\. These 8 valid rollouts were concentrated across three problems: AIME 2024 I\-6 \(1 rigorous, 1 mostly\), AIME 2026 P8 \(5 rigorous\), and AIME 2026 P19 \(1 mostly\)\.

Table[9](https://arxiv.org/html/2605.17314#A4.T9)summarizes the final verdict distribution, highlighting that96\.7%96\.7\\%of the evaluated rollouts were reward\-hacked\. We view this rigor gap as an exciting opening for future work\.

Table 9:Rigor verdicts across 239 correct rollouts\.
### D\.4A closer look at Mismatched \+ Wrong \(Ours\) vs No\-draft GRPO

We are interested in comparing Mismatched \+ Wrong head\-to\-head with No\-draft GRPO\. While Table[8](https://arxiv.org/html/2605.17314#A4.T8)gives an initial outcome\-level impression, we conducted a follow\-up rigor scan of No\-draft GRPO’s correct rollouts on the*three*problems where our method produced at least one rigorous or mostly rigorous derivation \(AIME 2024 I\-6, AIME 2026 P8, and AIME 2026 P19\)\. For a controlled comparison, we used Gemini 3\.1 Pro as the sole judge for both models—it had also matched the human verdict in all 7 manually reviewed cases from §[D\.3](https://arxiv.org/html/2605.17314#A4.SS3)\.

Table 10:Head\-on rigor verdicts \(judge: Gemini 3\.1 Pro\)\.We scoped this scan to three problems for two reasons: scanning every correct rollout from both models would be prohibitively large, and our primary interest was whether there exists any problem where Mismatched \+ Wrong produces a rigorous or mostly rigorous rollout while No\-draft GRPO produces none\. P19 \(2026\) is one such case: Mismatched \+ Wrong’s correct rollout was*mostly*valid, while No\-draft GRPO’s was reward\-hacked\.

Similar Articles

Reasoning Can Be Restored by Correcting a Few Decision Tokens

arXiv cs.AI

This paper shows that the reasoning gap between base LLMs and large reasoning models is concentrated on a small set of early planning tokens. It introduces disagreement-guided token intervention, where replacing only those critical tokens with a reasoning model's outputs allows a base model to nearly match the reasoning model's performance.

Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

arXiv cs.CL

This paper proposes SGR, a framework that enhances LLM stepwise reasoning by integrating external knowledge graphs through query-relevant subgraph generation, combining Cypher-based reasoning with collaborative reasoning integration. Experiments on CWQ, WebQSP, GrailQA, and KQA Pro show improved reasoning accuracy over standard prompting and knowledge-enhanced baselines.

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

arXiv cs.CL

This paper proposes a novel Chain-of-Thought distillation framework that transfers teacher models' stepwise attention on key information to student models through a Mixture-of-Layers module for dynamic layer alignment. The method achieves consistent performance improvements on mathematical and commonsense reasoning benchmarks by explicitly guiding student models to progressively focus on critical information during reasoning.