Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning
Summary
Introduces Strategy-Guided Policy Optimization (SGPO) for LLM reasoning, which replaces trajectory imitation with strategy distillation, improving generalization on math benchmarks.
View Cached Full Text
Cached at: 06/24/26, 07:44 AM
# Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning
Source: [https://arxiv.org/html/2606.24064](https://arxiv.org/html/2606.24064)
Tianyuan Shi1, Canbin Huang1, Bei Li2, Xin Chen2, Xiaojun Quan3, Jingang Wang2 Qifan Wang4 1School of Computer Science and Engineering, Sun Yat\-sen University, China 2Meituan, Inc\., China,3Shenzhen Loop Area Institute, China,4Meta AI, USA \{shity6,huangcb3\}@mail2\.sysu\.edu\.cn,xiaojunquan@slai\.edu\.cn \{libei17,chenxin148,wangjingang02\}@meituan\.com,wqfcr@fb\.com
###### Abstract
Distilling reasoning capabilities from strong to weak language models typically involves imitating specific solution trajectories, effectively transferring*what to answer*rather than*how to reason*\. This trajectory\-level imitation encourages memorization of instance\-specific steps rather than acquisition of transferable problem\-solving skills, limiting generalization to novel problems\. We proposeStrategy\-Guided Policy Optimization\(SGPO\), which replaces instance\-level trajectory imitation with reusable*strategy distillation*\. SGPO extracts structured strategy descriptions from strong\-model responses and, for each problem, constructs both autonomous and strategy\-guided trajectories to enable direct comparison of the model’s behavior with and without strategic guidance\. The framework then addresses two key questions\. For*how*to distill, a token\-level forward\-KL objective selectively transfers the distributional shift induced by strategy conditioning into the unguided policy, with proximal constraints ensuring stability\. For*when*to distill, adaptive instance\-level weighting strengthens guidance when autonomous exploration falls short and reduces it as the model’s own competence grows\. Experiments on four mathematical benchmarks across two model families show that SGPO consistently outperforms SFT, on\-policy RL, and hybrid\-policy baselines, improving the average score by 2\.2 points over the strongest baseline on Qwen2\.5\-7B\-Instruct\. Analysis reveals that the forward\-KL objective provides an inherently selective distillation signal that outperforms direct trajectory imitation, and that strategy distillation exhibits complementary scaling with base model capability\.
## 1Introduction
Large language models \(LLMs\) have demonstrated remarkable reasoning capabilities\(Daya et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib4); OpenAI et al\.,[2024](https://arxiv.org/html/2606.24064#bib.bib15)\), motivating a range of methods to transfer these capabilities to smaller models\. Whether through supervised fine\-tuning on expert traces\(Daya et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib4); Ye et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib23); Wu et al\.,[2026](https://arxiv.org/html/2606.24064#bib.bib20); Zhu et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib27)\)or hybrid objectives that integrate expert demonstrations into policy optimization\(Yan et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib21); Lv et al\.,[2026](https://arxiv.org/html/2606.24064#bib.bib12); Fu et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib6)\), existing approaches operate on the same unit of knowledge transfer:*instance\-level solution trajectories*\. The student is trained to reproduce what the expert wrote, that is, specific sequences of reasoning steps for specific problems, but is never taught the reusable problem\-solving strategy that explains*why*those steps were chosen\. This trajectory\-level imitation encourages memorization of instance\-specific patterns rather than acquisition of transferable skills, limiting generalization to novel problems\. A natural question arises:*Can we shift the distillation target from specific solutions to the reusable problem\-solving strategies that make those solutions effective?*
We proposeStrategy\-Guided Policy Optimization\(SGPO\), a framework designed to accomplish this shift\. SGPO extracts structured*strategy descriptions*from strong\-model responses\. Each description specifies the problem type, the solving approach, and the general procedural steps, without carrying out computations or revealing the answer\. Rather than asking the student to reproduce what the expert generates, SGPO uses these descriptions to reshape the student’s own reasoning distribution, distilling*how to reason*rather than*what to answer*\.
A central design principle is that strategic knowledge should be*internalized*into the model’s unguided policy rather than remain an external dependency unavailable at inference time\. To this end, SGPO constructs both autonomous and strategy\-guided trajectories for each problem \(§[3\.1](https://arxiv.org/html/2606.24064#S3.SS1)\), enabling direct comparison of the model’s behavior with and without strategic guidance\. This dual construction serves as the foundation for addressing two complementary questions: \(1\)How to distill\.A token\-level forward\-KL objective measures the divergence between guided and unguided next\-token distributions and selectively distills strategy\-critical information back into the unguided policy, while proximal constraints at both trajectory and token levels ensure stability \(§[3\.3](https://arxiv.org/html/2606.24064#S3.SS3), §[3\.4](https://arxiv.org/html/2606.24064#S3.SS4)\)\. \(2\)When to distill\.Adaptive per\-instance weighting modulates distillation strength based on the marginal benefit of strategy guidance, strengthening distillation when autonomous exploration falls short and reducing it as the model’s own competence grows \(§[3\.4](https://arxiv.org/html/2606.24064#S3.SS4)\)\. This naturally transitions training from strategy\-driven rapid improvement in early stages to autonomous\-policy\-dominated steady optimization in later stages, without manual scheduling\.
Crucially, SGPO never imitates any trajectory, whether autonomous or guided\. Instead, it distills the*distributional shift*caused by strategy conditioning, operating at the level of token\-level probability changes rather than sequence matching\. This enables selective transfer of strategic knowledge while preserving the reasoning diversity the model has already acquired through autonomous exploration\.
Experiments on four mathematical benchmarks across two model families \(Qwen2\.5 and Llama\-3\.2\) show that SGPO consistently outperforms strong baselines including SFT, on\-policy RL, and state\-of\-the\-art hybrid\-policy methods, improving the average score by 2\.2 points over the strongest baseline on Qwen2\.5\-7B\-Instruct\. Analysis reveals two findings\. First, the forward\-KL objective provides an inherently selective distillation signal: without any token\-level annotation, optimization pressure concentrates on tokens whose probability shifts most under strategy conditioning, which empirically correspond to strategy\-critical decision points rather than routine linguistic tokens\. Second, strategy distillation exhibits complementary scaling with base model capability: as foundational reasoning competence grows, the ability to internalize strategic guidance increases at a faster rate, suggesting that a minimum reasoning competence is needed to benefit from strategy\-level transfer\.
## 2Related Work
### 2\.1Supervised Fine\-Tuning for LLM Reasoning
Training a weaker model on expert reasoning traces is the most common form of reasoning distillation\(Daya et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib4); Ye et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib23)\)\. While simple and effective, this trajectory\-level distillation is sensitive to data quality\(Ye et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib23)\), prone to memorization\(Chu et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib3)\), and vulnerable to exposure bias\(Wu et al\.,[2026](https://arxiv.org/html/2606.24064#bib.bib20)\)\. Recent work mitigates these issues from RL perspectives along two directions\. The first introduces proximal constraints:Wu et al\. \([2026](https://arxiv.org/html/2606.24064#bib.bib20)\)downweight the loss on well\-learned tokens using the model’s own predictions, while PSFT\(Zhu et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib27)\)applies proximal clipping to bound update magnitudes\. The second recasts SFT through a reward\-optimization lens; for example, IW\-SFT\(Qin & Springenberg,[2025](https://arxiv.org/html/2606.24064#bib.bib16)\)tightens the RL lower bound via importance weighting on policy probability ratios\.
These methods improve*how*the teacher’s output is transferred but do not change*what*is transferred: the student still imitates specific solutions\. In contrast, our work operates at a higher level of abstraction, distilling reusable*strategies*rather than concrete trajectories\.
### 2\.2Hybrid Policy Optimization for LLM Reasoning
Incorporating expert demonstrations into policy optimization is a long\-standing theme in RL\(Rajeswaran et al\.,[2018](https://arxiv.org/html/2606.24064#bib.bib17); Nair et al\.,[2018](https://arxiv.org/html/2606.24064#bib.bib14)\)\. Recent LLM methods follow two directions\.Unified\-loss methodsmix expert data into the RL objective: LUFFY\(Yan et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib21)\)adds off\-policy expert trajectories to GRPO rollout groups with regularized importance sampling; AMPO\(Yuan et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib25)\)replaces incorrect on\-policy samples with diverse off\-policy alternatives; other approaches interleave RL and SFT updates\(Ma et al\.,[2026](https://arxiv.org/html/2606.24064#bib.bib13); Chen et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib2)\); and SRFT\(Fu et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib6)\)unifies these ideas in a single\-stage framework with sample\-level modulation\.Prefix\-guided methodsuse expert trajectories to structure generation: UFT\(Liu et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib11)\)progressively masks expert suffixes to encourage autonomy, while BREAD\(Zhang et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib26)\)branches rollouts from intermediate expert steps\.
All these methods transfer knowledge at the level of specific solution steps\. Our approach differs in two respects: \(i\) it conditions on strategy descriptions that specify problem\-solving direction without prescribing concrete computations, and \(ii\) rather than imitating the teacher’s output, it distills the*distributional shift*that strategy conditioning induces in the student’s own policy, making the transfer inherently adapted to the student’s current capability while removing dependence on external hints at inference time\.
## 3Method
### 3\.1Problem Formulation and Overview
Letπθ\(⋅∣q\)\\pi\_\{\\theta\}\(\\cdot\\mid q\)denote the policy of the target model for a reasoning problemqq\. For each problem, we assume access to a*strategy description*ssextracted from a strong\-model response\. A strategy description is a concise natural\-language summary that specifies the problem type, the solving approach, and the general procedural steps, without carrying out intermediate computations or revealing the final answer\. It encodes actionable strategic information while omitting solution\-specific details, occupying a middle ground between a generic hint and a complete solution\. Details of the extraction pipeline and prompt templates are provided in Appendix[B](https://arxiv.org/html/2606.24064#A2)\.
For each training problemqq, we construct two trajectory groups:\(1\) Autonomous group\{oi\}i=1G1\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\_\{1\}\}, sampled fromπθ\(⋅∣q\)\\pi\_\{\\theta\}\(\\cdot\\mid q\); and\(2\) Strategy\-guided group\{o~j\}j=1G2\\\{\\tilde\{o\}\_\{j\}\\\}\_\{j=1\}^\{G\_\{2\}\}, sampled fromπθ\(⋅∣q,s\)\\pi\_\{\\theta\}\(\\cdot\\mid q,s\)wheressis prepended to the prompt\. The central challenge is to convert the information gained under the guided condition into reusable reasoning ability under the original unguided condition\. The remainder of this section describes three mechanisms that jointly address this challenge: autonomous GRPO optimization \(§[3\.2](https://arxiv.org/html/2606.24064#S3.SS2)\), token\-level forward\-KL distillation \(§[3\.3](https://arxiv.org/html/2606.24064#S3.SS3)\), and proximal constraints with adaptive weighting \(§[3\.4](https://arxiv.org/html/2606.24064#S3.SS4)\)\.
Figure 1:Overview of the SGPO framework\. For each problem, we jointly construct an autonomous group and a strategy\-guided group\. The autonomous group is optimized with GRPO\. Correct trajectories from the strategy\-guided group provide proximal KL distillation signals to the unguided policy\. An adaptive weightα\(q\)\\alpha\(q\)controls distillation strength\.
### 3\.2Autonomous Exploration with GRPO
The autonomous trajectory group is optimized with Group Relative Policy Optimization \(GRPO;Shao et al\.[2024](https://arxiv.org/html/2606.24064#bib.bib18)\)\. For each problemqq, theG1G\_\{1\}sampled responses are scored by a verifiable reward functionR\(oi,q\)∈\{0,1\}R\(o\_\{i\},q\)\\in\\\{0,1\\\}\. Rewards within each group are normalized to zero\-mean, unit\-variance advantagesA^i\\hat\{A\}\_\{i\}, and the policy is updated via the clipped objective:
ℒGRPO\(θ\)=𝔼q,\{oi\}\[1G1∑i=1G11\|oi\|∑t=1\|oi\|min\(ρi,tA^i,clip\(ρi,t,1−ε,1\+ε\)A^i\)\],\\mathcal\{L\}\_\{\\mathrm\{GRPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{q,\\,\\\{o\_\{i\}\\\}\}\\\!\\Bigg\[\\frac\{1\}\{G\_\{1\}\}\\sum\_\{i=1\}^\{G\_\{1\}\}\\frac\{1\}\{\|o\_\{i\}\|\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\\min\\\!\\Big\(\\rho\_\{i,t\}\\,\\hat\{A\}\_\{i\},\\;\\mathrm\{clip\}\\big\(\\rho\_\{i,t\},1\{\-\}\\varepsilon,1\{\+\}\\varepsilon\\big\)\\hat\{A\}\_\{i\}\\Big\)\\Bigg\],\(1\)whereρi,t=πθ\(oi,t∣q,oi,<t\)/πθold\(oi,t∣q,oi,<t\)\\rho\_\{i,t\}=\{\\pi\_\{\\theta\}\(o\_\{i,t\}\\mid q,o\_\{i,<t\}\)\}\\big/\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(o\_\{i,t\}\\mid q,o\_\{i,<t\}\)\}is the token\-level importance ratio\. This objective ensures continued improvement on the model’s own distribution independently of strategy guidance\.
### 3\.3Strategy\-Guided Distillation
Rather than directly applying SFT to successful guided trajectories— which would collapse back into trajectory imitation and fail to separate strategic knowledge from surface realization—we construct a distillation signal that captures how strategy conditioning shifts the model’s own token\-level distribution\.
Concretely, let𝒞=\{j:R\(o~j,q\)=1\}\\mathcal\{C\}=\\\{j:R\(\\tilde\{o\}\_\{j\},q\)=1\\\}be the set of correct guided trajectories\. We define the token\-level KL distillation objective over𝒞\\mathcal\{C\}as:
ℒKD\(θ\)=1\|𝒞\|∑j∈𝒞1\|o~j\|∑t=1\|o~j\|dj,t\(θ\),\\mathcal\{L\}\_\{\\mathrm\{KD\}\}\(\\theta\)=\\frac\{1\}\{\|\\mathcal\{C\}\|\}\\sum\_\{j\\in\\mathcal\{C\}\}\\frac\{1\}\{\|\\tilde\{o\}\_\{j\}\|\}\\sum\_\{t=1\}^\{\|\\tilde\{o\}\_\{j\}\|\}d\_\{j,t\}\(\\theta\),\(2\)where
dj,t\(θ\)=DKL\(sg\[πθ\(⋅∣q,s,o~j,<t\)\]∥πθ\(⋅∣q,o~j,<t\)\)\.d\_\{j,t\}\(\\theta\)=D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\operatorname\{sg\}\\\!\\left\[\\pi\_\{\\theta\}\(\\cdot\\mid q,s,\\tilde\{o\}\_\{j,<t\}\)\\right\]\\;\\middle\\\|\\;\\pi\_\{\\theta\}\(\\cdot\\mid q,\\tilde\{o\}\_\{j,<t\}\)\\right\)\.\(3\)
Figure 2:KL direction\. \(a\) The unguided policy covers multiple strategies; the guided distribution reflects one\. \(b\) Reverse KL collapses onto the guided mode\. \(c\) Forward KL absorbs the guided strategy while preserving alternatives\.*First*, we adopt the forward KL direction \(guided as reference\)\. The strategy\-guided distribution concentrates on a particular strategy, whereas the unguided policy may cover multiple effective strategies discovered through autonomous exploration \(Figure[2](https://arxiv.org/html/2606.24064#S3.F2)\)\. Reverse KL would actively penalize mass outside the guided mode, collapsing the policy onto a single strategy\. Forward KL only encourages coverage of the guided mode without suppressing alternatives, and together with the GRPO objective that continuously reinforces independently discovered strategies, preserves existing reasoning diversity while absorbing new strategic insight \(empirical comparison in Appendix[F](https://arxiv.org/html/2606.24064#A6)\)\.*Second*, the stop\-gradientsg\[⋅\]\\operatorname\{sg\}\[\\cdot\]restricts gradients to the unguided distribution\. Both distributions in Eq\. \([3](https://arxiv.org/html/2606.24064#S3.E3)\) share the same output prefixo~j,<t\\tilde\{o\}\_\{j,<t\}but differ in prompt context:\(q,s\)\(q,s\)for the guided side vs\.qqalone for the unguided side\. Without the stop\-gradient, the sharedθ\\thetacould trivially reduce the KL by shifting the guided distribution toward the unguided one\.
The resulting objective has a natural selectivity property: at each token position, the KL value is large when the next\-token distribution shifts substantially under strategy conditioning \(e\.g\., choosing to apply the quadratic formula vs\. factoring\), and small when the token mainly reflects generic linguistic realization \(e\.g\., formatting\)\. \(§[3\.3](https://arxiv.org/html/2606.24064#S3.SS3)\) Distillation pressure thus automatically concentrates on positions that encode strategic decisions, without requiring any explicit token\-level annotation\.
### 3\.4Proximal Constraints and Adaptive Weighting
Directly minimizing the KL term in Eq\. \([2](https://arxiv.org/html/2606.24064#S3.E2)\) can be unstable when the guided and unguided distributions diverge substantially, risking entropy collapse \(§[5\.2](https://arxiv.org/html/2606.24064#S5.SS2)\)\. We introduce proximal constraints for training stability and an adaptive weighting mechanism for distillation efficiency, operating at three complementary levels\.
Trajectory level: reachable target selection\.Rather than averaging over all correct guided trajectories as in Eq\. \([2](https://arxiv.org/html/2606.24064#S3.E2)\), we select the single trajectory most reachable by the current unguided policy:
o\+=argmaxo~j∈𝒞∑t=1\|o~j\|logπθ\(o~j,t∣q,o~j,<t\)\.o^\{\+\}=\\arg\\max\_\{\\tilde\{o\}\_\{j\}\\in\\mathcal\{C\}\}\\sum\_\{t=1\}^\{\|\\tilde\{o\}\_\{j\}\|\}\\log\\pi\_\{\\theta\}\(\\tilde\{o\}\_\{j,t\}\\mid q,\\tilde\{o\}\_\{j,<t\}\)\.\(4\)Here we evaluate the guided trajectory under the*unguided*policy \(withoutssin the conditioning\) to measure how easily the model could produce this trajectory on its own\. This preserves correctness while minimizing the distributional gap that the distillation step must bridge\. When\|𝒞\|=0\|\\mathcal\{C\}\|=0, the distillation term is skipped entirely\.
Token level: KL clipping\.We mask token positions where the guided–unguided divergence exceeds a thresholdδ\\delta, yielding the clipped objective over the selected trajectory:
ℒKDclip\(θ\)=1\|o\+\|∑t=1\|o\+\|wtdt\(θ\),wt=𝟏\[dt\(θ\)≤δ\],\\mathcal\{L\}\_\{\\mathrm\{KD\}\}^\{\\mathrm\{clip\}\}\(\\theta\)=\\frac\{1\}\{\|o^\{\+\}\|\}\\sum\_\{t=1\}^\{\|o^\{\+\}\|\}w\_\{t\}\\,d\_\{t\}\(\\theta\),\\quad w\_\{t\}=\\mathbf\{1\}\\\!\\left\[d\_\{t\}\(\\theta\)\\leq\\delta\\right\],\(5\)wheredt\(θ\)d\_\{t\}\(\\theta\)denotes the per\-token KL from Eq\. \([3](https://arxiv.org/html/2606.24064#S3.E3)\) evaluated ono\+o^\{\+\}\. This prevents outlier positions from dominating optimization\.
Instance level: adaptive weighting\.Letpauto=1G1∑iR\(oi,q\)p\_\{\\text\{auto\}\}=\\frac\{1\}\{G\_\{1\}\}\\sum\_\{i\}R\(o\_\{i\},q\)andpguide=1G2∑jR\(o~j,q\)p\_\{\\text\{guide\}\}=\\frac\{1\}\{G\_\{2\}\}\\sum\_\{j\}R\(\\tilde\{o\}\_\{j\},q\)denote the autonomous and guided pass rates\. We modulate the per\-instance distillation strength as:
α\(q\)=clip\(α0⋅pguide−pauto\+γpauto\+γ,0,αmax\),\\alpha\(q\)=\\mathrm\{clip\}\\\!\\left\(\\alpha\_\{0\}\\cdot\\frac\{p\_\{\\text\{guide\}\}\-p\_\{\\text\{auto\}\}\+\\gamma\}\{p\_\{\\text\{auto\}\}\+\\gamma\},\\;0,\\;\\alpha\_\{\\max\}\\right\),\(6\)whereα0\\alpha\_\{0\}is a base coefficient,γ\>0\\gamma\>0a smoothing constant, andαmax\\alpha\_\{\\max\}a cap\. The weight is large when strategy guidance substantially raises the pass rate and vanishes when the model already solves the problem autonomously\. A comparison with alternative weighting strategies is given in Appendix[G](https://arxiv.org/html/2606.24064#A7)\.
### 3\.5Overall Objective
For each problemqq, the training loss combines autonomous GRPO with proximal strategy distillation:
ℒ\(θ;q\)=ℒGRPO\(θ;q\)\+α\(q\)ℒKDclip\(θ;q\),\\mathcal\{L\}\(\\theta;q\)=\\mathcal\{L\}\_\{\\mathrm\{GRPO\}\}\(\\theta;q\)\\;\+\\;\\alpha\(q\)\\,\\mathcal\{L\}\_\{\\mathrm\{KD\}\}^\{\\mathrm\{clip\}\}\(\\theta;q\),\(7\)and is averaged over the mini\-batch\. Conceptually, strategy extraction determines*what*to transfer; the token\-level forward\-KL objective identifies*where*to transfer by concentrating on strategy\-critical positions; and the three\-level proximal constraints control*how much*to transfer at the trajectory, token, and instance granularity\. The full training algorithm is given in Appendix[D](https://arxiv.org/html/2606.24064#A4)\.
## 4Experimental Setup
### 4\.1Models and Data
We conduct experiments on two model families: Qwen2\.5\-\{1\.5B, 7B\}\-Instruct\(Yang et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib22)\)and Llama\-3\.2\-8B\-Instruct\(Dubey et al\.,[2024](https://arxiv.org/html/2606.24064#bib.bib5)\), covering both different architectures and different scales\. Training data consist of 8\.5K problems randomly sampled from the LUFFY dataset\(Yan et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib21)\), with reference solutions generated by DeepSeek\-R1\. Each RL training instance requires\(G1\+G2\)=12\(G\_\{1\}\{\+\}G\_\{2\}\)\{=\}12sampled trajectories, yet stable gains are already observed at this moderate scale, indicating favorable data efficiency\. Strategy descriptions are extracted from the corresponding reference solutions via the procedure described in Appendix[B](https://arxiv.org/html/2606.24064#A2)\. An analysis of strategy description quality is provided in Appendix[H](https://arxiv.org/html/2606.24064#A8)\.
### 4\.2Evaluation Benchmarks
We evaluate on four mathematical reasoning benchmarks of increasing difficulty: MATH500\(Hendrycks et al\.,[2021](https://arxiv.org/html/2606.24064#bib.bib8)\), AMC 23, OlympiadBench\(He et al\.,[2024](https://arxiv.org/html/2606.24064#bib.bib7)\)and AIME 24\. Because AMC 23 and AIME 24 are relatively small test sets, we report avg@32 \(the accuracy averaged over 32 independent samples per problem\) to reduce variance; for the larger MATH500 and OlympiadBench, we report avg@4\. All evaluations use a maximum generation length of 32K tokens, temperature 0\.6, and top\-p=0\.95p\{=\}0\.95\.
### 4\.3Baselines
We compare against four baselines\.SFTperforms supervised fine\-tuning on expert reasoning traces only\.SFT \+ GRPOfollows SFT with standard on\-policy GRPO\. These two baselines test whether strategy distillation provides gains beyond conventional supervised and reinforcement learning\.HPT\(Lv et al\.,[2026](https://arxiv.org/html/2606.24064#bib.bib12)\)combines on\-policy RL with an auxiliary SFT loss on expert trajectories within a unified objective\.LUFFY\(Yan et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib21)\)incorporates off\-policy expert trajectories into GRPO rollout groups with regularized importance sampling\. HPT and LUFFY are the most directly relevant comparisons, as they also leverage strong\-model knowledge during policy optimization\.
### 4\.4Implementation Details
We split the 8\.5K training problems into roughly one third for SFT warm\-up and the remainder for RL training\. The SFT stage uses learning rate1×10−51\{\\times\}10^\{\-5\}for 10 epochs to equip base models with basic long\-chain reasoning ability\. The RL stage uses learning rate1×10−61\{\\times\}10^\{\-6\}for 10 epochs with total group size 12 \(G1=8G\_\{1\}\{=\}8autonomous,G2=4G\_\{2\}\{=\}4strategy\-guided\)\. Key hyperparameters include GRPO clipping boundsεlow=0\.2\\varepsilon\_\{\\text\{low\}\}\{=\}0\.2andεhigh=0\.28\\varepsilon\_\{\\text\{high\}\}\{=\}0\.28\(Yu et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib24)\), KL thresholdδ=1\.0\\delta\{=\}1\.0, base adaptive coefficientα0=0\.5\\alpha\_\{0\}\{=\}0\.5, smoothing constantγ=0\.1\\gamma\{=\}0\.1, and maximum adaptive weightαmax=0\.8\\alpha\_\{\\max\}\{=\}0\.8\. All baselines share the same SFT warm\-up configuration and total sample budget to ensure fair comparison\. Full details are provided in Appendix[E](https://arxiv.org/html/2606.24064#A5)\.
## 5Results
### 5\.1Main Results
We summarize the main results in Table[1](https://arxiv.org/html/2606.24064#S5.T1)and discuss two key observations\.
Consistent improvements over all baselines\.SGPO achieves the best average performance across all three base models and both model families\. Notably, HPT and LUFFY also leverage strong\-model knowledge during RL, yet SGPO consistently outperforms them by transferring knowledge at the strategy level rather than the trajectory level\. The mechanism underlying this advantage is examined in §[5\.3\.2](https://arxiv.org/html/2606.24064#S5.SS3.SSS2)\.
Complementary scaling with base model capability\.The gains from SGPO scale with the strength of the base model: on Qwen2\.5\-7B\-Instruct, the average score improves by 9\.9 points over the base model \(42\.2→\\to52\.1\), compared with 6\.7 points for Qwen2\.5\-1\.5B\-Instruct \(23\.2→\\to29\.9\)\. This suggests that stronger models are better able to convert strategy descriptions into effective guided exploration and internalize the resulting gains into the unguided policy\.
Table 1:Main results on four mathematical reasoning benchmarks\. Average is the arithmetic mean of the four benchmark scores\.Table 2:Ablation study on Qwen2\.5\-7B\-Instruct\.SettingMATH500AMC23OlympiadAIME24Average\\rowcolorblue\!8 SGPO82\.755\.950\.019\.752\.1w/o strategy distillation80\.352\.748\.418\.049\.9w/o autonomous GRPO78\.053\.147\.116\.348\.6w/o all proximal constraints76\.947\.342\.414\.445\.3w/o KL clipping79\.252\.147\.116\.348\.7w/o target selection78\.850\.447\.016\.048\.1w/o adaptive weighting81\.054\.848\.918\.750\.9
### 5\.2Ablation Studies
Autonomous GRPO and strategy distillation are complementary\.Removing the strategy distillation term \(α\(q\)=0\\alpha\(q\)\{=\}0, reducing to SFT\+GRPO\) lowers the average score from 52\.1 to 49\.9, confirming that strategy\-level transfer provides gains beyond on\-policy RL alone\. Removing autonomous GRPO \(i\.e\., training with only the KL distillation objective on strategy\-guided trajectories\) leads to a larger drop \(52\.1→\\to48\.6\)\. As shown in Figure[3](https://arxiv.org/html/2606.24064#S5.F3)\(a\)\(b\), this setting yields fast early reward growth but the progress is unsustainable: entropy drops sharply and reward subsequently stagnates\. A plausible explanation is that without autonomous rollouts, the policy loses the distributional diversity needed to explore beyond the strategies prescribed by the extracted, causing optimization to plateau once the easily exploitable strategy descriptions are exhausted\. These results demonstrate the necessity of combining both objectives: GRPO maintains broad exploration while distillation injects targeted strategic knowledge\.
Proximal constraints are essential for stable distillation\.Removing all proximal constraints causes the largest single\-component drop \(52\.1→\\to45\.3,−\-6\.8\)\. Both sub\-components contribute, with reachable target selection \(−\-4\.0\) slightly more impactful than KL clipping \(−\-3\.4\), indicating that whether the distillation target is reachable has a particularly direct effect on performance\. Unlike the w/o GRPO setting where entropy declines gradually due to lack of exploration diversity, removing proximal constraints triggers abrupt entropy collapse driven by excessively large KL updates, as shown in Figure[3](https://arxiv.org/html/2606.24064#S5.F3)\(a\)\(b\)\. The two constraints address complementary failure modes: target selection limits excessive distributional shifts at the trajectory level, while KL clipping suppresses local token\-level outliers\.
Adaptive weighting improves efficiency and ceiling\.Removing adaptive weighting leads to a moderate drop \(52\.1→\\to50\.9\)\. Figure[3](https://arxiv.org/html/2606.24064#S5.F3)\(c\) further shows that adaptive weighting accelerates optimization, reaching the same reward level in fewer steps\. This confirms that strategy guidance is not equally useful across instances, and concentrating distillation budget on high\-marginal\-benefit problems improves both convergence speed and final performance\. A comparison with alternative weighting strategies is provided in Appendix[G](https://arxiv.org/html/2606.24064#A7)\.
Figure 3:Training dynamics under ablation settings\. \(a\)\(b\) Removing either autonomous GRPO or all proximal constraints leads to a rapid entropy collapse\. \(c\) Adaptive weighting accelerates convergence and raises the final reward ceiling\.Figure 4:Training dynamics of SGPO and SFT\+GRPO\. \(a\) Policy entropy: SGPO maintains consistently higher entropy\. \(b\) Response length: SGPO reaches longer, more structured responses faster\. \(c\) Reward: SGPO achieves a higher final reward\. \(d\) Average adaptive weightα¯\\bar\{\\alpha\}: the weight decreases as the autonomous policy improves, reflecting a natural transition from strategy\-driven exploration to autonomous optimization\.
### 5\.3Further Analysis
#### 5\.3\.1Training Dynamics
Figure[4](https://arxiv.org/html/2606.24064#S5.F4)\(a\)\(b\)\(c\) compares SGPO with SFT\+GRPO\. SGPO maintains higher policy entropy throughout training\. Two factors likely contribute: the KL distillation objective introduces new high\-probability options at strategy\-critical positions, enriching the unguided policy’s distribution, while the proximal constraints prevent this distillation from over\-concentrating the distribution\. Both response length and reward increase rapidly in early training for both methods, but SGPO exhibits faster growth, suggesting that strategy conditioning helps the policy discover structured solution patterns earlier\.
Figure[4](https://arxiv.org/html/2606.24064#S5.F4)\(d\) tracks the average adaptive weightα¯\\bar\{\\alpha\}across training steps\. In early training, the gap between guided and autonomous pass rates is large, yielding highα¯\\bar\{\\alpha\}and strong distillation pressure\. As the autonomous policy improves and the pass\-rate gap narrows,α¯\\bar\{\\alpha\}decreases steadily\. This confirms the intended training dynamics: the framework transitions naturally from*strategy\-driven fast improvement*in early stages to*autonomous\-policy\-dominated steady optimization*in later stages, with no manual scheduling required\.
Figure 5:Step\-level KL analysis on a training example\. Gray bars: direct SFT loss, declining monotonically\. Colored bars: forward\-KL signal, peaking at strategy\-critical steps \(red, e\.g\., determining the LCD, isolating the variable\) and minimal at routine steps \(blue\)\.Table 3:KL\-based distillation vs\. direct SFT on successful strategy\-guided trajectories\.
#### 5\.3\.2Selective Amplification of Strategy\-Critical Tokens
To validate that KL\-based distillation is more effective than uniform imitation, we replace the forward\-KL objective with direct SFT on the same successful strategy\-guided trajectories, keeping all other components identical\. Table[3](https://arxiv.org/html/2606.24064#S5.T3)shows that KL\-based distillation outperforms direct SFT by 2\.8 points on average \(52\.1 vs\. 49\.3\), and Figure[6](https://arxiv.org/html/2606.24064#S5.F6)confirms that the advantage holds throughout training in both convergence speed and final ceiling\.
Figure 6:KL\-based distillation yields faster convergence and a higher reward ceiling than direct SFT on guided trajectories\.The performance gap arises from the granularity of the learning signal\. Direct SFT applies uniform fitting pressure across all tokens, whereas the KL objective automatically concentrates optimization on tokens whose generation probability shifts most under strategy conditioning\. Figure[5](https://arxiv.org/html/2606.24064#S5.F5)illustrates this on a training example involving a linear equation with fractional coefficients, where none of the 8 autonomous samples are correct while 2 of 4 strategy\-guided samples succeed\. We segment the successful trajectory into ten semantic steps and compute the average KL within each step\. The KL signal peaks at strategy\-critical steps such as determining the least common denominator and distributing it across both sides, all corresponding directly to the extracted strategy\. In contrast, the SFT loss declines monotonically across steps, reflecting generic token prediction difficulty rather than strategy relevance\. This confirms that the forward\-KL objective provides an inherently strategy\-aware signal that distinguishes strategic decisions from routine verbalization\.
## 6Conclusion
We presented SGPO, a framework that distills reusable strategy descriptions from strong\-model responses and transfers them into the target model’s reasoning policy through token\-level forward\-KL distillation between strategy\-guided and unguided distributions\. Combined with autonomous GRPO exploration, proximal constraints at the trajectory and token levels, and adaptive instance\-level weighting, the framework consistently outperforms strong baselines across mathematical benchmarks and model families\.
Our analysis further shows that the forward\-KL objective provides an inherently selective distillation signal that concentrates on strategy\-critical tokens without requiring any explicit annotation, offering a more effective alternative to uniform trajectory imitation\. Moreover, strategy distillation and base model capability scale complementarily, with gains increasing as the model’s foundational reasoning competence grows\. Together, these findings suggest that transferring knowledge at the strategy level, rather than the trajectory level, represents a promising direction for reasoning distillation\.
## Ethics Statement
This work focuses on improving the reasoning capabilities of language models through strategy distillation\. All training data are drawn from publicly available mathematical problem datasets, and no human subjects or personally identifiable information are involved\. The strong\-model responses used for strategy extraction are generated by existing publicly accessible models\. We acknowledge that improved reasoning capabilities could in principle be misused, but the mathematical reasoning setting studied here poses minimal direct risk\. We will release our code and strategy extraction prompts to support reproducibility\.
## References
- Ailin Huang \(2026\)Aobo Kong et\.al Ailin Huang, Ang Li\.Step 3\.5 flash: Open frontier\-level intelligence with 11b active parameters, 2026\.URL[https://arxiv\.org/abs/2602\.10604](https://arxiv.org/abs/2602.10604)\.
- Chen et al\. \(2025\)Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie\.Sft or rl? an early investigation into training r1\-like reasoning large vision\-language models, 2025\.URL[https://arxiv\.org/abs/2504\.11468](https://arxiv.org/abs/2504.11468)\.
- Chu et al\. \(2025\)Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V\. Le, Sergey Levine, and Yi Ma\.Sft memorizes, rl generalizes: A comparative study of foundation model post\-training, 2025\.URL[https://arxiv\.org/abs/2501\.17161](https://arxiv.org/abs/2501.17161)\.
- Daya et al\. \(2025\)Guo Daya, Yang Dejian, and Zhang Haowei et\.al\.Deepseek\-r1 incentivizes reasoning in llms through reinforcement learning\.*Nature*, 645\(8081\):633–638, September 2025\.ISSN 1476\-4687\.doi:10\.1038/s41586\-025\-09422\-z\.URL[http://dx\.doi\.org/10\.1038/s41586\-025\-09422\-z](http://dx.doi.org/10.1038/s41586-025-09422-z)\.
- Dubey et al\. \(2024\)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.
- Fu et al\. \(2025\)Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao\.Srft: A single\-stage method with supervised and reinforcement fine\-tuning for reasoning, 2025\.URL[https://arxiv\.org/abs/2506\.19767](https://arxiv.org/abs/2506.19767)\.
- He et al\. \(2024\)Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun\.Olympiadbench: A challenging benchmark for promoting agi with olympiad\-level bilingual multimodal scientific problems, 2024\.URL[https://arxiv\.org/abs/2402\.14008](https://arxiv.org/abs/2402.14008)\.
- Hendrycks et al\. \(2021\)Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the math dataset, 2021\.URL[https://arxiv\.org/abs/2103\.03874](https://arxiv.org/abs/2103.03874)\.
- Kwon et al\. \(2023\)Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E\. Gonzalez, Hao Zhang, and Ion Stoica\.Efficient memory management for large language model serving with pagedattention, 2023\.URL[https://arxiv\.org/abs/2309\.06180](https://arxiv.org/abs/2309.06180)\.
- Kydlíček \(2025\)Hynek Kydlíček\.Math\-verify: Math verification library, 2025\.URL[https://github\.com/huggingface/math\-verify](https://github.com/huggingface/math-verify)\.
- Liu et al\. \(2025\)Mingyang Liu, Gabriele Farina, and Asuman Ozdaglar\.Uft: Unifying supervised and reinforcement fine\-tuning, 2025\.URL[https://arxiv\.org/abs/2505\.16984](https://arxiv.org/abs/2505.16984)\.
- Lv et al\. \(2026\)Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, and Bowen Zhou\.Towards a unified view of large language model post\-training, 2026\.URL[https://arxiv\.org/abs/2509\.04419](https://arxiv.org/abs/2509.04419)\.
- Ma et al\. \(2026\)Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, and Wentao Zhang\.Learning what reinforcement learning can’t: Interleaved online fine\-tuning for hardest questions, 2026\.URL[https://arxiv\.org/abs/2506\.07527](https://arxiv.org/abs/2506.07527)\.
- Nair et al\. \(2018\)Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel\.Overcoming exploration in reinforcement learning with demonstrations, 2018\.URL[https://arxiv\.org/abs/1709\.10089](https://arxiv.org/abs/1709.10089)\.
- OpenAI et al\. \(2024\)OpenAI, :, Aaron Jaech, and Adam Kalai et\.al\.Openai o1 system card, 2024\.URL[https://arxiv\.org/abs/2412\.16720](https://arxiv.org/abs/2412.16720)\.
- Qin & Springenberg \(2025\)Chongli Qin and Jost Tobias Springenberg\.Supervised fine tuning on curated data is reinforcement learning \(and can be improved\), 2025\.URL[https://arxiv\.org/abs/2507\.12856](https://arxiv.org/abs/2507.12856)\.
- Rajeswaran et al\. \(2018\)Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine\.Learning complex dexterous manipulation with deep reinforcement learning and demonstrations, 2018\.URL[https://arxiv\.org/abs/1709\.10087](https://arxiv.org/abs/1709.10087)\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024\.URL[https://arxiv\.org/abs/2402\.03300](https://arxiv.org/abs/2402.03300)\.
- Sheng et al\. \(2025\)Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu\.Hybridflow: A flexible and efficient rlhf framework\.In*Proceedings of the Twentieth European Conference on Computer Systems*, pp\. 1279–1297\. ACM, March 2025\.doi:10\.1145/3689031\.3696075\.URL[http://dx\.doi\.org/10\.1145/3689031\.3696075](http://dx.doi.org/10.1145/3689031.3696075)\.
- Wu et al\. \(2026\)Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming\-Hsuan Yang, and Xu Yang\.On the generalization of sft: A reinforcement learning perspective with reward rectification, 2026\.URL[https://arxiv\.org/abs/2508\.05629](https://arxiv.org/abs/2508.05629)\.
- Yan et al\. \(2025\)Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang\.Learning to reason under off\-policy guidance, 2025\.URL[https://arxiv\.org/abs/2504\.14945](https://arxiv.org/abs/2504.14945)\.
- Yang et al\. \(2025\)An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu\.Qwen2\.5 technical report, 2025\.URL[https://arxiv\.org/abs/2412\.15115](https://arxiv.org/abs/2412.15115)\.
- Ye et al\. \(2025\)Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu\.Limo: Less is more for reasoning, 2025\.URL[https://arxiv\.org/abs/2502\.03387](https://arxiv.org/abs/2502.03387)\.
- Yu et al\. \(2025\)Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei\-Ying Ma, Ya\-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, and Mingxuan Wang\.Dapo: An open\-source llm reinforcement learning system at scale, 2025\.URL[https://arxiv\.org/abs/2503\.14476](https://arxiv.org/abs/2503.14476)\.
- Yuan et al\. \(2025\)Xiaoyang Yuan, Yujuan Ding, Yi Bin, Wenqi Shao, Jinyu Cai, Jingkuan Song, Yang Yang, and Heng Tao Shen\.More than one teacher: Adaptive multi\-guidance policy optimization for diverse exploration, 2025\.URL[https://arxiv\.org/abs/2510\.02227](https://arxiv.org/abs/2510.02227)\.
- Zhang et al\. \(2025\)Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, and Samet Oymak\.Bread: Branched rollouts from expert anchors bridge SFT & RL for reasoning, 2025\.URL[https://arxiv\.org/abs/2506\.17211](https://arxiv.org/abs/2506.17211)\.
- Zhu et al\. \(2025\)Wenhong Zhu, Ruobing Xie, Rui Wang, Xingwu Sun, Di Wang, and Pengfei Liu\.Proximal supervised fine\-tuning, 2025\.URL[https://arxiv\.org/abs/2508\.17784](https://arxiv.org/abs/2508.17784)\.
## Appendix ALimitations
Several limitations point to future directions\. First, our experiments focus on mathematical reasoning; validating the framework on other reasoning domains such as code generation, logical reasoning, and scientific problem solving remains important\. Second, strategy extraction currently relies on a strong model and is not fully quality\-controllable; developing more robust extraction methods or enabling the student model to discover strategies autonomously would reduce this dependence\. Third, the complementary scaling pattern implies that very weak base models may not benefit substantially from strategy distillation, and understanding the precise capability threshold deserves further investigation\. Finally, scaling the framework to larger training sets and stronger base models would help clarify its broader applicability\.
## Appendix BStrategy Extraction Pipeline
Strategy descriptions are extracted as a preprocessing step before RL training\. Given a strong\-model \(DeepSeek\-R1\) response to each training problem, we prompt the same model to produce a strategy description conforming to the three\-component structure described in §[3\.1](https://arxiv.org/html/2606.24064#S3.SS1): problem type identification, strategy selection, and general procedural steps\. To mitigate extraction variance, we sample five candidate descriptions per response and rank them with a separate scoring prompt that evaluates four criteria: \(i\) correctness of problem type identification, \(ii\) appropriateness and specificity of the chosen strategy, \(iii\) abstraction level of the procedural steps \(neither too vague nor too detailed\), and \(iv\) absence of intermediate computations or the final answer\. The highest\-scoring candidate is retained\. Prompt templates are given in Appendix[C](https://arxiv.org/html/2606.24064#A3)\.
Table 4:Prompt template for extracting strategy descriptions\.You are given a math problem together with a complete solution response written by a strong reasoning model\.Your task is to extract the problem\-solvingstrategybehind the response\. The extracted strategymust contain exactly three components:1\.Problem type: classify the problem into a specific, recognizable category\.2\.Strategy: state the specific principle, theorem, or technique used to solve this type of problem\.3\.Procedural steps: list the high\-level steps for executing the strategy, in order\.Donotinclude intermediate numerical computations or the final answer\.The output should be concise, generalizable, and helpful for guiding another model to solve similar problems\.Output format:Problem type: \[…\]Strategy: \[…\]Steps: \(a\) … \(b\) … \(c\) …Problem:\[problem text\]Response:\[strong\-model solution\]Strategy:
## Appendix CPrompt Templates
Table 5:Prompt template for scoring candidate strategy descriptions\.You are given a math problem, a reference solution, and several candidate extracted strategies\.Assign a quality score \(0\.0–1\.0\) to each candidate based on the following criteria:1\. Whether the problem type is correctly and specifically identified\.2\. Whether the chosen strategy is appropriate, specific, and non\-trivial\.3\. Whether the procedural steps are at the right abstraction level \(not too vague, not too detailed\)\.4\. Whether the strategy avoids revealing intermediate computations or the final answer\.Output format: Score: \[score1, score2, …\]Problem / Response / Candidate Strategies\(omitted for brevity\)Score:
## Appendix DTraining Algorithm
Algorithm[1](https://arxiv.org/html/2606.24064#alg1)summarizes the full procedure\. For each problem, we jointly sample an autonomous group and a strategy\-guided group\. The autonomous group is optimized through GRPO\. The guided group contributes only when it contains correct trajectories, in which case the most reachable correct response \(scored under the*unguided*policy\) is selected for proximal KL distillation with adaptive weighting\.
Algorithm 1Strategy\-Guided Policy Optimization \(SGPO\)1:Problem distribution
𝒫\(Q\)\\mathcal\{P\}\(Q\), policy
πθ\\pi\_\{\\theta\}, strategy dataset
𝒟m=\{\(q,s\)\}\\mathcal\{D\}\_\{m\}=\\\{\(q,s\)\\\}, group sizes
G1,G2G\_\{1\},G\_\{2\}
2:Updated policy
πθ\\pi\_\{\\theta\}
3:foreach training iterationdo
4:Sample minibatch
\{q\(b\)\}b=1B∼𝒫\(Q\)\\\{q^\{\(b\)\}\\\}\_\{b=1\}^\{B\}\\sim\\mathcal\{P\}\(Q\)
5:foreach problem
q\(b\)q^\{\(b\)\}do
6:Retrieve strategy
s\(b\)s^\{\(b\)\}
7:Sample autonomous group
\{oi\}i=1G1∼πθ\(⋅∣q\(b\)\)\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\_\{1\}\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid q^\{\(b\)\}\)
8:Sample strategy\-guided group
\{o~j\}j=1G2∼πθ\(⋅∣q\(b\),s\(b\)\)\\\{\\tilde\{o\}\_\{j\}\\\}\_\{j=1\}^\{G\_\{2\}\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid q^\{\(b\)\},s^\{\(b\)\}\)
9:Compute rewards
\{r\(oi,q\(b\)\)\}\\\{r\(o\_\{i\},q^\{\(b\)\}\)\\\}and
\{r\(o~j,q\(b\)\)\}\\\{r\(\\tilde\{o\}\_\{j\},q^\{\(b\)\}\)\\\}
10:Compute GRPO advantages
\{A^i\}\\\{\\hat\{A\}\_\{i\}\\\}and loss
ℒGRPO\\mathcal\{L\}\_\{\\mathrm\{GRPO\}\}
11:
𝒞←\{j:r\(o~j,q\(b\)\)=1\}\\mathcal\{C\}\\leftarrow\\\{j:r\(\\tilde\{o\}\_\{j\},q^\{\(b\)\}\)=1\\\}
12:if
𝒞≠∅\\mathcal\{C\}\\neq\\emptysetthen
13:Select reachable target:
om\+=argmaxo~j∈𝒞∑tlogπθ\(o~j,t∣q\(b\),o~j,<t\)o\_\{m\}^\{\+\}=\\arg\\max\_\{\\tilde\{o\}\_\{j\}\\in\\mathcal\{C\}\}\\sum\_\{t\}\\log\\pi\_\{\\theta\}\(\\tilde\{o\}\_\{j,t\}\\mid q^\{\(b\)\},\\tilde\{o\}\_\{j,<t\}\)⊳\\trianglerightunguided log\-prob
14:Compute clipped KL loss
ℒKDclip\\mathcal\{L\}\_\{\\mathrm\{KD\}\}^\{\\mathrm\{clip\}\}on
om\+o\_\{m\}^\{\+\}
15:Compute adaptive weight
α\(q\(b\)\)\\alpha\(q^\{\(b\)\}\)via Eq\.[6](https://arxiv.org/html/2606.24064#S3.E6)
16:else
17:
ℒKDclip←0\\mathcal\{L\}\_\{\\mathrm\{KD\}\}^\{\\mathrm\{clip\}\}\\leftarrow 0,
α\(q\(b\)\)←0\\alpha\(q^\{\(b\)\}\)\\leftarrow 0
18:endif
19:
ℒ\(b\)←ℒGRPO\+α\(q\(b\)\)ℒKDclip\\mathcal\{L\}^\{\(b\)\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{GRPO\}\}\+\\alpha\(q^\{\(b\)\}\)\\,\\mathcal\{L\}\_\{\\mathrm\{KD\}\}^\{\\mathrm\{clip\}\}
20:endfor
21:Update
θ\\thetaby gradient descent on
1B∑bℒ\(b\)\\frac\{1\}\{B\}\\sum\_\{b\}\\mathcal\{L\}^\{\(b\)\}
22:endfor
## Appendix EImplementation Details
Hardware and frameworks\.All experiments are conducted on 4 nodes, each equipped with 8 GPUs \(32 GPUs in total\)\. We use the veRL framework\(Sheng et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib19)\)for all training \(both SFT and RL stages\) and vLLM\(Kwon et al\.,[2023](https://arxiv.org/html/2606.24064#bib.bib9)\)for inference during evaluation\. Answer correctness is verified using Math\-Verify\(Kydlíček,[2025](https://arxiv.org/html/2606.24064#bib.bib10)\)\. End\-to\-end training \(SFT warm\-up \+ RL\) takes approximately 17 hours for the 7B model\.
SFT warm\-up\.We use roughly one third of the 8\.5K training instances for supervised warm\-up to endow the base models with basic thinking ability\. We use batch size 64, train for 10 epochs with learning rate1×10−51\{\\times\}10^\{\-5\}, warm\-up ratio 0\.1, and maximum sequence length 16K\. The checkpoint with the lowest loss on a randomly sampled one\-tenth validation split is used to initialize RL\.
Reinforcement learning\.The total group size is 12 \(G1=8G\_\{1\}\{=\}8autonomous,G2=4G\_\{2\}\{=\}4strategy\-guided\)\. We use temperature 1\.0, top\-p=0\.95p\{=\}0\.95, maximum generation length 16K, and train for 10 epochs with learning rate1×10−61\{\\times\}10^\{\-6\}\. Hyperparameters: GRPO clipping boundsεlow=0\.2\\varepsilon\_\{\\text\{low\}\}\{=\}0\.2andεhigh=0\.28\\varepsilon\_\{\\text\{high\}\}\{=\}0\.28\(Yu et al\.,[2025](https://arxiv.org/html/2606.24064#bib.bib24)\), KL thresholdδ=1\.0\\delta\{=\}1\.0, base adaptive coefficientα0=0\.5\\alpha\_\{0\}\{=\}0\.5, smoothing constantγ=0\.1\\gamma\{=\}0\.1, and maximum adaptive weightαmax=0\.8\\alpha\_\{\\max\}\{=\}0\.8\.
Baselines\.SFT and SFT\+GRPO use the same warm\-up hyperparameters as our method\. For HPT and LUFFY, we match the total number of samples and sampling parameters to ensure fair comparison, while keeping their method\-specific hyperparameters at the default values reported in the original papers, which already yield stable performance in our setting\.
## Appendix FKL Direction Ablation
We compare the forward KL directionDKL\(πguided∥πunguided\)D\_\{\\mathrm\{KL\}\}\(\\pi\_\{\\text\{guided\}\}\\,\\\|\\,\\pi\_\{\\text\{unguided\}\}\)used in our method with the reverse directionDKL\(πunguided∥πguided\)D\_\{\\mathrm\{KL\}\}\(\\pi\_\{\\text\{unguided\}\}\\,\\\|\\,\\pi\_\{\\text\{guided\}\}\)on Qwen2\.5\-7B\-Instruct\.
Table 6:KL direction ablation on Qwen2\.5\-7B\-Instruct\.The forward KL consistently outperforms the reverse direction\. The strategy\-guided distribution reflects a particular strategy, but a problem may admit multiple effective strategies that the model has already discovered through autonomous exploration\. Forward KL only requires coverage of the guided distribution, preserving these existing alternatives while absorbing the new methodological insight\. Reverse KL would instead collapse the unguided policy onto the guided modes, risking loss of alternative strategies\.
## Appendix GAdaptive Weighting Strategy Comparison
We compare the pass\-rate\-based adaptive weighting \(Eq\.[6](https://arxiv.org/html/2606.24064#S3.E6)\) against two alternatives on Qwen2\.5\-7B\-Instruct: \(i\) a fixed coefficientα=1\.0\\alpha\{=\}1\.0\(equivalent to the “w/o adaptive weighting” setting in Table[2](https://arxiv.org/html/2606.24064#S5.T2)\), and \(ii\) a group\-advantage weight described below\.
Group\-advantage weight\.Leto\+o^\{\+\}be the selected correct strategy\-guided trajectory and\{oi\}i=1G1\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\_\{1\}\}the autonomous group\. We merge them into a combined group𝒢=\{o1,…,oG1,o\+\}\\mathcal\{G\}=\\\{o\_\{1\},\\dots,o\_\{G\_\{1\}\},o^\{\+\}\\\}and compute a GRPO\-style normalized advantage for the guided trajectory:
A^m\+=R\(o\+,q\)−μ𝒢σ𝒢\+ϵ,\\hat\{A\}\_\{m^\{\+\}\}=\\frac\{R\(o^\{\+\},q\)\-\\mu\_\{\\mathcal\{G\}\}\}\{\\sigma\_\{\\mathcal\{G\}\}\+\\epsilon\},\(8\)whereμ𝒢\\mu\_\{\\mathcal\{G\}\}andσ𝒢\\sigma\_\{\\mathcal\{G\}\}are the mean and standard deviation of rewards within𝒢\\mathcal\{G\}\. The distillation weight is then set toαgroup\(q\)=clip\(A^m\+,0,αmax\)\\alpha\_\{\\text\{group\}\}\(q\)=\\mathrm\{clip\}\(\\hat\{A\}\_\{m^\{\+\}\},\\,0,\\,\\alpha\_\{\\max\}\)\.
Table 7:Comparison of weighting strategies on Qwen2\.5\-7B\-Instruct\.The pass\-rate\-based design consistently outperforms both alternatives\. The fixed coefficient applies uniform distillation strength regardless of instance difficulty\. The group\-advantage weight reflects the relative quality of the guided trajectory within the combined group but conflates two distinct signals: problem difficulty and strategy effectiveness\. A correct guided trajectory receives a high advantage whenever the autonomous group mostly fails, yet this does not indicate how much of the success is attributable to the strategy itself versus the inherent difficulty of the problem\. In contrast, the pass\-rate\-based design disentangles these factors by directly measuring the pass\-rate gap between guided and autonomous groups, providing a more faithful estimate of the marginal benefit of strategy guidance\.
## Appendix HStrategy Description Quality Analysis
We randomly sample 500 strategy descriptions and score each using Step\-3\.5\-Flash\(Ailin Huang,[2026](https://arxiv.org/html/2606.24064#bib.bib1)\)on the same four criteria used in the extraction scoring prompt \(Appendix[C](https://arxiv.org/html/2606.24064#A3)\): correctness of problem type identification, appropriateness of strategy, abstraction level of procedural steps, and absence of answer leakage\. Each criterion is rated on a 0–1 scale and the four scores are averaged into an overall quality score\. Table[8](https://arxiv.org/html/2606.24064#A8.T8)reports the distribution\.
Table 8:Quality distribution of extracted strategy descriptions\.The majority of extracted descriptions fall in the High and Medium bands, indicating that the extraction pipeline combined with the scoring\-based selection \(Appendix[B](https://arxiv.org/html/2606.24064#A2)\) produces generally reliable strategy descriptions\. The small fraction of Low\-quality descriptions is handled by the adaptive weighting mechanism \(Eq\.[6](https://arxiv.org/html/2606.24064#S3.E6)\): when a strategy is unhelpful, the guided pass ratepguidep\_\{\\text\{guide\}\}will not substantially exceedpautop\_\{\\text\{auto\}\}, automatically reducingα\(q\)\\alpha\(q\)and limiting the distillation signal from these instances\.Similar Articles
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.
SocraticPO: Policy Optimization via Interactive Guidance
SocraticPO augments RL rollouts with Socratic-style natural language guidance and reward decay to improve scientific reasoning in LLMs, outperforming strong baselines on SciKnowEval benchmarks.
Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance
This paper identifies limitations in token-level supervision for on-policy distillation of LLMs and proposes TOPD, which uses near-future trajectory information to better identify divergent reasoning states and distribute guidance across multiple tokens, achieving gains on AIME benchmarks.
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
Introduces LambdaPO, a novel reinforcement learning framework that improves upon GRPO by decomposing advantage estimation into pairwise preference comparisons and adding a semantic density reward, achieving better performance on math reasoning tasks.
Hint-Guided Diversified Policy Optimization for LLM Reasoning
This paper introduces Hint-Guided Diversified Policy Optimization (HDPO), a two-stage RL framework that encourages LLMs to first generate multiple candidate solution outlines (hints) and then select the most reliable one for detailed reasoning, improving reasoning diversity and reliability.