SocraticPO: Policy Optimization via Interactive Guidance

arXiv cs.LG Papers

Summary

SocraticPO augments RL rollouts with Socratic-style natural language guidance and reward decay to improve scientific reasoning in LLMs, outperforming strong baselines on SciKnowEval benchmarks.

arXiv:2606.09887v1 Announce Type: new Abstract: Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbf{SocraticPO} (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:16 AM

# SocraticPO: Policy Optimization via Interactive Guidance
Source: [https://arxiv.org/html/2606.09887](https://arxiv.org/html/2606.09887)
\\correspondence

Zirui Liu\\frontsup1, Jie Ouyang\\frontsup1, Qi Liu\\frontsup1\\corrauthor, Xianquan Wang\\frontsup1, Jiayu Liu\\frontsup1, Tingyue Pan\\frontsup1, Qingchuan Li\\frontsup1, Jing Sha\\frontsup2, Zhenya Huang\\frontsup1, Shijin Wang\\frontsup2, Enhong Chen\\frontsup1\\affilsup1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China\\affilsup2iFLYTEK AI Research \(Central China\), iFLYTEK Co\., Ltd[qiliuql@ustc\.edu\.cn](https://arxiv.org/html/2606.09887v1/mailto:[email protected])

###### Abstract

Reinforcement learning \(RL\) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness\. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies\. We proposeSocraticPO\(Socratic Policy Optimization\), a policy\-optimization framework that augments RL rollouts with Socratic\-style natural\-language guidance\. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context\. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward\. Since SocraticPO only modifies the rollout process while leaving the standard expected\-reward objective intact, it can be plugged into existing policy\-gradient backends such as Reinforce\+\+\. Moreover, because the teacher provides only text\-level guidance, SocraticPO can leverage stronger black\-box teacher models without requiring access to logits or distribution matching\. On undergraduate\-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self\-distillation baselines\. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction\.

![Refer to caption](https://arxiv.org/html/2606.09887v1/x1.png)Figure 1:Overview of SocraticPO\. Conventional RL rollouts provide only scalar outcome feedback after a student response, whereas SocraticPO inserts conditional teacher guidance after failed attempts\.## 1Introduction

Large language models \(LLMs\) have advanced rapidly with transformer architectures, scaling, and instruction tuning\[vaswani2017attention,brown2020language,touvron2023llama,achiam2023gpt,bai2023qwen\]\. Their post\-training increasingly relies on reinforcement learning \(RL\), where a policy is optimized from task rewards or preference\-derived signals\[christiano2017deep,stiennon2020learning,ouyang2022training\]\. For reasoning tasks, however, RL feedback is often reduced to a scalar outcome signal, such as binary correctness\. Such feedback tells the model which trajectories should be reinforced, but not how a mistaken reasoning process should be repaired\.

This limitation is especially problematic in multi\-step scientific reasoning\. A wrong final answer may arise from a localized conceptual error, an incorrect comparison, or a misplaced intermediate value\. Outcome rewards provide gradient\-level optimization pressure, but they do not provide thought\-level correction\. As a result, the policy may discover shortcuts that satisfy the verifier without learning robust reasoning\.

Human learning offers a different pattern of supervision\. Teachers rarely provide only a grade; they inspect the learner’s attempt, identify the source of the mistake, and give hints, critiques, or next\-step suggestions\[wood1976role,vanlehn2011relative\]\. The Socratic tradition frames this process as guided inquiry rather than direct answer transfer\[scott2006plato,liu2024socraticlm\]\. Inspired by this view, we introduceSocratic Policy Optimization \(SocraticPO\), which brings Socratic\-style guidance into the rollout stage of RL\. A student first attempts the problem independently\. If the attempt is incorrect, a teacher provides natural\-language guidance conditioned on the student’s response, and the student continues under the expanded context\.

Guidance must be introduced carefully\. If a response corrected after teacher intervention receives the same reward as one solved independently, the policy may learn a training\-time shortcut: wait for help, use the teacher\-provided clue, and still collect full reward\. Assisted correctness can then improve without a corresponding improvement in unassisted reasoning\. SocraticPO therefore pairs teacher guidance withreward decay: correct answers obtained after teacher intervention receive decayed rewards, and the reward decreases as more guidance rounds are needed\. This encourages the student not only to reach the correct answer, but also to internalize the correction and solve earlier with less assistance\.

SocraticPO changes only the rollout process\. The student still optimizes a standard expected\-reward objective, while teacher utterances are appended as text context for subsequent student attempts rather than optimized as imitation targets\. As a result, SocraticPO can be instantiated with existing policy\-gradient backends such as Reinforce\+\+\[hu2025reinforce\]\. Since the teacher is required only to produce text\-level guidance, the framework can also leverage stronger black\-box teacher models without accessing logits, token\-level likelihoods, or matching teacher and student distributions, unlike probability\-matching approaches such as OPD\[agarwal2024policy,fu2026revisiting\]\.

We evaluate SocraticPO on undergraduate\-level scientific reasoning benchmarks derived from SciKnowEval\[feng2024sciknoweval\]\. Compared with strong RL baselines, including Reinforce\+\+\[hu2025reinforce\]and GRPO\[shao2024deepseekmath\], as well as self\-distillation\-style training such as SDPO\[hubotter2026reinforcement\], SocraticPO achieves the best average performance\. Ablations further show that both components are necessary: targeted guidance provides useful correction beyond blind retrying, while reward decay mitigates reliance on teacher\-assisted correction\.

Our main contributions are:

- •A general SocraticPO framework\.We propose an interactive policy\-optimization framework that augments RL with teacher\-guided reasoning correction and remains compatible with existing policy\-gradient backends\.
- •Reward decay for SocraticPO\.To make teacher\-guided RL train the student rather than reward dependence on assistance, we introduce a reward decay mechanism for assisted correctness\. We further characterize its normalized\-advantage properties, showing non\-negativity, monotonicity, and saturation decay\.
- •Black\-box teacher compatibility\.Our method only requires teachers to provide text\-level guidance instead of distribution matching, making it compatible with stronger API\-based or otherwise black\-box teachers\.
- •Empirical evidence and analysis\.We show gains on scientific reasoning benchmarks and provide ablations on teacher guidance, reward decay, teacher\-side information, and teacher–student similarity\.

## 2Related Work

### 2\.1Reinforcement Learning for LLMs

Reinforcement learning has become a central paradigm for aligning LLMs with human preferences and task rewards\[christiano2017deep,ziegler2019fine,stiennon2020learning,ouyang2022training\]\. PPO\[schulman2017proximal\]is widely used in RLHF pipelines, and feedback can also be provided by AI systems rather than direct human annotation\[bai2022constitutional\]\. More recent LLM training methods adopt simpler or more scalable policy\-gradient variants, including REINFORCE\-style updates\[williams1992simple,sutton1999policy,hu2025reinforce\]and group\-wise relative objectives such as GRPO\[shao2024deepseekmath\]\.

Most of these methods optimize scalar reward signals\. In reasoning tasks, such rewards identify whether a trajectory should be reinforced but do not explain which part of the reasoning process should be corrected\. Process\-supervision work shows that intermediate feedback can be more informative than outcome supervision alone\[uesato2022solving\], yet how to inject corrective information into on\-policy RL rollouts remains underexplored\. SocraticPO addresses this gap by adding natural\-language correction to rollout sampling while retaining a standard policy\-gradient objective\.

### 2\.2Human Guided Teaching

Educational research emphasizes that effective teaching is interactive rather than purely evaluative\. Tutors scaffold learning by inspecting a student’s current attempt, diagnosing misconceptions, and providing hints or next\-step guidance\[wood1976role,vanlehn2011relative\]\. The Socratic tradition similarly frames learning as guided inquiry: questions and critiques help the learner revise an answer without simply copying an expert solution\[scott2006plato\]\. Recent NLP work has explored Socratic tutoring and personalized teaching dialogues with LLMs\[liu2024socraticlm\]\.

SocraticPO adapts this teaching principle to RL training\. The teacher does not merely label an answer as wrong; it provides language\-level correction conditioned on the student’s failed attempt\. At the same time, SocraticPO differs from tutoring systems whose goal is to optimize dialogue quality\. Our goal is to improve the student policy through RL, so teacher guidance is paired with a reward design that encourages the student to solve earlier and rely less on future intervention\.

### 2\.3Student–Teacher Paradigms in LLMs

Teacher–student learning is widely used in LLM post\-training to transfer capability from stronger models to smaller or weaker students\. One common paradigm is teacher\-generated supervised fine\-tuning \(SFT\), where a stronger model produces instructions, answers, or reasoning traces that are then treated as demonstration data for the student\. Self\-Instruct, Alpaca, WizardLM, Orca, and step\-by\-step distillation all follow this broad pattern\[wang2023selfinstruct,taori2023alpaca,xu2023wizardlm,mukherjee2023orca,hsieh2023distilling\]\. Recent work also studies how to synthesize student\-consistent SFT data from a teacher so that the demonstrations better match the student’s learning state\[huang2026fine\]\.

Another paradigm is knowledge or policy distillation, where the student is trained to approximate a teacher’s behavior more directly\. Classical distillation matches teacher distributions or soft targets\[hinton2015distilling\], while on\-policy distillation adapts this idea to LLMs by using model\-generated or teacher\-shaped trajectories during training\[agarwal2024policy,song2026survey,fu2026revisiting\]\. Self\-distillation and hybrid distillation variants further use the model’s own successful responses, privileged trajectories, or routed samples as training targets\[hubotter2026reinforcement,li2026srpo,ding2026hdpo,ye2026opcd\]\. Despite differences in data source and objective, these methods generally ask the student to imitate, approximate, or absorb teacher\-like behavior\.

SocraticPO adopts a different teacher–student relation\. The teacher is not a demonstration generator whose answer should be copied, nor a distributional target whose probabilities should be matched\. Instead, it acts as an interactive guide: after a failed student attempt, the teacher diagnoses the mistake and provides concise corrective guidance, while the student still produces its own subsequent response and is optimized only through task reward\. Because this guidance is text\-level, SocraticPO can also use black\-box teachers as an additional benefit of the design\.

## 3Preliminaries

We first introduce the standard policy\-gradient formulation for single\-turn LLM training, then use it to clarify what SocraticPO changes: the rollout distribution rather than the expected\-reward objective\.

### 3\.1Language Model Policy

Given an input promptxx, an autoregressive language model with parametersθ\\thetadefines a policyπθ​\(y∣x\)\\pi\_\{\\theta\}\(y\\mid x\)over response sequencesy=\(y1,…,yT\)y=\(y\_\{1\},\\dots,y\_\{T\}\)\. The response probability factorizes token by token:

πθ​\(y∣x\)=∏t=1Tπθ​\(yt∣x,y<t\)\.\\pi\_\{\\theta\}\(y\\mid x\)=\\prod\_\{t=1\}^\{T\}\\pi\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\.\(1\)In the standard single\-turn setting, one sampled response is treated as a trajectoryτ=y\\tau=yand assigned a scalar rewardR​\(τ\)R\(\\tau\), such as a correctness score\.

### 3\.2Expected\-Reward Objective

Policy optimization maximizes the expected reward under the current policy\[williams1992simple,sutton1999policy\]:

𝒥​\(θ\)=𝔼τ∼πθ\(⋅∣x\)​\[R​\(τ\)\]\.\\mathcal\{J\}\(\\theta\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\[R\(\\tau\)\]\.\(2\)For a sequence\-level reward, the corresponding policy\-gradient estimator can be written as:

∇θ𝒥​\(θ\)=𝔼τ∼πθ​\[R​\(τ\)​∑t=1T∇θlog⁡πθ​\(yt∣x,y<t\)\]\.\\nabla\_\{\\theta\}\\mathcal\{J\}\(\\theta\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\\left\[R\(\\tau\)\\sum\_\{t=1\}^\{T\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\\right\]\.\(3\)Practical implementations often add baselines, reward normalization, clipping, or group\-wise comparison to reduce variance and stabilize training\[schulman2017proximal,shao2024deepseekmath,hu2025reinforce\]\.

### 3\.3Rollout Sampling

The rollout distribution determines which trajectories the policy learns from\. In conventional LLM RL, the rollout is sampled only from the student policy:

τ∼πθ\(⋅∣x\)\.\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\.\(4\)The reward is then computed after generation, and the same trajectory is used for the policy update\.

### 3\.4Verifier and Correctness Notation

For each promptxx, let𝒢\\mathcal\{G\}denote the information used by the verifier, such as the ground\-truth answer and optional reference solution\. The teacher may also receive side informationℐ\\mathcal\{I\}, which can be empty, restricted, or include reference material depending on the teaching protocol\. We define a correctness function

δ​\(y,𝒢\)=𝕀​\[correct​\(y,𝒢\)\]\.\\delta\(y,\\mathcal\{G\}\)=\\mathbb\{I\}\[\\text\{correct\}\(y,\\mathcal\{G\}\)\]\.\(5\)

## 4Method

Traditional scalar rewards only tell the model whether an answer is correct or incorrect\. They provide an optimization signal, but not a diagnosis of the reasoning error that caused the failure\. In human learning, a teacher often plays this diagnostic role: by pointing out where the student’s current attempt goes wrong, the teacher can help the student revise more efficiently than a grade alone\. SocraticPO turns this idea into a policy\-optimization mechanism\. It modifies the rollout stage by inserting conditional teacher feedback after failed student attempts, so that the policy learns from trajectories that contain both the original mistake and a correction\-oriented context for the next attempt\.

### 4\.1Interactive Rollout

Using the notation from Section[3](https://arxiv.org/html/2606.09887#S3), the rollout is organized as a sequence of student attempts rather than a single long generation\. This step\-wise structure gives the verifier a chance to identify failure points and allows the teacher to respond to the student’s actual mistake, instead of providing generic hints before the model has committed to a line of reasoning\. At each interaction stepkkwith1≤k≤K1\\leq k\\leq K, the dialogue context is denoted byc\(k\)c^\{\(k\)\}, withc\(1\)=xc^\{\(1\)\}=x\. The student then generates a complete response:

y\(k\)=\(y1\(k\),…,yTk\(k\)\)∼πθ\(⋅∣c\(k\)\)\.y^\{\(k\)\}=\(y^\{\(k\)\}\_\{1\},\\dots,y^\{\(k\)\}\_\{T\_\{k\}\}\)\\sim\\pi\_\{\\theta\}\(\\cdot\\mid c^\{\(k\)\}\)\.\(6\)Equivalently, the response probability factorizes at the token level as

πθ​\(y\(k\)∣c\(k\)\)=∏t=1Tkπθ​\(yt\(k\)∣c\(k\),y<t\(k\)\)\.\\pi\_\{\\theta\}\(y^\{\(k\)\}\\mid c^\{\(k\)\}\)=\\prod\_\{t=1\}^\{T\_\{k\}\}\\pi\_\{\\theta\}\(y^\{\(k\)\}\_\{t\}\\mid c^\{\(k\)\},y^\{\(k\)\}\_\{<t\}\)\.\(7\)The response is then evaluated by

δ\(k\)=δ​\(y\(k\),𝒢\)\.\\delta^\{\(k\)\}=\\delta\(y^\{\(k\)\},\\mathcal\{G\}\)\.\(8\)Ifδ\(k\)=1\\delta^\{\(k\)\}=1, the rollout terminates\. Ifδ\(k\)=0\\delta^\{\(k\)\}=0andk<Kk<K, the teacher is queried with a teaching prompt that includes the current dialogue context, the student’s latest response, and the teacher\-side informationℐ\\mathcal\{I\}\. The teacher then generates concise natural\-language feedback, such as an error diagnosis, a correction hint, or a suggested next step\. We provide the prompt template in Appendix[B\.2](https://arxiv.org/html/2606.09887#A2.SS2)\. Formally,

f\(k\)∼πϕ\(⋅∣c\(k\),y\(k\),ℐ\),f^\{\(k\)\}\\sim\\pi\_\{\\phi\}\(\\cdot\\mid c^\{\(k\)\},y^\{\(k\)\},\\mathcal\{I\}\),\(9\)and the next context is constructed as

c\(k\+1\)=c\(k\)​‖y\(k\)‖​f\(k\)\.c^\{\(k\+1\)\}=c^\{\(k\)\}\|\|y^\{\(k\)\}\|\|f^\{\(k\)\}\.\(10\)Thus teacher feedback appears only after an incorrect student response\. A sampled interaction trajectory can be written as

τ=\(x,y\(1\),f\(1\),…,f\(L−1\),y\(L\)\),\\tau=\\left\(x,y^\{\(1\)\},f^\{\(1\)\},\\dots,f^\{\(L\-1\)\},y^\{\(L\)\}\\right\),\(11\)whereL≤KL\\leq Kis the number of student attempts actually made, andf\(k\)f^\{\(k\)\}is present only whenδ\(k\)=0\\delta^\{\(k\)\}=0andk<Lk<L\. This defines an interaction\-conditioned rollout distribution

τ∼πθ,ϕint\(⋅∣x,𝒢,ℐ,K\)\.\\tau\\sim\\pi\_\{\\theta,\\phi\}^\{\\mathrm\{int\}\}\(\\cdot\\mid x,\\mathcal\{G\},\\mathcal\{I\},K\)\.\(12\)Only the student factors in this distribution depend onθ\\theta; teacher outputs are sampled conditioning variables\.

### 4\.2Policy Optimization

We keep the expected\-reward objective but evaluate it under the interaction\-conditioned rollout distribution\. This preserves compatibility with standard policy\-gradient training: SocraticPO changes which trajectories are sampled, not the principle that the student is reinforced by task reward\.

𝒥​\(θ\)=𝔼τ∼πθ,ϕint​\[R​\(τ\)\]\.\\mathcal\{J\}\(\\theta\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta,\\phi\}^\{\\mathrm\{int\}\}\}\\left\[R\(\\tau\)\\right\]\.\(13\)In our step\-wise rollout, each generated student response receives a rewardr\(k\)r^\{\(k\)\}, and the trajectory return is

R​\(τ\)=∑k=1Lr\(k\)\.R\(\\tau\)=\\sum\_\{k=1\}^\{L\}r^\{\(k\)\}\.\(14\)Using step\-wise rewards is important for within\-trajectory credit assignment\[wang2026steppo\]\. A final trajectory\-level reward would mark the whole interaction as successful once the last answer is correct, but it would not identify which student attempt should receive credit\. Step\-wise rewards instead attach the learning signal to the specific response that becomes correct and leave earlier failed attempts unrewarded\. This separates independent success, recovery after one targeted hint, and recovery after repeated intervention inside the same trajectory\. The policy\-gradient update is applied only to student tokens:

∇θ𝒥​\(θ\)≈∑k=1Lr\(k\)​∑t=1Tk∇θlog⁡πθ​\(yt\(k\)∣c\(k\),y<t\(k\)\)\.\\nabla\_\{\\theta\}\\mathcal\{J\}\(\\theta\)\\approx\\sum\_\{k=1\}^\{L\}r^\{\(k\)\}\\sum\_\{t=1\}^\{T\_\{k\}\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y^\{\(k\)\}\_\{t\}\\mid c^\{\(k\)\},y^\{\(k\)\}\_\{<t\}\)\.\(15\)

### 4\.3Reward Decay

![Refer to caption](https://arxiv.org/html/2606.09887v1/x2.png)Figure 2:Step\-wise reward and reward decay in SocraticPO\. Step\-wise rewards assign credit to the specific student attempt that becomes correct within an interaction trajectory, while reward decay calibrates credit across the batch by assigning smaller rewards to answers that require more rounds of teacher guidance\.Step\-wise rewards alone, however, do not solve credit assignment across samples in the same batch\. If every correct response receives the same reward, then a problem solved independently at step11and another problem solved only after several rounds of guidance would contribute the same positive reward, even though they reflect different degrees of student competence\. This creates an assisted reward\-hacking failure mode: the student can benefit from teacher intervention during training without learning to solve earlier or independently\. We discuss this situation in Section[5\.3](https://arxiv.org/html/2606.09887#S5.SS3)\. Therefore, after locating credit within each trajectory, SocraticPO further calibrates credit across the batch by decaying rewards for later\-step correctness\.

Consider a training batch\. Let𝒜k\\mathcal\{A\}\_\{k\}be the set of samples that are active at stepkk, meaning they have not been solved before stepkk\. For samplei∈𝒜ki\\in\\mathcal\{A\}\_\{k\}, let

δi\(k\)=𝕀​\[correct​\(yi\(k\),𝒢i\)\]\.\\delta\_\{i\}^\{\(k\)\}=\\mathbb\{I\}\[\\text\{correct\}\(y\_\{i\}^\{\(k\)\},\\mathcal\{G\}\_\{i\}\)\]\.\(16\)The first\-step reward is

ri\(1\)=δi\(1\)\.r\_\{i\}^\{\(1\)\}=\\delta\_\{i\}^\{\(1\)\}\.\(17\)Fork≥2k\\geq 2, we compute the historical average reward over previous active attempts:

μk−1=∑j=1k−1∑i∈𝒜jri\(j\)∑j=1k−1\|𝒜j\|,\\mu\_\{k\-1\}=\\frac\{\\sum\_\{j=1\}^\{k\-1\}\\sum\_\{i\\in\\mathcal\{A\}\_\{j\}\}r\_\{i\}^\{\(j\)\}\}\{\\sum\_\{j=1\}^\{k\-1\}\|\\mathcal\{A\}\_\{j\}\|\},\(18\)and define

ri\(k\)=μk−1⋅δi\(k\),i∈𝒜k\.r\_\{i\}^\{\(k\)\}=\\mu\_\{k\-1\}\\cdot\\delta\_\{i\}^\{\(k\)\},\\qquad i\\in\\mathcal\{A\}\_\{k\}\.\(19\)Incorrect responses always receive zero reward\. Correct responses at later steps receive a reward scaled by the average success value of earlier attempts, which makes delayed correction less valuable than immediate success\. This reward rule gives the following properties and the proof is provided in Appendix[A](https://arxiv.org/html/2606.09887#A1):

Input:Student policy

πθ\\pi\_\{\\theta\}; teacher policy

πϕ\\pi\_\{\\phi\}; prompt dataset

𝒟\\mathcal\{D\}; verifier information

𝒢\\mathcal\{G\}; teacher\-side information

ℐ\\mathcal\{I\}; maximum interaction steps

KK; batch size

BB\.

Output:Optimized student policy

πθ\\pi\_\{\\theta\}\.

repeatSample a batch

\{\(xi,𝒢i,ℐi\)\}i=1B∼𝒟\\\{\(x\_\{i\},\\mathcal\{G\}\_\{i\},\\mathcal\{I\}\_\{i\}\)\\\}\_\{i=1\}^\{B\}\\sim\\mathcal\{D\};

Initialize

ci\(1\)←xic\_\{i\}^\{\(1\)\}\\leftarrow x\_\{i\}, active set

𝒜1←\{1,…,B\}\\mathcal\{A\}\_\{1\}\\leftarrow\\\{1,\\dots,B\\\}, and trajectory buffer

𝒯←∅\\mathcal\{T\}\\leftarrow\\emptyset;

for*k=1k=1toKK*do

foreach*i∈𝒜ki\\in\\mathcal\{A\}\_\{k\}*do

Generate

yi\(k\)∼πθ\(⋅∣ci\(k\)\)y\_\{i\}^\{\(k\)\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid c\_\{i\}^\{\(k\)\}\); Compute

δi\(k\)←𝕀​\[correct​\(yi\(k\),𝒢i\)\]\\delta\_\{i\}^\{\(k\)\}\\leftarrow\\mathbb\{I\}\[\\text\{correct\}\(y\_\{i\}^\{\(k\)\},\\mathcal\{G\}\_\{i\}\)\]; Store

\(ci\(k\),yi\(k\),δi\(k\)\)\(c\_\{i\}^\{\(k\)\},y\_\{i\}^\{\(k\)\},\\delta\_\{i\}^\{\(k\)\}\)in

𝒯\\mathcal\{T\};

if*δi\(k\)=0\\delta\_\{i\}^\{\(k\)\}=0andk<Kk<K*then

Generate

fi\(k\)∼πϕ\(⋅∣ci\(k\),yi\(k\),ℐi\)f\_\{i\}^\{\(k\)\}\\sim\\pi\_\{\\phi\}\(\\cdot\\mid c\_\{i\}^\{\(k\)\},y\_\{i\}^\{\(k\)\},\\mathcal\{I\}\_\{i\}\); Set

ci\(k\+1\)←ci\(k\)​‖yi\(k\)‖​fi\(k\)c\_\{i\}^\{\(k\+1\)\}\\leftarrow c\_\{i\}^\{\(k\)\}\|\|y\_\{i\}^\{\(k\)\}\|\|f\_\{i\}^\{\(k\)\};

Compute rewards

\{ri\(k\):i∈𝒜k\}\\\{r\_\{i\}^\{\(k\)\}:i\\in\\mathcal\{A\}\_\{k\}\\\}using the step\-wise rule;

Set

𝒜k\+1←\{i∈𝒜k:δi\(k\)=0\}\\mathcal\{A\}\_\{k\+1\}\\leftarrow\\\{i\\in\\mathcal\{A\}\_\{k\}:\\delta\_\{i\}^\{\(k\)\}=0\\\};

if*𝒜k\+1=∅\\mathcal\{A\}\_\{k\+1\}=\\emptyset*then

break;

Update

πθ\\pi\_\{\\theta\}with Reinforce\+\+ using student responses in

𝒯\\mathcal\{T\}and their rewards;

until convergence

Algorithm 1Socratic Policy Optimization with Conditional Teacher Intervention###### Theorem 1\(Assistance\-Aware Advantage under Reward Decay\)\.

Define the normalized advantage

a^i\(k\)=ri\(k\)−μtσt\+ϵ,\\hat\{a\}\_\{i\}^\{\(k\)\}=\\frac\{r\_\{i\}^\{\(k\)\}\-\\mu\_\{t\}\}\{\\sigma\_\{t\}\+\\epsilon\},\(20\)whereμt,σt\\mu\_\{t\},\\sigma\_\{t\}are the mean and standard deviation after stepttwith1≤t≤k≤K1\\leq t\\leq k\\leq K\. Then the following properties hold:

1. 1\.\(Non\-negativity\)If sampleiiis correct at stepkk, i\.e\.,δi\(k\)=1\\delta\_\{i\}^\{\(k\)\}=1, thena^i\(k\)≥0\\hat\{a\}\_\{i\}^\{\(k\)\}\\geq 0\.
2. 2\.\(Monotonicity\)Ift≥2t\\geq 2and samplesi,ji,jare correct at stepk,l​\(k<l\)k,l\(k<l\), i\.e\.,δi\(k\)=δj\(l\)=1\\delta\_\{i\}^\{\(k\)\}=\\delta\_\{j\}^\{\(l\)\}=1, thena^i\(k\)\>a^j\(l\)\\hat\{a\}\_\{i\}^\{\(k\)\}\>\\hat\{a\}\_\{j\}^\{\(l\)\}\.
3. 3\.\(Saturation decay\)Letpk=1\|𝒜k\|​∑i∈𝒜kδi\(k\)p\_\{k\}=\\frac\{1\}\{\|\\mathcal\{A\}\_\{k\}\|\}\\sum\_\{i\\in\\mathcal\{A\}\_\{k\}\}\\delta\_\{i\}^\{\(k\)\}be the correction rate among active samples\. Then ifpt→1p\_\{t\}\\rightarrow 1, we havea^i\(t\)→0\\hat\{a\}\_\{i\}^\{\(t\)\}\\rightarrow 0\.

The first property shows that SocraticPO does not punish corrected reasoning\. Once a sample is answered correctly, the corresponding student tokens receive a non\-negative normalized advantage, even if the answer is obtained after teacher guidance\. Thus,reward decay does not discard assisted successes; it keeps them as valid positive learning signals\.

The second property formalizes the batch\-level credit assignment discussed above\. Among samples that eventually become correct within the same batch, SocraticPO gives a larger normalized advantage to those corrected at earlier steps\. This prevents an independently solved problem and a problem solved only after multiple rounds of guidance from receiving indistinguishable credit\. In this sense,reward decay calibrates positive reward across samples by favoring trajectories that become correct with fewer guidance rounds\.

The third property explains how reward decay mitigates assisted reward hacking\. If teacher guidance makes almost all active samples correct, thenpk→1p\_\{k\}\\rightarrow 1and the advantage approaches zero\. Once a guidance round becomes non\-discriminative because nearly everyone can solve the task with that amount of help, trajectories from that round no longer provide strong positive reinforcement\. This means thatreward decay can prevent the student policy from gaining excessive advantage merely by relying on teacher intervention\.

## 5Experiments

We evaluate SocraticPO on scientific reasoning tasks\. The experiments are designed to answer two main questions: \(1\) whether SocraticPO improves over standard RL and student–teacher baselines; and \(2\) whether its two key components, teacher guidance and reward decay, are both necessary for effective learning\. We further include diagnostic analyses on teacher\-side information and teacher–student similarity to better understand when the interaction mechanism helps\.

### 5\.1Experimental Setup

#### Models\.

We useQwen3\-4B\-Instructas the student policyπθ\\pi\_\{\\theta\}\. For teacher guidance, we consider two settings:Qwen3\-4B\-Instructas a self\-guidance teacher andQwen3\.5\-27Bas a stronger teacher\. This allows us to separate the effect of the interaction mechanism from the effect of teacher capacity\.

#### Datasets and metric\.

Following the SDPO evaluation setup, we use four undergraduate\-level scientific reasoning benchmarks from the reasoning subsets ofSciKnowEval\[feng2024sciknoweval\]:Chemistry,Physics,Biology, andMaterial\. We report accuracy on each domain and the unweighted average across domains\.

Table 1:Accuracy on SciKnowEval reasoning benchmarks\. SocraticPO variants specify the teacher model in parentheses\. Bold indicates the best result in each column, and underline indicates the second\-best result\.![Refer to caption](https://arxiv.org/html/2606.09887v1/x3.png)Figure 3:Training dynamics of SocraticPO compared with baseline policy optimization\. The curve illustrates how interaction\-conditioned rollouts affect performance over training steps\.
#### Baselines\.

We compare againstReinforce\+\+\[hu2025reinforce\],GRPO\[shao2024deepseekmath\], andSDPO\[hubotter2026reinforcement\]\. Reinforce\+\+ and GRPO represent RL methods driven by scalar rewards, while SDPO represents a self\-distillation\-style baseline\.

### 5\.2Main Results

Table[1](https://arxiv.org/html/2606.09887#S5.T1)reports the main results\. With Qwen3\.5\-27B as the teacher, SocraticPO achieves the best average accuracy, improving from 0\.7388 with Reinforce\+\+ to 0\.7624\. It also obtains the best score on Physics, Biology, and Material, suggesting that Socratic guidance is especially useful when errors can be localized and corrected through intermediate reasoning feedback\. Chemistry is the main exception, where SDPO achieves the highest single\-domain score\. Under self\-guidance, SocraticPO remains competitive and reaches an average accuracy of 0\.7498, which is still higher than Reinforce\+\+ and GRPO\. This indicates that the interaction mechanism itself contributes to learning, while a stronger teacher further amplifies the benefit\.

In Figure[3](https://arxiv.org/html/2606.09887#S5.F3), we observe a two\-stage pattern in the training dynamics\. In the early stage, the student rapidly improves its ability to answer correctly after receiving teacher guidance, as reflected by the guided\-performance curve\. This is consistent with the role of interaction\-conditioned rollouts: the model sees its own failed attempts followed by targeted corrective feedback, and first learns how to use guidance to repair its reasoning\. In the later stage, reward decay gradually reduces the advantage of answers that are correct only after guidance, while unassisted correct answers receive relatively stronger credit\. As a result, the learning pressure shifts from “being correct with help” to “being correct earlier and without help\.” Combined with the correction ability learned in the early stage, this shift eventually improves the student’s unassisted performance as well\. This pattern supports our design intuition: teacher guidance creates learnable repairs, and reward decay encourages the student to internalize those repairs into independent reasoning\.

### 5\.3Ablation Study

Table 2:Ablation results for teacher guidance and reward decay\.![Refer to caption](https://arxiv.org/html/2606.09887v1/x4.png)Figure 4:Ablation comparison for teacher guidance and reward decay\.We evaluate two variants of SocraticPO\.SocraticPO \(w/o tg\)removes adaptive teacher guidance and replaces it with a fixed retry prompt\.SocraticPO \(w/o rd\)keeps teacher guidance but removes reward decay, assigning full reward to any correct answer regardless of the interaction step\.

Table[2](https://arxiv.org/html/2606.09887#S5.T2)separates the component analysis from the main comparison\. Removing teacher guidance and using a fixed retry prompt underperforms full SocraticPO, especially on Biology and Material\. This shows that the improvement is not explained by repeated attempts alone: the teacher must provide feedback conditioned on the student’s actual mistake so that the next rollout context contains useful correction\.

Reward decay is even more critical\. As visualized in Figure[4](https://arxiv.org/html/2606.09887#S5.F4), removing reward decay can make the guided ability improve faster in the early stage, as shown by the orange curve\. This is expected: once corrected answers after guidance receive the same reward as independent correct answers, the model is strongly incentivized to exploit teacher\-assisted contexts\. However, this early improvement does not translate effectively into unassisted ability in the later stage\. The resulting performance drop is substantial, especially with the stronger teacher, where the average decreases from 0\.7624 to 0\.7057\. This supports the assisted reward\-hacking concern: teacher guidance can quickly improve assisted correctness, but reward decay is needed to turn those guided repairs into independent reasoning ability\.

### 5\.4Additional Analysis

#### Teacher access to ground truth\.

We study whether the teacher should receive ground\-truth answers\. Although such access seems useful for generating precise feedback, Figure[5](https://arxiv.org/html/2606.09887#S5.F5)shows that it can degrade student performance in most cases\. We attribute this to two factors\. First, even when the prompt explicitly asks the teacher not to reveal the answer, a teacher with ground\-truth access may still leak solution\-specific hints\. This failure mode is inherently limited by the teacher model’s instruction\-following ability\. Second, ground\-truth access can shift the teacher from diagnosing the student’s reasoning process to reverse\-engineering hints from the known answer\. Both effects may encourage shortcut learning rather than robust correction\.

At the same time, ground\-truth information is not intrinsically harmful\. If the teacher can avoid answer leakage and use the reference solution to diagnose the student’s error, then access to the answer or even a standard reasoning process can substantially improve the quality of guidance, as suggested by the Qwen3\-8B\-Instruct self\-guidance result on Chemistry\. The effect of ground\-truth access is therefore conditional:it can strengthen Socratic guidance when the teacher follows the teaching protocol well, but it can become negative when the teacher turns the answer into overly direct hints\.

![Refer to caption](https://arxiv.org/html/2606.09887v1/x5.png)Figure 5:Effect of teacher access to ground\-truth information\.![Refer to caption](https://arxiv.org/html/2606.09887v1/x6.png)Figure 6:Teacher–student KL divergence under SocraticPO on Material\.
#### Teacher–student similarity\.

To examine whether SocraticPO works by implicitly making the student imitate the teacher, we measure the KL divergence between the student and teacher during training\. We useQwen3\-8Bas the teacher andQwen3\-4B\-Instructas the student on the Material benchmark\. This pair is chosen because KL computation requires aligned tokenization, and the two models share a compatible tokenizer\.

If SocraticPO were primarily a distillation mechanism, we would expect teacher–student divergence to decrease as training proceeds\. Figure[6](https://arxiv.org/html/2606.09887#S5.F6)shows that SocraticPO does not consistently reduce the KL divergence between teacher and student, even though it improves downstream task performance\. This supports the distinction made in our method: the teacher participates in rollout as an interactive guide, but the student is not optimized to reproduce the teacher’s token distribution or reasoning trace\. In other words, SocraticPO improves the student through guided correction rather than by making the student become more teacher\-like\.

Together, these analyses suggest that effective teaching protocols require more than a stronger teacher\.The information exposed to the teacher, the form of feedback, and the training objective jointly determine whether interaction improves policy learning\.

## 6Conclusion

We presentedSocraticPO, a policy\-optimization framework that addresses a limitation of scalar reward feedback in LLM reinforcement learning\. Outcome rewards indicate whether a response should be reinforced, but they do not explain how a mistaken reasoning process should be repaired\. SocraticPO augments rollout sampling with natural\-language teacher guidance: the student first attempts a solution, receives targeted correction only after an incorrect response, and continues under the expanded dialogue context\.

Teacher guidance must be controlled carefully\. If assisted correct answers are rewarded as strongly as independent correct answers, the student may exploit teacher intervention instead of improving its own reasoning\. SocraticPO therefore pairs guidance with reward decay, assigning smaller rewards to answers that become correct only after more interaction rounds\. Experiments on SciKnowEval scientific reasoning benchmarks show that SocraticPO improves average performance over strong RL and self\-distillation baselines, while ablations highlight the importance of both targeted guidance and reward decay\.

Our findings suggest that natural\-language correction can serve as process\-level supervision inside RL without requiring teacher distribution matching\. Because the teacher is used only as a text\-producing guide, SocraticPO can leverage black\-box mentors and remains modular with respect to the underlying RL optimizer\. Future work can further study teaching protocols, teacher\-side information, and broader tasks where reasoning errors are diagnosable before the final answer\.

## References

## Appendix AProof of Theorem 1

###### Proof\.

We prove the three properties in Theorem[1](https://arxiv.org/html/2606.09887#Thmtheorem1)\. Letmk=\|𝒜k\|m\_\{k\}=\|\\mathcal\{A\}\_\{k\}\|and

pk=1mk​∑i∈𝒜kδi\(k\)\.p\_\{k\}=\\frac\{1\}\{m\_\{k\}\}\\sum\_\{i\\in\\mathcal\{A\}\_\{k\}\}\\delta\_\{i\}^\{\(k\)\}\.\(21\)LetNk−1=∑j=1k−1\|𝒜j\|N\_\{k\-1\}=\\sum\_\{j=1\}^\{k\-1\}\|\\mathcal\{A\}\_\{j\}\|\. The batch mean after including the step\-kkactive samples is

μk=Nk−1​μk−1\+∑i∈𝒜kri\(k\)Nk−1\+mk\.\\mu\_\{k\}=\\frac\{N\_\{k\-1\}\\mu\_\{k\-1\}\+\\sum\_\{i\\in\\mathcal\{A\}\_\{k\}\}r\_\{i\}^\{\(k\)\}\}\{N\_\{k\-1\}\+m\_\{k\}\}\.\(22\)By the reward definition,

∑i∈𝒜kri\(k\)=μk−1​∑i∈𝒜kδi\(k\)=μk−1​mk​pk\.\\sum\_\{i\\in\\mathcal\{A\}\_\{k\}\}r\_\{i\}^\{\(k\)\}=\\mu\_\{k\-1\}\\sum\_\{i\\in\\mathcal\{A\}\_\{k\}\}\\delta\_\{i\}^\{\(k\)\}=\\mu\_\{k\-1\}m\_\{k\}p\_\{k\}\.\(23\)Thus,

μk=Nk−1​μk−1\+mk​pk​μk−1Nk−1\+mk=μk−1​Nk−1\+mk​pkNk−1\+mk\.\\mu\_\{k\}=\\frac\{N\_\{k\-1\}\\mu\_\{k\-1\}\+m\_\{k\}p\_\{k\}\\mu\_\{k\-1\}\}\{N\_\{k\-1\}\+m\_\{k\}\}=\\mu\_\{k\-1\}\\frac\{N\_\{k\-1\}\+m\_\{k\}p\_\{k\}\}\{N\_\{k\-1\}\+m\_\{k\}\}\.\(24\)For any correct sampleiiat stepkk, we haveδi\(k\)=1\\delta\_\{i\}^\{\(k\)\}=1and thereforeri\(k\)=μk−1r\_\{i\}^\{\(k\)\}=\\mu\_\{k\-1\}\. Its centered advantage is

ai\(k\)=ri\(k\)−μt=μk−1−μt=μk−1​\(1−Nk−1\+∑h=ktmk​pkNk−1\+∑h=ktmk\)≥0\.a\_\{i\}^\{\(k\)\}=r\_\{i\}^\{\(k\)\}\-\\mu\_\{t\}=\\mu\_\{k\-1\}\-\\mu\_\{t\}=\\mu\_\{k\-1\}\\left\(1\-\\frac\{N\_\{k\-1\}\+\\sum\_\{h=k\}^\{t\}m\_\{k\}p\_\{k\}\}\{N\_\{k\-1\}\+\\sum\_\{h=k\}^\{t\}m\_\{k\}\}\\right\)\\geq 0\.\(25\)The normalized advantage has the same sign becauseσt\+ϵ\>0\\sigma\_\{t\}\+\\epsilon\>0, soa^i\(k\)≥0\\hat\{a\}\_\{i\}^\{\(k\)\}\\geq 0\. This proves the first property\.

For the second property, consider two correct samples at stepskkandllwithk<lk<l\.

ai\(k\)−aj\(l\)=ri\(k\)−μt−\(rj\(l\)−μt\)=μk−1−μl−1=μk−1​\(1−Nk−1\+∑h=kl−1mh​phNk−1\+∑h=kl−1mh\)\.a\_\{i\}^\{\(k\)\}\-a\_\{j\}^\{\(l\)\}=r\_\{i\}^\{\(k\)\}\-\\mu\_\{t\}\-\(r\_\{j\}^\{\(l\)\}\-\\mu\_\{t\}\)=\\mu\_\{k\-1\}\-\\mu\_\{l\-1\}=\\mu\_\{k\-1\}\\left\(1\-\\frac\{N\_\{k\-1\}\+\\sum\_\{h=k\}^\{l\-1\}m\_\{h\}p\_\{h\}\}\{N\_\{k\-1\}\+\\sum\_\{h=k\}^\{l\-1\}m\_\{h\}\}\\right\)\.\(26\)Consider thatl\>kl\>k, according to the definition of rollout, it is inevitable thatpk<1p\_\{k\}<1, thenai\(k\)−aj\(l\)\>0a\_\{i\}^\{\(k\)\}\-a\_\{j\}^\{\(l\)\}\>0, which lead toa^i\(k\)≥a^j\(l\)\\hat\{a\}\_\{i\}^\{\(k\)\}\\geq\\hat\{a\}\_\{j\}^\{\(l\)\}\.

For the third property, the expression above shows that the centered advantage for a correct sample at stepttis proportional to1−pt1\-p\_\{t\}:

ai\(t\)=μt−1​mtNt−1\+mt​\(1−pt\)a\_\{i\}^\{\(t\)\}=\\frac\{\\mu\_\{t\-1\}m\_\{t\}\}\{N\_\{t\-1\}\+m\_\{t\}\}\(1\-p\_\{t\}\)\(27\)Therefore, whenpk→1p\_\{k\}\\rightarrow 1, i\.e\., almost all active samples are corrected at that step, we haveai\(k\)→0a\_\{i\}^\{\(k\)\}\\rightarrow 0\. Sinceσt\+ϵ\>ϵ\>0\\sigma\_\{t\}\+\\epsilon\>\\epsilon\>0, the normalized advantage also satisfiesa^i\(t\)→0\\hat\{a\}\_\{i\}^\{\(t\)\}\\rightarrow 0\. ∎

## Appendix BDetailed Experimental Settings

### B\.1Example Interaction Trace

The following trace shows one SocraticPO interaction from the Material domain\. Ellipses indicate text omitted for brevity while preserving the original dialogue structure\.

> Task instruction\.Given a question and four options, select the right answer\. Respond with<reasoning\> \.\.\. </reasoning\>and<answer\> \.\.\. </answer\>\. For the answer, only output the letter corresponding to the correct option\. Question\.Based on the given data, determine which material is most favorable for the stability of perovskite materials\. The stability can be predicted by element properties such as density and first ionization potential\. Given data\. Y4Sr4Fe2Co6O24: Density 6\.07425, FIP 6\.95; Y4Sr4Fe2Ni6O24: Density 6\.07425, FIP 6\.865625; Y4Sr4Mn1Co7O24: Density 6\.11125, FIP 6\.9221875; Y4Sr4Mn1Ni7O24: Density 7\.61475, FIP 8\.3265625\. Options\. A: Y4Sr4Mn1Co7O24; B: Y4Sr4Mn1Ni7O24; C: Y4Sr4Fe2Co6O24; D: Y4Sr4Fe2Ni6O24\. Student attempt 1\.<reasoning\>To determine which perovskite material is most favorable for stability, we are given two key properties: density and first ionization potential\. … Therefore, the best choice is C\.</reasoning\><answer\>C</answer\> Teacher feedback 1\.Your reasoning contains a critical contradiction in Step 2\. You stated that “Material C also has a slightly higher density \(6\.111\) than Material D,” but the data table shows Material C has density 6\.07425, while Material A is the one with 6\.11125\. …Next Step: \(1\) Re\-check the density values for each material\. \(2\) Re\-evaluate the trend: … Student attempt 2\.<reasoning\>I apologize for the earlier error\. Let’s carefully re\-examine the data and correct the reasoning step by step\. … While high density alone is not a guarantee, in combination with high FIP, it suggests a stable, tightly bound structure\. Final decision: Material B has the best combination of high ionization potential and high density\. Therefore, the most favorable for stability is B\.</reasoning\><answer\>B</answer\> Verifier outcome\.Attempt 1 is incorrect; attempt 2 is correct\. The trajectory is solved at interaction step 2\.

### B\.2Teacher Prompt Templates

> You are an expert teacher\. Student’s answer is incorrect\. Please review the student’s latest answer and provide concise, actionable guidance\. Prefer hints, corrections, and next\-step suggestions over directly revealing the answer\. TEACHING PRINCIPLE: - •Point out the main issue in the student’s answer\. - •Give the next step or correction they should try\. - •Do not solve the full problem for them unless explicitly allowed\. - •Keep the guidance short and concrete\. Question:\{question\} Student’s latest answer:\{student\_answer\} Reference solution:\{reference\_solution\} Ground\-truth block:\{ground\_truth\_block\} Output Format:Write a short piece of guidance for the student\.

### B\.3Dataset Statistics

We use the reasoning subsets of SciKnowEval in four scientific domains\. Table[3](https://arxiv.org/html/2606.09887#A2.T3)reports the number of training and evaluation examples in each domain\.

Table 3:Sciknoweval dataset statistics\.Table[4](https://arxiv.org/html/2606.09887#A2.T4)shows representative examples from the four benchmark domains\. Each task is formatted as a multiple\-choice question, and the model is evaluated by the final option letter\.

Table 4:Representative examples from the SciKnowEval reasoning subsets\.
### B\.4Implementation Details

Table[5](https://arxiv.org/html/2606.09887#A2.T5)summarizes the main training and inference settings in a grouped format\. All experiments are conducted on a 6\-card H100 GPU machine\. We allocate 4 GPUs for student\-policy training and 2 GPUs for teacher inference during SocraticPO rollouts, so that teacher feedback can be generated online without blocking the student update pipeline as much as possible\.

Table 5:Hyperparameters for Reinforce\+\+, GRPO, SDPO, and SocraticPO\. Entries marked “–” indicate parameters not applicable to that method\.

Similar Articles

Structured Role-Aware Policy Optimization for Multimodal Reasoning

arXiv cs.AI

This paper introduces Structured Role-Aware Policy Optimization (SRPO), a method that improves multimodal reasoning in Large Vision-Language Models by assigning token-level credit based on distinct perception and reasoning roles within reinforcement learning frameworks.

Near-Future Policy Optimization

Hugging Face Daily Papers

Proposes Near-Future Policy Optimization (NPO), a mixed-policy RL method that accelerates convergence by learning from a later checkpoint of the same training run, boosting Qwen3-VL-8B-Instruct performance from 57.88 to 62.84.