@Ankur_Samanta_: New work on credit assignment in multi-step reasoning RL post-training Introducing Self-Reset Policy Optimization (SRPO…

X AI KOLs Timeline 06/22/26, 07:26 PM Papers

Summary

Self-Reset Policy Optimization (SRPO) addresses credit assignment in multi-step reasoning RL post-training by localizing the first wrong reasoning step and learning from counterfactual continuations without external supervision.

🚀New work on credit assignment in multi-step reasoning RL post-training🚀 Introducing Self-Reset Policy Optimization (SRPO): i) localize the first wrong reasoning step, ii) reset to that step, iii) learn from counterfactual continuations from there – no external supervision.🧵 https://t.co/A1KHt2CRCF

Original Article

View Cached Full Text

Cached at: 06/23/26, 04:12 PM

In RL, the ability to reset to an arbitrary state is powerful (see, e.g., Go-Explore), but often unrealistic.

For LLMs though, states are tokens, so resets are natural! In work led by @Ankur_Samanta_ , we propose a GRPO variant where the model “self-resets then resamples.”

RLVR today applies one outcome reward across every step of multi-step trajectories. The step that actually caused the failure gets the same credit as the good ones. Credit assignment is broken — and that wastes signal. [2\n]

Well, how does this improve credit assignment? Re-sampling from wrong reasoning steps gives a dense, on-policy signal to self-correct — these are the improvable decision points, admitting a strictly better action, so more continuations yield a better gradient. [3\n]

To quantify the gain of resetting to the first erroneous reasoning step over random resets, we analyze both through the lens of Conservative Policy Iteration (CPI; Kakade & Langford, 2002). CPI-RR resets at random to a state within a trajectory. [4\n]

CPI-CARO uses a credit-assignment oracle — a test for improvable states (advantage > τ) — to reset only where there’s room to improve. We establish that CPI-CARO reduces the sample complexity and increases per-iteration improvement over CPI-RR. [5\n]

The two sources of improvement: i) better signal-to-noise, and ii) targeted update only at states that can be significantly improved.

We hope that it will provide a framework to better understand credit assignment in policy optimization. [6\n]

SRPO instantiates CPI-CARO in practice: self-localization stands in for the oracle. Localization quality is the lever — clean prefixes yield nearly 2× as many correct-suffix groups — so effective step-level self-localization is a key direction for future work. [7\n]

Across a 10-benchmark suite (math, science, strategic, commonsense), SRPO beats GRPO and other self-correction/tree-based baselines. Gains extend to coding: SRPO converges to higher pass rates and learns 2–3× faster than no resets (GRPO) and random resets (RRPO). [8\n]

Overall, we study how resets can be used as a credit-assignment primitive for RL post-training. Self-localized resets beat random and no resets in performance and sample efficiency — with self-localization an imperfect-but-effective proxy for the credit-assignment oracle. [9\n]

As agentic traces become increasingly longer with several intermediate decision steps, we expect targeted credit assignment to be essential for designing more efficient post-training methods.

Paper: https://arxiv.org/abs/2605.25507 Code: https://github.com/Ankur-Samanta/SRPO… [10\n]

Big thanks to the team: Akshayaa Magesh, @Ayushj240, Youliang Yu, @danielrjiang , Kavosh Asadi, @KavehHassani, @BMEChairCU, Jalaj Bhandari, @EfroniYonathan

@Ankur_Samanta_: New work on credit assignment in multi-step reasoning RL post-training Introducing Self-Reset Policy Optimization (SRPO…

Similar Articles

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

Structured Role-Aware Policy Optimization for Multimodal Reasoning

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

GraphPO: Graph-based Policy Optimization for Reasoning Models

Submit Feedback

Similar Articles

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

Structured Role-Aware Policy Optimization for Multimodal Reasoning

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

GraphPO: Graph-based Policy Optimization for Reasoning Models