@Ankur_Samanta_: New work on credit assignment in multi-step reasoning RL post-training Introducing Self-Reset Policy Optimization (SRPO…
Summary
Self-Reset Policy Optimization (SRPO) addresses credit assignment in multi-step reasoning RL post-training by localizing the first wrong reasoning step and learning from counterfactual continuations without external supervision.
View Cached Full Text
Cached at: 06/23/26, 04:12 PM
🚀New work on credit assignment in multi-step reasoning RL post-training🚀 Introducing Self-Reset Policy Optimization (SRPO): i) localize the first wrong reasoning step, ii) reset to that step, iii) learn from counterfactual continuations from there – no external supervision.🧵 https://t.co/A1KHt2CRCF
In RL, the ability to reset to an arbitrary state is powerful (see, e.g., Go-Explore), but often unrealistic.
For LLMs though, states are tokens, so resets are natural! In work led by @Ankur_Samanta_ , we propose a GRPO variant where the model “self-resets then resamples.”
RLVR today applies one outcome reward across every step of multi-step trajectories. The step that actually caused the failure gets the same credit as the good ones. Credit assignment is broken — and that wastes signal. [2\n]
Well, how does this improve credit assignment? Re-sampling from wrong reasoning steps gives a dense, on-policy signal to self-correct — these are the improvable decision points, admitting a strictly better action, so more continuations yield a better gradient. [3\n]
To quantify the gain of resetting to the first erroneous reasoning step over random resets, we analyze both through the lens of Conservative Policy Iteration (CPI; Kakade & Langford, 2002). CPI-RR resets at random to a state within a trajectory. [4\n]
CPI-CARO uses a credit-assignment oracle — a test for improvable states (advantage > τ) — to reset only where there’s room to improve. We establish that CPI-CARO reduces the sample complexity and increases per-iteration improvement over CPI-RR. [5\n]
The two sources of improvement: i) better signal-to-noise, and ii) targeted update only at states that can be significantly improved.
We hope that it will provide a framework to better understand credit assignment in policy optimization. [6\n]
SRPO instantiates CPI-CARO in practice: self-localization stands in for the oracle. Localization quality is the lever — clean prefixes yield nearly 2× as many correct-suffix groups — so effective step-level self-localization is a key direction for future work. [7\n]
Across a 10-benchmark suite (math, science, strategic, commonsense), SRPO beats GRPO and other self-correction/tree-based baselines. Gains extend to coding: SRPO converges to higher pass rates and learns 2–3× faster than no resets (GRPO) and random resets (RRPO). [8\n]
Overall, we study how resets can be used as a credit-assignment primitive for RL post-training. Self-localized resets beat random and no resets in performance and sample efficiency — with self-localization an imperfect-but-effective proxy for the credit-assignment oracle. [9\n]
As agentic traces become increasingly longer with several intermediate decision steps, we expect targeted credit assignment to be essential for designing more efficient post-training methods.
Paper: https://arxiv.org/abs/2605.25507 Code: https://github.com/Ankur-Samanta/SRPO… [10\n]
Big thanks to the team: Akshayaa Magesh, @Ayushj240, Youliang Yu, @danielrjiang , Kavosh Asadi, @KavehHassani, @BMEChairCU, Jalaj Bhandari, @EfroniYonathan
Similar Articles
Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.
Structured Role-Aware Policy Optimization for Multimodal Reasoning
This paper introduces Structured Role-Aware Policy Optimization (SRPO), a method that improves multimodal reasoning in Large Vision-Language Models by assigning token-level credit based on distinct perception and reasoning roles within reinforcement learning frameworks.
RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents
RICE-PO is a critic-free policy optimization framework that turns retrieval interactions into localized credit signals for training reasoning agents, outperforming prompt-based and group-based RL baselines on BRIGHT and BEIR benchmarks.
StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
StepPO introduces a step-centric paradigm for agentic reinforcement learning that aligns policy optimization with agent decision granularity, outperforming token-centric methods in multi-turn interaction tasks.
GraphPO: Graph-based Policy Optimization for Reasoning Models
GraphPO is a novel graph-based reinforcement learning framework that represents rollouts as a directed acyclic graph, merging semantically equivalent reasoning paths to reduce redundant exploration and improve credit assignment for large reasoning models.