Tag
Self-Reset Policy Optimization (SRPO) addresses credit assignment in multi-step reasoning RL post-training by localizing the first wrong reasoning step and learning from counterfactual continuations without external supervision.