counterfactual-learning

Tag

Cards List
#counterfactual-learning

@Ankur_Samanta_: New work on credit assignment in multi-step reasoning RL post-training Introducing Self-Reset Policy Optimization (SRPO…

X AI KOLs Timeline · yesterday Cached

Self-Reset Policy Optimization (SRPO) addresses credit assignment in multi-step reasoning RL post-training by localizing the first wrong reasoning step and learning from counterfactual continuations without external supervision.

0 favorites 0 likes
← Back to home

Submit Feedback