Tag
This paper introduces causal reinforcement learning (CRL), unifying causal inference and reinforcement learning under a structural causal model framework, and explores novel learning settings such as generalized policy learning and counterfactual learning.
Self-Reset Policy Optimization (SRPO) addresses credit assignment in multi-step reasoning RL post-training by localizing the first wrong reasoning step and learning from counterfactual continuations without external supervision.