Tag
OPID is a framework that extracts dense token-level supervision from completed on-policy trajectories for reinforcement learning of language agents, using hierarchical skills (episode-level and step-level) to improve sample efficiency and robustness.
OPID proposes an on-policy skill distillation framework that extracts dense hindsight supervision from completed trajectories, combining outcome-based RL with token-level self-distillation to improve language agent training efficiency and performance on multi-turn tasks.