Tag
This paper develops a statistical theory for offline reinforcement learning from trajectory-level outcome supervision, proposing the OPAC algorithm and characterizing when such supervision enables efficient learning versus when fundamental barriers arise.