Tag
This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.
This paper presents a self-play reinforcement learning framework for the four-player imperfect-information card game Big 2, comparing policy-gradient and value-based methods and finding that PPO with entropy regularization outperforms others.
A unified Python framework using PPO-based deep reinforcement learning for optimizing HVAC control with economizer logic and CO2-constrained ventilation is presented, showing improved energy efficiency and temperature stability over traditional PID controllers.
This paper investigates the temporal correlation problem in on-policy reinforcement learning with PPO, showing that randomly dropping a fixed fraction of transitions from rollouts reduces gradient redundancy and stabilizes training without degrading performance.
This paper introduces the value-gradient hypothesis to explain why critic-free RL methods like PPO and GRPO work well for LLMs, showing that the actor backward pass carries a value-gradient-like signal. It derives a predictive criterion for when RL is most effective along the pretraining trajectory.
A web-based tool that visualizes a neural network (using PPO) learning to play Snake in real-time, with configurable parameters and 3D rendering.
OpenAI demonstrates that competitive self-play in simulated 3D robot environments enables AI agents to discover complex physical behaviors like tackling, ducking, and faking without explicit instruction, suggesting self-play will be fundamental to future powerful AI systems.
OpenAI introduces Proximal Policy Optimization (PPO), a reinforcement learning algorithm that matches or outperforms state-of-the-art methods while being simpler to implement and tune. PPO uses a novel clipped objective function to constrain policy updates and has since become OpenAI's default RL algorithm.