ppo

Tag

Cards List
#ppo

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

arXiv cs.LG · 6d ago Cached

This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.

0 favorites 0 likes
#ppo

Self-Play Reinforcement Learning under Imperfect Information in Big 2

arXiv cs.LG · 6d ago Cached

This paper presents a self-play reinforcement learning framework for the four-player imperfect-information card game Big 2, comparing policy-gradient and value-based methods and finding that PPO with entropy regularization outperforms others.

0 favorites 0 likes
#ppo

A Unified Python Framework for Direct PPO-based Control of AHUs with Economizer Logic and CO2-Constrained Ventilation

arXiv cs.LG · 2026-05-26 Cached

A unified Python framework using PPO-based deep reinforcement learning for optimizing HVAC control with economizer logic and CO2-constrained ventilation is presented, showing improved energy efficiency and temperature stability over traditional PID controllers.

0 favorites 0 likes
#ppo

Not All Transitions Matter: Evidence from PPO

arXiv cs.LG · 2026-05-26 Cached

This paper investigates the temporal correlation problem in on-policy reinforcement learning with PPO, showing that randomly dropping a fixed fraction of transitions from rollouts reduces gradient redundancy and stabilizes training without degrading performance.

0 favorites 0 likes
#ppo

Value-Gradient Hypothesis of RL for LLMs

arXiv cs.LG · 2026-05-22 Cached

This paper introduces the value-gradient hypothesis to explain why critic-free RL methods like PPO and GRPO work well for LLMs, showing that the actor backward pass carries a value-gradient-like signal. It derives a predictive criterion for when RL is most effective along the pretraining trajectory.

0 favorites 0 likes
#ppo

Show HN: Watch a neural net learn to play Snake

Hacker News Top · 2026-05-14 Cached

A web-based tool that visualizes a neural network (using PPO) learning to play Snake in real-time, with configurable parameters and 3D rendering.

0 favorites 0 likes
#ppo

Competitive self-play

OpenAI Blog · 2017-10-11 Cached

OpenAI demonstrates that competitive self-play in simulated 3D robot environments enables AI agents to discover complex physical behaviors like tackling, ducking, and faking without explicit instruction, suggesting self-play will be fundamental to future powerful AI systems.

0 favorites 0 likes
#ppo

Proximal Policy Optimization

OpenAI Blog · 2017-07-20 Cached

OpenAI introduces Proximal Policy Optimization (PPO), a reinforcement learning algorithm that matches or outperforms state-of-the-art methods while being simpler to implement and tune. PPO uses a novel clipped objective function to constrain policy updates and has since become OpenAI's default RL algorithm.

0 favorites 0 likes
← Back to home

Submit Feedback