policy-gradient

#policy-gradient

@jiqizhixin: Awesome blog! State of RL for reasoning LLMs https://aweers.de/blog/2026/rl-for-llms/…

X AI KOLs Timeline ↗ · 2d ago Cached

A comprehensive blog post reviewing the state of reinforcement learning for reasoning LLMs, covering methods from REINFORCE and PPO to GRPO and beyond, with connections to key models like InstructGPT and DeepSeek-R1.

0 favorites 0 likes

#policy-gradient

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Hugging Face Daily Papers ↗ · 2026-04-15 Cached

GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.

0 favorites 0 likes

#policy-gradient

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

Hugging Face Daily Papers ↗ · 2026-04-14 Cached

This paper identifies and addresses aggregation bias in GRPO-style reinforcement learning for LLMs, proposing Balanced Aggregation (BA) which improves training stability and final performance by computing token-level means separately for positive and negative subsets.

0 favorites 0 likes

#policy-gradient

Spinning Up in Deep RL

OpenAI Blog ↗ · 2018-11-08 Cached

OpenAI released 'Spinning Up in Deep RL,' an educational toolkit featuring introductory materials, curated paper lists, and clean standalone implementations of key RL algorithms (VPG, TRPO, PPO, DDPG, TD3, SAC) designed to help newcomers learn deep reinforcement learning from scratch.

0 favorites 0 likes

#policy-gradient

Variance reduction for policy gradient with action-dependent factorized baselines

OpenAI Blog ↗ · 2018-03-20 Cached

OpenAI researchers derive a bias-free action-dependent baseline for variance reduction in policy gradient methods, demonstrating improved learning efficiency on high-dimensional control tasks, multi-agent, and partially observed environments.

0 favorites 0 likes

#policy-gradient

Learning with opponent-learning awareness

OpenAI Blog ↗ · 2017-09-13 Cached

OpenAI presents LOLA (Learning with Opponent-Learning Awareness), a multi-agent reinforcement learning method where agents shape the anticipated learning of other agents. The approach demonstrates emergence of cooperation in iterated prisoner's dilemma and convergence to Nash equilibrium in game-theoretic settings.

0 favorites 0 likes

#policy-gradient

Proximal Policy Optimization

OpenAI Blog ↗ · 2017-07-20 Cached

OpenAI introduces Proximal Policy Optimization (PPO), a reinforcement learning algorithm that matches or outperforms state-of-the-art methods while being simpler to implement and tune. PPO uses a novel clipped objective function to constrain policy updates and has since become OpenAI's default RL algorithm.

0 favorites 0 likes

policy-gradient

@jiqizhixin: Awesome blog! State of RL for reasoning LLMs https://aweers.de/blog/2026/rl-for-llms/…

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

Spinning Up in Deep RL

Variance reduction for policy gradient with action-dependent factorized baselines

Learning with opponent-learning awareness

Proximal Policy Optimization

Submit Feedback