Tag
EMAgnet introduces parameter-space exponential moving average regularization for policy gradient self-play in large two-player zero-sum games, achieving lower exploitability compared to uniform regularization targets.
This paper shows that the gradient of the clipped surrogate in Proximal Policy Optimization (PPO) is exactly reproduced by a per-sample Kullback-Leibler penalty with a variable coefficient, revealing structural features of the clipped surrogate and suggesting new design directions.
MIT researchers co-authored a paper showing that general-purpose policy gradient algorithms can outperform specialized game-theoretic algorithms in imperfect-information games, challenging long-held assumptions in the field.
This paper formalizes embedding model routing as an adversarial contextual linear bandit with low-rank experts, proposing the Hypentropy Policy Gradient (HPG) algorithm that achieves O~(s√(MT)) policy regret, avoiding the curse of dimensionality.
DiPOD stabilizes diffusion policy optimization by interleaving self-distillation with policy-gradient updates to maintain a tight ELBO, preventing the double-drift phenomenon and achieving higher rewards in both language and continuous control tasks.
SDPG (Self-Distilled Policy Gradient) is a new RL training framework for LLMs that combines group-relative verifier advantages with on-policy self-distillation and KL regularization to address sparse rewards and instability in RLVR training. The method uses a shared model as both student and teacher by conditioning on privileged context, showing improved stability and performance over RLVR and self-distillation baselines.
This paper introduces ReMax, a new objective for reinforcement learning that induces exploration as an emergent property by evaluating policies based on expected maximum return over multiple samples, without explicit exploration bonuses. The authors derive a policy gradient formulation and propose RePPO, a PPO variant that achieves efficient exploration on MinAtar and Craftax benchmarks.
This paper proposes SDPG, a self-distilled policy-gradient framework that combines on-policy self-distillation with verifier advantages and KL regularization to improve reinforcement learning stability and performance.
This paper identifies two failure modes for policy-gradient methods in long-horizon cumulative-damage problems—completion and optimality—and proposes a decomposition to address them separately, validated on two calibrated environments.
This paper provides a refined theoretical analysis of actor-critic methods with entropy regularization, showing that an exact critic acts as a strong variance reducer and enables sample complexity comparable to deterministic policy gradient, and that with a sufficiently accurate learned critic the benefits are preserved.
ECHO introduces a hybrid objective that combines policy-gradient loss with environment observation prediction to provide dense supervision from terminal feedback, doubling performance on TerminalBench-2.0 for Qwen3 models.
This paper identifies surrogate hacking and temporal uncertainty as failure modes in multi-timescale RL, and proposes a Target Decoupling architecture that removes routing from the actor, using the critic for auxiliary representation learning. The method eliminates policy collapse on the LunarLander-v2 benchmark and stably surpasses the 'Environment Solved' threshold without hyperparameter hacking.
Introduces DelTA, a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that amplifies distinctive token-gradient directions and reduces noise from shared patterns, achieving significant improvements on mathematical and code generation benchmarks.
Introduces Nexa, a trainable response-conditioned policy that combines parallel and sequential execution in multi-agent systems, using a lightweight transformer to predict sparse communication graphs, improving accuracy while minimizing latency.
This solo-author ICML paper introduces Amortized Group Relative Policy Optimization (AGRPO) to enable effective reinforcement learning post-training for diffusion language models.
A comprehensive blog post reviewing the state of reinforcement learning for reasoning LLMs, covering methods from REINFORCE and PPO to GRPO and beyond, with connections to key models like InstructGPT and DeepSeek-R1.
This paper introduces Listwise Policy Optimization (LPO), a method for RLVR that explicitly handles target projection via divergence minimization on the response simplex to improve training stability and performance in LLMs.
GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.
This paper identifies and addresses aggregation bias in GRPO-style reinforcement learning for LLMs, proposing Balanced Aggregation (BA) which improves training stability and final performance by computing token-level means separately for positive and negative subsets.
OpenAI released 'Spinning Up in Deep RL,' an educational toolkit featuring introductory materials, curated paper lists, and clean standalone implementations of key RL algorithms (VPG, TRPO, PPO, DDPG, TD3, SAC) designed to help newcomers learn deep reinforcement learning from scratch.