policy-gradient

#policy-gradient

EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games

arXiv cs.LG ↗ · 2d ago Cached

EMAgnet introduces parameter-space exponential moving average regularization for policy gradient self-play in large two-player zero-sum games, achieving lower exploitability compared to uniform regularization targets.

0 favorites 0 likes

#policy-gradient

KLip-PPO: A per-sample KL perspective on PPO-Clip

arXiv cs.LG ↗ · 2d ago Cached

This paper shows that the gradient of the clipped surrogate in Proximal Policy Optimization (PPO) is exactly reproduced by a per-sample Kullback-Leibler penalty with a variable coefficient, revealing structural features of the clipped surrogate and suggesting new design directions.

0 favorites 0 likes

#policy-gradient

In game theory, generalists sometimes win out over specialists

MIT News — Artificial Intelligence ↗ · 2026-06-17 Cached

MIT researchers co-authored a paper showing that general-purpose policy gradient algorithms can outperform specialized game-theoretic algorithms in imperfect-information games, challenging long-held assumptions in the field.

0 favorites 0 likes

#policy-gradient

Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

arXiv cs.LG ↗ · 2026-06-16 Cached

This paper formalizes embedding model routing as an adversarial contextual linear bandit with low-rank experts, proposing the Hypentropy Policy Gradient (HPG) algorithm that achieves O~(s√(MT)) policy regret, avoiding the curse of dimensionality.

0 favorites 0 likes

#policy-gradient

Diffusion Policy Optimization without Drifting Apart

arXiv cs.LG ↗ · 2026-06-15 Cached

DiPOD stabilizes diffusion policy optimization by interleaving self-distillation with policy-gradient updates to maintain a tight ELBO, preventing the double-drift phenomenon and achieving higher rewards in both language and continuous control tasks.

0 favorites 0 likes

#policy-gradient

Self-Distilled Policy Gradient

arXiv cs.LG ↗ · 2026-06-04 Cached

SDPG (Self-Distilled Policy Gradient) is a new RL training framework for LLMs that combines group-relative verifier advantages with on-policy self-distillation and KL regularization to address sparse rewards and instability in RLVR training. The method uses a shared model as both student and teacher by conditioning on privileged context, showing improved stability and performance over RLVR and self-distillation baselines.

0 favorites 0 likes

#policy-gradient

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

arXiv cs.LG ↗ · 2026-06-02 Cached

This paper introduces ReMax, a new objective for reinforcement learning that induces exploration as an emergent property by evaluating policies based on expected maximum return over multiple samples, without explicit exploration bonuses. The authors derive a policy gradient formulation and propose RePPO, a PPO variant that achieves efficient exploration on MinAtar and Craftax benchmarks.

0 favorites 0 likes

#policy-gradient

Self-Distilled Policy Gradient

Hugging Face Daily Papers ↗ · 2026-06-02 Cached

This paper proposes SDPG, a self-distilled policy-gradient framework that combines on-policy self-distillation with verifier advantages and KL regularization to improve reinforcement learning stability and performance.

0 favorites 0 likes

#policy-gradient

Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems

arXiv cs.AI ↗ · 2026-05-27 Cached

This paper identifies two failure modes for policy-gradient methods in long-horizon cumulative-damage problems—completion and optimality—and proposes a decomposition to address them separately, validated on two calibrated environments.

0 favorites 0 likes

#policy-gradient

Refined Analysis of Entropy-Regularized Actor-Critic

arXiv cs.LG ↗ · 2026-05-26 Cached

This paper provides a refined theoretical analysis of actor-critic methods with entropy regularization, showing that an exact critic acts as a strong variance reducer and enables sample complexity comparable to deterministic policy gradient, and that with a sufficiently accurate learned critic the benefits are preserved.

0 favorites 0 likes

#policy-gradient

ECHO: Terminal Agents Learn World Models for Free

Hugging Face Daily Papers ↗ · 2026-05-23 Cached

ECHO introduces a hybrid objective that combines policy-gradient loss with environment observation prediction to provide dense supervision from terminal feedback, doubling performance on TerminalBench-2.0 for Qwen3 models.

0 favorites 0 likes

#policy-gradient

Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Hugging Face Daily Papers ↗ · 2026-05-21 Cached

This paper identifies surrogate hacking and temporal uncertainty as failure modes in multi-timescale RL, and proposes a Target Decoupling architecture that removes routing from the actor, using the critic for auxiliary representation learning. The method eliminates policy collapse on the LunarLander-v2 benchmark and stably surpasses the 'Environment Solved' threshold without hyperparameter hacking.

0 favorites 0 likes

#policy-gradient

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Hugging Face Daily Papers ↗ · 2026-05-20 Cached

Introduces DelTA, a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that amplifies distinctive token-gradient directions and reduces noise from shared patterns, achieving significant improvements on mathematical and code generation benchmarks.

0 favorites 0 likes

#policy-gradient

Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

arXiv cs.CL ↗ · 2026-05-18 Cached

Introduces Nexa, a trainable response-conditioned policy that combines parallel and sequential execution in multi-agent systems, using a lightweight transformer to predict sparse communication graphs, improving accuracy while minimizing latency.

0 favorites 0 likes

#policy-gradient

@probablynotaz9: Solo-author ICML paper alert Ever wanted to post-train your diffusion LLM with good old policy gradients, without havin…

X AI KOLs Following ↗ · 2026-05-09 Cached

This solo-author ICML paper introduces Amortized Group Relative Policy Optimization (AGRPO) to enable effective reinforcement learning post-training for diffusion language models.

0 favorites 0 likes

#policy-gradient

@jiqizhixin: Awesome blog! State of RL for reasoning LLMs https://aweers.de/blog/2026/rl-for-llms/…

X AI KOLs Timeline ↗ · 2026-05-08 Cached

A comprehensive blog post reviewing the state of reinforcement learning for reasoning LLMs, covering methods from REINFORCE and PPO to GRPO and beyond, with connections to key models like InstructGPT and DeepSeek-R1.

0 favorites 0 likes

#policy-gradient

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Hugging Face Daily Papers ↗ · 2026-05-07 Cached

This paper introduces Listwise Policy Optimization (LPO), a method for RLVR that explicitly handles target projection via divergence minimization on the response simplex to improve training stability and performance in LLMs.

0 favorites 0 likes

#policy-gradient

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Hugging Face Daily Papers ↗ · 2026-04-15 Cached

GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.

0 favorites 0 likes

#policy-gradient

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

Hugging Face Daily Papers ↗ · 2026-04-14 Cached

This paper identifies and addresses aggregation bias in GRPO-style reinforcement learning for LLMs, proposing Balanced Aggregation (BA) which improves training stability and final performance by computing token-level means separately for positive and negative subsets.

0 favorites 0 likes

#policy-gradient

Spinning Up in Deep RL

OpenAI Blog ↗ · 2018-11-08 Cached

OpenAI released 'Spinning Up in Deep RL,' an educational toolkit featuring introductory materials, curated paper lists, and clean standalone implementations of key RL algorithms (VPG, TRPO, PPO, DDPG, TD3, SAC) designed to help newcomers learn deep reinforcement learning from scratch.

0 favorites 0 likes

policy-gradient

Submit Feedback