Metric-Gradient Projection for Stable Multi-Agent Policy Learning
Summary
Introduces HPML, a method that projects the joint update field of multi-agent systems onto a metric-gradient component to stabilize and improve multi-agent reinforcement learning. It provides theoretical guarantees and shows improved stability and returns on CTDE benchmarks.
Similar Articles
Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents
This paper introduces HCL-GP, a dynamic policy-learning framework that integrates generalized planning and hierarchical task decomposition to enable LLM-based agents to learn and reuse executable policy components, significantly improving performance on the AppWorld benchmark.
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.
Hybrid Policy Distillation for LLMs
Introduces Hybrid Policy Distillation (HPD), a unified knowledge distillation approach that balances forward and reverse KL divergences and combines off-policy data with lightweight on-policy sampling, improving LLM compression across math, dialogue, and code tasks.
Hölder Policy Optimisation
HölderPO introduces a generalized policy optimization framework that uses the Hölder mean for token-level probability aggregation in GRPO, with a dynamic annealing schedule to balance gradient concentration and variance. The method achieves state-of-the-art results on mathematical benchmarks (54.9% average, 7.2% relative gain over GRPO) and a 93.8% success rate on ALFWorld.
Evolved Policy Gradients
OpenAI introduces Evolved Policy Gradients (EPG), a meta-learning approach that learns loss functions through evolution rather than learning policies directly, enabling RL agents to generalize better across tasks by leveraging prior experience similar to how humans transfer skills.