Hölder Policy Optimisation
Summary
HölderPO introduces a generalized policy optimization framework that uses the Hölder mean for token-level probability aggregation in GRPO, with a dynamic annealing schedule to balance gradient concentration and variance. The method achieves state-of-the-art results on mathematical benchmarks (54.9% average, 7.2% relative gain over GRPO) and a 93.8% success rate on ALFWorld.
View Cached Full Text
Cached at: 05/18/26, 10:25 AM
Paper page - Hölder Policy Optimisation
Source: https://huggingface.co/papers/2605.12058 Authors:
,
,
,
,
,
,
,
,
,
Abstract
GroupRelativePolicyOptimisation(GRPO)enhanceslargelanguagemodelsbyestimatingadvantagesacrossagroupofsampledtrajectories.However,mappingthesetrajectory-leveladvantagestopolicyupdatesrequiresaggregatingtoken-levelprobabilitieswithineachsequence.Relyingonafixedaggregationmechanismforthisstepfundamentallylimitsthealgorithm’sadaptability.Empirically,weobserveacriticaltrade-off:certainfixedaggregationsfrequentlysufferfromtrainingcollapse,whileothersfailtoyieldsatisfactoryperformance.Toresolvethis,weproposeHölderPO,ageneralisedpolicyoptimisationframeworkunifyingtoken-levelprobabilityaggregationviatheHöldermean.Byexplicitlymodulatingtheparameterp,ourframeworkprovidescontinuouscontroloverthetrade-offbetweengradientconcentrationandvariancebounds.Theoretically,weprovethatalargerpconcentratesthegradienttoamplifysparselearningsignals,whereasasmallerpstrictlyboundsgradientvariance.Becausenostaticconfigurationcanuniversallyresolvethisconcentration-stabilitytrade-off,weinstantiatetheframeworkwithadynamicannealingalgorithmthatprogressivelyschedulespacrossthetraininglifecycle.Extensiveevaluationsdemonstratesuperiorstabilityandconvergenceoverexistingbaselines.Specifically,ourapproachachievesastate-of-the-artaverageaccuracyof54.9%acrossmultiplemathematicalbenchmarks,yieldingasubstantial7.2%relativegainoverstandardGRPOandsecuresanexceptional93.8%successrateonALFWorld.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.12058
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.12058 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12058 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12058 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization
BiasGRPO proposes a framework using Group Relative Policy Optimization (GRPO) to stabilize social bias mitigation in LLMs by normalizing rewards across sampled completions, outperforming DPO and PPO on multiple benchmarks. The authors also release a compute-efficient bias reward model designed for integration into multi-objective RLHF pipelines.
GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios
GenPO++ proposes a reversible generative policy optimization framework that uses history states as auxiliary memory in a high-order reversible ODE solver, enabling exact inversion and Jacobian-free likelihood-ratio computation for flow-based policies in reinforcement learning. It achieves competitive performance on large-scale control, fine-tuning, and real-world robotic tasks while improving stability and efficiency.
Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization (GRPO) that improves accuracy in modular AI systems by optimizing language model calls and prompts. It reports an average 11% accuracy improvement across various tasks and provides an open-source implementation in DSPy.
Generative OOD-regularized Model-based Policy Optimization
Introduces GORMPO, a density-regularized offline RL algorithm that uses generative density modeling to restrict policy updates to high-density areas, achieving 17% improvement on a real-world medical dataset and outperforming state-of-the-art baselines.
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.