Hölder Policy Optimisation

Hugging Face Daily Papers Papers

Summary

HölderPO introduces a generalized policy optimization framework that uses the Hölder mean for token-level probability aggregation in GRPO, with a dynamic annealing schedule to balance gradient concentration and variance. The method achieves state-of-the-art results on mathematical benchmarks (54.9% average, 7.2% relative gain over GRPO) and a 93.8% success rate on ALFWorld.

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter p, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger p concentrates the gradient to amplify sparse learning signals, whereas a smaller p strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules p across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical benchmarks, yielding a substantial 7.2% relative gain over standard GRPO and secures an exceptional 93.8% success rate on ALFWorld.
Original Article
View Cached Full Text

Cached at: 05/18/26, 10:25 AM

Paper page - Hölder Policy Optimisation

Source: https://huggingface.co/papers/2605.12058 Authors:

,

,

,

,

,

,

,

,

,

Abstract

GroupRelativePolicyOptimisation(GRPO)enhanceslargelanguagemodelsbyestimatingadvantagesacrossagroupofsampledtrajectories.However,mappingthesetrajectory-leveladvantagestopolicyupdatesrequiresaggregatingtoken-levelprobabilitieswithineachsequence.Relyingonafixedaggregationmechanismforthisstepfundamentallylimitsthealgorithm’sadaptability.Empirically,weobserveacriticaltrade-off:certainfixedaggregationsfrequentlysufferfromtrainingcollapse,whileothersfailtoyieldsatisfactoryperformance.Toresolvethis,weproposeHölderPO,ageneralisedpolicyoptimisationframeworkunifyingtoken-levelprobabilityaggregationviatheHöldermean.Byexplicitlymodulatingtheparameterp,ourframeworkprovidescontinuouscontroloverthetrade-offbetweengradientconcentrationandvariancebounds.Theoretically,weprovethatalargerpconcentratesthegradienttoamplifysparselearningsignals,whereasasmallerpstrictlyboundsgradientvariance.Becausenostaticconfigurationcanuniversallyresolvethisconcentration-stabilitytrade-off,weinstantiatetheframeworkwithadynamicannealingalgorithmthatprogressivelyschedulespacrossthetraininglifecycle.Extensiveevaluationsdemonstratesuperiorstabilityandconvergenceoverexistingbaselines.Specifically,ourapproachachievesastate-of-the-artaverageaccuracyof54.9%acrossmultiplemathematicalbenchmarks,yieldingasubstantial7.2%relativegainoverstandardGRPOandsecuresanexceptional93.8%successrateonALFWorld.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.12058

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12058 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12058 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12058 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios

arXiv cs.LG

GenPO++ proposes a reversible generative policy optimization framework that uses history states as auxiliary memory in a high-order reversible ODE solver, enabling exact inversion and Jacobian-free likelihood-ratio computation for flow-based policies in reinforcement learning. It achieves competitive performance on large-scale control, fine-tuning, and real-world robotic tasks while improving stability and efficiency.

Generative OOD-regularized Model-based Policy Optimization

arXiv cs.LG

Introduces GORMPO, a density-regularized offline RL algorithm that uses generative density modeling to restrict policy updates to high-density areas, achieving 17% improvement on a real-world medical dataset and outperforming state-of-the-art baselines.

Gradient Extrapolation-Based Policy Optimization

arXiv cs.LG

The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.