Hölder Policy Optimisation

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

Summary

HölderPO introduces a generalized policy optimization framework that uses the Hölder mean for token-level probability aggregation in GRPO, with a dynamic annealing schedule to balance gradient concentration and variance. The method achieves state-of-the-art results on mathematical benchmarks (54.9% average, 7.2% relative gain over GRPO) and a 93.8% success rate on ALFWorld.

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter p, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger p concentrates the gradient to amplify sparse learning signals, whereas a smaller p strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules p across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical benchmarks, yielding a substantial 7.2% relative gain over standard GRPO and secures an exceptional 93.8% success rate on ALFWorld.

Original Article

View Cached Full Text

Cached at: 05/18/26, 10:25 AM

Paper page - Hölder Policy Optimisation

Source: https://huggingface.co/papers/2605.12058 Authors:

Abstract

GroupRelativePolicyOptimisation(GRPO)enhanceslargelanguagemodelsbyestimatingadvantagesacrossagroupofsampledtrajectories.However,mappingthesetrajectory-leveladvantagestopolicyupdatesrequiresaggregatingtoken-levelprobabilitieswithineachsequence.Relyingonafixedaggregationmechanismforthisstepfundamentallylimitsthealgorithm’sadaptability.Empirically,weobserveacriticaltrade-off:certainfixedaggregationsfrequentlysufferfromtrainingcollapse,whileothersfailtoyieldsatisfactoryperformance.Toresolvethis,weproposeHölderPO,ageneralisedpolicyoptimisationframeworkunifyingtoken-levelprobabilityaggregationviatheHöldermean.Byexplicitlymodulatingtheparameterp,ourframeworkprovidescontinuouscontroloverthetrade-offbetweengradientconcentrationandvariancebounds.Theoretically,weprovethatalargerpconcentratesthegradienttoamplifysparselearningsignals,whereasasmallerpstrictlyboundsgradientvariance.Becausenostaticconfigurationcanuniversallyresolvethisconcentration-stabilitytrade-off,weinstantiatetheframeworkwithadynamicannealingalgorithmthatprogressivelyschedulespacrossthetraininglifecycle.Extensiveevaluationsdemonstratesuperiorstabilityandconvergenceoverexistingbaselines.Specifically,ourapproachachievesastate-of-the-artaverageaccuracyof54.9%acrossmultiplemathematicalbenchmarks,yieldingasubstantial7.2%relativegainoverstandardGRPOandsecuresanexceptional93.8%successrateonALFWorld.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.12058

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12058 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12058 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12058 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Hölder Policy Optimisation

Paper page - Hölder Policy Optimisation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Generative OOD-regularized Model-based Policy Optimization

Gradient Extrapolation-Based Policy Optimization

Submit Feedback

Similar Articles

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Generative OOD-regularized Model-based Policy Optimization

Gradient Extrapolation-Based Policy Optimization