Tag
MOCHA introduces a multi-objective optimization method for LLM agent skills, using Chebyshev scalarization and exponential annealing to handle hard platform constraints and discover Pareto-optimal variants, achieving significant improvements over existing optimizers.
HölderPO introduces a generalized policy optimization framework that uses the Hölder mean for token-level probability aggregation in GRPO, with a dynamic annealing schedule to balance gradient concentration and variance. The method achieves state-of-the-art results on mathematical benchmarks (54.9% average, 7.2% relative gain over GRPO) and a 93.8% success rate on ALFWorld.