DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

Hugging Face Daily Papers Papers

Summary

DVAO adaptively weights objectives based on reward variance to improve multi-reward RL training stability and multi-objective performance.

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.
Original Article
View Cached Full Text

Cached at: 05/26/26, 02:41 AM

Paper page - DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

Source: https://huggingface.co/papers/2605.25604

Abstract

Dynamic Variance-adaptive Advantage Optimization (DVAO) addresses training instability in multi-reward reinforcement learning by adaptively weighting objectives based on empirical reward variance, maintaining bounded advantage magnitudes and improving multi-objective performance.

Reinforcement Learninghas become a standard paradigm for aligningLarge Language Modelswith human intent and task requirements. WhileGroup Relative Policy Optimizationoffers an efficient, value-model-free alternative toProximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such asReward CombinationandAdvantage Combination, suffer from significant drawbacks:Reward Combinationfrequently generates advantages with excessively large squared magnitudes that lead to training instability, whileAdvantage Combinationrelies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we proposeDynamic Variance-adaptive Advantage Optimization(DVAO), which dynamically adjusts combination weights based on theempirical reward varianceof each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superiormulti-objective Pareto frontierand robusttraining stability.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.25604

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.25604 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.25604 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.25604 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Dual Advantage Fields

arXiv cs.LG

Dual Advantage Fields (DAF) is a policy-extraction method for offline goal-conditioned RL that converts a bilinear dual value model into a local advantage signal by learning an action-effect model predicting feature displacement and scoring actions by alignment with the goal direction. Accepted at the ICML 2026 Workshop on Decision Making, DAF shows improved performance on OGBench locomotion, manipulation, and puzzle tasks.

Optimistic Dual Averaging Unifies Modern Optimizers

arXiv cs.LG

This paper introduces SODA, a generalization of Optimistic Dual Averaging that unifies various modern optimizers like Muon and Lion. It proposes a practical wrapper that improves performance across different scales without requiring additional hyperparameter tuning for weight decay.