DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
Summary
DVAO adaptively weights objectives based on reward variance to improve multi-reward RL training stability and multi-objective performance.
View Cached Full Text
Cached at: 05/26/26, 02:41 AM
Paper page - DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
Source: https://huggingface.co/papers/2605.25604
Abstract
Dynamic Variance-adaptive Advantage Optimization (DVAO) addresses training instability in multi-reward reinforcement learning by adaptively weighting objectives based on empirical reward variance, maintaining bounded advantage magnitudes and improving multi-objective performance.
Reinforcement Learninghas become a standard paradigm for aligningLarge Language Modelswith human intent and task requirements. WhileGroup Relative Policy Optimizationoffers an efficient, value-model-free alternative toProximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such asReward CombinationandAdvantage Combination, suffer from significant drawbacks:Reward Combinationfrequently generates advantages with excessively large squared magnitudes that lead to training instability, whileAdvantage Combinationrelies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we proposeDynamic Variance-adaptive Advantage Optimization(DVAO), which dynamically adjusts combination weights based on theempirical reward varianceof each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superiormulti-objective Pareto frontierand robusttraining stability.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.25604
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.25604 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.25604 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.25604 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
D-VLA proposes a high-concurrency distributed asynchronous reinforcement learning framework for Vision-Language-Action models, using plane decoupling and a swimlane pipeline to improve throughput and efficiency in large-scale embodied AI training.
Dual Advantage Fields
Dual Advantage Fields (DAF) is a policy-extraction method for offline goal-conditioned RL that converts a bilinear dual value model into a local advantage signal by learning an action-effect model predicting feature displacement and scoring actions by alignment with the goal direction. Accepted at the ICML 2026 Workshop on Decision Making, DAF shows improved performance on OGBench locomotion, manipulation, and puzzle tasks.
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
This paper introduces Vector Policy Optimization (VPO), a reinforcement learning algorithm that trains LLMs to produce diverse solutions by optimizing across multiple reward dimensions, significantly improving test-time search performance compared to scalar RL baselines.
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
Introduces ODRPO, a framework that decomposes discrete rewards into ordinal binary indicators to improve robustness of policy optimization in RLAIF for LLMs, achieving up to 14.8% relative improvement with minimal overhead.
Optimistic Dual Averaging Unifies Modern Optimizers
This paper introduces SODA, a generalization of Optimistic Dual Averaging that unifies various modern optimizers like Muon and Lion. It proposes a practical wrapper that improves performance across different scales without requiring additional hyperparameter tuning for weight decay.