advantage-function

#advantage-function

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

arXiv cs.LG ↗ · yesterday Cached

This paper introduces FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage function that dynamically schedules gradient weights during RL post-training of LLMs, achieving faster convergence and better accuracy-diversity trade-offs compared to static baselines.

0 favorites 0 likes

advantage-function

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

Submit Feedback