Tag
The paper proposes GAC, a noise-aware adaptive mixing controller for hybrid SFT-RL post-training of LLMs. It derives a closed-form mixing weight that balances gradient noise and SFT-RL disagreement, achieving consistent improvements across multiple benchmarks with minimal overhead.
This paper introduces PNAPO, an offline preference optimization framework for rectified flow models that augments preference data with noise samples and uses dynamic regularization to improve training efficiency and sample efficiency.