GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

arXiv cs.LG Papers

Summary

The paper proposes GAC, a noise-aware adaptive mixing controller for hybrid SFT-RL post-training of LLMs. It derives a closed-form mixing weight that balances gradient noise and SFT-RL disagreement, achieving consistent improvements across multiple benchmarks with minimal overhead.

arXiv:2605.26184v1 Announce Type: new Abstract: Hybrid post-training usually combines supervised fine-tuning and reinforcement learning, but fixed mixing schedules cannot adapt when the relative noise of the two signals changes over time. We propose GAC, a noise-aware controller that derives an adaptive mixing weight from online estimates of gradient variance and disagreement between the two training signals. The method adds smoothing, prior guidance, and bounded updates while reusing existing training tensors. Experiments on math, code, science, and logic benchmarks show that GAC consistently improves hybrid post-training over strong fixed and rule-based baselines, with larger gains at larger model scales and less than 1% training overhead.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:05 AM

# GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training
Source: [https://arxiv.org/html/2605.26184](https://arxiv.org/html/2605.26184)
Yuelin Hu1Zhenbo Yu1Zhengxue Cheng1Wei Liu2Li Song1 1Shanghai Jiao Tong University2Shanghai Maritime University \{huyuelin51717221,yuzhenbo,zxcheng,songli\}@sjtu\.edu\.cn

###### Abstract

Hybrid post\-training combining supervised fine\-tuning \(SFT\) and reinforcement learning \(RL\) is the standard paradigm for aligning large language models, yet fixed mixing schedules cannot adapt when the relative noise of the two signals evolves\. We derive a noise\-aware mixing weightμ∗\\mu^\{\*\}by minimizing an MSE upper bound on the mixed stochastic gradient, yielding a closed\-form that balances gradient noise variance and SFT–RL disagreement\. Building on the token\-wise reweighting of CHORDZhang et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib30)\), the practical Guided Adaptive Controller \(GAC\) adds EMA smoothing, a schedule prior, and capped updates around this estimator, with all statistics estimated online from existing training tensors\. The noise\-aware controller alone outperforms the best rule\-based controller by\+\+3\.0pp on AMC; the full system reaches\+\+3\.8pp over HPT across math, code, science, and logic benchmarks, while reducing KL\-drift area by 28% and large\|Δ​μ\|\|\\Delta\\mu\|events by\>\>70%, at<<1% overhead\. Gains grow with model size from 1\.5B to 14B \(Table[6](https://arxiv.org/html/2605.26184#S4.T6)\)\. Code:[https://github\.com/anonymous/GAC](https://github.com/anonymous/GAC)\.

GAC: Noise\-Aware Adaptive Mixing for Hybrid SFT\-RL Post\-Training

Yuelin Hu1Zhenbo Yu1Zhengxue Cheng1Wei Liu2Li Song11Shanghai Jiao Tong University2Shanghai Maritime University\{huyuelin51717221,yuzhenbo,zxcheng,songli\}@sjtu\.edu\.cn

![Refer to caption](https://arxiv.org/html/2605.26184v1/figures/Noise-Guided-Adaptive.png)Figure 1:Overview of GAC and the training pipeline\. Top: limitations of fixed schedules versus the noise\-aware controller\. Middle: the Guided Adaptive Controller integratesσr2\\sigma\_\{r\}^\{2\}\(RL uncertainty\),σs2\\sigma\_\{s\}^\{2\}\(SFT uncertainty\), andΔ​g~2\\Delta\\tilde\{g\}^\{2\}\(disagreement proxy\) to compute the mixing weight in \([3](https://arxiv.org/html/2605.26184#S3.E3)\), followed by EMA smoothing, prior blending, and capped updates\. Bottom: the pipeline sources signals from GRPO and SFT with token\-wise weightingφ​\(p\)=p​\(1−p\)\\varphi\(p\)\{=\}p\(1\{\-\}p\)Zhang et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib30)\), composingL=\(1−μ\)​LGRPO\+μ​LSFT−φL\{=\}\(1\{\-\}\\mu\)L\_\{\\mathrm\{GRPO\}\}\{\+\}\\mu L\_\{\\mathrm\{SFT\}\{\-\}\\varphi\}\. All statistics are estimated online from existing training tensors\.## 1Introduction

Large language models \(LLMs\) are typically post\-trained by combining supervised fine\-tuning \(SFT\) with reinforcement learning \(RL\)\. SFT stabilizes format from expert demonstrations while RL improves reward\-seeking behavior from on\-policy rollouts\. However, SFT and RL objectives cannot be fully decoupled without mutual degradationNiu et al\. \([2026](https://arxiv.org/html/2605.26184#bib.bib36)\), and fixed mixtures cannot adapt to evolving policy drift and reward noise, inducing entropy collapse or late over\-imitation\. Recent hybrid methods address this via interleaved SFT–RLSu et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib34)\), anchored regularizationZhu et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib35)\), or conflict\-aware couplingZeng et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib37)\)\. We take a complementary approach: a noise\-aware global controller that adapts the mixing weight based on online gradient uncertainty estimates, combined with the token\-wise stabilizationφ​\(⋅\)\\varphi\(\\cdot\)from CHORDZhang et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib30)\)\. Figure[1](https://arxiv.org/html/2605.26184#S0.F1)provides an overview\.

#### Contributions

The new contribution is a global noise\-aware mixing controller and its proxy\-based online estimation; the token\-wise reweightingφ​\(p\)\\varphi\(p\)is adopted from CHORDZhang et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib30)\)\. \(C1\) A closed\-formμ∗\\mu^\{\*\}from MSE minimization \(Eq\.[3](https://arxiv.org/html/2605.26184#S3.E3)\), instantiated via z\-normalized proxies \(Appendix[D](https://arxiv.org/html/2605.26184#A4)\), wrapped in a guided controller with motivating stability analysis \(Proposition[2](https://arxiv.org/html/2605.26184#Thmtheorem2)\), reusing existing tensors with<<1% overhead\. \(C2\) Token\-wise SFT reweightingφ​\(p\)=p​\(1−p\)\\varphi\(p\)\{=\}p\(1\{\-\}p\)adopted from CHORD, contributing \+0\.4–1\.4pp orthogonal to the controller \(Section[4\.3](https://arxiv.org/html/2605.26184#S4.SS3)\)\. \(C3\) Systematic evaluation across math, code, science, and logic at 1\.5B, 7B, and 14B scales\.

## 2Related Work

Post\-training paradigms\.Sequential SFT\-then\-RL exhibits “shift–readapt–overfit” progression; scheduled mixtures remain heuristicOuyang et al\. \([2022](https://arxiv.org/html/2605.26184#bib.bib14)\); Christiano et al\. \([2017](https://arxiv.org/html/2605.26184#bib.bib5)\); Rafailov et al\. \([2023](https://arxiv.org/html/2605.26184#bib.bib15)\)\.Niu et al\. \([2026](https://arxiv.org/html/2605.26184#bib.bib36)\)prove that SFT and RL cannot be decoupled without mutual degradation, motivating integrated training\.

Dynamic weighting and stability\.Multi\-task learning employs uncertainty weightingKendall et al\. \([2018](https://arxiv.org/html/2605.26184#bib.bib7)\), gradient normalizationChen et al\. \([2018](https://arxiv.org/html/2605.26184#bib.bib4)\), or conflict resolution \(MGDA/PCGrad/CAGrad\)Sener and Koltun \([2018](https://arxiv.org/html/2605.26184#bib.bib20)\); Yu et al\. \([2020](https://arxiv.org/html/2605.26184#bib.bib24)\); Liu et al\. \([2021](https://arxiv.org/html/2605.26184#bib.bib10)\)\. Recent approaches include Nash\-MTLNavon et al\. \([2022](https://arxiv.org/html/2605.26184#bib.bib13)\), FAMOLiu et al\. \([2023](https://arxiv.org/html/2605.26184#bib.bib26)\), Aligned\-MTLSenushkin et al\. \([2023](https://arxiv.org/html/2605.26184#bib.bib27)\), SDMGradXiao et al\. \([2023](https://arxiv.org/html/2605.26184#bib.bib28)\), and MoCoFernando et al\. \([2023](https://arxiv.org/html/2605.26184#bib.bib29)\)\. None explicitly models both gradient noise variance and SFT–RL disagreement\. Table[1](https://arxiv.org/html/2605.26184#S2.T1)summarizes key differences\.

Table 1:Comparison of dynamic weighting methods for SFT–RL mixing\.†\\dagger: derived but omitted due to high estimation variance\.‡\\ddagger: loss\-ratio dynamics, not explicit EMA\.MethodNoiseDisagree\.Corr\.TemporalClosedσ2\\sigma^\{2\}Δ​g2\\Delta g^\{2\}ccSmoothFormUncertaintyKendall et al\. \([2018](https://arxiv.org/html/2605.26184#bib.bib7)\)✓✓GradNormChen et al\. \([2018](https://arxiv.org/html/2605.26184#bib.bib4)\)✓MGDA/PCGrad✓Nash\-MTLNavon et al\. \([2022](https://arxiv.org/html/2605.26184#bib.bib13)\)✓DWALiu et al\. \([2019](https://arxiv.org/html/2605.26184#bib.bib9)\)‡\\ddaggerFAMOLiu et al\. \([2023](https://arxiv.org/html/2605.26184#bib.bib26)\)✓✓✓Aligned\-MTLSenushkin et al\. \([2023](https://arxiv.org/html/2605.26184#bib.bib27)\)✓SDMGradXiao et al\. \([2023](https://arxiv.org/html/2605.26184#bib.bib28)\)✓MoCoFernando et al\. \([2023](https://arxiv.org/html/2605.26184#bib.bib29)\)✓✓CHORDZhang et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib30)\)✓SRFTFu et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib31)\)LUFFYYan et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib32)\)✓HPTLv et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib33)\)✓✓TRAPOSu et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib34)\)✓ASFTZhu et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib35)\)GAC \(ours\)✓✓†\\dagger✓✓

Hybrid SFT–RL post\-training\.CHORDZhang et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib30)\)proposes dual control with a globalμ\\muand token\-wiseφ​\(p\)=p​\(1−p\)\\varphi\(p\)\{=\}p\(1\{\-\}p\); GAC builds on CHORD, replacing its heuristic schedule with a noise\-aware controller\. SRFTFu et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib31)\)uses entropy\-aware weighting; LUFFYYan et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib32)\)augments RL with off\-policy traces; HPTLv et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib33)\)derives accuracy\-gated signal selection\. TRAPOSu et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib34)\)interleaves SFT and RL within each instance via trust\-region SFT\. ASFTZhu et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib35)\)anchors the policy to the base distribution via KL regularization\. GTAZeng et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib37)\)combines supervised and RL signals with conflict mitigation\. GAC differs by operating at the gradient noise level with a closed\-formμ∗\\mu^\{\*\}\(Proposition[2](https://arxiv.org/html/2605.26184#Thmtheorem2)\)\.

## 3Method

Notation\.We writeμ∗\\mu^\{\*\}for the idealized optimal mixing weight,σs2,σr2\\sigma\_\{s\}^\{2\},\\sigma\_\{r\}^\{2\}for SFT/RL gradient noise variance \(estimated via proxies\),Δ​g2\\Delta g^\{2\}for gradient disagreement, andαtgt\\alpha\_\{\\mathrm\{tgt\}\}for the theoretical target ratio in the MSE derivation\. In practice we use a KL\-controlled ratioαctrl\\alpha\_\{\\mathrm\{ctrl\}\}\(Eq\.[11](https://arxiv.org/html/2605.26184#S3.E11)\)\. Throughout, “SFT” denotesLSFT−φL\_\{\\mathrm\{SFT\}\{\-\}\\varphi\}with token\-wise weightingφ​\(p\)=p​\(1−p\)\\varphi\(p\)\{=\}p\(1\{\-\}p\)Zhang et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib30)\)\. A full notation table is in AppendixLABEL:app:notation\.

### 3\.1Closed\-Formμ\\muvia MSE Minimization

Let SFT and RL provide gradient estimatorsg^s=gs∗\+εs\\hat\{g\}\_\{s\}=g\_\{s\}^\{\*\}\+\\varepsilon\_\{s\}andg^r=gr∗\+εr\\hat\{g\}\_\{r\}=g\_\{r\}^\{\*\}\+\\varepsilon\_\{r\}\(noise\-free gradientsgs∗,gr∗g\_\{s\}^\{\*\},g\_\{r\}^\{\*\}plus zero\-mean noise with variancesσs2,σr2\\sigma\_\{s\}^\{2\},\\sigma\_\{r\}^\{2\}\)\. For mixed gradientg^​\(μ\)=μ​g^s\+\(1−μ\)​g^r\\hat\{g\}\(\\mu\)=\\mu\\hat\{g\}\_\{s\}\+\(1\-\\mu\)\\hat\{g\}\_\{r\}and targetg⋆=αtgt​gs∗\+\(1−αtgt\)​gr∗g^\{\\star\}=\\alpha\_\{\\mathrm\{tgt\}\}g\_\{s\}^\{\*\}\+\(1\-\\alpha\_\{\\mathrm\{tgt\}\}\)g\_\{r\}^\{\*\}, we derive the optimal mixture weight\.

#### On the “target gradient” assumption\.

g⋆g^\{\\star\}is a local control objective encoding a desired trade\-off viaαtgt\\alpha\_\{\\mathrm\{tgt\}\}, analogous to trust\-region surrogates, not a global optimality claim\. Under first\-order approximation, minimizing MSE tog⋆g^\{\\star\}is equivalent to minimizing a local upper bound onLmixL\_\{\\mathrm\{mix\}\}with variance\-aware regularization\. WhenΔ​g2→∞\\Delta g^\{2\}\\to\\infty,μ∗→αtgt\\mu^\{\*\}\\to\\alpha\_\{\\mathrm\{tgt\}\}: the controller defaults to the user\-specified preference\. We further employ KL\-stabilizedαctrl\\alpha\_\{\\mathrm\{ctrl\}\}and capped updates to prevent abrupt shifts \(see Limitations\)\.

###### Definition 1\(Mean Squared Error Objective\)\.

The expected squared error between the mixed gradient and target is:

ℰ​\(μ\)≜𝔼​\[‖g^​\(μ\)−g⋆‖2\]\.\\mathcal\{E\}\(\\mu\)\\triangleq\\mathbb\{E\}\\left\[\\\|\\hat\{g\}\(\\mu\)\-g^\{\\star\}\\\|^\{2\}\\right\]\.\(1\)

Under the assumption that𝔼​\[εs\]=𝔼​\[εr\]=0\\mathbb\{E\}\[\\varepsilon\_\{s\}\]=\\mathbb\{E\}\[\\varepsilon\_\{r\}\]=0and independence𝔼​\[εs​εr⊤\]=0\\mathbb\{E\}\[\\varepsilon\_\{s\}\\varepsilon\_\{r\}^\{\\top\}\]=0, we expand the MSE \(see Appendix[A](https://arxiv.org/html/2605.26184#A1)for details\)\. Substitutingg^​\(μ\)\\hat\{g\}\(\\mu\)andg⋆g^\{\\star\}and usingg^​\(μ\)−g⋆=\(μ−αtgt\)​\(gs∗−gr∗\)\+μ​εs\+\(1−μ\)​εr\\hat\{g\}\(\\mu\)\-g^\{\\star\}=\(\\mu\{\-\}\\alpha\_\{\\mathrm\{tgt\}\}\)\(g\_\{s\}^\{\*\}\{\-\}g\_\{r\}^\{\*\}\)\+\\mu\\varepsilon\_\{s\}\+\(1\{\-\}\\mu\)\\varepsilon\_\{r\}:

ℰ​\(μ\)=\(μ−αtgt\)2​Δ​g2\+μ2​σs2\+\(1−μ\)2​σr2,\\mathcal\{E\}\(\\mu\)=\(\\mu\-\\alpha\_\{\\mathrm\{tgt\}\}\)^\{2\}\\,\\Delta g^\{2\}\+\\mu^\{2\}\\sigma\_\{s\}^\{2\}\+\(1\-\\mu\)^\{2\}\\sigma\_\{r\}^\{2\},\(2\)whereΔ​g2=‖gs∗−gr∗‖2\\Delta g^\{2\}=\\\|g\_\{s\}^\{\*\}\-g\_\{r\}^\{\*\}\\\|^\{2\}denotes gradient disagreement, and cross\-terms vanish under independence\.

###### Theorem 1\(Optimal Mixture Weight\)\.

The unique minimizer ofℰ​\(μ\)\\mathcal\{E\}\(\\mu\)overμ∈ℝ\\mu\\in\\mathbb\{R\}is:

μ∗=αtgt​Δ​g2\+σr2Δ​g2\+σs2\+σr2\.\\mu^\{\*\}=\\frac\{\\alpha\_\{\\mathrm\{tgt\}\}\\,\\Delta g^\{2\}\+\\sigma\_\{r\}^\{2\}\}\{\\Delta g^\{2\}\+\\sigma\_\{s\}^\{2\}\+\\sigma\_\{r\}^\{2\}\}\.\(3\)

###### Proof sketch\.

Taking∂ℰ∂μ=0\\frac\{\\partial\\mathcal\{E\}\}\{\\partial\\mu\}=0from Eq\.[2](https://arxiv.org/html/2605.26184#S3.E2):

2​\(μ−αtgt\)​Δ​g2\+2​μ​σs2−2​\(1−μ\)​σr2=0\.2\(\\mu\{\-\}\\alpha\_\{\\mathrm\{tgt\}\}\)\\Delta g^\{2\}\+2\\mu\\sigma\_\{s\}^\{2\}\-2\(1\{\-\}\\mu\)\\sigma\_\{r\}^\{2\}=0\.\(4\)Solving forμ\\muyields Eq\.[3](https://arxiv.org/html/2605.26184#S3.E3)\. The second derivative∂2ℰ∂μ2=2​\(Δ​g2\+σs2\+σr2\)\>0\\frac\{\\partial^\{2\}\\mathcal\{E\}\}\{\\partial\\mu^\{2\}\}=2\(\\Delta g^\{2\}\+\\sigma\_\{s\}^\{2\}\+\\sigma\_\{r\}^\{2\}\)\>0confirms this is a minimum\. ∎

This embodies a bias–variance trade\-off: whenΔ​g2→0\\Delta g^\{2\}\\to 0,μ∗\\mu^\{\*\}reduces to inverse\-variance weighting; asΔ​g2→∞\\Delta g^\{2\}\\to\\infty,μ∗→αtgt\\mu^\{\*\}\\to\\alpha\_\{\\mathrm\{tgt\}\}\.

#### Correlated noise extension\.

When independence is violated, the extension withc=tr​Cov​\(εs,εr\)c\{=\}\\mathrm\{tr\}\\,\\mathrm\{Cov\}\(\\varepsilon\_\{s\},\\varepsilon\_\{r\}\)yields:

μc∗=αtgt​Δ​g2\+σr2−cΔ​g2\+σs2\+σr2−2​c\.\\mu\_\{c\}^\{\*\}\\;=\\;\\frac\{\\alpha\_\{\\mathrm\{tgt\}\}\\,\\Delta g^\{2\}\+\\sigma\_\{r\}^\{2\}\-c\}\{\\Delta g^\{2\}\+\\sigma\_\{s\}^\{2\}\+\\sigma\_\{r\}^\{2\}\-2c\}\.\(5\)Empirically,cchas coefficient of variation\>\>0\.8; including it yields \+0\.2pp \(not statistically significant\) but triples large\|Δ​μ\|\|\\Delta\\mu\|events\. We omitccin all main experiments \(Appendix[B](https://arxiv.org/html/2605.26184#A2)\)\.

#### Biased estimators\.

Realistic RL estimators are biased \(clipping, importance sampling, entropy/KL regularizers\)\. With biasesbs,brb\_\{s\},b\_\{r\}, minimizing the MSE upper bound yields \(Appendix[A\.3](https://arxiv.org/html/2605.26184#A1.SS3)\):

μ~∗=αtgt​Δ​g2\+σr2−c\+⟨Δ​b,g¯⟩Δ​g2\+σs2\+σr2−2​c\+‖Δ​b‖2,\\tilde\{\\mu\}^\{\*\}=\\frac\{\\alpha\_\{\\mathrm\{tgt\}\}\\,\\Delta g^\{2\}\+\\sigma\_\{r\}^\{2\}\-c\+\\langle\\Delta b,\\,\\bar\{g\}\\rangle\}\{\\Delta g^\{2\}\+\\sigma\_\{s\}^\{2\}\+\\sigma\_\{r\}^\{2\}\-2c\+\\\|\\Delta b\\\|^\{2\}\},\(6\)reducing to Equations[3](https://arxiv.org/html/2605.26184#S3.E3)–[5](https://arxiv.org/html/2605.26184#S3.E5)whenb⋅=0b\_\{\\cdot\}\{=\}0andc=0c\{=\}0\.

### 3\.2Proxy Signals

The closed\-formμ∗\\mu^\{\*\}depends on gradient\-level quantities\(σs2,σr2,Δ​g2\)\(\\sigma\_\{s\}^\{2\},\\sigma\_\{r\}^\{2\},\\Delta g^\{2\}\)that are expensive to compute at every step\. We employ computationally tractable*proxy uncertainty signals*: \(i\) advantage dispersion for RL uncertainty, and \(ii\) length\-normalized NLL variance for SFT uncertainty\.

#### Theoretical motivation\.

For RL, the policy gradient is directly scaled by advantagesAtA\_\{t\}; thus mini\-batch advantage variance proxiesVar​\(∇Lr\)\\mathrm\{Var\}\(\\nabla L\_\{r\}\)\. For SFT, per\-sample NLL variance captures gradient heterogeneity\. Pearson correlations with true gradient statistics:r=0\.82±0\.04r\{=\}0\.82\{\\pm\}0\.04forσr2\\sigma\_\{r\}^\{2\}andr=0\.76±0\.05r\{=\}0\.76\{\\pm\}0\.05forσs2\\sigma\_\{s\}^\{2\}, with sanity checks confirming genuine coefficient structure \(Appendix[D](https://arxiv.org/html/2605.26184#A4)\)\.

### 3\.3Stability\-Motivated Design Guidelines

We provide stability\-motivated analysis under idealized assumptions \(LL\-smooth losses, KL\-bounded updates\)\. These serve as a motivating analysis for design choices rather than strict guarantees \(Appendix[C](https://arxiv.org/html/2605.26184#A3)\)\.

###### Proposition 2\(Motivating Analysis\)\.

Under idealized smoothness and KL\-bounded updates, withη≤12​L\\eta\\leq\\frac\{1\}\{2L\},λ≤ρρ\+1\\lambda\\leq\\frac\{\\rho\}\{\\rho\+1\}, and bounded capc¯\\bar\{c\}:

𝔼​\[Vt\+1−Vt\]≤−ζ​𝔼​\[‖∇L‖2\]\+𝒪​\(c¯2\),\\mathbb\{E\}\[V\_\{t\+1\}\-V\_\{t\}\]\\leq\-\\zeta\\,\\mathbb\{E\}\[\\\|\\nabla L\\\|^\{2\}\]\+\\mathcal\{O\}\(\\bar\{c\}^\{2\}\),\(7\)whereVt=‖θt−θ∗‖2\+ρ​\(μt−αtgt\)2V\_\{t\}=\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\\|^\{2\}\+\\rho\(\\mu\_\{t\}\-\\alpha\_\{\\mathrm\{tgt\}\}\)^\{2\}is a Lyapunov potential andζ=min⁡\{12​L,ρρ\+1\}\\zeta=\\min\\\{\\frac\{1\}\{2L\},\\frac\{\\rho\}\{\\rho\+1\}\\\}\.

Practical implications\.\(i\) smallc¯\\bar\{c\}: the𝒪​\(c¯2\)\\mathcal\{O\}\(\\bar\{c\}^\{2\}\)term motivates capping per\-step changes; \(ii\) moderateλ≤0\.5\\lambda\\leq 0\.5: for balancedθ\\theta–μ\\mudynamics\. Ablations confirm: increasingc¯\\bar\{c\}from 0\.01 to 0\.02 raises large\-shift events from 3% to 8% \(Appendix[L](https://arxiv.org/html/2605.26184#A12)\)\.

### 3\.4Online Estimation

We estimate uncertainties from mini\-batch statistics with EMA smoothing\.

#### RL uncertaintyσr2\\sigma\_\{r\}^\{2\}\.

We employ sequence\-level advantage dispersion:

σr2=1\|ℬ\|​∑i∈ℬ\(A¯i−1\|ℬ\|​∑jA¯j\)2,\\sigma\_\{r\}^\{2\}=\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{i\\in\\mathcal\{B\}\}\\left\(\\bar\{A\}\_\{i\}\-\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{j\}\\bar\{A\}\_\{j\}\\right\)^\{2\},\(8\)whereA¯i\\bar\{A\}\_\{i\}denotes the sequence\-level normalized advantage for trajectoryii\.

#### SFT uncertaintyσs2\\sigma\_\{s\}^\{2\}\.

We use length\-normalized, trimmed NLL variance:

σs2=Vartrim​\(nlli\),\\sigma\_\{s\}^\{2\}=\\mathrm\{Var\}\_\{\\mathrm\{trim\}\}\(\\mathrm\{nll\}\_\{i\}\),\(9\)wherenlli\\mathrm\{nll\}\_\{i\}is the length\-normalized negative log\-likelihood for sampleii, and trimming excludes the top and bottom 10% of values for robustness\.

#### Gradient disagreement proxyΔ​g~2\\Delta\\tilde\{g\}^\{2\}\.

Computing true per\-objective gradients at every step is expensive\. Since both RL and SFT gradients share the form∇θL=𝔼​\[∑tct​∇θlog⁡πθ​\(at\|st\)\]\\nabla\_\{\\theta\}L=\\mathbb\{E\}\[\\sum\_\{t\}c\_\{t\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)\]with different coefficientsctc\_\{t\}\(advantages for RL, weights for SFT\), the coefficient difference captures disagreement\. With z\-score normalization:

Δ​g~2=𝔼i∈ℬ​\[meant∈resp​\(i\)​\(g~s,i​t−g~r,i​t\)2\],\\Delta\\tilde\{g\}^\{2\}\\;=\\;\\mathbb\{E\}\_\{i\\in\\mathcal\{B\}\}\\Big\[\\mathrm\{mean\}\_\{t\\in\\mathrm\{resp\}\(i\)\}\\big\(\\tilde\{g\}\_\{s,it\}\-\\tilde\{g\}\_\{r,it\}\\big\)^\{2\}\\Big\],\(10\)whereg~s,i​t,g~r,i​t\\tilde\{g\}\_\{s,it\},\\tilde\{g\}\_\{r,it\}are z\-normalized coefficients\. This proxy correlates with trueΔ​g2\\Delta g^\{2\}atr=0\.84r\{=\}0\.84, destroyed by shuffling \(Appendix[D](https://arxiv.org/html/2605.26184#A4)\)\. Statistics are updated everyfμf\_\{\\mu\}steps with EMA smoothing\.

#### KL\-controlled target ratioαctrl\\alpha\_\{\\mathrm\{ctrl\}\}\.

We adaptαctrl\\alpha\_\{\\mathrm\{ctrl\}\}using a smoothed KL controller to prevent destabilizing drift\. With hysteresis bandhhand step sizesη↑,η↓\\eta\_\{\\uparrow\},\\eta\_\{\\downarrow\}:

αctrl,t\+1=clip​\(αctrl,t⋅exp⁡\(st\),\[αmin,αmax\]\),\\alpha\_\{\\mathrm\{ctrl\},t\+1\}=\\mathrm\{clip\}\\big\(\\alpha\_\{\\mathrm\{ctrl\},t\}\\cdot\\exp\(s\_\{t\}\),\\;\[\\alpha\_\{\\min\},\\alpha\_\{\\max\}\]\\big\),\(11\)where the stepsts\_\{t\}follows a hysteresis rule \(Appendix[G](https://arxiv.org/html/2605.26184#A7)\)\. This yields the online estimator:

μt=α^ctrl​Δ​g~t2\+σ^r,t2Δ​g~t2\+σ^s,t2\+σ^r,t2\.\\mu\_\{t\}=\\frac\{\\hat\{\\alpha\}\_\{\\mathrm\{ctrl\}\}\\,\\Delta\\tilde\{g\}\_\{t\}^\{2\}\+\\hat\{\\sigma\}\_\{r,t\}^\{2\}\}\{\\Delta\\tilde\{g\}\_\{t\}^\{2\}\+\\hat\{\\sigma\}\_\{s,t\}^\{2\}\+\\hat\{\\sigma\}\_\{r,t\}^\{2\}\}\.\(12\)

#### Relationship betweenαtgt\\alpha\_\{\\mathrm\{tgt\}\}andαctrl\\alpha\_\{\\mathrm\{ctrl\}\}\.

The theoretical closed\-form uses fixedαtgt\\alpha\_\{\\mathrm\{tgt\}\}to derive the controller structure\. In practice, we instantiate it with time\-varyingαctrl​\(t\)\\alpha\_\{\\mathrm\{ctrl\}\}\(t\)responding to KL feedback, preserving the bias–variance–disagreement trade\-off while adding safety control\.

### 3\.5Guided Adaptive Controller

Directly usingμ∗\\mu^\{\*\}can induce abrupt shifts\. We compute the trainingμ\\muin three guarded steps:

1. 1\.EMA smoothing:μada=β​μt−1\+\(1−β\)​μ∗\\mu\_\{\\mathrm\{ada\}\}=\\beta\\,\\mu\_\{t\-1\}\+\(1\-\\beta\)\\,\\mu^\{\*\}withβ∈\[0,1\)\\beta\\in\[0,1\)\.
2. 2\.Schedule prior blending:μblend=\(1−λ\)​μprior\+λ​μada\\mu\_\{\\mathrm\{blend\}\}=\(1\-\\lambda\)\\,\\mu\_\{\\mathrm\{prior\}\}\+\\lambda\\,\\mu\_\{\\mathrm\{ada\}\}, whereμprior\\mu\_\{\\mathrm\{prior\}\}is a warmup\+\+cosine schedule\.
3. 3\.Per\-step change cap: Letδt=clip​\(μblend−μt−1,−c¯,c¯\)\\delta\_\{t\}=\\mathrm\{clip\}\(\\mu\_\{\\mathrm\{blend\}\}\-\\mu\_\{t\-1\},\\,\-\\bar\{c\},\\,\\bar\{c\}\), then μt=clip​\(μt−1\+δt,\[μmin,μmax\]\)\.\\mu\_\{t\}=\\mathrm\{clip\}\\big\(\\mu\_\{t\-1\}\+\\delta\_\{t\},\\,\[\\mu\_\{\\min\},\\mu\_\{\\max\}\]\\big\)\.\(13\)

Relationship between theory and practice\.The noise\-awareμ∗\\mu^\{\*\}provides a principled initialization that encodes how uncertainty and disagreement should shift the mixture\. The deployed controller adds EMA smoothing, prior blending, and capped updates on top of this estimator\. These are standard control mechanisms analogous to PPO clippingSchulman et al\. \([2017](https://arxiv.org/html/2605.26184#bib.bib19)\)and Adam momentumLoshchilov and Hutter \([2019](https://arxiv.org/html/2605.26184#bib.bib12)\), and the empirical gains should be attributed to the full stack rather than to the closed\-form alone\. That said, causal attribution \(Table[7](https://arxiv.org/html/2605.26184#S4.T7)\) confirms the noise\-aware estimator contributes \+3\.7pp, whereas EMA alone adds only \+0\.2pp, indicating that the MSE\-derived signal is the dominant contributor\. The complete procedure is given in Algorithm[1](https://arxiv.org/html/2605.26184#alg1)\(Appendix[O](https://arxiv.org/html/2605.26184#A15)\)\.

## 4Experiments

![Refer to caption](https://arxiv.org/html/2605.26184v1/x1.png)Figure 2:Performance and stability metrics across training under four mixing policies \(HPCD, WCF, QCM, GAC\)\. \(a\) Rollout accuracy: GAC consistently leads from∼\\sim200 steps\. \(b\) Response length: GAC maintains a moderate regime \(∼\\sim1\.6–2\.0k tokens\), while QCM and HPCD exhibit length spikes \(2\.5–3\.0k\) indicative of reward hacking\. \(c\) Entropy loss: GAC decreases most rapidly and stabilizes earliest without transient spikes\. \(d\) Policy\-gradient loss: GAC shows the smallest oscillation amplitude\. Raw traces \(faint\) and EMA\-smoothed curves \(bold\) are overlaid\.![Refer to caption](https://arxiv.org/html/2605.26184v1/x2.png)Figure 3:Mixing weight dynamics and driving uncertainty signals\. \(e\)μ\\mu: GAC starts near 0\.85 \(SFT\-dominated\), gradually decreasing to∼\\sim0\.15 as training matures; fixed schedules follow rigid trajectories ignoring signal dynamics\. \(f\) SFT uncertaintyσs2\\sigma\_\{s\}^\{2\}: gradual decline as the model converges on supervised targets\. \(g\) RL uncertaintyσr2\\sigma\_\{r\}^\{2\}: GAC dampens RL noise amplification, with 20–40% lower variance during QCM’sσr2\\sigma\_\{r\}^\{2\}spikes\. Notably,μ\\mutracksσr2\\sigma\_\{r\}^\{2\}rather than KL, confirming the noise\-aware estimator drivesμ\\muduring\>\>93% of steps\. \(h\)Δ​g~2\\Delta\\tilde\{g\}^\{2\}: gradient conflict peaks during early exploration and late distribution shift, with GAC exhibiting lower conflict due to its stabilized trajectory\.### 4\.1Setup

Data\.We sampleNSFT=5,000N\_\{\\mathrm\{SFT\}\}=5\{,\}000SFT instances andNRL=20,000N\_\{\\mathrm\{RL\}\}=20\{,\}000RL prompts from OpenR1\-Math\-220kAI\-MO \([2024](https://arxiv.org/html/2605.26184#bib.bib8)\); Guo et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib6)\)\. For cross\-domain experiments, we use MBPPAustin et al\. \([2021](https://arxiv.org/html/2605.26184#bib.bib1)\), HumanEvalChen et al\. \([2021](https://arxiv.org/html/2605.26184#bib.bib3)\), GPQARein et al\. \([2023](https://arxiv.org/html/2605.26184#bib.bib16)\), SciBenchWang et al\. \([2023](https://arxiv.org/html/2605.26184#bib.bib23)\), and BBH logical subsetsSuzgun et al\. \([2022](https://arxiv.org/html/2605.26184#bib.bib22)\)\.

Model and training\.Base model: Qwen2\.5\-7B\-Instruct\. Training: SFT via AdamWLoshchilov and Hutter \([2019](https://arxiv.org/html/2605.26184#bib.bib12)\)\(cosine annealing, batch 64\); RL via GRPO \(batch 32,K=8K=8rollouts, PPO\-style KL control\)\. Mainline configuration:β=0\.99\\beta\{=\}0\.99,c¯=0\.01\\bar\{c\}\{=\}0\.01,λ=0\.5\\lambda\{=\}0\.5,fμ=10f\_\{\\mu\}\{=\}10, trimmed NLL variance with 10% tail trimming, and KL controller with\(KLtgt,αmin,αmax,η↑,η↓,h\)=\(0\.02,0\.1,0\.95,0\.2,0\.3,0\.1\)\(\\mathrm\{KL\}\_\{\\mathrm\{tgt\}\},\\alpha\_\{\\min\},\\alpha\_\{\\max\},\\eta\_\{\\uparrow\},\\eta\_\{\\downarrow\},h\)=\(0\.02,0\.1,0\.95,0\.2,0\.3,0\.1\)and KL\-EMA coefficient 0\.9\. Additional hyperparameter details are provided in Appendix[H](https://arxiv.org/html/2605.26184#A8)\.

Memory and compute overhead\.GAC adds only 3 EMA scalars beyond the standard SFT–RL batches, incurring<<1% wall\-time overhead and no measurable memory increase \(Appendix[E](https://arxiv.org/html/2605.26184#A5)\)\.

Baselines\.We compare against: \(i\) mixing schedules \(HPCD, WCF, QCM\); \(ii\) multi\-objective solvers \(MGDA, PCGrad, CAGrad, Nash\-MTL, DWA\); \(iii\) rule\-based controllers \(KL\-ctrl, RewVar\-ctrl, GradNorm\-ctrl\); \(iv\) RL\-free methods \(DPO, IPO\); \(v\) recent hybrid SFT–RL methods \(CHORDZhang et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib30)\), SRFTFu et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib31)\), LUFFYYan et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib32)\), HPTLv et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib33)\)\)\. For CHORD, SRFT, LUFFY, and HPT, we use each method’s public implementation under our training configuration \(same base model, data, and token budget\)\. Implementation details are provided in Appendix[I](https://arxiv.org/html/2605.26184#A9)\.

Baseline fairness\.Table[2](https://arxiv.org/html/2605.26184#S4.T2)summarizes adaptation details for each recent hybrid baseline\. All methods use the same base model, data splits, and token budget \(≈\\approx1\.2B\)\. Multi\-objective solvers compute∇LSFT\\nabla L\_\{\\mathrm\{SFT\}\}and∇LRL\\nabla L\_\{\\mathrm\{RL\}\}on shared mini\-batches \(2 backward passes\), withℓ2\\ell\_\{2\}\-normalized RL gradients and grid\-searched hyperparameters \(Appendix[I](https://arxiv.org/html/2605.26184#A9)\)\.

Table 2:Baseline adaptation details for recent hybrid SFT–RL methods\. All share the same base model, data, and token budget\.Reproducibility\.All main results report mean±\\pmstd over 3 random seeds\. We report Cohen’sddalongsidepp\-values; significance claims require bothp<0\.05p\{<\}0\.05andd\>0\.8d\{\>\}0\.8\. With only 3 seeds, these inferential statistics should be interpreted as indicative rather than definitive\. Additional robustness checks \(weak priors,αctrl\\alpha\_\{\\mathrm\{ctrl\}\}stability, overhead, baseline tuning\) are in the Appendix\.

A practitioner summary with recommended defaults, robustness analysis, and deployment guidelines is provided in Appendix[P](https://arxiv.org/html/2605.26184#A16)\.

### 4\.2Main Results

Table 3:Mathematical reasoning and knowledge tasks \(mean±\\pmstd, 3 seeds\)\. Best in bold\.†:p<0\.05p\{<\}0\.05, Cohen’sd\>0\.8d\{\>\}0\.8vs\. best baseline\.#### Mathematical reasoning\.

Table[3](https://arxiv.org/html/2605.26184#S4.T3)shows GAC achieves the highest scores across all benchmarks\. Among recent hybrid methods, HPT \(63\.4%\) and LUFFY \(63\.1%\) are the strongest competitors; GAC \+ Token\-φ\\varphisurpasses both by \+3\.8–4\.1pp, with consistent gains across all four metrics\.

Contribution separation\.GAC w/oφ\\varphi\(plain SFT\) achieves 65\.8% AMC, outperforming KL\-ctrl \(62\.8%\) by \+3\.0pp and CHORD \(62\.5%\) by \+3\.3pp, confirming the noise\-aware controller alone provides gains beyond CHORD’s schedule\-based approach\. Combined with Token\-φ\\varphi, GAC reaches 67\.2%, demonstrating complementary benefits\. To isolate GAC’s independent contribution from Token\-φ\\varphi, we also evaluate HPT \+ Token\-φ\\varphi\(64\.2%\) and LUFFY \+ Token\-φ\\varphi\(63\.8%\) in Table[5](https://arxiv.org/html/2605.26184#S4.T5); GAC \+ Token\-φ\\varphistill outperforms HPT \+ Token\-φ\\varphiby \+3\.0pp \(p<0\.05p\{<\}0\.05, though with only 3 seeds this should be interpreted cautiously\)\. GAC also exhibits lower cross\-seed variance \(±\\pm0\.4 vs\.±\\pm0\.5–1\.0 for baselines\)\.

#### Code generation\.

Table[4](https://arxiv.org/html/2605.26184#S4.T4)shows GAC achieves 78\.8% MBPP \(\+3\.4pp over CHORD, \+2\.8pp over HPT\) and 83\.5% HumanEval \(\+2\.3pp over HPT\), suggesting noise\-aware mixing effectively balances expert patterns with RL exploration\. Notably, the gain on MBPP exceeds that on HumanEval, consistent with MBPP’s higher reward variance amplifying the benefit of noise\-awareμ\\muadaptation\.

Table 4:Code generation \(pass@1, %, 3 seeds\)\.†:p<0\.05p\{<\}0\.05,d\>0\.8d\{\>\}0\.8\.
#### Scientific and logical reasoning\.

GAC achieves 43\.5% on GPQA \(\+3\.9pp over CHORD, \+3\.1pp over HPT\) and 41\.2% on SciBench \(\+2\.5pp over HPT\)\. On BBH logical subsets, GAC averages 65\.7% \(\+3\.1pp over HPT\)\. The asymmetric gains across benchmarks reflect varying noise profiles: tasks with higher reward variance \(GPQA\) benefit more from noise\-aware adaptation than tasks with lower variance \(SciBench\)\. Full results are provided in Appendix[J](https://arxiv.org/html/2605.26184#A10)\.

#### Training dynamics\.

Figure[2](https://arxiv.org/html/2605.26184#S4.F2)shows GAC maintains the highest rollout accuracy, a moderate response length regime, rapid entropy stabilization, and the smallest policy\-gradient loss oscillation\.

#### Mixing weight dynamics\.

Figure[3](https://arxiv.org/html/2605.26184#S4.F3)reveals how GAC alignsμ\\muwith evolving uncertainty signals across three phases: early SFT\-dominated mixing \(μ∼0\.85\\mu\{\\sim\}0\.85, steps 0–200\), mid\-training adaptive shifting \(steps 200–800\) whereμ\\mutracksσr2\\sigma\_\{r\}^\{2\}rather than KL, and late RL\-dominated refinement \(μ∼0\.15\\mu\{\\sim\}0\.15, steps 800\+\)\. TheΔ​g~2\\Delta\\tilde\{g\}^\{2\}\-dominated fallback accounts for<<7% of steps \(Appendix[M](https://arxiv.org/html/2605.26184#A13)\)\.

#### Pure\-GRPO underperformance\.

Pure\-GRPO achieves only 52\.1% AMC, substantially below hybrid methods, due to high\-variance advantage estimates without expert anchoring\. GAC shiftsμ\\mutoward SFT whenσr2\\sigma\_\{r\}^\{2\}is elevated, preventing collapse \(Appendix[N](https://arxiv.org/html/2605.26184#A14)\)\.

#### Orthogonality of Token\-φ\\varphi\.

Table[5](https://arxiv.org/html/2605.26184#S4.T5)reports results when Token\-φ\\varphiis added to HPT and LUFFY\. HPT \+ Token\-φ\\varphireaches 64\.2% AMC \(\+0\.8pp over HPT alone\), while GAC \+ Token\-φ\\varphi\(67\.2%\) still outperforms it by \+3\.0pp \(p<0\.05p\{<\}0\.05,d=0\.72d\{=\}0\.72\), isolating GAC’s independent contribution\.

Table 5:Token\-φ\\varphiis orthogonal: adding it to other methods\. AMC accuracy \(%, 3 seeds\)\.
#### Model scale experiments\.

Table[6](https://arxiv.org/html/2605.26184#S4.T6)reports AMC results on Qwen2\.5\-1\.5B, 7B, and 14B\-Instruct\. At 14B, GAC \+ Token\-φ\\varphiachieves 74\.1%, outperforming HPT by \+3\.3pp, comparable to the 7B margin\. At 1\.5B, the gain attenuates to \+2\.2pp \(within one standard deviation\), expected because smaller models exhibit lowerσs2/σr2\\sigma\_\{s\}^\{2\}/\\sigma\_\{r\}^\{2\}dynamic range, reducing the scope for noise\-aware adaptation\.

Table 6:Model scale experiments \(AMC accuracy %, 3 seeds\)\. Gains grow with model size: \+2\.2pp at 1\.5B, \+3\.8pp at 7B, \+3\.3pp at 14B\.
#### Training health\.

Process metrics \(effective RL tokens, clipping ratio, KL\-trigger rate\) are comparable across methods, supporting token\-budget fairness \(Appendix[L](https://arxiv.org/html/2605.26184#A12)\)\.

### 4\.3Ablations and Signal Analysis

Causal attribution\.Table[7](https://arxiv.org/html/2605.26184#S4.T7)isolates the contribution of each component\. Starting from a fixed\-μ\\mubaseline \(QCM, 62\.1%\), EMA smoothing alone adds only \+0\.2pp, and Token\-φ\\varphialone adds \+0\.4pp\. The noise\-aware estimator \(GAC w/oφ\\varphi\) contributes \+3\.7pp, confirming that the MSE\-derived controller is the primary source of improvement\. The full system reaches 67\.2%, within 0\.6pp of an oracleμ\\muupper bound \(67\.8%\)\.

Table 7:Causal attribution \(AMC accuracy, 3 seeds\)\. The noise\-aware estimator contributes \+3\.7pp; EMA and Token\-φ\\varphiprovide complementary but smaller gains\.Proxy degradation\.To validate that the proxy signals genuinely drive the controller, we replace the proposed proxies with deliberately degraded alternatives: \(a\) constantσr2=1\\sigma\_\{r\}^\{2\}\{=\}1; \(b\) shuffledΔ​g~2\\Delta\\tilde\{g\}^\{2\}\(breaking temporal structure\); \(c\) randomΔ​g~2∼𝒰​\(0,1\)\\Delta\\tilde\{g\}^\{2\}\\sim\\mathcal\{U\}\(0,1\)\. Table[8](https://arxiv.org/html/2605.26184#S4.T8)shows that degrading any proxy consistently hurts performance and increases instability, confirming that the controller relies on genuine signal structure rather than incidental regularization from the guardrails alone\.

Table 8:Proxy degradation ablation \(AMC, 3 seeds\)\. Degrading any proxy signal reduces accuracy and increases large\-shift events, confirming genuine signal dependence\.Stability and component ablations\.Table[9](https://arxiv.org/html/2605.26184#Sx2.T9)shows GAC reduces KL\-drift area by 28% and large\-shift events by 73% relative to constant\-μ\\mumixing\. Removing the cap degrades AMC by 2\.5pp; removing EMA drops AMC by 1\.8pp\. Withλ=1\.0\\lambda\{=\}1\.0\(no prior\), AMC still reaches 66\.1%, \+3\.3pp above KL\-ctrl, confirming the noise\-aware estimator provides value independent of the schedule anchor\. Full component ablation details are in Appendix[L](https://arxiv.org/html/2605.26184#A12)and[K](https://arxiv.org/html/2605.26184#A11)\.

## 5Conclusion

GAC reframes SFT–RL mixing as a noise\-aware control problem\. The MSE\-derived estimator \(Eq\.[3](https://arxiv.org/html/2605.26184#S3.E3)\), instantiated via validated proxy signals, provides a theoretically motivated mixing signal; the guided controller \(Algorithm[1](https://arxiv.org/html/2605.26184#alg1)\) adds standard regularization for stable deployment\. Causal attribution confirms the noise\-aware estimator as the dominant contributor \(\+3\.7pp\), with EMA and Token\-φ\\varphiproviding complementary gains\. On tasks with verifiable rewards, the controller alone \(GAC w/oφ\\varphi\) outperforms KL\-ctrl by \+3\.0pp and CHORD by \+3\.3pp on AMC; the full system outperforms HPT \+ Token\-φ\\varphiby \+3\.0pp\. Scale experiments \(1\.5B, 7B, 14B\) confirm growing benefits with model size\. KL\-drift area decreases by 28% and large\|Δ​μ\|\|\\Delta\\mu\|events by\>\>70%, at<<1% wall\-time overhead\. The method is validated on structured\-reward tasks; extension to open\-ended alignment with learned reward models remains future work\.

## Ethics Statement

This work uses only publicly available datasets\. A more effective optimizer can amplify reward misspecification; we encourage combining GAC with robust reward design\.

## Limitations

Proposition[2](https://arxiv.org/html/2605.26184#Thmtheorem2)relies on idealized assumptions, providing design guidelines rather than convergence guarantees\. All experiments use verifiable\-reward tasks; transfer to open\-ended alignment with learned reward models remains open\. The deployed controller adds engineering layers \(EMA, prior, cap\) beyond the closed\-form; gains should be attributed to the full stack \(see causal attribution in Table[7](https://arxiv.org/html/2605.26184#S4.T7)\)\.

Table 9:Stability metrics across 3 seeds \(mean±\\pmstd\)\.
## References

- Austin et al\. \(2021\)Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton\. 2021\. Program synthesis with large language models\.*arXiv preprint arXiv:2108\.07732*\.
- Bai et al\. \(2022\)Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, et al\. 2022\. Training a helpful and harmless assistant with reinforcement learning from human feedback\.*arXiv preprint arXiv:2204\.05862*\.
- Chen et al\. \(2021\)Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al\. 2021\. Evaluating large language models trained on code\.*arXiv preprint arXiv:2107\.03374*\.
- Chen et al\. \(2018\)Zhao Chen, Vijay Badrinarayanan, Chen\-Yu Lee, and Andrew Rabinovich\. 2018\. GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks\. In*Proceedings of ICML*\.
- Christiano et al\. \(2017\)Paul F\. Christiano, Jan Leike, Tom Brown, Miljan Martic, et al\. 2017\. Deep reinforcement learning from human preferences\. In*Proceedings of NeurIPS*\.
- Guo et al\. \(2025\)Guo D\., Yang D\., Zhang H\., et al\. 2025\. Deepseek\-R1: Incentivizing reasoning capability in LLMs via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*\.
- Kendall et al\. \(2018\)Alex Kendall, Yarin Gal, and Roberto Cipolla\. 2018\. Multi\-task learning using uncertainty to weigh losses for scene geometry and semantics\. In*Proceedings of CVPR*\.
- AI\-MO \(2024\)AI\-MO\. 2024\. NuminaMath\-1\.5 dataset card\. Hugging Face Datasets\. URL:[https://huggingface\.co/datasets/AI\-MO/NuminaMath\-1\.5](https://huggingface.co/datasets/AI-MO/NuminaMath-1.5)\.
- Liu et al\. \(2019\)Shikun Liu, Edward Johns, and Andrew J\. Davison\. 2019\. End\-to\-end multi\-task learning with attention\. In*Proceedings of CVPR*\.
- Liu et al\. \(2021\)Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu\. 2021\. Conflict\-averse gradient descent for multi\-task learning\. In*Proceedings of NeurIPS*\.
- Loshchilov and Hutter \(2017\)Ilya Loshchilov and Frank Hutter\. 2017\. SGDR: Stochastic gradient descent with warm restarts\. In*Proceedings of ICLR*\.
- Loshchilov and Hutter \(2019\)Ilya Loshchilov and Frank Hutter\. 2019\. Decoupled weight decay regularization\. In*Proceedings of ICLR*\.
- Navon et al\. \(2022\)Aviv Navon, Idan Achituve, Haggai Maron, Gal Chechik, and Ethan Fetaya\. 2022\. Multi\-task learning as a bargaining game\. In*Proceedings of ICML*\.
- Ouyang et al\. \(2022\)Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, et al\. 2022\. Training language models to follow instructions with human feedback\. In*Proceedings of NeurIPS*\.
- Rafailov et al\. \(2023\)Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D\. Manning, and Chelsea Finn\. 2023\. Direct preference optimization: Your language model is secretly a reward model\. In*Proceedings of NeurIPS*\.
- Rein et al\. \(2023\)David Rein, Betty Li Hou, Asa Cooper Stickland, et al\. 2023\. GPQA: A graduate\-level google\-proof Q&A benchmark\.*arXiv preprint arXiv:2311\.12022*\.
- Schulman et al\. \(2015a\)John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel\. 2015\. High\-dimensional continuous control using generalized advantage estimation\.*arXiv preprint arXiv:1506\.02438*\.
- Schulman et al\. \(2015b\)John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz\. 2015\. Trust region policy optimization\. In*Proceedings of ICML*\.
- Schulman et al\. \(2017\)John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov\. 2017\. Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*\.
- Sener and Koltun \(2018\)Ozan Sener and Vladlen Koltun\. 2018\. Multi\-task learning as multi\-objective optimization\. In*Proceedings of NeurIPS*\.
- Shao et al\. \(2024\)Shao Z\., Wang P\., Zhu Q\., et al\. 2024\. Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*\.
- Suzgun et al\. \(2022\)Mirac Suzgun, Nathan Scales, Nathanael Schärli, et al\. 2022\. Challenging BIG\-Bench tasks and whether chain\-of\-thought can solve them\.*arXiv preprint arXiv:2210\.09261*\.
- Wang et al\. \(2023\)Xiaoxuan Wang, Ziniu Hu, Pan Lu, et al\. 2023\. SciBench: Evaluating college\-level scientific problem\-solving abilities of large language models\.*arXiv preprint arXiv:2307\.10635*\.
- Yu et al\. \(2020\)Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn\. 2020\. Gradient surgery for multi\-task learning\. In*Proceedings of NeurIPS*\.
- Azar et al\. \(2024\)Azar M\. G\., Rowland M\., Piot B\., Guo Z\. D\., Calandriello D\., Valko M\., and Munos R\. 2024\. A general theoretical paradigm to understand learning from human preferences\. In*Proceedings of the International Conference on Artificial Intelligence and Statistics \(AISTATS\)*, Proceedings of Machine Learning Research, pages 4447–4455\.
- Liu et al\. \(2023\)Bo Liu, Yihao Feng, Peter Stone, and Qiang Liu\. 2023\. FAMO: Fast adaptive multitask optimization\. In*Proceedings of NeurIPS*\.
- Senushkin et al\. \(2023\)Dmitry Senushkin, Nikolay Patakin, Arseny Kuznetsov, and Anton Konushin\. 2023\. Independent component alignment for multi\-task learning\. In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\.
- Xiao et al\. \(2023\)Peiyao Xiao, Hao Ban, and Kaiyi Ji\. 2023\. Direction\-oriented multi\-objective learning: Simple and provable stochastic algorithms\. In*Proceedings of NeurIPS*\.
- Fernando et al\. \(2023\)Heshan Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, and Tianyi Chen\. 2023\. Mitigating gradient bias in multi\-objective learning: A provably convergent approach\. In*Proceedings of ICLR*\.
- Zhang et al\. \(2025\)Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou\. 2025\. On\-policy RL meets off\-policy experts: Harmonizing supervised fine\-tuning and reinforcement learning via dynamic weighting\.*arXiv preprint arXiv:2508\.11408*\.
- Fu et al\. \(2025\)Yuqian Fu, Tinghong Chen, Jianhao Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao\. 2025\. SRFT: A single\-stage method with supervised and reinforcement fine\-tuning for reasoning\.*arXiv preprint arXiv:2506\.19767*\.
- Yan et al\. \(2025\)Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang\. 2025\. Learning to reason under off\-policy guidance\.*arXiv preprint arXiv:2504\.14945*\.
- Lv et al\. \(2025\)Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, and Bowen Zhou\. 2025\. Towards a unified view of large language model post\-training\.*arXiv preprint arXiv:2509\.04419*\.
- Su et al\. \(2025\)Mingyu Su, Jian Guan, Yuxian Gu, Minlie Huang, and Hongning Wang\. 2025\. Trust\-region adaptive policy optimization\.*arXiv preprint arXiv:2512\.17636*\. \(ICLR 2026\)
- Zhu et al\. \(2025\)He Zhu, Junyou Su, Peng Lai, Ren Ma, Wenjia Zhang, Linyi Yang, and Guanhua Chen\. 2025\. Anchored supervised fine\-tuning\.*arXiv preprint arXiv:2509\.23753*\. \(ICLR 2026\)
- Niu et al\. \(2026\)Xueyan Niu, Bo Bai, Wei Han, and Weixi Zhang\. 2026\. On the non\-decoupling of supervised fine\-tuning and reinforcement learning in post\-training\.*arXiv preprint arXiv:2601\.07389*\.
- Zeng et al\. \(2025\)Min Zeng, Jingfei Sun, Xueyou Luo, Shiqi Zhang, Li Xie, Caiquan Liu, and Xiaoxin Chen\. 2025\. GTA: Supervised\-guided reinforcement learning for text classification with large language models\. In*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 1050–1060\.

## Appendix AMathematical Derivations

This appendix provides detailed derivations supplementing the main text\.

### A\.1Full MSE Expansion

Starting fromg^​\(μ\)=μ​g^s\+\(1−μ\)​g^r\\hat\{g\}\(\\mu\)=\\mu\\hat\{g\}\_\{s\}\+\(1\-\\mu\)\\hat\{g\}\_\{r\}and targetg⋆=αtgt​gs∗\+\(1−αtgt\)​gr∗g^\{\\star\}=\\alpha\_\{\\mathrm\{tgt\}\}g\_\{s\}^\{\*\}\+\(1\-\\alpha\_\{\\mathrm\{tgt\}\}\)g\_\{r\}^\{\*\}:

g^​\(μ\)−g⋆\\displaystyle\\hat\{g\}\(\\mu\)\-g^\{\\star\}=\(μ−αtgt\)​\(gs∗−gr∗\)\+μ​εs\+\(1−μ\)​εr\.\\displaystyle=\(\\mu\{\-\}\\alpha\_\{\\mathrm\{tgt\}\}\)\(g\_\{s\}^\{\*\}\{\-\}g\_\{r\}^\{\*\}\)\+\\mu\\varepsilon\_\{s\}\+\(1\{\-\}\\mu\)\\varepsilon\_\{r\}\.\(14\)Taking the expectation of the squared norm under𝔼​\[εs\]=𝔼​\[εr\]=0\\mathbb\{E\}\[\\varepsilon\_\{s\}\]=\\mathbb\{E\}\[\\varepsilon\_\{r\}\]=0and𝔼​\[εs​εr⊤\]=0\\mathbb\{E\}\[\\varepsilon\_\{s\}\\varepsilon\_\{r\}^\{\\top\}\]=0, cross\-terms vanish, yielding Eq\.[2](https://arxiv.org/html/2605.26184#S3.E2)in the main text\.

### A\.2Proof of Theorem[1](https://arxiv.org/html/2605.26184#Thmtheorem1)

The proof sketch appears in the main text\. Here we verify the second\-order condition:∂2ℰ∂μ2=2​\(Δ​g2\+σs2\+σr2\)\>0\\frac\{\\partial^\{2\}\\mathcal\{E\}\}\{\\partial\\mu^\{2\}\}=2\(\\Delta g^\{2\}\+\\sigma\_\{s\}^\{2\}\+\\sigma\_\{r\}^\{2\}\)\>0since all terms are non\-negative and at least one variance is positive\. This confirmsμ∗\\mu^\{\*\}is a global minimum\.

### A\.3Biased Estimator Upper Bound

Under biased estimatorsg^s=gs∗\+bs\+εs\\hat\{g\}\_\{s\}=g\_\{s\}^\{\*\}\+b\_\{s\}\+\\varepsilon\_\{s\}andg^r=gr∗\+br\+εr\\hat\{g\}\_\{r\}=g\_\{r\}^\{\*\}\+b\_\{r\}\+\\varepsilon\_\{r\}, the MSE satisfies:

ℰbias​\(μ\)\\displaystyle\\mathcal\{E\}\_\{\\mathrm\{bias\}\}\(\\mu\)≤\(μ−αtgt\)2​Δ​g2\+μ2​\(σs2\+‖bs‖2\)\\displaystyle\\leq\(\\mu\-\\alpha\_\{\\mathrm\{tgt\}\}\)^\{2\}\\Delta g^\{2\}\+\\mu^\{2\}\(\\sigma\_\{s\}^\{2\}\+\\\|b\_\{s\}\\\|^\{2\}\)\+\(1−μ\)2​\(σr2\+‖br‖2\)\+2​μ​\(1−μ\)​c\.\\displaystyle\\quad\+\(1\-\\mu\)^\{2\}\(\\sigma\_\{r\}^\{2\}\+\\\|b\_\{r\}\\\|^\{2\}\)\+2\\mu\(1\-\\mu\)c\.\(15\)Minimizing this upper bound via calculus yields Equation[6](https://arxiv.org/html/2605.26184#S3.E6)\.

For the bias term⟨br−bs,g¯⟩\\langle b\_\{r\}\-b\_\{s\},\\,\\bar\{g\}\\rangle, we adopt an isotropic surrogate:⟨br−bs,g¯⟩≈γ⋅‖g¯‖2\\langle b\_\{r\}\-b\_\{s\},\\bar\{g\}\\rangle\\approx\\gamma\\cdot\\\|\\bar\{g\}\\\|^\{2\}whereγ\\gammacaptures the alignment between bias difference and target gradient\. On a small set of diagnostic checkpoints \(50 points across training\), we compute the true inner product⟨br−bs,g¯⟩\\langle b\_\{r\}\-b\_\{s\},\\bar\{g\}\\rangleand our approximationγ​‖g¯‖2\\gamma\\\|\\bar\{g\}\\\|^\{2\}\. Empirically,γ∈\[−0\.12,0\.15\]\\gamma\\in\[\-0\.12,0\.15\]across training, with mean\|γ\|=0\.08±0\.04\|\\gamma\|=0\.08\\pm 0\.04\. The approximation error contributes<<5% to the total MSE numerator\.

## Appendix BCross\-Covariance Analysis

Table 10:Cross\-covarianceccstatistics and denominator validity \(3 seeds, 800 training steps each\)\.We tracked cross\-covarianceccthroughout training on AMC and MBPP tasks \(Table[10](https://arxiv.org/html/2605.26184#A2.T10)\)\. Key findings:

- •cchas mean0\.12±0\.080\.12\\pm 0\.08\(AMC\) and0\.09±0\.060\.09\\pm 0\.06\(MBPP\), with coefficient of variation\>0\.8\>0\.8\.
- •The denominatorΔ​g2\+σs2\+σr2−2​c\\Delta g^\{2\}\+\\sigma\_\{s\}^\{2\}\+\\sigma\_\{r\}^\{2\}\-2cremains positive for 98\.7% of training steps—the 1\.3% violations occur during early warmup whenσs2,σr2\\sigma\_\{s\}^\{2\},\\sigma\_\{r\}^\{2\}are small, handled by clippingμc∗\\mu\_\{c\}^\{\*\}to\[0,1\]\[0,1\]\.
- •\|c\|\|c\|averages 18% ofσs2\+σr2\\sigma\_\{s\}^\{2\}\+\\sigma\_\{r\}^\{2\}, insufficient to dominate the denominator\.

#### Summary and honest assessment\.

The \+0\.2pp gain from includingccfalls within error bars \(±\\pm0\.4–0\.6\) and is*not statistically significant*\(pairedtt\-testp=0\.38p=0\.38\)\. Meanwhile, large\-shift events triple from 3% to 9%\. We therefore omitccin all main experiments\.

#### Why retain the correlated\-noise derivation?

Equation[5](https://arxiv.org/html/2605.26184#S3.E5)serves as \(i\) a theoretical reference showing how correlation*would*affect optimal mixing if reliably estimated, and \(ii\) a guide for future work on low\-variance estimators \(e\.g\., shrinkage, longer EMA windows\)\. We emphasize that the derivation provides theoretical completeness rather than immediate practical value—current estimation methods are too noisy to benefit fromcc\.

## Appendix CStability Analysis

### C\.1Idealized Assumptions for Proposition[2](https://arxiv.org/html/2605.26184#Thmtheorem2)

The sufficient conditions in Proposition[2](https://arxiv.org/html/2605.26184#Thmtheorem2)rely on the following idealized assumptions: \(i\)Ls,LrL\_\{s\},L\_\{r\}areLL\-smooth; \(ii\) KL penalty enforces‖θt\+1−θt‖≤Bθ\\\|\\theta\_\{t\+1\}\-\\theta\_\{t\}\\\|\\leq B\_\{\\theta\}; \(iii\) GAC evolves asμt\+1−μt=clip​\(λ​\(μ~t∗−μt\),±c¯\)\\mu\_\{t\+1\}\-\\mu\_\{t\}=\\mathrm\{clip\}\(\\lambda\(\\tilde\{\\mu\}\_\{t\}^\{\*\}\-\\mu\_\{t\}\),\\pm\\bar\{c\}\); \(iv\) proxy signals provide unbiased estimates\.*These assumptions are idealized and do not hold exactly in practice*\(e\.g\., RL gradients are biased due to clipping; noise is non\-zero\-mean\)\. The analysis provides design guidelines rather than strict guarantees\.

### C\.2Relating EMA to Convergence Rate

The convergence coefficientζ=min⁡\{12​L,ρρ\+1\}\\zeta=\\min\\\{\\tfrac\{1\}\{2L\},\\tfrac\{\\rho\}\{\\rho\+1\}\\\}controls expected potential decrease\. Withρ=1\\rho=1\(equal weighting ofθ\\thetaandμ\\mudeviations\), we haveζ=min⁡\{12​L,0\.5\}\\zeta=\\min\\\{\\tfrac\{1\}\{2L\},0\.5\\\}\. The EMA coefficientβ=0\.99\\beta=0\.99corresponds to an effective update rate\(1−β\)=0\.01\(1\-\\beta\)=0\.01forμ\\mu, satisfyingλ≤ρρ\+1=0\.5\\lambda\\leq\\frac\{\\rho\}\{\\rho\+1\}=0\.5with margin\.

### C\.3Empirical Stability Verification

We verify that key stability conditions hold approximately during training:

- •The KL constraint‖θt\+1−θt‖≤Bθ\\\|\\theta\_\{t\+1\}\-\\theta\_\{t\}\\\|\\leq B\_\{\\theta\}is satisfied for 97% of steps withBθ=0\.1B\_\{\\theta\}=0\.1\.
- •The capc¯=0\.01\\bar\{c\}=0\.01limits\|Δ​μ\|\|\\Delta\\mu\|effectively \(only 3% of steps exceed 0\.02\)\.
- •EMA reduces high\-frequencyμ\\muoscillations by 73% based on spectral analysis\.

The 3% of steps violating‖θt\+1−θt‖≤Bθ\\\|\\theta\_\{t\+1\}\-\\theta\_\{t\}\\\|\\leq B\_\{\\theta\}occur predominantly during early training \(steps 1–100\) when gradients are large\. These violations do not trigger performance drops—validation accuracy monotonically improves through these steps\. The violations are bounded \(‖θt\+1−θt‖≤1\.3​Bθ\\\|\\theta\_\{t\+1\}\-\\theta\_\{t\}\\\|\\leq 1\.3B\_\{\\theta\}in worst case\), and subsequent EMA smoothing dampens any inducedμ\\muoscillations within 5 steps\.

## Appendix DProxy Signal Validation

Table 11:Pearson correlation between proxy signals and true gradient statistics \(50 diagnostic points, 3 seeds\)\.Table 12:Sanity checks forΔ​g~2\\Delta\\tilde\{g\}^\{2\}proxy: correlation with true gradient disagreement under different perturbations\.Table 13:Proxy–gradient correlation across tasks \(Pearsonrr, 50 diagnostic points, 3 seeds\)\.Tables[11](https://arxiv.org/html/2605.26184#A4.T11)–[12](https://arxiv.org/html/2605.26184#A4.T12)provide the main proxy validation results referenced in Section[3\.2](https://arxiv.org/html/2605.26184#S3.SS2)\. Table[13](https://arxiv.org/html/2605.26184#A4.T13)extends the analysis to code and scientific reasoning tasks, demonstrating consistent validity across domains\.

## Appendix ECompute and Memory Overhead

#### Why the default implementation has negligible overhead\.

Our released implementation estimates\(σs2,σr2,Δ​g~2\)\(\\sigma\_\{s\}^\{2\},\\sigma\_\{r\}^\{2\},\\Delta\\tilde\{g\}^\{2\}\)using only already\-available tensors \(token log\-probs, masks, and advantages\), and updates statistics everyfμf\_\{\\mu\}steps \(defaultfμ=10f\_\{\\mu\}\{=\}10\)\. The only distributed synchronization is a low\-cost all\-reduce of first/second moments forσr2\\sigma\_\{r\}^\{2\}; no additional forward/backward passes are introduced\. Empirically, this results in<1%<1\\%wall\-time overhead and no measurable increase in peak memory in our training setup\.

Table 14:Runtime and peak\-memory impact of GAC \(Qwen2\.5\-7B, FSDP, 8 GPUs; mean over 3 runs\)\. “Overhead” is relative to Token\-φ\\varphiwith the same batch/length\.
#### Sensitivity to early\-training non\-stationarity\.

To address the concern that disagreement can change rapidly early in training, we tested smallerfμf\_\{\\mu\}during the first 100 steps and then restored the default:fμ=2f\_\{\\mu\}\{=\}2for steps 1–100, thenfμ=10f\_\{\\mu\}\{=\}10\. This improves responsiveness \(lower KL\-area by 3–5%\) at a small cost \(\+0\.3% wall\-time\), while final accuracy remains within±\\pm0\.2pp of the default across 3 seeds\.

## Appendix FRobustness to Weak or Misspecified Priors

#### Question\.

How strongly does GAC depend on the prior scheduleμprior​\(t\)\\mu\_\{\\mathrm\{prior\}\}\(t\)? We stress\-test GAC by intentionally using*bad priors*\(constantμprior=0\.1\\mu\_\{\\mathrm\{prior\}\}\{=\}0\.1or0\.90\.9\) while keeping the same controller hyperparameters and blending weightλ=0\.5\\lambda=0\.5\.

Table 15:Robustness to weak/misspecified priors on AMC \(3 seeds\)\. “Bad priors” are constant schedules; GAC can still correct due to the adaptive term and capped updates\.

## Appendix GStability of the KL\-Controlledαctrl\\alpha\_\{\\mathrm\{ctrl\}\}

#### Hysteresis rule forαctrl\\alpha\_\{\\mathrm\{ctrl\}\}update\.

The stepsts\_\{t\}in Eq\.[11](https://arxiv.org/html/2605.26184#S3.E11)follows a hysteresis rule:

st=\{η↑​\(KLtKLtgt−1\),if​KLtKLtgt\>1\+h,−η↓​\(1−KLtKLtgt\),if​KLtKLtgt<1−h,0,otherwise\.s\_\{t\}=\\begin\{cases\}\\eta\_\{\\uparrow\}\\big\(\\frac\{\\mathrm\{KL\}\_\{t\}\}\{\\mathrm\{KL\}\_\{\\mathrm\{tgt\}\}\}\-1\\big\),&\\text\{if \}\\frac\{\\mathrm\{KL\}\_\{t\}\}\{\\mathrm\{KL\}\_\{\\mathrm\{tgt\}\}\}\>1\{\+\}h,\\\\\[3\.0pt\] \-\\eta\_\{\\downarrow\}\\big\(1\-\\frac\{\\mathrm\{KL\}\_\{t\}\}\{\\mathrm\{KL\}\_\{\\mathrm\{tgt\}\}\}\\big\),&\\text\{if \}\\frac\{\\mathrm\{KL\}\_\{t\}\}\{\\mathrm\{KL\}\_\{\\mathrm\{tgt\}\}\}<1\{\-\}h,\\\\\[3\.0pt\] 0,&\\text\{otherwise\}\.\\end\{cases\}\(16\)This design \(EMA \+ hysteresis\) prevents rapid oscillations while responding when KL persistently deviates\.

#### Does a dynamicαctrl\\alpha\_\{\\mathrm\{ctrl\}\}cause objective oscillations?

We monitor the variability ofαctrl\\alpha\_\{\\mathrm\{ctrl\}\}and the \(smoothed\) KL during training\. With EMA smoothing and hysteresis,αctrl\\alpha\_\{\\mathrm\{ctrl\}\}changes slowly and does not exhibit high\-frequency oscillations\.

Table 16:Stability statistics ofαctrl\\alpha\_\{\\mathrm\{ctrl\}\}and KL on AMC \(800 steps, 3 seeds\)\.

## Appendix HHyperparameter Details

Token budget definition\.We define*training token budget*as total forward\+backward pass tokens during optimization, excluding evaluation\. Specifically:Budget=∑t\(\|ℬst\|\+Kroll⋅\|ℬrt\|\)×L¯\\mathrm\{Budget\}=\\sum\_\{t\}\(\|\\mathcal\{B\}\_\{s\}^\{t\}\|\+K\_\{\\mathrm\{roll\}\}\\cdot\|\\mathcal\{B\}\_\{r\}^\{t\}\|\)\\times\\bar\{L\}, whereKrollK\_\{\\mathrm\{roll\}\}is the number of RL rollouts per prompt in GRPO andL¯\\bar\{L\}is the average sequence length\. All methods use identical budgets of≈\\approx1\.2B training tokens\.

SFT variants\.*SFT\-light*: 1,000 instances, 1 epoch, early stopping at validation loss plateau\.*SFT\-best*: 5,000 instances, 3 epochs with cosine annealing, checkpoint selected by validation accuracy\. Both use identical optimizer settings \(AdamW, lr=2×10−52\\times 10^\{\-5\}\)\.

Table 17:Hyperparameter sensitivity \(AMC accuracy, mean over 3 seeds\)\. Bold: default configuration\.Table[17](https://arxiv.org/html/2605.26184#A8.T17)presents sensitivity analysis\. GAC is robust within tested ranges:β∈\[0\.95,0\.99\]\\beta\\in\[0\.95,0\.99\]yields≤\\leq0\.5pp variation; the capc¯∈\[0\.003,0\.02\]\\bar\{c\}\\in\[0\.003,0\.02\]trades smoothness for responsiveness; prior blendλ∈\[0\.3,0\.7\]\\lambda\\in\[0\.3,0\.7\]balances data\-driven adaptivity with schedule robustness\.

## Appendix IBaseline Implementation Details

Multi\-objective baselines\.Implementation details for multi\-objective methods:

- •Gradient computation:We compute∇LSFT\\nabla L\_\{\\mathrm\{SFT\}\}and∇LRL\\nabla L\_\{\\mathrm\{RL\}\}on shared mini\-batches, then apply each solver’s surgery/weighting\.
- •RL gradient normalization:FollowingChen et al\. \([2018](https://arxiv.org/html/2605.26184#bib.bib4)\), we applyℓ2\\ell\_\{2\}normalization to RL gradients before combining, as raw RL gradients exhibit 3–5×\\timeshigher variance\.
- •Hyperparameters:MGDA uses Frank\-Wolfe solver with 10 iterations; PCGrad uses cosine similarity threshold 0; CAGrad usesc=0\.5c=0\.5; Nash\-MTL uses 5 optimization steps per update; DWA uses temperatureT=2\.0T=2\.0\. All tuned via grid search on validation set\.

Table 18:Baseline hyperparameter tuning protocol \(summary\)\. For each baseline we run a small grid on AMC validation and report the best configuration \(same token budget\)\.RL\-free baselines \(DPO/IPO\)\.We use identical preference data constructed from SFT responses \(chosen: correct solutions; rejected: incorrect solutions from the same prompts\)\. DPO usesβ=0\.1\\beta=0\.1followingRafailov et al\. \([2023](https://arxiv.org/html/2605.26184#bib.bib15)\); IPO usesτ=0\.5\\tau=0\.5\. Both train for 3 epochs with lr=5×10−65\\times 10^\{\-6\}, batch size 32, consuming≈\\approx1\.2B tokens \(matching our budget\)\.

Rule\-based controllers:

- •KL\-ctrl:μt=1−KL​\(πθ∥πref\)/κ\\mu\_\{t\}=1\-\\mathrm\{KL\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\mathrm\{ref\}\}\)/\\kappa, directly using KL divergence\.
- •RewVar\-ctrl:μt∝1/Var​\(R\)\\mu\_\{t\}\\propto 1/\\mathrm\{Var\}\(R\), inversely proportional to reward variance\.
- •GradNorm\-ctrl:μt=‖∇Ls‖/\(‖∇Ls‖\+‖∇Lr‖\)\\mu\_\{t\}=\\\|\\nabla L\_\{s\}\\\|/\(\\\|\\nabla L\_\{s\}\\\|\+\\\|\\nabla L\_\{r\}\\\|\), gradient norm ratio\.

Recent hybrid SFT–RL methods\.For CHORDZhang et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib30)\), we use the public implementation from the Trinity\-RFT repository with the same data and token budget as our method\. We run both the CHORD\-μ\\mu\(global schedule\) and CHORD\-φ\\varphi\(token\-wise\) configurations; the CHORD \(Token\-φ\\varphi, fixed sched\.\) entry in our tables corresponds to using both components with the default heuristic schedule\. For SRFTFu et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib31)\), we adapt the entropy\-aware weighting mechanism to our training pipeline with the default hyperparameters from the original paper\. For LUFFYYan et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib32)\), we use advantage\-weighted off\-policy mixing with the recommended configuration\. For HPTLv et al\. \([2025](https://arxiv.org/html/2605.26184#bib.bib33)\), we implement the unified policy gradient estimator with accuracy\-gated signal switching\. All four methods use identical data splits, base model, and token budget \(≈\\approx1\.2B training tokens\) as GAC for fair comparison\.

## Appendix JAdditional Results

Table 19:Scientific reasoning tasks \(accuracy %, mean±\\pmstd\)\.†:p<0\.05p<0\.05vs\. best baseline\.Table 20:Logical reasoning on BBH subsets \(accuracy %, mean±\\pmstd\)\.†:p<0\.05p<0\.05vs\. best baseline\.
## Appendix KAdditional Ablations

The causal attribution results appear in the main text \(Table[7](https://arxiv.org/html/2605.26184#S4.T7)\)\. Below we provide additional ablation details\.

#### Entropy collapse in fixed\-schedule baselines\.

QCM \(constantμ=0\.58\\mu=0\.58\) exhibits entropy collapse at steps 280–320 in 2 of 3 seeds: entropy drops from 0\.035 to 0\.008 within 40 steps, accompanied by a 4\.2pp accuracy drop\. GAC’s noise\-guidedμ\\muincreases from 0\.52 to 0\.67 during the sameσr2\\sigma\_\{r\}^\{2\}spike, preventing collapse\.

## Appendix LTraining Health and Component Ablations

Table 21:Training health indicators across methods \(AMC task, mean over 3 seeds\)\. Effective RL tokens = tokens passing importance sampling threshold; Clip ratio = fraction of clipped gradients\.Table 22:Ablation study on GAC components \(mean over 3 seeds\)\.Δ\\Delta: change from full GAC\.
## Appendix MStage\-Wiseμ\\muAnalysis

We identify three distinct training phases in GAC’s mixing weight dynamics:

*Early phase \(steps 0–200\):*μ\\muis initialized high \(∼\\sim0\.85\) and both uncertainty signals \(σs2\\sigma\_\{s\}^\{2\},σr2\\sigma\_\{r\}^\{2\}\) are elevated\. Gradient disagreementΔ​g~2\\Delta\\tilde\{g\}^\{2\}peaks as the model explores a broad policy space\. GAC begins with SFT\-dominated mixing, leveraging expert demonstrations to anchor the policy before RL exploration introduces instability\.

*Mid\-training \(steps 200–800\):*σs2\\sigma\_\{s\}^\{2\}declines steadily whileσr2\\sigma\_\{r\}^\{2\}exhibits periodic spikes correlated with exploration bursts \(Figure[3](https://arxiv.org/html/2605.26184#S4.F3)g\)\. GAC responds by dynamically loweringμ\\muduring stable periods and temporarily increasing it whenσr2\\sigma\_\{r\}^\{2\}spikes\. This noise\-tracking behavior is directly visible in panels \(e\) and \(g\)\.μ\\mutracksσr2\\sigma\_\{r\}^\{2\}fluctuations rather than mirroring KL divergence, confirming that the noise\-guided estimator drives adaptation during\>\>93% of steps\.

*Late phase \(steps 800\+\):*μ\\musettles near∼\\sim0\.15–0\.20, reflecting a mature policy that derives most benefit from RL refinement\.σr2\\sigma\_\{r\}^\{2\}rises modestly due to distribution shift, andΔ​g~2\\Delta\\tilde\{g\}^\{2\}increases as objectives diverge\. GAC’s capped updates \(c¯=0\.01\\bar\{c\}\{=\}0\.01\) prevent overreaction to late\-stage fluctuations\.

## Appendix NPure\-GRPO Analysis

Pure\-GRPO achieves only 52\.1% AMC, substantially below SFT\-best \+ RL \(58\.4%\) and all hybrid methods\. This underperformance stems from high\-variance advantage estimates inherent to GRPO without expert anchoring: the policy simultaneously explores and evaluates, producing noisy reward signals that destabilize KL and entropy regimes\. Figure[3](https://arxiv.org/html/2605.26184#S4.F3)g illustrates the resultingσr2\\sigma\_\{r\}^\{2\}instability under unregulated RL, with variance spikes exceeding 2–3×\\timesthe GAC\-controlled levels\. The accompanying response length volatility \(Figure[2](https://arxiv.org/html/2605.26184#S4.F2)b\) further confirms that pure RL fails to maintain consistent generation behavior\.

## Appendix OAlgorithm Pseudocode

Algorithm 1GAC: Guided Adaptive Controller0:Learning rate

η\\eta, EMA coefficient

β\\beta, blend weight

λ\\lambda, step cap

c¯\\bar\{c\}, update frequency

fμf\_\{\\mu\}
1:Initialize

μ0←μinit=0\.5\\mu\_\{0\}\\leftarrow\\mu\_\{\\mathrm\{init\}\}\{=\}0\.5,

α0←αinit\\alpha\_\{0\}\\leftarrow\\alpha\_\{\\mathrm\{init\}\},

θ0\\theta\_\{0\}
2:for

t=1,2,…,Tt=1,2,\\ldots,Tdo

3:Sample RL batch

ℬr\\mathcal\{B\}\_\{r\}, SFT batch

ℬs\\mathcal\{B\}\_\{s\}
4:if

tmodfμ=0t\\mod f\_\{\\mu\}=0then

5:Update EMA statistics:

σr2\\sigma\_\{r\}^\{2\}via \([8](https://arxiv.org/html/2605.26184#S3.E8)\),

σs2\\sigma\_\{s\}^\{2\}via \([9](https://arxiv.org/html/2605.26184#S3.E9)\),

Δ​g~2\\Delta\\tilde\{g\}^\{2\}via \([10](https://arxiv.org/html/2605.26184#S3.E10)\)

6:endif

7:Update

αt\\alpha\_\{t\}via \([11](https://arxiv.org/html/2605.26184#S3.E11)\); compute

μt∗\\mu^\{\*\}\_\{t\}via \([12](https://arxiv.org/html/2605.26184#S3.E12)\)

8:

μada←β​μt−1\+\(1−β\)​μt∗\\mu\_\{\\mathrm\{ada\}\}\\leftarrow\\beta\\mu\_\{t\-1\}\+\(1\-\\beta\)\\mu^\{\*\}\_\{t\}
9:

μblend←\(1−λ\)​μprior​\(t\)\+λ​μada\\mu\_\{\\mathrm\{blend\}\}\\leftarrow\(1\-\\lambda\)\\mu\_\{\\mathrm\{prior\}\}\(t\)\+\\lambda\\mu\_\{\\mathrm\{ada\}\}
10:

μt←clip​\(μt−1\+clip​\(μblend−μt−1,±c¯\),\[μmin,μmax\]\)\\mu\_\{t\}\\leftarrow\\mathrm\{clip\}\(\\mu\_\{t\-1\}\+\\mathrm\{clip\}\(\\mu\_\{\\mathrm\{blend\}\}\-\\mu\_\{t\-1\},\\pm\\bar\{c\}\),\[\\mu\_\{\\min\},\\mu\_\{\\max\}\]\)
11:

L←\(1−μt\)​LRL\+μt​LSFTL\\leftarrow\(1\-\\mu\_\{t\}\)L\_\{\\mathrm\{RL\}\}\+\\mu\_\{t\}L\_\{\\mathrm\{SFT\}\};

θt←θt−1−η​∇θL\\theta\_\{t\}\\leftarrow\\theta\_\{t\-1\}\-\\eta\\nabla\_\{\\theta\}L
12:endfor

## Appendix PPractitioner Summary

Practitioner Summary: Deploying GAC What GAC requires:3 EMA scalars \(σ^s2\\hat\{\\sigma\}\_\{s\}^\{2\},σ^r2\\hat\{\\sigma\}\_\{r\}^\{2\},Δ​g^2\\widehat\{\\Delta g\}^\{2\}\), one scalar division perμ\\mu\-update \(Eq\.[3](https://arxiv.org/html/2605.26184#S3.E3)\), and a prior scheduleμprior​\(t\)\\mu\_\{\\mathrm\{prior\}\}\(t\)\. No extra forward/backward passes;<<1% wall\-time overhead\.Recommended defaults:β=0\.99\\beta\{=\}0\.99\(EMA\),c¯=0\.01\\bar\{c\}\{=\}0\.01\(cap\),λ=0\.5\\lambda\{=\}0\.5\(blend\),fμ=10f\_\{\\mu\}\{=\}10\(update frequency\), 10% tail trimming on NLL variance\.When to increaseμ\\mutoward SFT:whenσr2\\sigma\_\{r\}^\{2\}spikes \(RL noise is high\) orΔ​g~2\\Delta\\tilde\{g\}^\{2\}is large \(objectives conflict\)\. The controller handles this automatically\.Robustness:GAC is insensitive to prior schedule choice—even constant bad priors \(μprior=0\.1\\mu\_\{\\mathrm\{prior\}\}\{=\}0\.1or0\.90\.9\) degrade AMC by≤\\leq0\.4pp \(Table[15](https://arxiv.org/html/2605.26184#A6.T15)\)\. Hyperparameters are stable acrossβ∈\[0\.95,0\.99\]\\beta\{\\in\}\[0\.95,0\.99\],c¯∈\[0\.003,0\.02\]\\bar\{c\}\{\\in\}\[0\.003,0\.02\],λ∈\[0\.3,0\.7\]\\lambda\{\\in\}\[0\.3,0\.7\]\.Guardrails are standard control mechanismsanalogous to PPO’s surrogate clippingSchulman et al\. \([2017](https://arxiv.org/html/2605.26184#bib.bib19)\)\(cf\. our per\-step capc¯\\bar\{c\}\) and Adam’s momentumLoshchilov and Hutter \([2019](https://arxiv.org/html/2605.26184#bib.bib12)\)\(cf\. our EMA smoothingβ\\beta\)\. They stabilize a principled estimator, not replace it\.

Similar Articles

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Hugging Face Daily Papers

GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Hugging Face Daily Papers

ART (Art-based Reinforcement Training) enables parameter-efficient fine-tuning of frozen multimodal LLMs by optimizing raw visual input via gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs for high-throughput engines like vLLM.

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

arXiv cs.CL

This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.