Compatibility-Aware Dynamic Fine-Tuning for Large Language Models
Summary
Introduces Compatibility-Aware Dynamic Fine-Tuning (CADFT), an extension of Dynamic Fine-Tuning that controls sample-level optimization variance in LLM supervised fine-tuning, improving stability and generalization.
View Cached Full Text
Cached at: 06/11/26, 01:35 PM
# Compatibility-Aware Dynamic Fine-Tuning for Large Language Models
Source: [https://arxiv.org/html/2606.11206](https://arxiv.org/html/2606.11206)
Yucheng Zhou1,, Junwei Sheng1,11footnotemark:1, Qianning Wang2, Jianbing Shen1,🖂\{\}^\{1,\{\\text\{\\Letter\}\}\} 1SKL\-IOTSC, CIS, University of Macau,2Auckland University of Technology yucheng\.zhou@connect\.um\.edu\.mo, jianbingshen@um\.edu\.mo
###### Abstract
Supervised Fine\-Tuning \(SFT\) is the predominant paradigm for aligning large language models \(LLMs\), yet it suffers from optimization instability and limited generalization\. Recent work attributes this issue to pathological gradient scaling and proposes Dynamic Fine\-Tuning \(DFT\) to correct it at the token level\. However, DFT assumes all demonstrations are equally suitable learning targets, an assumption violated by the strong heterogeneity of large\-scale instruction data, where demonstration\-policy mismatch induces high\-variance updates at the sample level\. We introduceCompatibility\-Aware Dynamic Fine\-Tuning \(CADFT\), a principled extension of DFT that controls sample\-level optimization variance\. CADFT derives a dynamic, policy\-dependent compatibility signal from model likelihoods to modulate supervised updates, suppressing high\-variance gradients from incompatible demonstrations\. We further propose a delayed, low\-frequency compatibility\-guided rewriting strategy to transform persistently incompatible demonstrations into learnable targets\. We show that CADFT can be interpreted as a variance\-controlled estimator that generalizes token\-level stabilization in DFT to the sample level\. Extensive experiments demonstrate improved stability, generalization, and cold\-start reinforcement learning initialization, while remaining fully supervised and independent of explicit reward modeling\.
Compatibility\-Aware Dynamic Fine\-Tuning for Large Language Models
Yucheng Zhou1,††thanks:Equal Contribution\., Junwei Sheng1,11footnotemark:1, Qianning Wang2, Jianbing Shen1,🖂\{\}^\{1,\{\\text\{\\Letter\}\}\}1SKL\-IOTSC, CIS, University of Macau,2Auckland University of Technologyyucheng\.zhou@connect\.um\.edu\.mo, jianbingshen@um\.edu\.mo
🖂🖂footnotetext:Corresponding Author\.## 1Introduction
Supervised fine\-tuning \(SFT\) is the dominant paradigm for aligning large language models \(LLMs\) with downstream tasks and instruction\-following behaviors\. By maximizing the likelihood of expert demonstrations under a teacher\-forcing regime, SFT provides a simple, stable, and scalable training framework\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.11206#bib.bib6); Chunget al\.,[2024](https://arxiv.org/html/2606.11206#bib.bib3); Weiet al\.,[2022](https://arxiv.org/html/2606.11206#bib.bib1)\)\. Despite its empirical success, recent theoretical and empirical studies have revealed that standard SFT suffers from a fundamental optimization pathology\. When viewed through the lens of policy optimization, the SFT gradient implicitly corresponds to a distorted objective in which low\-probability tokens induce disproportionately large gradient updates\(Wuet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib21); Chuet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib22)\)\. This inverse\-probability amplification leads to high gradient variance, training instability, and poor generalization, particularly on reasoning\-intensive tasks under distribution shift\(Chuet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib22); Linet al\.,[2017](https://arxiv.org/html/2606.11206#bib.bib19); Zhenget al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib49)\)\.
Dynamic Fine\-Tuning \(DFT\) was recently proposed to address this issue at the token level\(Wuet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib21)\)\. DFT reformulates SFT into a probability\-aware objective that corrects the pathological gradient scaling induced by rare tokens, yielding bounded and more stable updates\. Crucially, DFT achieves this without introducing reinforcement learning components such as reward models, on\-policy sampling, or policy optimization algorithms\. However, DFT implicitly assumes that all demonstrations in the dataset are equally suitable learning targets\. In practice, large\-scale instruction datasets are highly heterogeneous\. Some demonstrations are well\-aligned with the model’s current inductive biases and capability level, while others are excessively complex, poorly structured, or semantically mismatched\. Even when token\-level gradient instability is corrected, such*demonstration\-policy mismatch*can induce high\-variance updates at the sample level, leading to inefficient learning and unstable optimization\(Zhouet al\.,[2023](https://arxiv.org/html/2606.11206#bib.bib2); Bengioet al\.,[2009](https://arxiv.org/html/2606.11206#bib.bib26)\)\.
This observation suggests that stabilizing supervised fine\-tuning requires controlling not only*how*token\-level gradients are scaled, but also*which*demonstrations exert strong influence on parameter updates, and*to what extent*, under the current model state\. In other words, effective fine\-tuning demands a mechanism for regulating sample\-level compatibility between demonstrations and the evolving policy\.
In this work, we proposeCompatibility\-Aware Dynamic Fine\-Tuning \(CADFT\), a principled extension of DFT that incorporates a dynamic, policy\-dependent compatibility signal into the supervised objective\. CADFT treats compatibility as a relative measure of demonstration\-policy alignment, computed from the model’s own likelihoods and normalized adaptively during training\. This signal is used to modulate the strength of sample\-level updates, suppressing high\-variance gradients induced by incompatible demonstrations while preserving informative supervision\.
Importantly, CADFT does not discard low\-compatibility demonstrations outright\. To avoid permanently ignoring difficult but potentially valuable data, we further introduce a conservative, delayed rewriting mechanism that selectively reformulates persistently incompatible demonstrations into targets that lie within the model’s current feasible region\. Rewriting is activated only after a warm\-up phase and at low frequency, preventing premature self\-reinforcement and maintaining training stability\.
CADFT preserves the supervised learning paradigm of DFT and introduces no reinforcement learning, reward modeling, or policy optimization machinery\. From a theoretical perspective, CADFT can be understood as a variance\-controlled extension of DFT that generalizes token\-level stabilization to the sample level\. Empirically, CADFT consistently improves optimization stability, generalization performance, and downstream reinforcement learning initialization across language, code, and multimodal reasoning tasks\.
Our main contributions are as follows:
- •We show that samples with low compatibility induce higher\-variance gradient updates, and that mitigating such updates improves optimization stability and generalization\.
- •We proposeCompatibility\-Aware Dynamic Fine\-Tuning \(CADFT\), a simple and principled method that incorporates a dynamic, normalized compatibility signal to modulate sample\-level update strength within a fully supervised framework\.
- •We introduce a delayed, low\-frequency compatibility\-guided rewriting strategy that transforms incompatible demonstrations into learnable targets\.
- •We provide a theoretical interpretation of CADFT as a variance\-controlled estimator and empirically demonstrate its effectiveness across mathematical reasoning, code generation, multimodal reasoning, and cold\-start reinforcement learning settings\.
## 2Related Work
### 2\.1SFT and RL in LLM Alignment
Supervised Fine\-Tuning \(SFT\) aligns LLMs with downstream tasks\(Zhouet al\.,[2025a](https://arxiv.org/html/2606.11206#bib.bib41),[2026a](https://arxiv.org/html/2606.11206#bib.bib42),[2026b](https://arxiv.org/html/2606.11206#bib.bib44); Huet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib45)\)by maximizing the likelihood of expert demonstrations\(Weiet al\.,[2022](https://arxiv.org/html/2606.11206#bib.bib1); Zhouet al\.,[2023](https://arxiv.org/html/2606.11206#bib.bib2); Chunget al\.,[2024](https://arxiv.org/html/2606.11206#bib.bib3)\), effectively performing imitation learning or behavioral cloning\(Mandlekaret al\.,[2021](https://arxiv.org/html/2606.11206#bib.bib4)\)\. However, SFT overfits training distributions and generalizes poorly to OOD inputs\(Zhouet al\.,[2024](https://arxiv.org/html/2606.11206#bib.bib43)\), whereas RL optimizes task\-level objectives via reward signals for improved generalization\(Christianoet al\.,[2017](https://arxiv.org/html/2606.11206#bib.bib5); Ouyanget al\.,[2022](https://arxiv.org/html/2606.11206#bib.bib6); Baiet al\.,[2022](https://arxiv.org/html/2606.11206#bib.bib7)\), albeit with substantial computational overhead and instability\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.11206#bib.bib8); Strubellet al\.,[2019](https://arxiv.org/html/2606.11206#bib.bib9)\)\. Empirical studies confirm that RL\-based fine\-tuning yields superior robustness on reasoning\-intensive tasks, making the SFT–RL generalization gap a central alignment challenge\(Chuet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib22); Swamyet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib23)\)\.
To bridge this gap, hybrid methods combine SFT and RL: RLHF refines SFT with a learned reward model\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.11206#bib.bib6)\), DPO directly optimizes from preference data without explicit rewards\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.11206#bib.bib11)\), group\-relative variants reduce reliance on absolute rewards\(Shaoet al\.,[2024](https://arxiv.org/html/2606.11206#bib.bib12)\), Negative\-aware Fine\-Tuning uses incorrect generations as implicit negative feedback\(Chenet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib13)\), and self\-rewarding vision\-language models optimize prompts via iterative self\-feedback\(Yanget al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib52)\)—all extending beyond pure supervised learning\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.11206#bib.bib6); Rafailovet al\.,[2023](https://arxiv.org/html/2606.11206#bib.bib11)\)\. Theoretically,Duet al\.\([2025](https://arxiv.org/html/2606.11206#bib.bib14)\)reinterpret RLHF as reward\-weighted SFT,Wanget al\.\([2025](https://arxiv.org/html/2606.11206#bib.bib15)\)analyze SFT as RL with an implicit reward, andQin and Springenberg \([2025](https://arxiv.org/html/2606.11206#bib.bib17)\)model SFT as offline RL with importance weighting; however, these expectation\-level analyses do not characterize the variance of the resulting gradient estimators, which is critical for stable optimization\. From a broader perspective, structured constraints and feedback mechanisms further underscore the importance of principled objective design\(Askellet al\.,[2021](https://arxiv.org/html/2606.11206#bib.bib24)\), as also evidenced by abnormal\-aware feedback in medical VL models\(Zhouet al\.,[2025b](https://arxiv.org/html/2606.11206#bib.bib46); Zhenget al\.,[2026](https://arxiv.org/html/2606.11206#bib.bib48)\), rubric\-guided reinforcement learning for emotional support\(Yuanet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib50)\), and reinforcing VL frameworks for sign language translation\(Raoet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib51)\)\.
### 2\.2Stabilizing SFT and Gradient Reweighting
Several works stabilize SFT through loss reweighting or objective modification\. MixCE combines forward and reverse cross\-entropy to balance mode\-covering and mode\-seeking behaviors\(Zhanget al\.,[2023](https://arxiv.org/html/2606.11206#bib.bib18)\), and related importance\-weighting ideas appear in offline RL from demonstrations\(Mandlekaret al\.,[2021](https://arxiv.org/html/2606.11206#bib.bib4); Qin and Springenberg,[2025](https://arxiv.org/html/2606.11206#bib.bib17)\)\. Entropy\-guided optimization for autoregressive generation\(Songet al\.,[2026](https://arxiv.org/html/2606.11206#bib.bib47)\)also demonstrates that controlling entropy during training yields more stable and coherent synthesis\.Wuet al\.\([2025](https://arxiv.org/html/2606.11206#bib.bib21)\)show that the SFT gradient is equivalent to an offline policy gradient estimator with implicit rewards and inverse\-probability importance weighting, and propose Dynamic Fine\-Tuning \(DFT\) to rectify the resulting pathological scaling by rescaling token\-level gradients with the model’s own probabilities\. Unlike heuristic methods such as Focal Loss\(Linet al\.,[2017](https://arxiv.org/html/2606.11206#bib.bib19)\), DFT focuses on variance correction rather than emphasizing difficult samples\.Abdolmalekiet al\.\([2025](https://arxiv.org/html/2606.11206#bib.bib25)\)further show that improper feedback weighting under mixtures of positive and negative feedback leads to instability, reinforcing the need for principled gradient control\.
### 2\.3Data Quality and Sample Compatibility
While DFT stabilizes token\-level optimization, it assumes all demonstrations are equally suitable learning targets\(Wuet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib21)\)\. In practice, heterogeneous datasets can induce demonstration\-policy mismatch and high\-variance updates at the sample level\(Mandlekaret al\.,[2021](https://arxiv.org/html/2606.11206#bib.bib4); Liu and Zhang,[2025](https://arxiv.org/html/2606.11206#bib.bib20)\)\.Liu and Zhang \([2025](https://arxiv.org/html/2606.11206#bib.bib20)\)show in knowledge distillation that selectively downweighting low\-compatibility samples improves stability, and prior work on curriculum learning further confirms that supervision effectiveness depends on the alignment between sample difficulty and model capacity\(Mandlekaret al\.,[2021](https://arxiv.org/html/2606.11206#bib.bib4); Liu and Zhang,[2025](https://arxiv.org/html/2606.11206#bib.bib20); Chuet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib22)\)\. Indiscriminately updating on low\-compatibility demonstrations forces the model to memorize patterns it cannot reliably internalize\(Liu and Zhang,[2025](https://arxiv.org/html/2606.11206#bib.bib20); Chuet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib22)\)\.
Inspired by these findings, our work introduces a compatibility\-aware extension of DFT that addresses demonstration\-policy mismatch at the sample level within a fully supervised framework\. In summary, prior work has improved SFT either by combining it with RL or by stabilizing token\-level gradients; our work is the first to unify token\-level and sample\-level variance control\(Wuet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib21); Liu and Zhang,[2025](https://arxiv.org/html/2606.11206#bib.bib20)\)\.
## 3Compatibility\-Aware Dynamic Fine\-Tuning
Figure 1:Overall framework of Compatibility\-Aware Dynamic Fine\-Tuning \(CADFT\)\. CADFT extends DFT by incorporating a sample\-level compatibility signal that modulates update strength and optionally guides delayed demonstration rewriting\.In this section, we presentCompatibility\-Aware Dynamic Fine\-Tuning \(CADFT\), a robust alignment framework designed to stabilize supervised fine\-tuning under heterogeneous data distributions\. We first revisit Dynamic Fine\-Tuning \(DFT\), then introduce a dynamic, sample\-level compatibility signal for reweighting updates, and finally describe an optional delayed rewriting mechanism\. The overall procedure is summarized in Algorithm[1](https://arxiv.org/html/2606.11206#alg1)\.
### 3\.1Preliminaries
Let𝒟=\{\(x,y\)\}\\mathcal\{D\}=\\\{\(x,y\)\\\}denote a dataset of instruction\-response pairs, andπθ\(y\|x\)\\pi\_\{\\theta\}\(y\|x\)a language model parameterized byθ\\theta\. Standard Supervised Fine\-Tuning \(SFT\) minimizes the negative log\-likelihood:
ℒSFT\(x,y\)=−∑t=1\|y\|logπθ\(yt∣x,y<t\)\.\\displaystyle\\mathcal\{L\}\_\{\\text\{SFT\}\}\(x,y\)=\-\\sum\_\{t=1\}^\{\|y\|\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\.\(1\)The gradient magnitude ofℒSFT\\mathcal\{L\}\_\{\\text\{SFT\}\}scales inversely withπθ\(yt\|⋅\)\\pi\_\{\\theta\}\(y\_\{t\}\|\\cdot\), causing low\-probability tokens to induce disproportionately large updates and high optimization variance\.
Dynamic Fine\-Tuning \(DFT\)\(Wuet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib21)\)addresses this issue by rectifying token\-level gradient scaling\. From a gradient perspective, DFT induces updates proportional to\(1\+logpt\)\(1\+\\log p\_\{t\}\), which remain bounded aspt→0p\_\{t\}\\to 0\. This effectively neutralizes the inverse\-probability amplification present in SFT and stabilizes token\-level optimization\. However, DFT operates purely at the token level and implicitly assumes all demonstrations are equally suitable learning targets\.
### 3\.2Dynamic Compatibility Assessment
We posit that demonstrations vary in their suitability for learning at different stages of training\. We therefore introduce a*dynamic compatibility*signal that measures how well a demonstration aligns with the model’s current inductive bias\.
#### Raw Compatibility Score\.
For a sample\(x,y\)\(x,y\), we define the raw compatibility score as the length\-normalized negative log\-likelihood:
craw\(x,y;θ\)=1\|y\|∑t=1\|y\|−logπθ\(yt∣x,y<t\)\.\\displaystyle c\_\{\\text\{raw\}\}\(x,y;\\theta\)=\\frac\{1\}\{\|y\|\}\\sum\_\{t=1\}^\{\|y\|\}\-\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\.\(2\)Lower values indicate higher compatibility\. As training progresses, however, the absolute scale ofcrawc\_\{\\text\{raw\}\}shifts, rendering static thresholds ineffective\.
#### Adaptive Normalization\.
To obtain a scale\-invariant signal, we normalize raw compatibility scores within each effective global mini\-batchℬ\\mathcal\{B\}, whereℬ\\mathcal\{B\}aggregates all micro\-batches across data\-parallel workers\. Letμℬ\\mu\_\{\\mathcal\{B\}\}andσℬ\\sigma\_\{\\mathcal\{B\}\}denote the mean and standard deviation ofcrawc\_\{\\text\{raw\}\}in the batch, computed via distributed all\-reduce synchronization to ensure consistency across devices\. The normalized score is:
c^i=craw\(xi,yi\)−μℬσℬ\+ϵ,\\displaystyle\\hat\{c\}\_\{i\}=\\frac\{c\_\{\\text\{raw\}\}\(x\_\{i\},y\_\{i\}\)\-\\mu\_\{\\mathcal\{B\}\}\}\{\\sigma\_\{\\mathcal\{B\}\}\+\\epsilon\},\(3\)whereϵ\\epsilonensures numerical stability\. Importantly,c^\\hat\{c\}represents a*relative and model\-dependent*notion of compatibility, reflecting alignment with the current model state rather than an absolute measure of difficulty\.
### 3\.3Compatibility\-Aware Objective
CADFT integrates the compatibility signal into Dynamic Fine\-Tuning \(DFT\) through a sample\-level weighting functionw\(c^\)w\(\\hat\{c\}\)that modulates the strength of supervised updates\. We employ a soft exponential decay:
w\(c^i\)=exp\(−β⋅max\(0,c^i\)\),\\displaystyle w\(\\hat\{c\}\_\{i\}\)=\\exp\\\!\\left\(\-\\beta\\cdot\\max\(0,\\hat\{c\}\_\{i\}\)\\right\),\(4\)whereβ≥0\\beta\\geq 0controls the sensitivity of the weighting mechanism\.
This design preserves the full contribution of samples whose normalized compatibilityc^i\\hat\{c\}\_\{i\}is no worse than the batch average \(c^i≤0\\hat\{c\}\_\{i\}\\leq 0\), while progressively down\-weighting less compatible samples \(c^i\>0\\hat\{c\}\_\{i\}\>0\)\. As a result, demonstrations that are misaligned with the model’s current inductive bias exert reduced influence on optimization, mitigating high\-variance updates without discarding potentially useful data\.
The resulting compatibility\-aware objective is defined as:
ℒCADFT\(ℬ\)=1\|ℬ\|∑\(x,y\)∈ℬw\(c^\(x,y\)\)⋅ℒDFT\(x,y\)\\displaystyle\\\!\\\!\\\!\\mathcal\{L\}\_\{\\text\{CADFT\}\}\(\\mathcal\{B\}\)\\\!=\\\!\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\\!\\\!\\sum\_\{\(x,y\)\\in\\mathcal\{B\}\}\\\!\\\!\\\!\\\!w\(\\hat\{c\}\(x,y\)\)\\\!\\cdot\\\!\\mathcal\{L\}\_\{\\text\{DFT\}\}\(x,y\)\\\!\\\!\\\!\(5\)From an optimization perspective, this formulation is conceptually related to self\-paced or curriculum learning: samples exhibiting higher compatibility exert stronger influence on parameter updates, while harder or misaligned demonstrations are incorporated more conservatively as training progresses\.
### 3\.4Delayed Compatibility\-Guided Rewriting
While compatibility\-based reweighting mitigates the influence of incompatible samples, it may underutilize demonstrations that are correct but substantially misaligned with the model’s current capability\. To address this, we further explore a conservative delayed rewriting mechanism that optionally reformulates persistently incompatible samples into more learnable targets\.
#### Two\-Stage Training\.
We divide training into two stages to avoid premature self\-reinforcement:
1. 1\.Warm\-up Stage \(t<Twarmt<T\_\{\\text\{warm\}\}\):Training proceeds solely with the compatibility\-aware objective, allowing the model to acquire stable instruction\-following behavior\.
2. 2\.Rewriting Stage \(t≥Twarmt\\geq T\_\{\\text\{warm\}\}\):At periodic intervals, a small subset of samples with persistently high moving\-average compatibility scores may be selected for optional rewriting\. Their original targets can be replaced with model\-generated alternatives that lie within the model’s current feasible region\.
Specifically, rewritten targets are sampled as:
y^∼NucleusSampling\(πθ\(⋅\|x\);p=0\.9,T=0\.7\)\.\\displaystyle\\hat\{y\}\\sim\\text\{NucleusSampling\}\(\\pi\_\{\\theta\}\(\\cdot\|x\);p=0\.9,T=0\.7\)\.\(6\)This process can be viewed as projecting overly hard demonstrations onto the model’s current hypothesis class, converting high\-variance supervision into stable but simplified learning signals\.
Algorithm 1Compatibility\-Aware Dynamic Fine\-Tuning0:Dataset
𝒟\\mathcal\{D\}, model
πθ\\pi\_\{\\theta\}, batch size
BB, warm\-up steps
TwarmT\_\{\\text\{warm\}\}, rewrite interval
KK, compatibility sensitivity
β\\beta
1:fortraining step
t=1,…,Tmaxt=1,\\dots,T\_\{\\text\{max\}\}do
2:Sample mini\-batch
ℬ=\{\(xi,yi\)\}i=1B\\mathcal\{B\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{B\}from
𝒟\\mathcal\{D\}
3:// Compute compatibility \(no gradient\)
4:foreach
\(xi,yi\)∈ℬ\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{B\}do
5:
ci←1\|yi\|∑t−logπθ\(yi,t∣xi,yi,<t\)c\_\{i\}\\leftarrow\\frac\{1\}\{\|y\_\{i\}\|\}\\sum\_\{t\}\-\\log\\pi\_\{\\theta\}\(y\_\{i,t\}\\mid x\_\{i\},y\_\{i,<t\}\)
6:endfor
7:Compute batch mean
μℬ\\mu\_\{\\mathcal\{B\}\}and std
σℬ\\sigma\_\{\\mathcal\{B\}\}
8:
c^i←stop\_grad\(ci−μℬσℬ\+ϵ\)\\hat\{c\}\_\{i\}\\leftarrow\\text\{stop\\\_grad\}\\\!\\left\(\\frac\{c\_\{i\}\-\\mu\_\{\\mathcal\{B\}\}\}\{\\sigma\_\{\\mathcal\{B\}\}\+\\epsilon\}\\right\)
9:
wi←exp\(−β⋅max\(0,c^i\)\)w\_\{i\}\\leftarrow\\exp\(\-\\beta\\cdot\\max\(0,\\hat\{c\}\_\{i\}\)\)
10:// Compatibility\-aware DFT update
11:
ℒ←1B∑iwi⋅ℒDFT\(xi,yi\)\\mathcal\{L\}\\leftarrow\\frac\{1\}\{B\}\\sum\_\{i\}w\_\{i\}\\cdot\\mathcal\{L\}\_\{\\text\{DFT\}\}\(x\_\{i\},y\_\{i\}\)
12:Update parameters
θ←Optimizer\(θ,∇ℒ\)\\theta\\leftarrow\\text\{Optimizer\}\(\\theta,\\nabla\\mathcal\{L\}\)
13:// Optional delayed rewriting
14:if
t\>Twarmt\>T\_\{\\text\{warm\}\}and
tmodK=0t\\bmod K=0then
15:Identify small subset
𝒮⊂𝒟\\mathcal\{S\}\\subset\\mathcal\{D\}with highest moving\-average compatibility scores
16:foreach
\(x,y\)∈𝒮\(x,y\)\\in\\mathcal\{S\}do
17:Generate
y^∼πθ\(⋅\|x\)\\hat\{y\}\\sim\\pi\_\{\\theta\}\(\\cdot\|x\)via nucleus sampling
18:Optionally replace
yywith
y^\\hat\{y\}
19:endfor
20:endif
21:endfor
### 3\.5Theoretical Perspective: Variance Reduction
Letgi=∇ℒDFT\(xi,yi\)g\_\{i\}=\\nabla\\mathcal\{L\}\_\{\\text\{DFT\}\}\(x\_\{i\},y\_\{i\}\)denote the stochastic gradient of sampleii\. Under the commonly observed assumption that incompatible or low\-probability demonstrations induce disproportionately large gradient norms in SFT\-style training, such samples tend to dominate the second moment of the gradient estimator without contributing proportionally to the mean update direction\.
By applying the compatibility weight, CADFT usesg~i=w\(c^i\)gi\\tilde\{g\}\_\{i\}=w\(\\hat\{c\}\_\{i\}\)\\,g\_\{i\}\. Asw\(c^i\)w\(\\hat\{c\}\_\{i\}\)decays for increasingly incompatible samples, the weighted second moment𝔼\[‖g~‖2\]\\mathbb\{E\}\[\\\|\\tilde\{g\}\\\|^\{2\}\]is reduced relative to standard DFT\. Consequently, CADFT acts as a variance\-controlled estimator that stabilizes optimization by modulating gradients based on semantic compatibility rather than arbitrary norm clipping\.
## 4Experiments
We conduct comprehensive experiments to evaluate the effectiveness ofCompatibility\-Aware Dynamic Fine\-Tuning \(CADFT\)\. Our experimental study is designed to answer the following questions: \(i\) whether CADFT consistently improves over SFT and DFT across tasks and models scales; \(ii\) how compatibility\-aware reweighting affects optimization stability; \(iii\) whether CADFT provides a stronger initialization for downstream reinforcement learning; and \(iv\) which design choices are critical to its effectiveness?
We evaluate CADFT on mathematical reasoning, code generation, and multimodal reasoning tasks, under both supervised fine\-tuning and reinforcement learning settings\.
### 4\.1Experimental Setup
#### Models\.
We evaluate CADFT on a diverse set of open\-source language and vision\-language models, including LLaMA\-3 series\(Team,[2024](https://arxiv.org/html/2606.11206#bib.bib27)\), DeepSeekMath\(Shaoet al\.,[2024](https://arxiv.org/html/2606.11206#bib.bib12)\), Qwen2\.5\-Math\(Yanget al\.,[2024](https://arxiv.org/html/2606.11206#bib.bib29)\), Qwen2\.5\-Coder\(Huiet al\.,[2024](https://arxiv.org/html/2606.11206#bib.bib30)\), and Qwen2\.5\-VL\(Baiet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib28)\), covering multiple parameter scales\. For fair comparison, all methods share identical model architectures and initial checkpoints\.
#### Datasets, Benchmarks, and Evaluation\.
We evaluate CADFT on benchmarks spanning mathematical reasoning, code generation, and multimodal reasoning\. Mathematical reasoning is evaluated on Math500\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.11206#bib.bib33)\), Minerva Math\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2606.11206#bib.bib31)\), OlympiadBench\(Heet al\.,[2024](https://arxiv.org/html/2606.11206#bib.bib38)\), AIME 2024\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.11206#bib.bib33)\), and AMC 2023\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.11206#bib.bib33)\)\. Code generation is evaluated on HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2606.11206#bib.bib36)\), HumanEval\+\(Liuet al\.,[2023](https://arxiv.org/html/2606.11206#bib.bib35)\), and MultiPL\-E\(Cassanoet al\.,[2023](https://arxiv.org/html/2606.11206#bib.bib34)\)across nine programming languages\. Multimodal reasoning is evaluated on MathVerse\(Zhanget al\.,[2024](https://arxiv.org/html/2606.11206#bib.bib37)\), MathVision\(Wanget al\.,[2024](https://arxiv.org/html/2606.11206#bib.bib40)\), and WeMath\(Qiaoet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib39)\)\.
For all benchmarks, we strictly follow the official evaluation protocols and metrics provided by each dataset\. Mathematical reasoning performance is reported using Average@16 accuracy, code generation using pass@1 accuracy, and multimodal benchmarks using accuracy\-based metrics\.
#### Training Details\.
We follow the training protocol of DFT\(Wuet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib21)\)\. All models are trained using identical batch sizes, optimizers, learning rates, and training steps across SFT, DFT, and CADFT\. The effective global batch size is 256, achieved via data\-parallel synchronization and gradient accumulation\. Compatibility scores are computed per mini\-batch and normalized dynamically usingμℬ\\mu\_\{\\mathcal\{B\}\}andσℬ\\sigma\_\{\\mathcal\{B\}\}synchronized across all data\-parallel workers via all\-reduce, ensuring that normalization is performed on the effective global batch \(rather than per\-device micro\-batches\) and is invariant to sharding strategy\. The compatibility statistics are detached from gradient computation\. For CADFT\-specific hyperparameters, we set the compatibility normalization to per\-mini\-batch z\-score withϵ=10−6\\epsilon=10^\{\-6\}and the weighting sensitivity toβ=1\.0\\beta=1\.0\. When delayed rewriting is enabled, we use a warm\-up ofTwarm=3000T\_\{\\text\{warm\}\}=3000steps, a rewriting interval ofK=1000K=1000steps, and rewrite a fraction of0\.5%0\.5\\%of the dataset per interval with replacement probability0\.50\.5\. Rewritten targets are generated via nucleus sampling withp=0\.9p=0\.9and temperatureT=0\.7T=0\.7\.
Table 1:Mathematical reasoning performance \(Average@16\)\.Accuracy \(%\) of five representative large language models on diverse mathematical reasoning benchmarks\. For each backbone model, we report results under vanilla fine\-tuning \(SFT\), Dynamic Fine\-Tuning \(DFT\), and the proposed Compatibility\-Aware DFT \(CADFT\)\.Table 2:Code generation performance on HumanEval and MultiPL\-E\.We report pass@1 accuracy \(%\) on HumanEval \(HE, HE\+\) and MultiPL\-E across nine programming languages\. HE and HE\+ are subsets of the HumanEval benchmark, while all language\-specific scores belong to MultiPL\-E\.Table 3:Multi\-modal mathematical reasoning performance\.Comparison on MathVerse, MathVision, and WeMath benchmarks under different visual reasoning regimes\. Scores reflect overall accuracy \(%\)\. The proposed CADFT consistently improves performance across vision\-only and vision\-intensive settings\.Table 4:Cold\-start mathematical reasoning with GRPO\.All models are first initialized via supervised fine\-tuning \(SFT\), DFT, or CADFT, and subsequently optimized using GRPO\. Results demonstrate that CADFT provides a stronger initialization for downstream reinforcement learning\.Table 5:Code generation with GRPO fine\-tuning\.All models are further optimized with GRPO after SFT or DFT initialization\. Results are reported on HumanEval \(HE, HE\+\) and MultiPL\-E benchmarks\. CADFT yields consistently stronger GRPO\-aligned representations\.Table 6:Multi\-modal reasoning with GRPO optimization\.Comparison of SFT\-, DFT\-, and CADFT\-initialized models after GRPO fine\-tuning\. CADFT consistently delivers stronger alignment between visual perception and reasoning under reinforcement learning\.
### 4\.2Main Results
#### Mathematical Reasoning Performance
Table[1](https://arxiv.org/html/2606.11206#S4.T1)presents the main results on mathematical reasoning benchmarks\. Across all evaluated model families and scales, CADFT consistently outperforms both SFT and DFT, with particularly large gains on the most challenging benchmarks\. Compared to vanilla SFT, CADFT avoids the severe performance degradation observed on OlympiadBench, AIME 2024, and AMC 2023, where demonstration\-policy mismatch is pronounced\. While DFT mitigates token\-level instability, it still treats all demonstrations as equally informative, allowing incompatible samples to induce noisy updates\. By explicitly down\-weighting such samples, CADFT further reduces optimization variance and enables more stable learning from heterogeneous supervision\. Notably, the relative improvement of CADFT over DFT grows with model scale\. This suggests that as model capacity increases, sample\-level mismatch becomes a dominant source of optimization noise, making compatibility\-aware control increasingly important\.
#### Code Generation Performance
Table[2](https://arxiv.org/html/2606.11206#S4.T2)summarizes code generation results on HumanEval and MultiPL\-E\. CADFT consistently improves performance over both SFT and DFT across all evaluated models\. Beyond aggregate gains, CADFT exhibits particularly strong improvements on lower\-resource and syntactically diverse languages such as Bash and PHP within MultiPL\-E\. This indicates that compatibility\-aware reweighting discourages overfitting to high\-frequency patterns in dominant languages \(e\.g\., Python\), thereby improving cross\-language generalization\. These results support the hypothesis that sample\-level heterogeneity is a major source of optimization variance in multilingual code generation\.
#### Multimodal Mathematical Reasoning
We further evaluate CADFT on multimodal mathematical reasoning benchmarks\. As shown in Table[3](https://arxiv.org/html/2606.11206#S4.T3), CADFT consistently improves performance across vision\-only, vision\-intensive, and vision\-dominant regimes on MathVerse, MathVision, and WeMath\. Multimodal settings introduce additional sources of demonstration\-policy mismatch due to imperfect visual grounding and varying degrees of visual dependency\. While DFT stabilizes token\-level updates, it cannot distinguish between well\-grounded and poorly grounded demonstrations\. CADFT alleviates this issue by suppressing high\-variance updates from incompatible multimodal samples, leading to more robust vision\-language alignment\.
Figure 2:Ablation of compatibility definition and dynamicity\.Dynamic, normalized compatibility yields consistent gains over DFT, while static, unnormalized, or random reweighting degrades performance\.Table 7:Ablation of weighting functionsw\(c\)w\(c\)on mathematical reasoning\. We compare a soft monotonic exponential decay, a linearly clipped weighting, a binary filter, and an inverse weighting scheme\.Table 8:Gradient norm variance across sample groups with different compatibility levels\. Lower compatibility samples induce substantially higher gradient variance\.
### 4\.3Cold\-Start Reinforcement Learning Initialization
We investigate whether CADFT provides a stronger initialization for downstream reinforcement learning\. Following prior work\(Wuet al\.,[2025](https://arxiv.org/html/2606.11206#bib.bib21)\), models are further optimized using GRPO\(DeepSeek\-AI,[2025](https://arxiv.org/html/2606.11206#bib.bib32)\)after SFT, DFT, or CADFT initialization\. Specifically, we adopt the same GRPO protocol and implementation choices asWuet al\.\([2025](https://arxiv.org/html/2606.11206#bib.bib21)\)\. Correctness is determined bymath\_verifyas the verifier\-based reward signal\. GRPO is trained in theverlframework with learning rate1e\-61\\text\{e\-\}6, global batch size 256, warmup ratio 0\.1, and number of sampled responses per promptn=4n=4\. All other GRPO hyperparameters follow the official DFT implementation scripts\. As shown in Tables[4](https://arxiv.org/html/2606.11206#S4.T4)\-[6](https://arxiv.org/html/2606.11206#S4.T6), CADFT\-initialized models consistently outperform SFT\+GRPO and DFT\+GRPO across mathematical reasoning, code generation, and multimodal reasoning tasks\. These results indicate that CADFT produces representations with lower gradient noise and better\-aligned supervision, which facilitates subsequent policy optimization\. Importantly, CADFT does not optimize for any reinforcement learning objective during pretraining\. The observed gains suggest that reducing supervised optimization variance is complementary to, rather than competing with, reinforcement learning\.
Figure 3:Gradient norm variance across compatibility groups\.Low\-compatibility samples induce significantly higher gradient variance, motivating compatibility\-aware variance control\.
### 4\.4Ablation Studies and Analysis
We conduct a series of ablation studies to isolate the contribution of each design component in CADFT, including compatibility definition, weighting function shape, gradient variance control, and delayed rewriting\. All ablations are evaluated under the same training and evaluation settings for fair comparison\.
Table 9:Overall gradient norm variance under different fine\-tuning methods\. CADFT achieves the lowest variance, indicating the most stable optimization\.Table 10:Ablation of compatibility\-guided rewriting strategies\. Rewriting too early or too frequently leads to premature self\-reinforcement, while delayed rewriting after a warm\-up phase achieves the best performance\.#### Effect of Compatibility Definition and Dynamicity\.
We first study how the definition and dynamicity of compatibility affect performance\. As shown in Figure[2](https://arxiv.org/html/2606.11206#S4.F2), dynamically normalized compatibility consistently outperforms all static or unnormalized variants\. Static compatibility definitions fail to account for the evolving model policy, causing samples to be permanently over\- or under\-weighted as training progresses\. Unnormalized compatibility is sensitive to scale drift in likelihood values, leading to unstable weighting behavior\. In contrast, dynamic, batch\-normalized compatibility provides a relative and policy\-dependent signal, allowing CADFT to adaptively suppress incompatible samples throughout training\. The random reweighting baseline further confirms that the observed gains do not arise from implicit regularization or noise injection, but from structured, compatibility\-aware modulation\.
#### Impact of Weighting Function Shape\.
Table[7](https://arxiv.org/html/2606.11206#S4.T7)examines the effect of different weighting function shapesw\(c\)w\(c\)while keeping all other components fixed\. The exponential decay function achieves the best overall performance, as it provides smooth and monotonic suppression of incompatible samples without introducing discontinuities\. Hard filtering or binary weighting removes gradient contributions abruptly, leading to optimization instability and reduced sample efficiency\. Inverse weighting overly amplifies low\-compatibility samples, resulting in high\-variance updates\. These results indicate that soft, monotonic weighting is critical for balancing stability and information retention in compatibility\-aware optimization\.
#### Gradient Variance Across Compatibility Levels\.
To directly validate our theoretical motivation, we analyze gradient norm variance across samples grouped by compatibility\. Table[8](https://arxiv.org/html/2606.11206#S4.T8)and Figure[3](https://arxiv.org/html/2606.11206#S4.F3)show that low\-compatibility samples induce substantially higher gradient variance than high\-compatibility ones\. This observation provides empirical evidence that demonstration\-policy mismatch is a major source of optimization noise in supervised fine\-tuning\. By down\-weighting such samples, CADFT effectively suppresses high\-variance updates at the sample level\. Consistently, Table[9](https://arxiv.org/html/2606.11206#S4.T9)shows that CADFT achieves the lowest overall gradient variance among SFT, DFT, and CADFT, confirming its role as a variance\-controlled estimator\.
#### Effect of Delayed Compatibility\-Guided Rewriting\.
Finally, we study the impact of delayed demonstration rewriting\. As shown in Table[10](https://arxiv.org/html/2606.11206#S4.T10), early or aggressive rewriting significantly degrades performance, indicating premature self\-reinforcement when the model is not yet stable\. In contrast, delayed rewriting after a warm\-up phase consistently improves performance\. This suggests that rewriting is beneficial only after the model has acquired a stable inductive bias, at which point projecting incompatible demonstrations into the model’s feasible region reduces variance without reinforcing spurious solutions\. These results show that delayed, conservative rewriting complements compatibility\-aware reweighting, while aggressive rewriting undermines training stability\.
## 5Conclusion
We presentedCompatibility\-Aware Dynamic Fine\-Tuning \(CADFT\), a principled extension of Dynamic Fine\-Tuning that explicitly controls sample\-level optimization variance in supervised fine\-tuning\. By introducing a dynamic, policy\-dependent compatibility signal, CADFT suppresses high\-variance updates from mismatched demonstrations while preserving informative supervision\. A delayed and low\-frequency rewriting strategy further enables conservative utilization of persistently incompatible data\. Both theoretically and empirically, CADFT generalizes token\-level stabilization in DFT to the sample level, yielding improved stability, generalization, and stronger initialization for downstream reinforcement learning, without introducing reward models or on\-policy optimization\.
## Limitations
CADFT builds on signals derived from the model’s own likelihood estimates and therefore inherits the inductive biases and representational capacity of the underlying backbone\. As a result, the effectiveness of compatibility estimation is naturally bounded by the model’s current expressive power and pretraining quality\.
## References
- A\. Abdolmaleki, B\. Piot, B\. Shahriari, J\. T\. Springenberg, T\. Hertweck, M\. Bloesch, R\. Joshi, T\. Lampe, J\. Oh, N\. Heess, J\. Buchli, and M\. A\. Riedmiller \(2025\)Learning from negative feedback, or positive feedback or both\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=4FVGowGzQb)Cited by:[§2\.2](https://arxiv.org/html/2606.11206#S2.SS2.p1.1)\.
- A\. Askell, Y\. Bai, A\. Chen, D\. Drain, D\. Ganguli, T\. Henighan, A\. Jones, N\. Joseph, B\. Mann, N\. DasSarma, N\. Elhage, Z\. Hatfield\-Dodds, D\. Hernandez, J\. Kernion, K\. Ndousse, C\. Olsson, D\. Amodei, T\. B\. Brown, J\. Clark, S\. McCandlish, C\. Olah, and J\. Kaplan \(2021\)A general language assistant as a laboratory for alignment\.CoRRabs/2112\.00861\.External Links:[Link](https://arxiv.org/abs/2112.00861),2112\.00861Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1)\.
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin \(2025\)Qwen2\.5\-vl technical report\.CoRRabs/2502\.13923\.External Links:[Link](https://doi.org/10.48550/arXiv.2502.13923),[Document](https://dx.doi.org/10.48550/ARXIV.2502.13923),2502\.13923Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan, N\. Joseph, S\. Kadavath, J\. Kernion, T\. Conerly, S\. E\. Showk, N\. Elhage, Z\. Hatfield\-Dodds, D\. Hernandez, T\. Hume, S\. Johnston, S\. Kravec, L\. Lovitt, N\. Nanda, C\. Olsson, D\. Amodei, T\. B\. Brown, J\. Clark, S\. McCandlish, C\. Olah, B\. Mann, and J\. Kaplan \(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.CoRRabs/2204\.05862\.External Links:[Link](https://doi.org/10.48550/arXiv.2204.05862),[Document](https://dx.doi.org/10.48550/ARXIV.2204.05862),2204\.05862Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.
- Y\. Bengio, J\. Louradour, R\. Collobert, and J\. Weston \(2009\)Curriculum learning\.InProceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14\-18, 2009,A\. P\. Danyluk, L\. Bottou, and M\. L\. Littman \(Eds\.\),ACM International Conference Proceeding Series, Vol\.382,pp\. 41–48\.External Links:[Link](https://doi.org/10.1145/1553374.1553380),[Document](https://dx.doi.org/10.1145/1553374.1553380)Cited by:[§1](https://arxiv.org/html/2606.11206#S1.p2.1)\.
- F\. Cassano, J\. Gouwar, D\. Nguyen, S\. Nguyen, L\. Phipps\-Costin, D\. Pinckney, M\. Yee, Y\. Zi, C\. J\. Anderson, M\. Q\. Feldman, A\. Guha, M\. Greenberg, and A\. Jangda \(2023\)MultiPL\-e: A scalable and polyglot approach to benchmarking neural code generation\.IEEE Trans\. Software Eng\.49\(7\),pp\. 3675–3691\.External Links:[Link](https://doi.org/10.1109/TSE.2023.3267446),[Document](https://dx.doi.org/10.1109/TSE.2023.3267446)Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px2.p1.1)\.
- H\. Chen, K\. Zheng, Q\. Zhang, G\. Cui, Y\. Cui, H\. Ye, T\. Lin, M\. Liu, J\. Zhu, and H\. Wang \(2025\)Bridging supervised learning and reinforcement learning in math reasoning\.CoRRabs/2505\.18116\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.18116),[Document](https://dx.doi.org/10.48550/ARXIV.2505.18116),2505\.18116Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.CoRRabs/2107\.03374\.External Links:[Link](https://arxiv.org/abs/2107.03374),2107\.03374Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px2.p1.1)\.
- P\. F\. Christiano, J\. Leike, T\. B\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2017\)Deep reinforcement learning from human preferences\.InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4\-9, 2017, Long Beach, CA, USA,I\. Guyon, U\. von Luxburg, S\. Bengio, H\. M\. Wallach, R\. Fergus, S\. V\. N\. Vishwanathan, and R\. Garnett \(Eds\.\),pp\. 4299–4307\.External Links:[Link](https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html)Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.
- T\. Chu, Y\. Zhai, J\. Yang, S\. Tong, S\. Xie, D\. Schuurmans, Q\. V\. Le, S\. Levine, and Y\. Ma \(2025\)SFT memorizes, RL generalizes: A comparative study of foundation model post\-training\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,External Links:[Link](https://openreview.net/forum?id=dYur3yabMj)Cited by:[§1](https://arxiv.org/html/2606.11206#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2606.11206#S2.SS3.p1.1)\.
- H\. W\. Chung, L\. Hou, S\. Longpre, B\. Zoph, Y\. Tay, W\. Fedus, Y\. Li, X\. Wang, M\. Dehghani, S\. Brahma, A\. Webson, S\. S\. Gu, Z\. Dai, M\. Suzgun, X\. Chen, A\. Chowdhery, A\. Castro\-Ros, M\. Pellat, K\. Robinson, D\. Valter, S\. Narang, G\. Mishra, A\. Yu, V\. Y\. Zhao, Y\. Huang, A\. M\. Dai, H\. Yu, S\. Petrov, E\. H\. Chi, J\. Dean, J\. Devlin, A\. Roberts, D\. Zhou, Q\. V\. Le, and J\. Wei \(2024\)Scaling instruction\-finetuned language models\.J\. Mach\. Learn\. Res\.25,pp\. 70:1–70:53\.External Links:[Link](https://jmlr.org/papers/v25/23-0870.html)Cited by:[§1](https://arxiv.org/html/2606.11206#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.CoRRabs/2501\.12948\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.12948),[Document](https://dx.doi.org/10.48550/ARXIV.2501.12948),2501\.12948Cited by:[§4\.3](https://arxiv.org/html/2606.11206#S4.SS3.p1.2)\.
- Y\. Du, Z\. Li, P\. Cheng, Z\. Chen, Y\. Xie, X\. Wan, and A\. Gao \(2025\)Simplify RLHF as reward\-weighted SFT: A variational method\.CoRRabs/2502\.11026\.External Links:[Link](https://doi.org/10.48550/arXiv.2502.11026),[Document](https://dx.doi.org/10.48550/ARXIV.2502.11026),2502\.11026Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1)\.
- C\. He, R\. Luo, Y\. Bai, S\. Hu, Z\. L\. Thai, J\. Shen, J\. Hu, X\. Han, Y\. Huang, Y\. Zhang, J\. Liu, L\. Qi, Z\. Liu, and M\. Sun \(2024\)OlympiadBench: A challenging benchmark for promoting AGI with olympiad\-level bilingual multimodal scientific problems\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 3828–3850\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.211),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.211)Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual,J\. Vanschoren and S\. Yeung \(Eds\.\),External Links:[Link](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html)Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px2.p1.1)\.
- H\. Hu, Y\. Zhou, Q\. Wang, Y\. Zou, C\. Ma, J\. Si, J\. Liu, Z\. Yu, L\. Cui, and F\. Ma \(2025\)From pattern recognizers to personalized companions: a survey of large language models in mental health\.Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.
- B\. Hui, J\. Yang, Z\. Cui, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu, K\. Dang, A\. Yang, R\. Men, F\. Huang, X\. Ren, X\. Ren, J\. Zhou, and J\. Lin \(2024\)Qwen2\.5\-coder technical report\.CoRRabs/2409\.12186\.External Links:[Link](https://doi.org/10.48550/arXiv.2409.12186),[Document](https://dx.doi.org/10.48550/ARXIV.2409.12186),2409\.12186Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px1.p1.1)\.
- A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, T\. Gutman\-Solo, Y\. Wu, B\. Neyshabur, G\. Gur\-Ari, and V\. Misra \(2022\)Solving quantitative reasoning problems with language models\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2022/hash/18abbeef8cfe9203fdf9053c9c4fe191-Abstract-Conference.html)Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px2.p1.1)\.
- T\. Lin, P\. Goyal, R\. B\. Girshick, K\. He, and P\. Dollár \(2017\)Focal loss for dense object detection\.InIEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22\-29, 2017,pp\. 2999–3007\.External Links:[Link](https://doi.org/10.1109/ICCV.2017.324),[Document](https://dx.doi.org/10.1109/ICCV.2017.324)Cited by:[§1](https://arxiv.org/html/2606.11206#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.11206#S2.SS2.p1.1)\.
- J\. Liu, C\. S\. Xia, Y\. Wang, and L\. Zhang \(2023\)Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-Conference.html)Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px2.p1.1)\.
- L\. Liu and M\. Zhang \(2025\)Less is more: selective reflection for compatible and efficient knowledge distillation in large language models\.CoRRabs/2508\.06135\.External Links:[Link](https://doi.org/10.48550/arXiv.2508.06135),[Document](https://dx.doi.org/10.48550/ARXIV.2508.06135),2508\.06135Cited by:[§2\.3](https://arxiv.org/html/2606.11206#S2.SS3.p1.1),[§2\.3](https://arxiv.org/html/2606.11206#S2.SS3.p2.1)\.
- A\. Mandlekar, D\. Xu, J\. Wong, S\. Nasiriany, C\. Wang, R\. Kulkarni, L\. Fei\-Fei, S\. Savarese, Y\. Zhu, and R\. Martín\-Martín \(2021\)What matters in learning from offline human demonstrations for robot manipulation\.InConference on Robot Learning, 8\-11 November 2021, London, UK,A\. Faust, D\. Hsu, and G\. Neumann \(Eds\.\),Proceedings of Machine Learning Research, Vol\.164,pp\. 1678–1690\.External Links:[Link](https://proceedings.mlr.press/v164/mandlekar22a.html)Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.11206#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2606.11206#S2.SS3.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. F\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.11206#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1)\.
- R\. Qiao, Q\. Tan, G\. Dong, M\. Wu, C\. Sun, X\. Song, J\. Wang, Z\. Gongque, S\. Lei, Y\. Zhang, Z\. Wei, M\. Zhang, R\. Qiao, X\. Zong, Y\. Xu, P\. Yang, Z\. Bao, M\. Diao, C\. Li, and H\. Zhang \(2025\)We\-math: does your large multimodal model achieve human\-like mathematical reasoning?\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 20023–20070\.External Links:[Link](https://aclanthology.org/2025.acl-long.983/)Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px2.p1.1)\.
- C\. Qin and J\. T\. Springenberg \(2025\)Supervised fine tuning on curated data is reinforcement learning \(and can be improved\)\.CoRRabs/2507\.12856\.External Links:[Link](https://doi.org/10.48550/arXiv.2507.12856),[Document](https://dx.doi.org/10.48550/ARXIV.2507.12856),2507\.12856Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1),[§2\.2](https://arxiv.org/html/2606.11206#S2.SS2.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html)Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1)\.
- Z\. Rao, Y\. Zhou, B\. Zhou, Y\. Huang, S\. Escalera, and J\. Wan \(2025\)RVLF: a reinforcing vision\-language framework for gloss\-free sign language translation\.arXiv preprint arXiv:2512\.07273\.Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.CoRRabs/1707\.06347\.External Links:[Link](http://arxiv.org/abs/1707.06347),1707\.06347Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.CoRRabs/2402\.03300\.External Links:[Link](https://doi.org/10.48550/arXiv.2402.03300),[Document](https://dx.doi.org/10.48550/ARXIV.2402.03300),2402\.03300Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px1.p1.1)\.
- H\. Song, Y\. Zhou, J\. Shen, and Y\. Cheng \(2026\)From broad exploration to stable synthesis: entropy\-guided optimization for autoregressive image generation\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2606.11206#S2.SS2.p1.1)\.
- E\. Strubell, A\. Ganesh, and A\. McCallum \(2019\)Energy and policy considerations for deep learning in NLP\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,A\. Korhonen, D\. R\. Traum, and L\. Màrquez \(Eds\.\),pp\. 3645–3650\.External Links:[Link](https://doi.org/10.18653/v1/p19-1355),[Document](https://dx.doi.org/10.18653/V1/P19-1355)Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.
- G\. Swamy, S\. Choudhury, W\. Sun, Z\. S\. Wu, and J\. A\. Bagnell \(2025\)All roads lead to likelihood: the value of reinforcement learning in fine\-tuning\.CoRRabs/2503\.01067\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.01067),[Document](https://dx.doi.org/10.48550/ARXIV.2503.01067),2503\.01067Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.
- L\. Team \(2024\)CAThe llama 3 herd of models\.CoRRabs/2407\.21783\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.21783),[Document](https://dx.doi.org/10.48550/ARXIV.2407.21783),2407\.21783Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px1.p1.1)\.
- B\. Wang, Q\. Cheng, R\. Peng, R\. Bao, P\. Li, Q\. Guo, L\. Li, Z\. Zeng, Y\. Zhou, and X\. Qiu \(2025\)Implicit reward as the bridge: A unified view of SFT and DPO connections\.CoRRabs/2507\.00018\.External Links:[Link](https://doi.org/10.48550/arXiv.2507.00018),[Document](https://dx.doi.org/10.48550/ARXIV.2507.00018),2507\.00018Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1)\.
- K\. Wang, J\. Pan, W\. Shi, Z\. Lu, H\. Ren, A\. Zhou, M\. Zhan, and H\. Li \(2024\)Measuring multimodal mathematical reasoning with math\-vision dataset\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/ad0edc7d5fa1a783f063646968b7315b-Abstract-Datasets%5C_and%5C_Benchmarks%5C_Track.html)Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px2.p1.1)\.
- J\. Wei, M\. Bosma, V\. Y\. Zhao, K\. Guu, A\. W\. Yu, B\. Lester, N\. Du, A\. M\. Dai, and Q\. V\. Le \(2022\)Finetuned language models are zero\-shot learners\.InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022,External Links:[Link](https://openreview.net/forum?id=gEZrGCozdqR)Cited by:[§1](https://arxiv.org/html/2606.11206#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.
- Y\. Wu, Y\. Zhou, Z\. Ziheng, Y\. Peng, X\. Ye, X\. Hu, W\. Zhu, L\. Qi, M\. Yang, and X\. Yang \(2025\)On the generalization of SFT: A reinforcement learning perspective with reward rectification\.CoRRabs/2508\.05629\.External Links:[Link](https://doi.org/10.48550/arXiv.2508.05629),[Document](https://dx.doi.org/10.48550/ARXIV.2508.05629),2508\.05629Cited by:[§1](https://arxiv.org/html/2606.11206#S1.p1.1),[§1](https://arxiv.org/html/2606.11206#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.11206#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2606.11206#S2.SS3.p1.1),[§2\.3](https://arxiv.org/html/2606.11206#S2.SS3.p2.1),[§3\.1](https://arxiv.org/html/2606.11206#S3.SS1.p2.2),[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px3.p1.10),[§4\.3](https://arxiv.org/html/2606.11206#S4.SS3.p1.2)\.
- A\. Yang, B\. Zhang, B\. Hui, B\. Gao, B\. Yu, C\. Li, D\. Liu, J\. Tu, J\. Zhou, J\. Lin, K\. Lu, M\. Xue, R\. Lin, T\. Liu, X\. Ren, and Z\. Zhang \(2024\)Qwen2\.5\-math technical report: toward mathematical expert model via self\-improvement\.CoRRabs/2409\.12122\.External Links:[Link](https://doi.org/10.48550/arXiv.2409.12122),[Document](https://dx.doi.org/10.48550/ARXIV.2409.12122),2409\.12122Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px1.p1.1)\.
- H\. Yang, Y\. Zhou, W\. Han, and J\. Shen \(2025\)Self\-rewarding large vision\-language models for optimizing prompts in text\-to\-image generation\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 7332–7349\.Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1)\.
- J\. Yuan, Z\. Cui, H\. Wang, Y\. Gao, Y\. Zhou, and U\. Naseem \(2025\)Kardia\-r1: unleashing llms to reason toward understanding and empathy for emotional support via rubric\-as\-judge reinforcement learning\.arXiv preprint arXiv:2512\.01282\.Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1)\.
- R\. Zhang, D\. Jiang, Y\. Zhang, H\. Lin, Z\. Guo, P\. Qiu, A\. Zhou, P\. Lu, K\. Chang, Y\. Qiao, P\. Gao, and H\. Li \(2024\)MATHVERSE: does your multi\-modal LLM truly see the diagrams in visual math problems?\.InECCV 2024,A\. Leonardis, E\. Ricci, S\. Roth, O\. Russakovsky, T\. Sattler, and G\. Varol \(Eds\.\),Vol\.15066,pp\. 169–186\.External Links:[Link](https://doi.org/10.1007/978-3-031-73242-3%5C_10),[Document](https://dx.doi.org/10.1007/978-3-031-73242-3%5F10)Cited by:[§4\.1](https://arxiv.org/html/2606.11206#S4.SS1.SSS0.Px2.p1.1)\.
- S\. Zhang, S\. Wu, O\. Irsoy, S\. Lu, M\. Bansal, M\. Dredze, and D\. S\. Rosenberg \(2023\)MixCE: training autoregressive language models by mixing forward and reverse cross\-entropies\.InACL 2023,pp\. 9027–9050\.External Links:[Link](https://doi.org/10.18653/v1/2023.acl-long.502),[Document](https://dx.doi.org/10.18653/V1/2023.ACL-LONG.502)Cited by:[§2\.2](https://arxiv.org/html/2606.11206#S2.SS2.p1.1)\.
- H\. Zheng, Y\. Zhou, T\. Yan, D\. Chen, H\. Lu, W\. Liao, T\. He, P\. Peng, and J\. Shen \(2026\)Clinical cognition alignment for gastrointestinal diagnosis with multimodal llms\.arXiv preprint arXiv:2603\.20698\.Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1)\.
- H\. Zheng, Y\. Zhou, T\. Yan, J\. Su, H\. Chen, D\. Chen, X\. Gui, W\. Han, R\. Tao, Z\. Qiu,et al\.\(2025\)From human intention to action prediction: intention\-driven end\-to\-end autonomous driving\.arXiv preprint arXiv:2512\.12302\.Cited by:[§1](https://arxiv.org/html/2606.11206#S1.p1.1)\.
- C\. Zhou, P\. Liu, P\. Xu, S\. Iyer, J\. Sun, Y\. Mao, X\. Ma, A\. Efrat, P\. Yu, L\. Yu, S\. Zhang, G\. Ghosh, M\. Lewis, L\. Zettlemoyer, and O\. Levy \(2023\)LIMA: less is more for alignment\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/ac662d74829e4407ce1d126477f4a03a-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.11206#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.
- Y\. Zhou, X\. Li, Q\. Wang, and J\. Shen \(2024\)Visual in\-context learning for large vision\-language models\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 15890–15902\.Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.
- Y\. Zhou, J\. Shen, and Y\. Cheng \(2025a\)Weak to strong generalization for large language models with multi\-capabilities\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.
- Y\. Zhou, L\. Song, and J\. Shen \(2025b\)Improving medical large vision\-language models with abnormal\-aware feedback\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12994–13011\.Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p2.1)\.
- Y\. Zhou, J\. Zhang, G\. Chen, J\. Shen, and Y\. Cheng \(2026a\)Less is more: vision representation compression for efficient video generation with large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 13826–13834\.Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.
- Y\. Zhou, H\. Zheng, D\. Chen, H\. Yang, W\. Han, and J\. Shen \(2026b\)From medical llms to versatile medical agents: a comprehensive survey\.Authorea Preprints\.Cited by:[§2\.1](https://arxiv.org/html/2606.11206#S2.SS1.p1.1)\.Similar Articles
PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models
PR-CAD introduces a unified framework using LLMs for iterative text-to-CAD generation and editing, achieving state-of-the-art controllability and faithfulness via reinforcement learning and a curated high-fidelity dataset.
Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty
This paper investigates how similar large language model uncertainty is to human uncertainty, exploring alignment, calibration, and activation patterns in LLMs across multiple datasets and the impact of instruction fine-tuning.
Attribution-Guided Continual Learning for Large Language Models
This paper proposes an attribution-guided continual fine-tuning framework for large language models that estimates task-specific parameter importance in Transformer layers and modulates gradients accordingly, mitigating catastrophic forgetting while maintaining performance on new tasks.
Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning
This paper introduces AdaNAGED, a method that combines zero-order optimization, parameter-free adaptation, and non-Euclidean update geometry for memory-efficient fine-tuning of large language models, with theoretical convergence guarantees and validation on the OPT-1.3B model.
GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.