GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
Summary
GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.
View Cached Full Text
Cached at: 04/21/26, 07:20 AM
Paper page - GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
Source: https://huggingface.co/papers/2604.14258
Abstract
Group Fine-Tuning addresses limitations in supervised fine-tuning by using diverse response groups and adaptive weight bounding to improve training stability and efficiency.
Large language models are typically post-trained usingsupervised fine-tuning(SFT) andreinforcement learning(RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case ofpolicy gradient optimizationwith an extremely sparseimplicit rewardand unstableinverse-probability weighting, which together lead tosingle-path dependency,entropy collapse, andgradient explosion. Motivated by this diagnosis, we proposeGroup Fine-Tuning(GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms:Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, andDynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.
View arXiv pageView PDFProject pageGitHubAdd to collection
Get this paper in your agent:
hf papers read 2604\.14258
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.14258 in a model README.md to link it from this page.
Datasets citing this paper1
#### OmniAI-ZJU/NuminaMath-Cot-Distillation-100K Updatedabout 18 hours ago
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.14258 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@maximelabonne: This is so neat! Dynamic Fine-Tuning (DFT) reweights the SFT loss by the model's own token probability, which creates a…
Dynamic Fine-Tuning (DFT) is introduced as a method that reweights the SFT loss using the model's own token probability, creating a feedback loop, and adds forward KL to penalize tokens the base model finds likely but the policy has pushed toward zero probability. The tweet expresses skepticism about SFT papers in practice but praises the attempt.
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training
The paper proposes GAC, a noise-aware adaptive mixing controller for hybrid SFT-RL post-training of LLMs. It derives a closed-form mixing weight that balances gradient noise and SFT-RL disagreement, achieving consistent improvements across multiple benchmarks with minimal overhead.
Goal-Conditioned Supervised Learning for LLM Fine-Tuning
This paper proposes goal-conditioned supervised learning (GCSL) as an offline fine-tuning framework for LLMs, which treats feedback as an explicit goal and trains models via supervised learning with a novel goal formulation and natural-language goal representations. Evaluated on non-toxic generation, code generation, and recommendation, it outperforms standard offline baselines.
@LakshyAAAgrawal: Learning from rich textual feedback (errors, traces, partial reasoning) beats scalar reward alone for LLM optimization.…
Fast-Slow Training (FST) interleaves context optimization (via GEPA) with model weight updates via RL, achieving 3× sample efficiency over RL alone on math, code, and physics reasoning while preserving plasticity and enabling continual learning.
Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL
This paper introduces FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage function that dynamically schedules gradient weights during RL post-training of LLMs, achieving faster convergence and better accuracy-diversity trade-offs compared to static baselines.