GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Hugging Face Daily Papers Papers

Summary

GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.
Original Article
View Cached Full Text

Cached at: 04/21/26, 07:20 AM

Paper page - GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Source: https://huggingface.co/papers/2604.14258

Abstract

Group Fine-Tuning addresses limitations in supervised fine-tuning by using diverse response groups and adaptive weight bounding to improve training stability and efficiency.

Large language models are typically post-trained usingsupervised fine-tuning(SFT) andreinforcement learning(RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case ofpolicy gradient optimizationwith an extremely sparseimplicit rewardand unstableinverse-probability weighting, which together lead tosingle-path dependency,entropy collapse, andgradient explosion. Motivated by this diagnosis, we proposeGroup Fine-Tuning(GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms:Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, andDynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.

View arXiv pageView PDFProject pageGitHubAdd to collection

Get this paper in your agent:

hf papers read 2604\.14258

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.14258 in a model README.md to link it from this page.

Datasets citing this paper1

#### OmniAI-ZJU/NuminaMath-Cot-Distillation-100K Updatedabout 18 hours ago

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.14258 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

@maximelabonne: This is so neat! Dynamic Fine-Tuning (DFT) reweights the SFT loss by the model's own token probability, which creates a…

X AI KOLs Following

Dynamic Fine-Tuning (DFT) is introduced as a method that reweights the SFT loss using the model's own token probability, creating a feedback loop, and adds forward KL to penalize tokens the base model finds likely but the policy has pushed toward zero probability. The tweet expresses skepticism about SFT papers in practice but praises the attempt.

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

arXiv cs.LG

The paper proposes GAC, a noise-aware adaptive mixing controller for hybrid SFT-RL post-training of LLMs. It derives a closed-form mixing weight that balances gradient noise and SFT-RL disagreement, achieving consistent improvements across multiple benchmarks with minimal overhead.

Goal-Conditioned Supervised Learning for LLM Fine-Tuning

arXiv cs.LG

This paper proposes goal-conditioned supervised learning (GCSL) as an offline fine-tuning framework for LLMs, which treats feedback as an explicit goal and trains models via supervised learning with a novel goal formulation and natural-language goal representations. Evaluated on non-toxic generation, code generation, and recommendation, it outperforms standard offline baselines.

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

arXiv cs.LG

This paper introduces FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage function that dynamically schedules gradient weights during RL post-training of LLMs, achieving faster convergence and better accuracy-diversity trade-offs compared to static baselines.