GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Hugging Face Daily Papers 04/15/26, 12:00 AM Papers

Summary

GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/21/26, 07:20 AM

Paper page - GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Source: https://huggingface.co/papers/2604.14258

Abstract

Group Fine-Tuning addresses limitations in supervised fine-tuning by using diverse response groups and adaptive weight bounding to improve training stability and efficiency.

Large language models are typically post-trained usingsupervised fine-tuning(SFT) andreinforcement learning(RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case ofpolicy gradient optimizationwith an extremely sparseimplicit rewardand unstableinverse-probability weighting, which together lead tosingle-path dependency,entropy collapse, andgradient explosion. Motivated by this diagnosis, we proposeGroup Fine-Tuning(GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms:Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, andDynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.

View arXiv page View PDF Project page GitHub Add to collection

Get this paper in your agent:

hf papers read 2604\.14258

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.14258 in a model README.md to link it from this page.

Datasets citing this paper1

#### OmniAI-ZJU/NuminaMath-Cot-Distillation-100K Updatedabout 18 hours ago

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.14258 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Paper page - GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?

Gradient Extrapolation-Based Policy Optimization

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Submit Feedback

Similar Articles

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?

Gradient Extrapolation-Based Policy Optimization

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning