GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Hugging Face Daily Papers Papers

Summary

GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/21/26, 07:20 AM

Paper page - GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Source: https://huggingface.co/papers/2604.14258

Abstract

Group Fine-Tuning addresses limitations in supervised fine-tuning by using diverse response groups and adaptive weight bounding to improve training stability and efficiency.

Large language models are typically post-trained usingsupervised fine-tuning(SFT) andreinforcement learning(RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case ofpolicy gradient optimizationwith an extremely sparseimplicit rewardand unstableinverse-probability weighting, which together lead tosingle-path dependency,entropy collapse, andgradient explosion. Motivated by this diagnosis, we proposeGroup Fine-Tuning(GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms:Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, andDynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.

View arXiv pageView PDFProject pageGitHubAdd to collection

Get this paper in your agent:

hf papers read 2604\.14258

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.14258 in a model README.md to link it from this page.

Datasets citing this paper1

#### OmniAI-ZJU/NuminaMath-Cot-Distillation-100K Updatedabout 18 hours ago

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.14258 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

arXiv cs.CL

This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.

Gradient Extrapolation-Based Policy Optimization

arXiv cs.LG

The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Hugging Face Daily Papers

The paper introduces FocuSFT, a bilevel optimization framework that enhances long-context language model performance by addressing attention dilution through parametric memory. It demonstrates significant improvements in accuracy and context engagement on benchmarks like BABILong and RULER.