GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
Summary
GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.
View Cached Full Text
Cached at: 04/21/26, 07:20 AM
Paper page - GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
Source: https://huggingface.co/papers/2604.14258
Abstract
Group Fine-Tuning addresses limitations in supervised fine-tuning by using diverse response groups and adaptive weight bounding to improve training stability and efficiency.
Large language models are typically post-trained usingsupervised fine-tuning(SFT) andreinforcement learning(RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case ofpolicy gradient optimizationwith an extremely sparseimplicit rewardand unstableinverse-probability weighting, which together lead tosingle-path dependency,entropy collapse, andgradient explosion. Motivated by this diagnosis, we proposeGroup Fine-Tuning(GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms:Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, andDynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.
View arXiv pageView PDFProject pageGitHubAdd to collection
Get this paper in your agent:
hf papers read 2604\.14258
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.14258 in a model README.md to link it from this page.
Datasets citing this paper1
#### OmniAI-ZJU/NuminaMath-Cot-Distillation-100K Updatedabout 18 hours ago
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.14258 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
This paper introduces A^2TGPO, a reinforcement learning method for agentic LLMs that uses adaptive turn-level clipping and information gain normalization to improve process credit assignment in multi-turn interactions.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.
LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?
LiFT is a longitudinal instruction fine-tuning framework that unifies diverse temporal NLP tasks under a shared instruction schema with curriculum-based training. Evaluated across OLMo, LLaMA, and Qwen models, LiFT consistently outperforms base-model in-context learning, especially on out-of-distribution data and rare change events.
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
The paper introduces FocuSFT, a bilevel optimization framework that enhances long-context language model performance by addressing attention dilution through parametric memory. It demonstrates significant improvements in accuracy and context engagement on benchmarks like BABILong and RULER.