Tag
Proposes Correction-Oriented Policy Optimization (CIPO), an extension to RLVR that converts failed trajectories into correction-oriented supervision, improving reasoning and correction performance in LLMs across math and code benchmarks.
This paper identifies the problem of missing observations in inverse reinforcement learning (IRL) that can make expert actions appear suboptimal, and develops a practical algorithm to quantify the minimal perturbations needed for expert actions to appear optimal, validated on synthetic tasks, cancer treatment simulation, and ICU data.
GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.