reward-learning

#reward-learning

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

arXiv cs.CL ↗ · 2026-05-15 Cached

Proposes Correction-Oriented Policy Optimization (CIPO), an extension to RLVR that converts failed trajectories into correction-oriented supervision, improving reasoning and correction performance in LLMs across math and code benchmarks.

0 favorites 0 likes

#reward-learning

Quantifying Potential Observation Missingness in Inverse Reinforcement Learning

arXiv cs.LG ↗ · 2026-05-14 Cached

This paper identifies the problem of missing observations in inverse reinforcement learning (IRL) that can make expert actions appear suboptimal, and develops a practical algorithm to quantify the minimal perturbations needed for expert actions to appear optimal, validated on synthetic tasks, cancer treatment simulation, and ICU data.

0 favorites 0 likes

#reward-learning

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Hugging Face Daily Papers ↗ · 2026-04-15 Cached

GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.

0 favorites 0 likes

reward-learning

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Quantifying Potential Observation Missingness in Inverse Reinforcement Learning

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Submit Feedback