Tag
LLM agents often mis-assess their own performance after observing environment feedback, a problem called the reflection gap. RefGRPO addresses this by augmenting RL with a free calibration bonus and dynamic scheduling, reducing underconfidence from 44.4% to 7.7% and improving task accuracy on text-to-SQL benchmarks.