Tag
This paper introduces POISE, a method for stable policy optimization in large reasoning models by estimating baselines using the model's own internal states, reducing computational overhead compared to PPO and GRPO.