value-estimation

#value-estimation

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Hugging Face Daily Papers ↗ · 2026-05-08 Cached

This paper introduces POISE, a method for stable policy optimization in large reasoning models by estimating baselines using the model's own internal states, reducing computational overhead compared to PPO and GRPO.

0 favorites 0 likes

value-estimation

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Submit Feedback