value-gradient

#value-gradient

Value-Gradient Hypothesis of RL for LLMs

arXiv cs.LG ↗ · 2026-05-22 Cached

This paper introduces the value-gradient hypothesis to explain why critic-free RL methods like PPO and GRPO work well for LLMs, showing that the actor backward pass carries a value-gradient-like signal. It derives a predictive criterion for when RL is most effective along the pretraining trajectory.

0 favorites 0 likes

value-gradient

Value-Gradient Hypothesis of RL for LLMs

Submit Feedback