value-gradient

Tag

Cards List
#value-gradient

Value-Gradient Hypothesis of RL for LLMs

arXiv cs.LG · 2026-05-22 Cached

This paper introduces the value-gradient hypothesis to explain why critic-free RL methods like PPO and GRPO work well for LLMs, showing that the actor backward pass carries a value-gradient-like signal. It derives a predictive criterion for when RL is most effective along the pretraining trajectory.

0 favorites 0 likes
← Back to home

Submit Feedback