position-weighted

#position-weighted

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

Hugging Face Daily Papers ↗ · 2026-06-09 Cached

This paper introduces CPPO, a method that improves reinforcement learning with verifiable rewards for LLMs by using position-weighted thresholds and cumulative prefix budgeting to address limitations of uniform token-level trust regions.

0 favorites 0 likes

position-weighted

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

Submit Feedback