@RyanBoldi: Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse al…
Summary
Introduces Vector Policy Optimization (VPO), a new RL method that handles vector-valued rewards to improve test-time scaling for LLMs, outperforming conventional scalar reward approaches.
View Cached Full Text
Cached at: 05/22/26, 03:50 PM
Your RL post-training may be sabotaging your LLM’s test-time scaling!
Conventional RL pretends that you can collapse all reward signals upfront into a single scalar reward. We introduce Vector Policy Optimization (VPO), which natively maximizes vector-valued rewards, boosting test time search performance, even on the original scalar.
Similar Articles
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
This paper introduces Vector Policy Optimization (VPO), a reinforcement learning algorithm that trains LLMs to produce diverse solutions by optimizing across multiple reward dimensions, significantly improving test-time search performance compared to scalar RL baselines.
@ishapuri101: It's never made sense to me that RL collapses all reward signals to a single scalar. Today, we fix that! Introducing Ve…
Introduces Vector Policy Optimization (VPO) to train models with vector-valued rewards instead of scalar rewards, enabling diverse answer sets for test-time search.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
This paper introduces Listwise Policy Optimization (LPO), a method for RLVR that explicitly handles target projection via divergence minimization on the response simplex to improve training stability and performance in LLMs.
Value-Gradient Hypothesis of RL for LLMs
This paper introduces the value-gradient hypothesis to explain why critic-free RL methods like PPO and GRPO work well for LLMs, showing that the actor backward pass carries a value-gradient-like signal. It derives a predictive criterion for when RL is most effective along the pretraining trajectory.
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
Proposes Correction-Oriented Policy Optimization (CIPO), an extension to RLVR that converts failed trajectories into correction-oriented supervision, improving reasoning and correction performance in LLMs across math and code benchmarks.