@RyanBoldi: Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse al…

X AI KOLs Following Papers

Summary

Introduces Vector Policy Optimization (VPO), a new RL method that handles vector-valued rewards to improve test-time scaling for LLMs, outperforming conventional scalar reward approaches.

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.
Original Article
View Cached Full Text

Cached at: 05/22/26, 03:50 PM

Your RL post-training may be sabotaging your LLM’s test-time scaling!

Conventional RL pretends that you can collapse all reward signals upfront into a single scalar reward. We introduce Vector Policy Optimization (VPO), which natively maximizes vector-valued rewards, boosting test time search performance, even on the original scalar.

Similar Articles

Value-Gradient Hypothesis of RL for LLMs

arXiv cs.LG

This paper introduces the value-gradient hypothesis to explain why critic-free RL methods like PPO and GRPO work well for LLMs, showing that the actor backward pass carries a value-gradient-like signal. It derives a predictive criterion for when RL is most effective along the pretraining trajectory.