Tag
A new feature called OpenResearch allows reproducing and experimenting on papers, with a one-click template to train Vector Policy Optimization (VPO) on ToolRL, enabling diverse answer generation and improved test-time search.
Introduces Vector Policy Optimization (VPO), a new RL method that handles vector-valued rewards to improve test-time scaling for LLMs, outperforming conventional scalar reward approaches.