Tag
This paper introduces Vector Policy Optimization (VPO), a reinforcement learning algorithm that trains LLMs to produce diverse solutions by optimizing across multiple reward dimensions, significantly improving test-time search performance compared to scalar RL baselines.
This paper investigates how 1D coarse-to-fine token structures in autoregressive models improve test-time search efficiency compared to classical 2D grid tokenization. The authors show that such ordered tokens enable better test-time scaling and even training-free text-to-image generation when guided by image-text verifiers.