Tag
This paper proposes PTD-PO, a privileged tutoring distillation framework that provides dense token-level supervision for reinforcement learning with verifiable rewards in multimodal reasoning tasks, without exposing the answer. It uses structured hints and a Top-K JS divergence objective to stabilize training, consistently outperforming existing methods on 2B-8B LVLMs.
A comprehensive guide to 15 policy optimization and preference optimization techniques important in 2026, including GRPO, DPO, REINFORCE++, and many newer variants, mapping the landscape of reasoning RL methods.
Harvard researchers challenge the standard LLM training pipeline by showing RL can be effectively applied during pre-training rather than only after SFT, finding that data composition matters more than model scale, and proposing parallel averaging of RL and SFT objectives that outperforms sequential approaches while preserving general capabilities.
This paper introduces One-to-Many Temporal Grounding (OMTG), a new task for localizing multiple disjoint video segments from a single text query, along with a benchmark, evaluation metrics, a 56k-sample dataset, and novel reward functions that achieve state-of-the-art results, outperforming Gemini 2.5 Pro and Seed-1.8.
This paper introduces Hint-Guided Diversified Policy Optimization (HDPO), a two-stage RL framework that encourages LLMs to first generate multiple candidate solution outlines (hints) and then select the most reliable one for detailed reasoning, improving reasoning diversity and reliability.
Fair Reinforcement Learning introduces Democratic Alignment to incorporate multiple competing value sets from different agents, overcoming traditional RLHF limitations, and achieves orders of magnitude faster optimization via a black-box policy wrapper.
This paper introduces Posterior Hybrid Bayesian Belief (PhyB), a framework that reformulates the expectation in Bayesian RL as a convex combination over dynamics models, enabling efficient regularized offline policy optimization with bounded objective discrepancy and state-of-the-art performance.
Moment Matching Q-Learning (MoMa QL) uses maximum mean discrepancy to match all moment statistics for distribution-level convergence in offline RL, achieving computational efficiency and strong performance on D4RL tasks.
This paper introduces Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment in reinforcement learning by contrasting model predictions under positive and negative prompts, consistently outperforming GRPO and DAPO baselines on text-to-image generation and chain-of-thought reasoning benchmarks.
Introduces Belief Entropy and Metacognitive Memory Policy Optimization (MMPO) to improve memory quality in long-horizon LLM agents, outperforming existing methods and maintaining performance over long contexts.
Proposes Model-Based Diffusion Policy Optimization (MBDPO), a framework that unifies search and policy optimization in world models using diffusion policy representations, achieving consistent scaling behavior and superior performance across offline and online reinforcement learning tasks.
RICE-PO is a critic-free policy optimization framework that turns retrieval interactions into localized credit signals for training reasoning agents, outperforming prompt-based and group-based RL baselines on BRIGHT and BEIR benchmarks.
Introduces GORMPO, a density-regularized offline RL algorithm that uses generative density modeling to restrict policy updates to high-density areas, achieving 17% improvement on a real-world medical dataset and outperforming state-of-the-art baselines.
Introduces temporal scheduling for credit allocation criteria in reinforcement learning with verifiable rewards, showing that scheduling when learning signals are applied improves policy evolution and stability.
This paper introduces Vector Policy Optimization (VPO), a reinforcement learning algorithm that trains LLMs to produce diverse solutions by optimizing across multiple reward dimensions, significantly improving test-time search performance compared to scalar RL baselines.
Introduces Vector Policy Optimization (VPO) to train models with vector-valued rewards instead of scalar rewards, enabling diverse answer sets for test-time search.
Proposes TEMPO, a policy optimization method that trains LLMs to reason exclusively from pre-cutoff information by using a two-mode reward and GRPO-based training, reducing knowledge leakage by 2–13% while improving task performance by 6–13%.
Introduces LambdaPO, a novel reinforcement learning framework that improves upon GRPO by decomposing advantage estimation into pairwise preference comparisons and adding a semantic density reward, achieving better performance on math reasoning tasks.
This paper identifies weaknesses in existing reinforcement learning methods for diffusion language models—lack of temporal credit assignment and biased likelihood estimates—and proposes DACA-GRPO, a plug-and-play enhancement that introduces denoising progress scores and stratified masking likelihood, achieving consistent improvements across reasoning, code generation, and constrained generation benchmarks.
Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.