Tag
LongTraceRL introduces tiered distractor construction and rubric reward design to improve long-context reasoning in language models using reinforcement learning. The method generates multi-hop questions via knowledge graph random walks and uses search agent trajectories to build challenging distractors, with a rubric reward providing entity-level process supervision.
Introduces Metacognition-as-Reward (MaR), a reinforcement learning framework that guides LLM reasoning via metacognitive knowledge and regulation signals, achieving up to 11% improvement over vanilla methods on reasoning benchmarks.
Introduces Vector Policy Optimization (VPO) to train models with vector-valued rewards instead of scalar rewards, enabling diverse answer sets for test-time search.
Introduces ClaimDiff-RL, a reinforcement learning framework for long-form image captioning that uses typed, verifiable claim differences as reward units to separately measure and balance hallucination and missing facts, improving faithfulness and coverage.
A new paper introduces an outcome-based reward that quantifies how self-generated world knowledge boosts task success, enabling agents to improve without external guidance at inference.