reward-design

#reward-design

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

LongTraceRL introduces tiered distractor construction and rubric reward design to improve long-context reasoning in language models using reinforcement learning. The method generates multi-hop questions via knowledge graph random walks and uses search agent trajectories to build challenging distractors, with a rubric reward providing entity-level process supervision.

0 favorites 0 likes

#reward-design

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

arXiv cs.CL ↗ · 2026-05-25 Cached

Introduces Metacognition-as-Reward (MaR), a reinforcement learning framework that guides LLM reasoning via metacognitive knowledge and regulation signals, achieving up to 11% improvement over vanilla methods on reasoning benchmarks.

0 favorites 0 likes

#reward-design

@ishapuri101: It's never made sense to me that RL collapses all reward signals to a single scalar. Today, we fix that! Introducing Ve…

X AI KOLs Timeline ↗ · 2026-05-22 Cached

Introduces Vector Policy Optimization (VPO) to train models with vector-valued rewards instead of scalar rewards, enabling diverse answer sets for test-time search.

0 favorites 0 likes

#reward-design

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

arXiv cs.LG ↗ · 2026-05-21 Cached

Introduces ClaimDiff-RL, a reinforcement learning framework for long-form image captioning that uses typed, verifiable claim differences as reward units to separately measure and balance hallucination and missing facts, improving faithfulness and coverage.

0 favorites 0 likes

#reward-design

@dair_ai: How far are we from agents that can self-generate world knowledge? The work proposes an outcome-based reward that measu…

X AI KOLs Following ↗ · 2026-04-22 Cached

A new paper introduces an outcome-based reward that quantifies how self-generated world knowledge boosts task success, enabling agents to improve without external guidance at inference.

0 favorites 0 likes

reward-design

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

@ishapuri101: It's never made sense to me that RL collapses all reward signals to a single scalar. Today, we fix that! Introducing Ve…

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

@dair_ai: How far are we from agents that can self-generate world knowledge? The work proposes an outcome-based reward that measu…

Submit Feedback