Tag
This paper introduces a method using knowledge-graph paths as intermediate supervision to improve self-evolving search agents. It addresses bottlenecks in Search Self-Play by grounding question construction in relational context and introducing a Waypoint Coverage Reward for graded partial credit.
This paper introduces a two-stage inference-time budget control method for LLM search agents, using Value-of-Information scores to optimize tool-call and token allocation during multi-hop question answering.
OThink-SRR1 introduces an iterative Search-Refine-Reason framework trained with GRPO-IR reinforcement learning to reduce retrieval noise and token costs while boosting multi-hop QA accuracy.