Tag
AGORA is a new benchmark for evaluating large language models on archive-grounded reasoning tasks across workplace documents, comprising 362 questions over 9,664 real documents. The strongest model achieves only 59.4% accuracy, highlighting substantial room for improvement.
SAG (SQL-Augmented Generation) is a novel SQL-based retrieval augmented generation method that converts data chunks into events and entities, enabling multi-hop reasoning via SQL join queries. On the MuSiQue dataset, recall increased from 65.13% to 80.04%. It supports second-level online retrieval of approximately 500 million data entries and has been open-sourced.
This paper identifies an anchor collapse phenomenon in agentic search where parallel trajectories converge due to similar initial queries, and proposes DivInit, a training-free method that samples diverse initial queries to improve multi-hop question answering performance.
This paper proposes Telegraph English, a readable symbolic format for context compression that outperforms matched-budget baselines on multi-hop QA datasets, preserving entity content more densely.
This paper explores using visual graph mind maps as reasoning scaffolds for LLMs, finding that visual guidance remains effective even without direct answer hints, while textual flattening of graphs loses benefits.
ARBOR introduces a reusable rubric buffer to provide online process rewards for LLM-based search agents, improving training efficiency when outcome-only rewards are insufficient. It outperforms GRPO and DAPO on multi-hop QA benchmarks, converting up to 42% of zero-gradient training groups into informative ones.
StepGap is a hybrid NLI-LLM decision tree that detects step-level evidence gaps in multi-hop QA, labeling them as Contradicted Claim, Irrelevant Evidence, or Missing Bridge. It achieves competitive F1 while providing a decomposable structure that improves downstream QA performance when used as a process reward for reinforcement learning.
Introduces Stepwise Confidence Attribution (SCA), a framework for assigning step-level confidence to reasoning traces from black-box LLMs without internal access, using the Information Bottleneck principle to distinguish legitimate variability from errors. Experiments show SCA reliably identifies low-confidence steps and improves self-correction success rates by up to 13.5% over answer-level feedback.
This paper introduces a method using knowledge-graph paths as intermediate supervision to improve self-evolving search agents. It addresses bottlenecks in Search Self-Play by grounding question construction in relational context and introducing a Waypoint Coverage Reward for graded partial credit.
This paper introduces a two-stage inference-time budget control method for LLM search agents, using Value-of-Information scores to optimize tool-call and token allocation during multi-hop question answering.
OThink-SRR1 introduces an iterative Search-Refine-Reason framework trained with GRPO-IR reinforcement learning to reduce retrieval noise and token costs while boosting multi-hop QA accuracy.