multi-hop-qa

#multi-hop-qa

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

arXiv cs.CL ↗ · yesterday Cached

AGORA is a new benchmark for evaluating large language models on archive-grounded reasoning tasks across workplace documents, comprising 362 questions over 9,664 real documents. The strongest model achieves only 59.4% accuracy, highlighting substantial room for improvement.

0 favorites 0 likes

#multi-hop-qa

@teach_fireworks: https://x.com/teach_fireworks/status/2067243590447952212

X AI KOLs Timeline ↗ · 2026-06-17 Cached

SAG (SQL-Augmented Generation) is a novel SQL-based retrieval augmented generation method that converts data chunks into events and entities, enabling multi-hop reasoning via SQL join queries. On the MuSiQue dataset, recall increased from 65.13% to 80.04%. It supports second-level online retrieval of approximately 500 million data entries and has been open-sourced.

0 favorites 0 likes

#multi-hop-qa

Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search

arXiv cs.AI ↗ · 2026-06-17 Cached

This paper identifies an anchor collapse phenomenon in agentic search where parallel trajectories converge due to similar initial queries, and proposes DivInit, a training-free method that samples diverse initial queries to improve multi-hop question answering performance.

0 favorites 0 likes

#multi-hop-qa

Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

arXiv cs.CL ↗ · 2026-06-16 Cached

This paper proposes Telegraph English, a readable symbolic format for context compression that outperforms matched-budget baselines on multi-hop QA datasets, preserving entity content more densely.

0 favorites 0 likes

#multi-hop-qa

Visual Graph Scaffolds for Structural Reasoning in Large Language Models

arXiv cs.AI ↗ · 2026-06-03 Cached

This paper explores using visual graph mind maps as reasoning scaffolds for LLMs, finding that visual guidance remains effective even without direct answer hints, while textual flattening of graphs loses benefits.

0 favorites 0 likes

#multi-hop-qa

ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents

arXiv cs.CL ↗ · 2026-06-03 Cached

ARBOR introduces a reusable rubric buffer to provide online process rewards for LLM-based search agents, improving training efficiency when outcome-only rewards are insufficient. It outperforms GRPO and DAPO on multi-hop QA benchmarks, converting up to 42% of zero-gradient training groups into informative ones.

0 favorites 0 likes

#multi-hop-qa

StepGap: A Hybrid NLI-LLM Checker for Step-Level Evidence-Gap Detectionin Multi-Hop Question Answering

arXiv cs.CL ↗ · 2026-05-26 Cached

StepGap is a hybrid NLI-LLM decision tree that detects step-level evidence gaps in multi-hop QA, labeling them as Contradicted Claim, Irrelevant Evidence, or Missing Bridge. It achieves competitive F1 while providing a decomposable structure that improves downstream QA performance when used as a process reward for reinforcement learning.

0 favorites 0 likes

#multi-hop-qa

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

arXiv cs.CL ↗ · 2026-05-20 Cached

Introduces Stepwise Confidence Attribution (SCA), a framework for assigning step-level confidence to reasoning traces from black-box LLMs without internal access, using the Information Bottleneck principle to distinguish legitimate variability from errors. Experiments show SCA reliably identifies low-confidence steps and improves self-correction success rates by up to 13.5% over answer-level feedback.

0 favorites 0 likes

#multi-hop-qa

Knowledge-Graph Paths as Intermediate Supervision for Self-Evolving Search Agents

arXiv cs.AI ↗ · 2026-05-08 Cached

This paper introduces a method using knowledge-graph paths as intermediate supervision to improve self-evolving search agents. It addresses bottlenecks in Search Self-Play by grounding question construction in relational context and introducing a Waypoint Coverage Reward for graded partial credit.

0 favorites 0 likes

#multi-hop-qa

Inference-Time Budget Control for LLM Search Agents

arXiv cs.AI ↗ · 2026-05-08 Cached

This paper introduces a two-stage inference-time budget control method for LLM search agents, using Value-of-Information scores to optimize tool-call and token allocation during multi-hop question answering.

0 favorites 0 likes

#multi-hop-qa

OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models

arXiv cs.CL ↗ · 2026-04-23 Cached

OThink-SRR1 introduces an iterative Search-Refine-Reason framework trained with GRPO-IR reinforcement learning to reduce retrieval noise and token costs while boosting multi-hop QA accuracy.

0 favorites 0 likes

multi-hop-qa

Submit Feedback