LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

arXiv cs.CL 06/12/26, 04:00 AM Papers
benchmark search-agents long-horizon knowledge-graph evaluation ai-agents difficulty-ceiling
Summary
LoHoSearch is a new benchmark for evaluating long-horizon search agents, built from a knowledge graph of 7 million Wikipedia entities. It introduces questions with large search spaces and structural complexity to exceed human-authored difficulty ceilings, and shows that the best model achieves only 34.74% accuracy.
arXiv:2606.12837v1 Announce Type: new Abstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.
Original Article
View Cached Full Text
Cached at: 06/12/26, 08:50 AM
# LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling
Source: [https://arxiv.org/html/2606.12837](https://arxiv.org/html/2606.12837)
Jiarui Zhao\*†\\dagger,Rongzhi Zhang\*,Lingchuan Liu†\\dagger,Hao Yang, Xunliang Cai,Xi Su Meituan \{zhaojiarui02,liulingchuan\}@meituan\.com

###### Abstract

Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy\. Since these benchmarks are predominantly human\-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity\. This creates a difficulty ceiling that is hard to break\. To address this, we introduceLoHoSearch\(Long\-HorizonSearchAgents\), a challenging benchmark comprising 544 human\-verified questions across 11 domains\. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG\-verified unique answers\. Our evaluation demonstrates that even the strongest model achieves only 34\.74% accuracy, and existing context management strategies \(best \+6\.8%\) yield far smaller gains than on prior benchmarks\. LoHoSearch provides a more demanding standard for evaluating long\-horizon reasoning and context management in search agents\.

LoHoSearch: Benchmarking Long\-Horizon Search Agents Beyond the Human Difficulty Ceiling

Jiarui Zhao\*†\\dagger, Rongzhi Zhang\*, Lingchuan Liu†\\dagger, Hao Yang,Xunliang Cai,Xi SuMeituan\{zhaojiarui02,liulingchuan\}@meituan\.com

††footnotetext:∗These authors contributed equally to this work\.††footnotetext:†Corresponding authors\.## 1Introduction

Since April 2025, challenging yet easily verifiable benchmarks, exemplified by BrowseComp\(Weiet al\.,[2025](https://arxiv.org/html/2606.12837#bib.bib3)\), have rapidly become the de facto standard for measuring search agent capabilities\. Yet, as Figure[1](https://arxiv.org/html/2606.12837#S1.F1)illustrates, model performance on BrowseComp has soared from 30% to over 90% in barely ten months, and the benchmark is quickly losing its discriminating power\(Anthropic,[2026a](https://arxiv.org/html/2606.12837#bib.bib23)\)\. The root cause is that these benchmarks are predominantly human\-authored: annotators tend to choose entities and relations they are familiar with, which typically have high popularity and direct connections, causing most questions to be answerable within only a few retrieval steps\. This forms a difficulty ceiling that is hard to raise further, and as model capabilities continue to advance, this trend will only intensify\.

![Refer to caption](https://arxiv.org/html/2606.12837v1/latex/fig/browsecomp_progression.png)Figure 1:BrowseComp accuracy progression from August 2025 to May 2026 across major model families\.The difficulty of search problems is determined by two core factors: \(1\) the search space size per constraint, i\.e\., the number of candidate entities satisfying a single condition\. A larger search space forces the agent to verify and eliminate more candidates\. \(2\) The structural complexity, i\.e\., the number of constraints that must be jointly satisfied to uniquely identify the answer\. Higher structural complexity means more constraints must be checked to rule out each candidate, substantially raising the overall solving difficulty\. While reasoning depth \(the number of knowledge hops required to reach the final answer\) also contributes to difficulty, it is the easiest to control and has been well\-addressed by existing benchmarks\(Trivediet al\.,[2022](https://arxiv.org/html/2606.12837#bib.bib4); Krishnaet al\.,[2025](https://arxiv.org/html/2606.12837#bib.bib15)\)\. When both search space size and structural complexity are large, the agentic search process becomes substantially longer, placing higher demands on both reasoning and context management\. However, human annotators lack a global perspective on entity statistics and cannot systematically maximize difficulty along both dimensions\.

To address this, we introduce LoHoSearch, a benchmark constructed through an automated pipeline grounded in a knowledge graph\. Starting from Wikipedia, we build a large\-scale knowledge graph spanning over 7 million entities, select relations with genuinely large search spaces under a global view, and assemble them into structurally complex subgraphs whose answers are KG\-verified uniqueness\. Each subgraph is subsequently verbalized into a natural\-language question by a language model and undergoes multiple rounds of automated verification and human review to ensure correctness and answer uniqueness\. The resulting benchmark comprises 544 human\-verified questions across 11 topical domains\. Our main contributions are as follows:

- •We propose a knowledge\-graph\-based automated QA construction pipeline that systematically controls search space size and structural complexity, breaking through the difficulty ceiling of human authoring\.
- •We introduce the LoHoSearch benchmark\. Even the strongest model achieves only 34\.7%, with correct trajectories requiring 1\.7×\\timesmore tool calls than BrowseComp, establishing a more discriminative evaluation standard for search agents\.
- •Our benchmark reveals the limitations of existing context management strategies in high\-difficulty scenarios\. The best strategy yields only a 6\.8% improvement, far below gains on existing benchmarks, demonstrating that LoHoSearch serves as a more challenging testbed for future research\.

## 2Data Synthesis

![Refer to caption](https://arxiv.org/html/2606.12837v1/latex/fig/main_pipeline.png)Figure 2:Overview of the LoHoSearch pipeline\.Our pipeline proceeds through four stages \(illustrated in Figure[2](https://arxiv.org/html/2606.12837#S2.F2)\): knowledge graph construction, subgraph sampling, QA generation and verification, and post filtering with human review\.

### 2\.1Knowledge Graph Construction

We construct a knowledge graph from the full English Wikipedia dump: each page corresponds to an entity \(node\), and hyperlinks within the page body pointing to other Wikipedia pages serve as directed edges\. We define each entity’s type as its Wikidata\(Vrandečić and Krötzsch,[2014](https://arxiv.org/html/2606.12837#bib.bib32)\)P31 \(instance\_of\) class and its popularity as in\-degree, both for use in subsequent stages\. The resulting knowledge graph contains approximately 7\.62 million entities and 265 million directed edges\.

### 2\.2Subgraph Sampling

We employ two complementary subgraph structures: tree\-structured and graph\-structured\. The difficulty of the tree structure mainly arises from the size of the search space, while the graph structure further increases structural complexity through cyclic dependencies and cross\-constraints among entities\. Both structures require all constituent entities to have low popularity and moderate page lengths, ensuring that entities cannot be easily inferred\. Additionally, we balance the type distribution of answer entities during sampling to ensure diversity across topical domains\.

We first define the search space of a relation\. Given a directed edge from entityAAto entityBB, its search space is defined as:

𝒮\(A→B\)=\{e∣type\(e\)=type\(A\),\(e→B\)∈𝒢\}\\begin\{split\}\\mathcal\{S\}\(A\\\!\\rightarrow\\\!B\)=\\\{e\\mid\\;&\\text\{type\}\(e\)=\\text\{type\}\(A\),\\\\ &\(e\\\!\\rightarrow\\\!B\)\\in\\mathcal\{G\}\\\}\\end\{split\}\(1\)That is, in the knowledge graph𝒢\\mathcal\{G\}, the set of all entities that share the same type asAAand also have a directed edge toBB\. A larger search space means more candidate entities satisfy the given relational constraint, making it harder for the agent to identifyAAvia this relation\.

#### 2\.2\.1Tree\-Structured Subgraph Sampling

The tree structure uses a low\-popularity entity as the root node \(i\.e\., the answer\), which connects to multiple intermediate entities, each of which further connects to several leaf nodes\. Sampling proceeds layer by layer:

##### First\-layer expansion\.

From the root’s relations, we selectNNedges pointing to intermediate entities, subject to:

- •The search space size of each relation\|𝒮\|\>τ\\lvert\\mathcal\{S\}\\rvert\>\\tau;
- •The intersection of candidate sets for anyN−1N\{\-\}1relations has size\>1\>1\. That is, the answer is no longer uniquely determined if we remove any single relation, ensuring every relation is necessary;
- •The intersection of candidate sets across allNNrelations equals exactly\{root\}\\\{\\text\{root\}\\\}, guaranteeing KG\-level uniqueness of the answer\.

##### Second\-layer expansion\.

For each intermediate entity, we select 1 toMMedges pointing to leaf nodes, subject to:

- •The search space size of each relation\|𝒮\|\>τ\\lvert\\mathcal\{S\}\\rvert\>\\tau;
- •The intersection of candidate sets across theMMrelations has size\>1\>1, ensuring the intermediate entity itself cannot be directly inferred;
- •We refer to candidates in this intersection \(other than the current intermediate entity\) as pseudo\-candidates\. We further require that no pseudo\-candidate, when combined with the remaining intermediate entities, uniquely determines the answer—this preserves answer uniqueness\.

In practice, we setN=3N=3,M=2M=2, andτ=3\\tau=3\.

#### 2\.2\.2Graph\-Structured Subgraph Sampling

Unlike the hierarchical expansion of tree structures, graph\-structured subgraphs contain extensive cross\-edges among entities and may form cycles, making the problem constraints non\-decomposable into independent sub\-problems\. During sampling, we first select a low\-popularity entity as the seed \(i\.e\., the answer\), then greedily expand until the subgraph reaches a maximum of 10 entities: at each step, we prioritize the candidate with the most edges to the current subgraph and whose corresponding edges have the largest search space\. The subgraph must satisfy node type diversity, sufficient edge count, and connectivity\.

After construction, uniqueness is verified via exhaustive backtracking search: we search for another set of entities in the full graph that satisfies the same entity types and directed adjacency relations; if none exists, uniqueness is confirmed\. Additionally, we require the seed entity to have a sufficient number of confounding candidates of the same type that are connected to entities of all neighbor types in the subgraph, preventing brute\-force solving via type enumeration\.

### 2\.3QA Generation and Verification

This stage converts each sampled subgraph into a natural\-language question\. Specifically, for each edge in the subgraph, a language model extracts an obfuscated description of the relation from the source entity’s Wikipedia page; for leaf nodes in tree structures, 1–2 additional property descriptions are extracted\.

After extraction, we apply search\-based verification to ensure sufficient obfuscation: each relation must be neither directly locatable via search engines nor inferable by an LLM\. To prevent multiple relations from becoming easy to infer when combined, we perform joint verification on all relations of the same entity\.

Once verification passes, we assemble all relations and properties into a structured description with entity names hidden, and have an LLM convert it into a natural\-language question\. The generated question undergoes two rounds of automated validation:

- •Subgraph coverage check: verifies that the question faithfully covers all relations and properties from the input subgraph, with no omissions, additions, or distortions\.
- •Answer satisfaction check: a search agent verifies that the ground\-truth answer indeed satisfies all conditions stated in the question\.

All LLM\-based steps in this stage are performed using DeepSeek\-V3\.2\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2606.12837#bib.bib33)\)\.

### 2\.4Post Filtering and Human Review

After the QA generation stage, we subject all questions to multiple rounds of filtering:

##### Uniqueness verification\.

Although the subgraph sampling stage enforces structural uniqueness, the transformation from subgraph to natural language may introduce ambiguity\. We deploy multiple search agents of different capability levels to independently attempt each question, collect candidate answers, and automatically judge whether any candidate satisfies all conditions\. Questions for which an alternative valid answer is found are filtered out\.

##### Difficulty filtering\.

To calibrate the difficulty of the benchmark, we have a DeepSeek\-V3\.2\-powered search agent attempt each question multiple times independently\. Questions that are answered correctly across multiple trials are filtered out, retaining only those that pose genuine search difficulty\.

##### Human review\.

After all automated filtering, the remaining questions are submitted to professional annotators for manual verification\. Annotators evaluate each question along multiple dimensions such as answer correctness, answer uniqueness, logical coherence across conditions, language fluency, and information redundancy, ensuring that the final questions are both rigorous and natural\.

### 2\.5Data Statistics

Table[1](https://arxiv.org/html/2606.12837#S2.T1)summarizes the key statistics of LoHoSearch\. The dataset comprises 544 human\-verified questions\. Graph\-structured subgraphs are notably denser than tree\-structured ones, with more nodes and nearly twice as many edges, reflecting their higher structural complexity\. As shown in Figure[3](https://arxiv.org/html/2606.12837#S2.F3), the questions span 11 topical domains, including Music, Geography & Places, Film & Television, and Sports, ensuring broad coverage across knowledge areas\.

Table 1:Dataset statistics of LoHoSearch\.![Refer to caption](https://arxiv.org/html/2606.12837v1/latex/fig/memosearch_domain_distribution.png)Figure 3:Domain distribution of LoHoSearch\. The dataset consists of 544 samples spanning 11 categories\.Table 2:Performance \(%\) on LoHoSearch\. Score reports the average correct ratio on the 544\-sample dataset\. Calibration Error measures the confidence calibration of each model\. Best inbold\.†Indicates models that encountered service instability or safety refusals during evaluation\.Regarding quality assurance, 75\.5% of all automatically constructed questions passed human review directly, 22\.3% were accepted after minor wording adjustments by annotators \(e\.g\., correcting unnatural phrasing or removing redundant qualifiers\), and only 2\.2% were discarded due to critical issues such as logical inconsistencies, demonstrating the high generation quality of the automated pipeline\. In terms of answer uniqueness, 70\.8% of questions are confirmed by annotators to have a definitively unique answer\. For the remaining 29\.2%, annotators were unable to conclusively rule out alternatives, yet found no substitute candidates after thorough search\. We note that as question difficulty increases, verifying uniqueness itself becomes extremely challenging even for human annotators, further attesting to the inherent complexity of LoHoSearch questions\.

## 3Experiments

### 3\.1Experimental Settings

We select the best\-performing model from each major model family based on their results on BrowseComp, which are then designated as our evaluation targets; the vast majority are the latest releases within their respective families\. The full list of evaluated models is detailed in Table[2](https://arxiv.org/html/2606.12837#S2.T2)\.

Each model is equipped with two tools: \(1\)search, which performs keyword queries using traditional search engines such as Google; and \(2\)browse, which retrieves detailed content from one or more specified web pages given their URLs\. We adopt the same system prompt as BrowseComp to instruct all models\. For models that support both thinking and non\-thinking modes or adjustable thinking effort, we adopt their official default settings\. To ensure fair comparison across models with potentially different optimal temperature configurations, we uniformly set the temperature to 1\.0, which also encourages diverse search strategies\. The context window is uniformly set to 200K tokens, with 184K allocated for input and 16K for output\.

##### Evaluation\.

We adopt an LLM\-based automated evaluation approach\. Specifically, we compute accuracy once using the BrowseComp grading prompt with GPT\-4\.1\(OpenAI,[2025](https://arxiv.org/html/2606.12837#bib.bib37)\)as the judge, and a second time using the SimpleQA\(Weiet al\.,[2024](https://arxiv.org/html/2606.12837#bib.bib36)\)grading prompt with Qwen2\.5\-32BYanget al\.\([2025](https://arxiv.org/html/2606.12837#bib.bib38)\)as the judge\. The final score is the average of the two accuracy values\. We find that a single grading combination may be either overly strict or overly lenient, and averaging across two complementary combinations best approximates the true performance of each model\.

### 3\.2Main Results

Table 3:Ablation study on various context management strategies based on DeepSeek\-V4\-Flash\. Best inbold\.As shown in Table[2](https://arxiv.org/html/2606.12837#S2.T2), GPT\-5\.5 achieves the highest LoHoSearch Score of 34\.74%, substantially outperforming all other models\. DeepSeek\-V4\-Pro, Claude\-Opus\-4\.6, and Kimi\-K2\.6 demonstrate comparable performance with scores ranging from 15\.53% to 15\.99%, while the remaining models all score below 14%\. Overall, no model exceeds 35% accuracy, confirming that LoHoSearch poses a substantial challenge to current state\-of\-the\-art search agents\. During evaluation, we observed that Kimi\-K2\.6 and DeepSeek\-V4\-Pro encountered safety refusals and service instability issues, which may have affected their scores; results impacted by these issues are denoted with†in Table[2](https://arxiv.org/html/2606.12837#S2.T2)\.

Regarding confidence calibration, we adopt the calibration error metric following the methodology of BrowseComp\-ZH\(Zhouet al\.,[2025](https://arxiv.org/html/2606.12837#bib.bib5)\)\. We note that some models did not strictly follow the system\-prompt format \(e\.g\., omitting the Confidence field\), introducing noise into the calibration results\. Despite this, models evaluated on our benchmark exhibit consistently high calibration errors, indicating considerable uncertainty when confronted with these challenging queries\. This further corroborates the difficulty of LoHoSearch\.

### 3\.3Context Management

To investigate how existing context management strategies perform under the extended reasoning demands of LoHoSearch, we conduct experiments on DeepSeek\-V4\-Flash with a standard ReAct framework\(Yaoet al\.,[2022](https://arxiv.org/html/2606.12837#bib.bib18)\)as the baseline\. We then evaluate two context management strategies triggered when token usage exceeds 80% of the context window: \(1\) Summary\(Wuet al\.,[2025](https://arxiv.org/html/2606.12837#bib.bib19)\), which compresses the current trajectory and re\-initiates the search\. \(2\) Discard\-all\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2606.12837#bib.bib33)\), which discards all previous tool calls and restarts with only the original query\. Additionally, we incorporate a Verify module that leverages the query and the most recent reasoning steps to judge whether all query conditions are satisfied before committing to a final answer\.

As shown in Table[3](https://arxiv.org/html/2606.12837#S3.T3), the best\-performing combination \(Discard\-all \+ Verify\) achieves 16\.82%, an absolute gain of only 6\.8% over the baseline\. Notably, the same strategies yield a 14\.03% gain on BrowseComp, yet deliver only marginal improvements on LoHoSearch\. This discrepancy suggests that the large search space and high structural complexity of our benchmark demand substantially longer reasoning chains, exposing fundamental limitations in current context management approaches\. These results also suggest that our benchmark provides a more rigorous testbed for the development of next\-generation context management techniques\.

![Refer to caption](https://arxiv.org/html/2606.12837v1/latex/fig/rounds_distribution_comparison.png)Figure 4:Distribution of the number of tool calls \(correct answers\) for BrowseComp and LoHoSearch\. LoHoSearch requires significantly more tool calls to reach correct answers compared to BrowseComp\.
### 3\.4Further Analyses

#### 3\.4\.1Difficulty Analysis

We further analyze the difficulty gap between BrowseComp and LoHoSearch using DeepSeek\-V4\-Flash as the probe model\. In terms of accuracy, DeepSeek\-V4\-Flash achieves 58\.84% on BrowseComp but only 10\.02% on LoHoSearch \(Table[2](https://arxiv.org/html/2606.12837#S2.T2)\)\. Regarding the distribution of tool calls in correct trajectories, as illustrated in Figure[4](https://arxiv.org/html/2606.12837#S3.F4), solving queries in our benchmark requires substantially more tool calls compared to BrowseComp\. Specifically, the mean number of tool calls increases from 35 to 61 \(a 74% relative increase\), while the median rises from 26 to 59, confirming that LoHoSearch demands substantially more multi\-step reasoning and retrieval\.

Furthermore, graph\-structured questions achieve an accuracy of only 8\.01%, compared to 11\.89% for tree\-structured ones\. This gap is primarily attributable to the cyclic dependencies and cross\-constraints in graph structures that prevent problem decomposition, indicating that structural complexity, as a difficulty factor independent of search space size, further increases solving difficulty\.

![Refer to caption](https://arxiv.org/html/2606.12837v1/latex/fig/parallel_sampling_voting_v2.png)Figure 5:Illustration of pass@N and three answer aggregation strategies \(majority voting, weighted voting, and best\-of\-N\) under parallel sampling\.
#### 3\.4\.2Parallel Sampling

To investigate the performance upper bound achievable through repeated sampling on our benchmark, we conduct a parallel sampling analysis following the scoring formulation of BrowseComp\. Specifically, we sample 16 independent responses from DeepSeek\-V4\-Flash and evaluate three answer aggregation strategies: majority voting, weighted voting, and best\-of\-N selection\.

![Refer to caption](https://arxiv.org/html/2606.12837v1/latex/fig/pattern_indegree_combined.png)Figure 6:Analysis of hidden entity popularity and search space size\. \(a\) In\-degree distribution of hidden entities in BrowseComp and LoHoSearch, where higher in\-degree indicates greater entity popularity and easier inferability\. BrowseComp contains significantly more high\-popularity hidden entities than LoHoSearch\. \(b\) Under the same popularity constraint, LoHoSearch exhibits a substantially larger relational search space, demonstrating the necessity of a knowledge graph for systematically constructing high\-difficulty questions\.As shown in Figure[5](https://arxiv.org/html/2606.12837#S3.F5), the pass@N metric improves substantially with increasing sample count, rising from 9\.3% atN=1N\{=\}1to 38\.3% atN=16N\{=\}16, indicating considerable headroom for improvement through repeated sampling\. Among the three aggregation strategies, best\-of\-N, which selects the answer with the highest model confidence score, achieves the strongest performance at 24\.6%\. This demonstrates that confidence\-based selection is more effective than voting\-based approaches and yields performance closer to the theoretical pass@N upper bound\.

#### 3\.4\.3Necessity of the Knowledge Graph

We analyze the hidden entities in questions from BrowseComp and LoHoSearch, i\.e\., entities that the agent must infer on its own during the solving process, to demonstrate the necessity of a knowledge graph for constructing truly challenging questions\. Specifically, we focus on the hidden entities directly associated with the answer in each question\. For LoHoSearch, hidden entities are obtained directly from the subgraph structure data; since human\-authored benchmarks inherently lack such structural annotations, we parse hidden entities from search agents’ correct trajectories, and for questions that agents failed to answer correctly, we use DeepSeek\-V3\.2 to reverse\-engineer them from the answer and question text\. Both extraction methods share the same objective: identifying intermediate entities that require reasoning during the solving process\. After obtaining the hidden entities, we further retrieve their associated information in the knowledge graph, including entity popularity and the search space size of relations\.

The results are shown in Figure[6](https://arxiv.org/html/2606.12837#S3.F6)\. First, the popularity of hidden entities in BrowseComp is substantially higher than in LoHoSearch \(Figure[6](https://arxiv.org/html/2606.12837#S3.F6)a\)\. Although BrowseComp also aims to construct challenging questions, the lack of a global perspective in human annotation makes it difficult to precisely control entity popularity, resulting in hidden entities that tend to be relatively easy to infer\. Second, even under the same popularity constraint, the relational search space size of entities in LoHoSearch is significantly larger than in BrowseComp \(Figure[6](https://arxiv.org/html/2606.12837#S3.F6)b\), meaning that inferring new entities from a given entity is considerably harder\. The above analysis shows that human construction has clear limitations in both entity popularity and search space size, and that a knowledge graph is a necessary foundation for systematically constructing high\-difficulty questions\.

## 4Related Work

### 4\.1Search Agent Benchmarks

Multi\-hop question answering benchmarks established the foundation for evaluating complex reasoning over documents\. HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.12837#bib.bib1)\)contains 113K Wikipedia\-based questions covering bridge and comparison reasoning\. It was the first dataset to require sentence\-level supporting evidence\. Building on this, 2WikiMultiHopQA\(Hoet al\.,[2020](https://arxiv.org/html/2606.12837#bib.bib2)\)extended coverage to four reasoning types across two Wikipedia sources with chain\-level evidence annotations\. MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2606.12837#bib.bib4)\)composed single\-hop sub\-questions bottom\-up into 25K questions of 2–4 hops and used unanswerable variants to resist shortcuts\. These benchmarks share two constraints: they operate in closed\-domain settings that do not reflect open\-web agent use, and they rely on manual construction that limits scale\.

Tool\-augmented language models shifted evaluation toward the open web\. GAIA\(Mialonet al\.,[2024](https://arxiv.org/html/2606.12837#bib.bib14)\)contains 466 questions at three difficulty levels\. FRAMES\(Krishnaet al\.,[2025](https://arxiv.org/html/2606.12837#bib.bib15)\)provides 824 multi\-hop questions for RAG pipelines\. These benchmarks bring tool use into scope but still rely on fixed, manually curated question sets\. They cannot scale beyond annotation budgets, and they cannot be refreshed once published\.

BrowseComp\(Weiet al\.,[2025](https://arxiv.org/html/2606.12837#bib.bib3)\)serves as the most directly comparable benchmark for our study\. It contains 1266 questions designed to be difficult to retrieve but easy to verify, crafted by domain experts through multiple rounds of verification and adversarial filtering\. At the time of release, OpenAI Deep Research scored 51\.5%, and human testers 33\.3%\. Despite its high quality, its reliance on manual construction introduces clear limitations: \(i\) construction costs grow linearly with scale; \(ii\) answer uniqueness relies on human verification without formal guarantees; \(iii\) difficulty calibration depends on expert judgment\. BrowseComp\-ZH\(Zhouet al\.,[2025](https://arxiv.org/html/2606.12837#bib.bib5)\), which extends the benchmark to Chinese with 289 questions, further illustrates the prohibitive cost of cross\-lingual expansion under manual construction\.

Recent benchmarks further push difficulty beyond BrowseComp: DeepSearchQA\(Guptaet al\.,[2026](https://arxiv.org/html/2606.12837#bib.bib34)\)introduces 900 causal\-chain tasks requiring exhaustive answer lists, and WideSearch\(Wonget al\.,[2026](https://arxiv.org/html/2606.12837#bib.bib35)\)evaluates large\-scale information collection where even the best system achieves only 5% success rate\. However, both rely on manual construction and frame difficulty through multi\-subtask composition rather than maximizing the intrinsic difficulty of individual questions\. In contrast, LoHoSearch systematically controls search space size and structural complexity at the single\-question level\.

In summary, existing benchmarks face three key limitations: \(i\) manual construction that cannot scale; \(ii\) answer uniqueness without formal guarantees; \(iii\) difficulty calibration dependent on human judgment\.LoHoSearchaddresses all three through a knowledge\-graph\-driven automated pipeline\.

### 4\.2QA Generation for Search Agents

Synthetic data methods for search agents focus on training stronger agents rather than evaluating them\. WebShaper\(Taoet al\.,[2026](https://arxiv.org/html/2606.12837#bib.bib6)\)formalizes information seeking into four stages \(query formulation, search execution, page browsing, and information synthesis\) and repurposes HotpotQA questions to construct agent trajectories\. WebSailor\(Liet al\.,[2025](https://arxiv.org/html/2606.12837#bib.bib7)\)introduces a multi\-round progressive search framework and achieves superhuman performance on BrowseComp\. WebSailor\-V2\(Liet al\.,[2026](https://arxiv.org/html/2606.12837#bib.bib8)\)combines synthetic data with scalable reinforcement learning under a progressive difficulty curriculum, significantly narrowing the gap between open\-source and proprietary agents\.

These works focus on leveraging existing QA datasets as training signals to improve agent performance\. In contrast, LoHoSearch takes a fundamentally different approach: We build a large\-scale knowledge graph and sample structurally complex subgraphs for QA synthesis\. Because the difficulty can be systematically escalated by adjusting search space size and structural complexity parameters, the benchmark keeps pace with the rapid self\-improvement cycle of modern agents rather than being left behind by it\.

Knowledge\-graph\-based QA generation constitutes a closely related line of research\. KNIGHT\(Amanlouet al\.,[2026](https://arxiv.org/html/2606.12837#bib.bib10)\)builds topic\-specific knowledge graphs as compressed, reusable representations and generates multiple\-choice questions at varying difficulty levels through graph traversal\. GraphGen\(Chenet al\.,[2025](https://arxiv.org/html/2606.12837#bib.bib11)\)builds fine\-grained knowledge graphs from source text and generates QA pairs at multiple granularities via multi\-hop neighborhood sampling\.

LoHoSearch shares the knowledge\-graph\-driven generation philosophy with these works but differs in two key respects\. First, LoHoSearch uses the graph primarily to guarantee answer validity through subgraph structural uniqueness\. Second, rather than probing static knowledge, LoHoSearch prompts an LLM to extract obfuscated relation descriptions from Wikipedia and applies a search verification step to ensure that no question can be resolved through simple retrieval\. Together, these choices decouple the knowledge graph from question content and answer storage, extending KG\-driven evaluation from closed\-domain knowledge assessment to open\-web browsing agent evaluation\.

## 5Conclusion

We present LoHoSearch, a benchmark constructed via a knowledge\-graph\-driven pipeline that systematically controls search space size and structural complexity\. The resulting questions surpass the difficulty ceiling of human annotation, and experiments confirm that all evaluated models struggle while current context management strategies yield limited gains\. LoHoSearch provides a discriminative testbed for advancing search agent capabilities\.

## Limitations

LoHoSearch currently covers only English\. Since our pipeline is language\-agnostic by design, we plan to release multilingual variants in future work\. Additionally, the released evaluation set is a static snapshot that may become susceptible to contamination or temporal drift over time\. On the evaluation side, all models are assessed under the same search\-and\-browse toolset with a fixed context window, and answer correctness relies on LLM\-based judges that may introduce noise on ambiguous edge cases\.

Beyond these aspects, two further limitations stem from the construction pipeline itself\. First, answer uniqueness is verified within our knowledge graph; for the 29\.2% of questions that human annotators could not conclusively confirm as uniquely answerable \(§2\.4\), alternative answers outside the KG cannot be ruled out, though no substitute candidate was found after thorough search\. Second, difficulty filtering relies on a single LLM \(DeepSeek\-V3\.2\) as the calibration agent, which may introduce family\-specific bias in which questions survive filtering\. We plan to address these aspects through periodic regeneration, broader tool settings, more diverse calibration agents, and improved judging in subsequent iterations\.

## Ethical considerations

This research is conducted in full accordance with the EMNLP Code of Ethics\. All datasets used in this work were acquired by strictly following their respective licensing agreements and usage guidelines, with no violation of privacy or data protection standards\. We have made deliberate efforts throughout the study to examine and mitigate any potential biases or discriminatory effects that may arise in both the data collection and model development processes\. No personally identifiable information was included in any part of our experiments, thereby ensuring the protection of individual privacy and security\. We are committed to maintaining transparency, reproducibility, and academic integrity across all stages of this research\.

## References

- KNIGHT: knowledge graph\-driven multiple\-choice question generation with adaptive hardness calibration\.InProceedings of the Conference on Parsing and Linguistic Theories \(CPAL\),External Links:[Link](https://openreview.net/forum?id=8kA9oO5gEc)Cited by:[§4\.2](https://arxiv.org/html/2606.12837#S4.SS2.p3.1)\.
- Anthropic \(2026a\)System card: Claude Mythos Preview\.External Links:[Link](https://www-cdn.anthropic.com/8b8380204f74670be75e81c820ca8dda846ab289.pdf)Cited by:[§1](https://arxiv.org/html/2606.12837#S1.p1.1)\.
- Anthropic \(2026b\)System card: Claude Opus 4\.6\.External Links:[Link](https://www-cdn.anthropic.com/6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd.pdf)Cited by:[Table 2](https://arxiv.org/html/2606.12837#S2.T2.4.2.2.2.2.2.2.5.2.1)\.
- Anthropic \(2026c\)System card: Claude Opus 4\.7\.External Links:[Link](https://cdn.sanity.io/files/4zrzovbb/website/037f06850df7fbe871e206dad004c3db5fd50340.pdf)Cited by:[Table 2](https://arxiv.org/html/2606.12837#S2.T2.4.2.2.2.2.2.2.8.5.1)\.
- Z\. Chen, W\. Jiang, J\. Li, Z\. Yuan, H\. Kong, W\. Ouyang, and N\. Dong \(2025\)GraphGen: enhancing supervised fine\-tuning for llms with knowledge\-driven synthetic data generation\.External Links:2505\.20416,[Link](https://arxiv.org/abs/2505.20416)Cited by:[§4\.2](https://arxiv.org/html/2606.12837#S4.SS2.p3.1)\.
- G\. DeepMind \(2026\)Model card: Gemini 3\.1 Pro\.External Links:[Link](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf)Cited by:[Table 2](https://arxiv.org/html/2606.12837#S2.T2.4.2.2.2.2.2.2.6.3.1)\.
- DeepSeek\-AI, A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong, C\. Lu, C\. Zhao, C\. Deng, C\. Xu, C\. Ruan, D\. Dai, D\. Guo, D\. Yang, D\. Chen, E\. Li, F\. Zhou, F\. Lin, F\. Dai, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Li, H\. Liang, H\. Wei, H\. Zhang, H\. Luo, H\. Ji, H\. Ding, H\. Tang, H\. Cao, H\. Gao, H\. Qu, H\. Zeng, J\. Huang, J\. Li, J\. Xu, J\. Hu, J\. Chen, J\. Xiang, J\. Yuan, J\. Cheng, J\. Zhu, J\. Ran, J\. Jiang, J\. Qiu, J\. Li, J\. Song, K\. Dong, K\. Gao, K\. Guan, K\. Huang, K\. Zhou, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Wang, L\. Zhao, L\. Yin, L\. Guo, L\. Luo, L\. Ma, L\. Wang, L\. Zhang, M\. S\. Di, M\. Y\. Xu, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, P\. Huang, P\. Cong, P\. Wang, Q\. Wang, Q\. Zhu, Q\. Li, Q\. Chen, Q\. Du, R\. Xu, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. Yin, R\. Xu, R\. Shen, R\. Zhang, S\. H\. Liu, S\. Lu, S\. Zhou, S\. Chen, S\. Cai, S\. Chen, S\. Hu, S\. Liu, S\. Hu, S\. Ma, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. Zhou, T\. Ni, T\. Yun, T\. Pei, T\. Ye, T\. Yue, W\. Zeng, W\. Liu, W\. Liang, W\. Pang, W\. Luo, W\. Gao, W\. Zhang, X\. Gao, X\. Wang, X\. Bi, X\. Liu, X\. Wang, X\. Chen, X\. Zhang, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yu, X\. Li, X\. Yang, X\. Li, X\. Chen, X\. Su, X\. Pan, X\. Lin, X\. Fu, Y\. Q\. Wang, Y\. Zhang, Y\. Xu, Y\. Ma, Y\. Li, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Qian, Y\. Yu, Y\. Zhang, Y\. Ding, Y\. Shi, Y\. Xiong, Y\. He, Y\. Zhou, Y\. Zhong, Y\. Piao, Y\. Wang, Y\. Chen, Y\. Tan, Y\. Wei, Y\. Ma, Y\. Liu, Y\. Yang, Y\. Guo, Y\. Wu, Y\. Wu, Y\. Cheng, Y\. Ou, Y\. Xu, Y\. Wang, Y\. Gong, Y\. Wu, Y\. Zou, Y\. Li, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Z\. F\. Wu, Z\. Z\. Ren, Z\. Zhao, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Gou, Z\. Ma, Z\. Yan, Z\. Shao, Z\. Huang, Z\. Wu, Z\. Li, Z\. Zhang, Z\. Xu, Z\. Wang, Z\. Gu, Z\. Zhu, Z\. Li, Z\. Zhang, Z\. Xie, Z\. Gao, Z\. Pan, Z\. Yao, B\. Feng, H\. Li, J\. L\. Cai, J\. Ni, L\. Xu, M\. Li, N\. Tian, R\. J\. Chen, R\. L\. Jin, S\. S\. Li, S\. Zhou, T\. Sun, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Song, X\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Z\. Huang, Z\. Xu, Z\. Zhang, D\. Ji, J\. Liang, J\. Guo, J\. Chen, L\. Xia, M\. Wang, M\. Li, P\. Zhang, R\. Chen, S\. Sun, S\. Wu, S\. Ye, T\. Wang, W\. L\. Xiao, W\. An, X\. Wang, X\. Sun, X\. Wang, Y\. Tang, Y\. Zha, Z\. Zhang, Z\. Ju, Z\. Zhang, and Z\. Qu \(2025\)DeepSeek\-V3\.2: pushing the frontier of open large language models\.External Links:2512\.02556,[Link](https://arxiv.org/abs/2512.02556)Cited by:[§2\.3](https://arxiv.org/html/2606.12837#S2.SS3.p4.1),[§3\.3](https://arxiv.org/html/2606.12837#S3.SS3.p1.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-V4: towards highly efficient million\-token context intelligence\.External Links:[Link](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf)Cited by:[Table 2](https://arxiv.org/html/2606.12837#S2.T2.3.1.1.1.1.1.1.1.1),[Table 2](https://arxiv.org/html/2606.12837#S2.T2.4.2.2.2.2.2.2.9.6.1)\.
- N\. Gupta, R\. Chatterjee, L\. Haas, C\. Tao, A\. Wang, C\. Liu, H\. Oiwa, E\. Gribovskaya, J\. Ackermann, J\. Blitzer, S\. Goldshtein, and D\. Das \(2026\)DeepSearchQA: bridging the comprehensiveness gap for deep research agents\.External Links:2601\.20975,[Link](https://arxiv.org/abs/2601.20975)Cited by:[§4\.1](https://arxiv.org/html/2606.12837#S4.SS1.p4.1)\.
- X\. Ho, A\. Duong Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing a multi\-hop QA dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics,D\. Scott, N\. Bel, and C\. Zong \(Eds\.\),Barcelona, Spain \(Online\),pp\. 6609–6625\.External Links:[Link](https://aclanthology.org/2020.coling-main.580/),[Document](https://dx.doi.org/10.18653/v1/2020.coling-main.580)Cited by:[§4\.1](https://arxiv.org/html/2606.12837#S4.SS1.p1.1)\.
- S\. Krishna, K\. Krishna, A\. Mohananey, S\. Schwarcz, A\. Stambler, S\. Upadhyay, and M\. Faruqui \(2025\)Fact, fetch, and reason: a unified evaluation of retrieval\-augmented generation\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 4745–4759\.External Links:[Link](https://aclanthology.org/2025.naacl-long.243/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.243),ISBN 979\-8\-89176\-189\-6Cited by:[§1](https://arxiv.org/html/2606.12837#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.12837#S4.SS1.p2.1)\.
- K\. Li, Z\. Zhang, H\. Yin, R\. Ye, Y\. Zhao, L\. Zhang, L\. Ou, D\. Zhang, X\. Wu, X\. Yu, J\. Wu, X\. Wang, Z\. Qiao, Z\. Zhang, Y\. Jiang, P\. Xie, F\. Huang, Z\. J\. Xu, S\. Wang, M\. Cheng, and J\. Zhou \(2026\)WebSailor\-V2: bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=HuP16O5SJf)Cited by:[§4\.2](https://arxiv.org/html/2606.12837#S4.SS2.p1.1)\.
- K\. Li, Z\. Zhang, H\. Yin, L\. Zhang, L\. Ou, J\. Wu, W\. Yin, B\. Li, Z\. Tao, X\. Wang, W\. Shen, J\. Zhang, D\. Zhang, X\. Wu, Y\. Jiang, M\. Yan, P\. Xie, F\. Huang, and J\. Zhou \(2025\)WebSailor: navigating super\-human reasoning for web agent\.External Links:2507\.02592,[Link](https://arxiv.org/abs/2507.02592)Cited by:[§4\.2](https://arxiv.org/html/2606.12837#S4.SS2.p1.1)\.
- LongCat\-Team, A\. Gui, B\. Li, B\. Tao, B\. Zhou, B\. Chen, C\. Zhang, C\. Gao, C\. Zhang, C\. Han,et al\.\(2026\)LongCat\-Flash\-Thinking\-2601 technical report\.arXiv preprint arXiv:2601\.16725\.Cited by:[Table 2](https://arxiv.org/html/2606.12837#S2.T2.4.2.2.2.2.2.2.10.7.1)\.
- G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom \(2024\)GAIA: a benchmark for general AI assistants\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=fibxvahvs3)Cited by:[§4\.1](https://arxiv.org/html/2606.12837#S4.SS1.p2.1)\.
- MiniMax \(2026a\)MiniMax M2\.5: built for real\-world productivity\.External Links:[Link](https://www.minimaxi.com/news/minimax-m25)Cited by:[Table 2](https://arxiv.org/html/2606.12837#S2.T2.4.2.2.2.2.2.2.12.9.1)\.
- MiniMax \(2026b\)MiniMax M2\.7: early echoes of self\-evolution\.External Links:[Link](https://www.minimax.io/news/minimax-m27-en)Cited by:[Table 2](https://arxiv.org/html/2606.12837#S2.T2.4.2.2.2.2.2.2.11.8.1)\.
- Moonshot\-AI \(2026\)Kimi K2\.6: advancing open\-source coding\.External Links:[Link](https://www.kimi.com/blog/kimi-k2-6)Cited by:[Table 2](https://arxiv.org/html/2606.12837#S2.T2.4.2.2.2.2.2.2.2.1)\.
- OpenAI \(2025\)Introducing GPT‑4\.1 in the API\.External Links:[Link](https://openai.com/index/gpt-4-1/)Cited by:[§3\.1](https://arxiv.org/html/2606.12837#S3.SS1.SSS0.Px1.p1.1)\.
- OpenAI \(2026\)Introducing GPT\-5\.5\.External Links:[Link](https://openai.com/index/introducing-gpt-5-5/)Cited by:[Table 2](https://arxiv.org/html/2606.12837#S2.T2.4.2.2.2.2.2.2.4.1.1)\.
- Z\. Tao, J\. Wu, W\. Yin, P\. Wu, J\. Zhang, B\. Li, H\. SHEN, K\. Li, L\. Zhang, X\. Wang, W\. Zhang, Y\. Jiang, P\. Xie, F\. Huang, and J\. Zhou \(2026\)WebShaper: agentically data synthesizing via information\-seeking formalization\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hld4TzJsnD)Cited by:[§4\.2](https://arxiv.org/html/2606.12837#S4.SS2.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.External Links:[Link](https://aclanthology.org/2022.tacl-1.31/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00475)Cited by:[§1](https://arxiv.org/html/2606.12837#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.12837#S4.SS1.p1.1)\.
- D\. Vrandečić and M\. Krötzsch \(2014\)Wikidata: a free collaborative knowledge base\.Communications of the ACM57,pp\. 78–85\.External Links:[Link](http://cacm.acm.org/magazines/2014/10/178785-wikidata/fulltext)Cited by:[§2\.1](https://arxiv.org/html/2606.12837#S2.SS1.p1.1)\.
- J\. Wei, N\. Karina, H\. W\. Chung, Y\. J\. Jiao, S\. Papay, A\. Glaese, J\. Schulman, and W\. Fedus \(2024\)Measuring short\-form factuality in large language models\.arXiv preprint arXiv:2411\.04368\.Cited by:[§3\.1](https://arxiv.org/html/2606.12837#S3.SS1.SSS0.Px1.p1.1)\.
- J\. Wei, Z\. Sun, S\. Papay, S\. McKinney, J\. Han, I\. Fulford, H\. W\. Chung, A\. T\. Passos, W\. Fedus, and A\. Glaese \(2025\)BrowseComp: a simple yet challenging benchmark for browsing agents\.arXiv preprint arXiv:2504\.12516\.Cited by:[§1](https://arxiv.org/html/2606.12837#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.12837#S4.SS1.p3.1)\.
- R\. Wong, J\. Wang, J\. Zhao, L\. Chen, Y\. Gao, L\. Zhang, X\. Zhou, Z\. Wang, K\. Xiang, G\. Zhang, W\. Huang, Y\. Wang, and K\. Wang \(2026\)WideSearch: benchmarking agentic broad info\-seeking\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Q7YUY7zGkZ)Cited by:[§4\.1](https://arxiv.org/html/2606.12837#S4.SS1.p4.1)\.
- X\. Wu, K\. Li, Y\. Zhao, L\. Zhang, L\. Ou, H\. Yin, Z\. Zhang, X\. Yu, D\. Zhang, Y\. Jiang,et al\.\(2025\)Resum: unlocking long\-horizon search intelligence via context summarization\.arXiv preprint arXiv:2509\.13313\.Cited by:[§3\.3](https://arxiv.org/html/2606.12837#S3.SS3.p1.1)\.
- Q\. A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§3\.1](https://arxiv.org/html/2606.12837#S3.SS1.SSS0.Px1.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 2369–2380\.External Links:[Link](https://aclanthology.org/D18-1259/),[Document](https://dx.doi.org/10.18653/v1/D18-1259)Cited by:[§4\.1](https://arxiv.org/html/2606.12837#S4.SS1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§3\.3](https://arxiv.org/html/2606.12837#S3.SS3.p1.1)\.
- Zhipu\-AI \(2026\)GLM\-5\.1: towards long\-horizon tasks\.External Links:[Link](https://z.ai/blog/glm-5.1)Cited by:[Table 2](https://arxiv.org/html/2606.12837#S2.T2.4.2.2.2.2.2.2.7.4.1)\.
- P\. Zhou, B\. Leon, X\. Ying, C\. Zhang, Y\. Shao, Q\. Ye, D\. Chong, Z\. Jin, C\. Xie, M\. Cao, Y\. Gu, S\. Hong, J\. Ren, J\. Chen, C\. Liu, and Y\. Hua \(2025\)BrowseComp\-ZH: benchmarking web browsing ability of large language models in chinese\.External Links:2504\.19314,[Link](https://arxiv.org/abs/2504.19314)Cited by:[§3\.2](https://arxiv.org/html/2606.12837#S3.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.12837#S4.SS1.p3.1)\.

## Appendix ACalibration Error

We adopt Expected Calibration Error \(ECE\) to measure the alignment between model confidence and actual accuracy\. Predicted probabilities are partitioned into five equally spaced bins:\[0,0\.2\)\[0,0\.2\),\[0\.2,0\.4\)\[0\.2,0\.4\),\[0\.4,0\.6\)\[0\.4,0\.6\),\[0\.6,0\.8\)\[0\.6,0\.8\), and\[0\.8,1\.0\]\[0\.8,1\.0\]\. ECE is defined as:

ECE=∑i=1KniN\|acc\(i\)−conf\(i\)\|\\text\{ECE\}=\\sum\_\{i=1\}^\{K\}\\frac\{n\_\{i\}\}\{N\}\\left\|\\text\{acc\}\(i\)\-\\text\{conf\}\(i\)\\right\|\(2\)
whereKKis the number of bins,NNis the total number of samples,nin\_\{i\}is the number of samples in theii\-th bin,acc\(i\)\\text\{acc\}\(i\)is the empirical accuracy of theii\-th bin, andconf\(i\)\\text\{conf\}\(i\)is the average predicted confidence in theii\-th bin\. A lower ECE indicates better calibration between model confidence and actual performance\.

## Appendix BSystem Prompt

We adopt the same system prompt as BrowseComp to instruct all models\. The prompt content is as follows:

Your response should be in the following format: Explanation: \{your explanation for your final answer\} Exact Answer: \{your succinct, final answer\} Confidence: \{your confidence score between 0% and 100% for your answer\}

## Appendix CTool Definitions

All models are equipped with two tools:searchandbrowse\. Their definitions are as follows:

```
{"type": "function",
 "function": {
   "name": "search",
   "description": "Web search, using traditional
     search engines like google, complex
     questions need to be broken down into
     simple queries",
   "parameters": {
     "type": "object",
     "properties": {
       "query": {
         "type": "array",
         "items": {"type": "string"},
         "description": "List of search queries.
           You can search up to 5 queries at
           the same time."}
       },
     "required": ["query"]
   }
 }
}
```

```
{"type": "function",
 "function": {
   "name": "browse",
   "description": "Retrieve detailed content
     from one or more specified web pages by
     providing their URLs.",
   "parameters": {
     "type": "object",
     "required": ["url"],
     "properties": {
       "url": {
         "type": "array",
         "description": "List of urls. You can
           browse up to 3 webpages at the
           same time.",
         "items": {"type": "string"}
       }
     }
   }
 }
}
```

## Appendix DCase Studies

We present one tree\-structured and one graph\-structured example from LoHoSearch\.

Case 1 \(Tree\-structured\):Consider the following facts\. They all pertain to one album\. Which album is it? 1\. It includes singles from another album \(Album A\), which was created by a group formed on a reality competition television series that originally aired from the early 2000s to the mid\-2010s; additionally, Album A includes a solo song by a person \(Person D\) who has one child, works as a pop singer and television host, and was born in the late 1970s\. 2\. It includes a track written by a person \(Person B\) who shares the same name as a person \(Person E\) who earned an award in the early 2000s; this award has been presented annually since the year 2000 and has an alternative name that includes an acronym for a community service initiative\. 3\. It includes tracks produced by a person \(Person C\) who studied an academic discipline historically defined by roughly three main methods—individual casework, social group work, and community intervention work—and from the late 20th century, many different methods derived from these three classic methods\. Answer:The Best of No Angels

Case 2 \(Graph\-structured\):Who is described by the following conditions? 1\. This person toured with a musical group \(Group C\)\. Group C performs in a music genre \(Genre D\) that is featured at a music festival \(Festival E\), where Group C was a performer in the early 2010s\. The group also released an album \(Album H\) in the early 2010s, which includes a musical composition \(Composition G\) by them\. Additionally, Group C performs at a recurring event \(Event F\), and this person has performed at Event F as well\. 2\. They recorded a comedy album at a recurring event \(Event I\)\. Event I featured Group C as a guest of honor in the late 2010s\. 3\. Furthermore, this person appeared as a guest on a podcast \(Podcast B\) and wrote and performed shows at a film festival \(Festival J\)\. Answer:Joseph Scrimshaw
LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

Similar Articles

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

WANDR Benchmark: Evaluating Research Agents That Must Search Wide and Deep (15 minute read)

SearchOS-V1: Towards Robust Open-Domain Information-Seeking Agent Collaboration

@tom_doerr: Fully open sources training data for 30B scale search agents https://github.com/PolarSeeker/OpenSeeker…

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Submit Feedback

Similar Articles

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge
WANDR Benchmark: Evaluating Research Agents That Must Search Wide and Deep (15 minute read)
SearchOS-V1: Towards Robust Open-Domain Information-Seeking Agent Collaboration
@tom_doerr: Fully open sources training data for 30B scale search agents https://github.com/PolarSeeker/OpenSeeker…
LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis