EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

arXiv cs.CL Papers

Summary

This paper introduces EvoBrowseComp, a dynamic benchmark of 400 English and 400 Chinese complex questions that are synthesized via live-web traversal to evaluate search agents without test-set contamination, ensuring robustness against parametric memorization.

arXiv:2606.13120v1 Announce Type: new Abstract: Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.
Original Article
View Cached Full Text

Cached at: 06/12/26, 08:51 AM

# EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge
Source: [https://arxiv.org/html/2606.13120](https://arxiv.org/html/2606.13120)
Yunhan Wang♣\\clubsuit♠\\spadesuit, Jiaan Wang♠\\spadesuit22footnotemark:2, Lianzhe Huang♠\\spadesuit, Xianfeng Zeng♠\\spadesuitand Fandong Meng♠\\spadesuit ♣\\clubsuitNortheastern University, China♠\\spadesuitWeixin AI, Tencent Inc, China yunhannnan@gmail\.com\{torchwang,fandongmeng\}@tencent\.comWork was done when Yunhan Wang was interning at Weixin AI, Tencent Inc, China\. Corresponding authors\.

###### Abstract

Search Agents—large language models augmented with search tools—have intensified the need for future\-proof evaluation benchmarks\. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test\-set contamination and parametric memorization\. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts\.

In this paper, we introduce EvoBrowseComp, anevolvingbenchmark of 400 English and 400 Chinese contamination\-free complex questions synthesized via live\-web traversal\. To collect these questions, we design a three\-agent collaborative framework: \(1\) a QA synthesis agent that retrieves*fresh*knowledge from the live web to synthesize QA pairs; \(2\) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and \(3\) a high\-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs\. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness\. Extensive experiments confirm its great difficulty, requiring broad horizontal search\. It establishes a scalable paradigm for auto\-updatable, high\-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities\.111We have released the data at[https://hf\.co/datasets/Krystalan/EvoBrowseComp](https://hf.co/datasets/Krystalan/EvoBrowseComp)

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Yunhan Wang♣\\clubsuit♠\\spadesuit††thanks:Work was done when Yunhan Wang was interning at Weixin AI, Tencent Inc, China\., Jiaan Wang♠\\spadesuit22footnotemark:2, Lianzhe Huang♠\\spadesuit, Xianfeng Zeng♠\\spadesuitand Fandong Meng♠\\spadesuit††thanks:Corresponding authors\.♣\\clubsuitNortheastern University, China♠\\spadesuitWeixin AI, Tencent Inc, Chinayunhannnan@gmail\.com\{torchwang,fandongmeng\}@tencent\.com

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.13120v1/x1.png)Figure 1:An illustrative example question from EvoBrowseComp\.Orangehighlights fresh knowledge \(post\-2026\) lies in its reasoning paths, whilereddenotes its final answer\.Large Language Models \(LLMs\) augmented with web search tools, known as search agentsWeiet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib3)\); Chenet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib1)\); Zhouet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib2)\), have demonstrated remarkable performance on information\-seeking tasks\. These agents exerciseweb browsing ability—persistently navigating the open web, executing multi\-hop questions, and gathering fragmented evidence across different sourcesWuet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib7)\); Guptaet al\.\([2026](https://arxiv.org/html/2606.13120#bib.bib4)\)\. To measure this ability, many benchmark datasets are proposed sequentially\. BrowseCompWeiet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib3)\)and BrowseComp\-ZHZhouet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib2)\)focus on horizontal search, evaluating persistence and creativity in locating hard\-to\-find facts\. GAIAMialonet al\.\([2024](https://arxiv.org/html/2606.13120#bib.bib5)\)tests general assistant competence through real\-world, multi\-step tool use\. BFCLPatilet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib8)\)assesses search orchestration via function calling, while WebWalkerWuet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib7)\)isolates vertical traversal within structured websites\. More recently, specialized benchmarks have targeted higher\-order retrieval competencies: SealQAPhamet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib6)\)probes robustness under noisy and conflicting retrieval conditions; while DeepSearchQAGuptaet al\.\([2026](https://arxiv.org/html/2606.13120#bib.bib4)\)raises the bar by requiring exhaustive collation of answer sets across multiple sources\. These efforts have established a rich landscape for benchmarking LLMs’ web browsing ability\.

However, existing benchmarks are typically anchored tostaticknowledge\. For example, BrowseCompWeiet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib3)\)and BrowseComp\-ZHZhouet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib2)\)relies on question–answer pairs manually curated at a fixed point in time; BrowseComp\-PlusChenet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib1)\)freezes a curated document snapshot to ensure reproducibility; GAIAMialonet al\.\([2024](https://arxiv.org/html/2606.13120#bib.bib5)\)grounds its tasks in specific and immutable versions of web pages or attached files; and DeepSearchQAGuptaet al\.\([2026](https://arxiv.org/html/2606.13120#bib.bib4)\), though time\-anchored, comprises a static prompt set evaluated against a fixed answer key\. This static nature renders them acutely vulnerable to test\-set contamination: as pre\-training corpora expand, benchmark content inevitably leaks into model parameters, enabling models to solve questions via parametric memorization rather than genuine browsing and reasoning\. As pointed out byAnthropic \([2026a](https://arxiv.org/html/2606.13120#bib.bib10)\), the explicit leakage of BrowseComp answers into public data confirms that this benchmark has been compromised by data contamination\.

To address these limitations, we introduce*EvoBrowseComp*, an evolving benchmark comprising 400 English and 400 Chinese complex questions automatically synthesized from live\-web traversal\. Our construction pipeline actively discovers and validatesfreshknowledge and automatically constructs QA pairs through a three\-agent collaborative framework\. First, aQA Synthesis Agentretrieves fresh knowledge via web tools and provides \(candidate\) QA pairs based on the knowledge\. Second, anInformation Filtering Agentfilters out retrieved knowledge in terms of credibility \(verifying source credibility and cross\-source consistency\) and popularity \(blocking parametric shortcuts through over\-exposed knowledge\)\. Third, aHigh\-level Guidance Agentstructures each question as a reasoning graph using three basic operations: projection, intersection, and complement\. It identifies both structural redundancies and shortcuts, and directs the QA synthesis agent toward specific synthesis directions\. Moreover, we adopt several strategies to ensure data quality, including the verification of textual quality, answer uniqueness, and question difficulty\. In this manner, high\-quality challenging questions involving fresh knowledge can be automatically collected \(c\.f\., Figure[1](https://arxiv.org/html/2606.13120#S1.F1)\)\. In contrast, prior benchmarksMialonet al\.\([2024](https://arxiv.org/html/2606.13120#bib.bib5)\); Weiet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib3)\); Zhouet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib2)\); Chenet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib1)\); Phamet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib6)\); Guptaet al\.\([2026](https://arxiv.org/html/2606.13120#bib.bib4)\)typically rely on labor\-intensive human curation that makes regular updates prohibitively expensive\. EvoBrowseComp removes this barrier: its synthesis pipeline is fully automated, requiring no costly manual annotation\. This enables the benchmark to be refreshed continuously at minimal cost, swapping in newly emerged facts while retiring over\-exposed ones\.

Based on EvoBrowseComp, we evaluate various LLMs under both tool\-based and tool\-free settings\. The results reveal two critical phenomena\. First, even Claude\-Opus\-4\.6Anthropic \([2026b](https://arxiv.org/html/2606.13120#bib.bib16)\), a cutting\-edge reasoning LLM, achieves only 44\.8% accuracy when equipped with tools, indicating that our temporally fresh, structurally complex questions are not easily retrieved\. Second, when tool access is removed, Claude\-Opus\-4\.6’s performance drops to 6\.0%, confirming that answering these questions demands genuine retrieval and multi\-hop reasoning over fresh knowledge, rather than static recall\. We believe this establishes a sustainable, contamination\-resistant paradigm for future\-proof evaluation of search agents\.

In summary, our contributions are as follows:

- •We introduce EvoBrowseComp, a search agent benchmark comprising 400 English and 400 Chinese complex questions\. Grounding questions in fresh knowledge, it prevents models from exploiting parametric memorization\.
- •We propose a fully automated three\-agent synthesis framework\. It requires no costly human annotation, enabling continuous, low\-cost regeneration that retires over\-exposed questions and incorporates newly emerged facts and knowledge\.
- •Extensive evaluations demonstrate that even frontier LLMs achieve only modest accuracy \(<<45%\) with web tools, and their performance collapses sharply when tool access is removed \(<<11%\)\. This confirms that EvoBrowseComp effectively isolates genuine web browsing and multi\-hop reasoning from static parametric recall\.

## 2EvoBrowseComp

![Refer to caption](https://arxiv.org/html/2606.13120v1/x2.png)

Figure 2:The illustration of the three\-agent collaborative framework\. \(a\) QA synthesis agent retrieves knowledge from the live web and generates \(candidate\) QA pairs; \(b\) Information filtering agent judges each retrieved knowledge in terms of credibility and popularity \(popular/over\-covered or not credible knowledge will be discarded\); \(c\) High\-level guidance agent detects the logical redundancy and shortcuts in the candidate QA pairs based on the constructed reasoning graphs, and gives suggestions to the QA synthesis agent in the next iteration\.EvoBrowseComp is built on two foundational principles\. First,*questions should involve fresh knowledge*\. By synthesizing questions from knowledge that emerges after training cutoffs, we prevent models from answering via parametric memorization\. Second,*the construction pipeline should be fully automated and continuously evolvable*\. This enables periodic regeneration in which over\-exposed questions are retired and replaced by newly surfaced knowledge, guaranteeing long\-term benchmark validity without expensive human curation\.

### 2\.1Data Collection

The data collection pipeline operates as aniterative feedback loopamong three specialized agents \(c\.f\., Figure[2](https://arxiv.org/html/2606.13120#S2.F2)\)\. Beginning with seed entities, a QA synthesis agent searches the live web to propose candidate QA pairs with retrieved knowledge\. Each retrieved knowledge is evaluated by an information filtering agent in terms of credibility and popularity\. A high\-level guidance agent formalizes the underlying reasoning structure of a candidate question generated in theii\-th iteration, detects its logical redundancy and shortcuts, and guides the QA synthesis agent in the next iteration\. In this way, the three agents collaborate automatically to synthesize highly complex, high\-quality QA pairs\.

#### Seed Entity\.

Synthesizing temporally fresh and logically complex QA pairs requires seed entities that tend to involve fresh knowledge\. Rather than harvesting entities from a static knowledge graph—which risks stale facts—we collect seed entities through live\-web retrieval\. Specifically, we pre\-define 9 core domains \(*e\.g\.*, science, economy and geography\) and 50 fine\-grained sub\-domains\. For each sub\-domain, we equip an advanced LLM,*i\.e\.*, DeepSeek\-V3\.2Liuet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib9)\), with search tools to aggregate recently surfaced entities mentioned in high\-coverage news or official websites w\.r\.t the sub\-domain\. This process yields about 50K seed entities, denoted asE\\mathrm\{E\}\. Illustration examples of seed entity collection are provided in Appendix[A\.1](https://arxiv.org/html/2606.13120#A1.SS1)\.

#### QA Synthesis Agent\.

For a given seed entitye∈Ee\\in\\mathrm\{E\}, the QA synthesis agent iteratively mines information from the live web to construct a QA pair⟨q,a⟩\\langle q,a\\rangle\. The overall synthesis process can be formulated as amm\-step iterative chain:

e→⟨qe\(1\),ae\(1\)⟩→⟨qe\(2\),ae\(2\)⟩→…→⟨qe\(m\),ae\(m\)⟩e\\rightarrow\\langle q\_\{e\}^\{\(1\)\},a\_\{e\}^\{\(1\)\}\\rangle\\rightarrow\\langle q\_\{e\}^\{\(2\)\},a\_\{e\}^\{\(2\)\}\\rangle\\rightarrow\\dots\\rightarrow\\langle q\_\{e\}^\{\(m\)\},a\_\{e\}^\{\(m\)\}\\rangle\(1\)whereqe\(t\),ae\(t\)q\_\{e\}^\{\(t\)\},a\_\{e\}^\{\(t\)\}denotes the question and its answer generated intt\-th iteration, respectively\. In detail, the agent involves two sub\-steps in each iteration:

\(1\)*Web Information Gathering*: The agent collects information via engaging in multi\-turn interactions with web tools: a*search*tool uses the Google search engine to retrieve information and a*visit*tool extracts targeted information from specific web pages\. In the course of this multi\-turn interaction, we encourage the agent to gather*fresh*knowledge, defined as information that becomes available after a specified timestamptt\.222In this paper, we setttto January 1, 2026, and it can be trivially adjusted to other timestamps \(*e\.g\.*, training cutoffs of specific LLMs\)\.The agent then refines the gathered knowledge into an evidence list, denoted asℰ=\{ϵ1,ϵ2,…,ϵn\}\\mathcal\{E\}=\\\{\\epsilon\_\{1\},\\epsilon\_\{2\},\.\.\.,\\epsilon\_\{n\}\\\}, where eachϵi\\epsilon\_\{i\}indicates a concise knowledge statement \(*e\.g\.*, entityeie\_\{i\}has some specific attributes\)\.

\(2\)*QA Pair Construction*: Leveraging the evidence listℰ\\mathcal\{E\}, the agent incorporates these pieces of evidence to synthesize a complex QA pair⟨qe\(t\),ae\(t\)⟩\\langle q\_\{e\}^\{\(t\)\},a\_\{e\}^\{\(t\)\}\\rangle\. Ideally, all evidence inℰ\\mathcal\{E\}is fresh knowledge, ensuring that the synthesizedqe\(t\)q\_\{e\}^\{\(t\)\}is free from data contamination because it falls completely outside the search agents’ parametric memorization\. However, fresh knowledge appears much less frequently on the live web than its counterpart, namely non‑fresh knowledge \(*i\.e\.*, information already available before the timestamptt\)\. Consequently, although we encourage the agent to gather fresh knowledge,ℰ\\mathcal\{E\}inevitably contains non‑fresh knowledge\. If we strictly require all evidence to be fresh knowledge, the scale ofℰ\\mathcal\{E\}will be too limited to synthesize a complex question\. Therefore, we allow some non\-fresh knowledge inℰ\\mathcal\{E\}, and require the agent to classify eachϵi\\epsilon\_\{i\}as either fresh or non\-fresh\. To avoid overly covered answers induced by the non‑fresh knowledge inℰ\\mathcal\{E\}, we limit the final answer to be based on fresh knowledge\. In this way, a preliminary questionq^e\(t\)\\hat\{q\}\_\{e\}^\{\(t\)\}together with its answerae\(t\)\{a\}\_\{e\}^\{\(t\)\}is generated\. To further enhance the difficulty ofq^e\(t\)\\hat\{q\}\_\{e\}^\{\(t\)\}, wo followLiet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib12)\); Luet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib14)\)and obfuscate features and relationships withinq^e\(t\)\\hat\{q\}\_\{e\}^\{\(t\)\}\(*e\.g\.*, vague time references and and non\-specific descriptors\) to obtain final questionqe\(t\)q\_\{e\}^\{\(t\)\}\. The prompts employed by the QA synthesis agent in these two sub\-steps are presented in the Appendix[A\.2](https://arxiv.org/html/2606.13120#A1.SS2)\.

#### Information Filtering Agent\.

Directly using the gathered evidence listℰ\\mathcal\{E\}to synthesize QA pairs might involve the following issues: \(1\) The evidence might suffer from rumor propagation, making the synthesized QA pairs unreliable\. This issue is particularly pronounced in the context of fresh knowledge compared to non\-fresh knowledgeAlkhodairet al\.\([2020](https://arxiv.org/html/2606.13120#bib.bib27)\)\. \(2\) Too popular or over\-covered non\-fresh knowledge inℰ\\mathcal\{E\}will make the relevant reasoning in the QA pairs too predictable for search agentsLuet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib14)\)\.

To deal with the above issues, we introduce an information filtering agent to filter out flawedϵi∈ℰ\\epsilon\_\{i\}\\in\\mathcal\{E\}: \(1\) For each freshϵi\\epsilon\_\{i\}, the agent equipped with web tools \(*i\.e\.*, search and visit\) to cross\-validate the reliability ofϵi\\epsilon\_\{i\}on the live web, and finally output a reliability label,*i\.e\.*, “credible”, “not credible”, or “unclear”\. \(2\) For each non\-freshϵj\\epsilon\_\{j\}, the agent directly judges whether it is too popular or overly covered, without using any tools\. It will output a popularity label,*i\.e\.*, “popular” or “non\-popular”\. Only credible fresh evidence and non\-popular non\-fresh evidence are retained inℰ\\mathcal\{E\}\. If the length ofℰ\\mathcal\{E\}is less than a pre\-defined thresholdkk, the QA synthesis agent retries the web information gathering process to collect additional evidence\. The prompts used to obtain the reliability and popularity labels are provided in the Appendix[A\.3](https://arxiv.org/html/2606.13120#A1.SS3)\.

![Refer to caption](https://arxiv.org/html/2606.13120v1/x3.png)

Figure 3:The illustration of a reasoning graph\.
#### High\-Level Guidance Agent\.

As pointed out byTaoet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib24)\), information\-driven text\-only synthesis paradigms struggle to capture underlying complex topologies \(*e\.g\.*, entity–relation structures\) and lack systematic control\. As a result, the synthesized QA pairs tend to be*redundant*or*shortcut\-prone*reasoning paths\. In contrast, graphs offer a structured, semantically rich environment for multi\-hop reasoning, enabling explicit control over reasoning pathsLuet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib14)\)\.

To mitigate reasoning redundancy and shortcuts inherent in text\-only synthesis paradigms and to provide high\-level, explicit control over QA synthesis, we introduce a high\-level guidance agent\. This agent structures each synthesized questionqe\(t\)q\_\{e\}^\{\(t\)\}into a reasoning graph𝒢et=\{𝒱et,ℛet\}\\mathcal\{G\}\_\{e\}^\{t\}=\\\{\\mathcal\{V\}\_\{e\}^\{t\},\\mathcal\{R\}\_\{e\}^\{t\}\\\}, where𝒱et\\mathcal\{V\}\_\{e\}^\{t\}andℛet\\mathcal\{R\}\_\{e\}^\{t\}denote the note set and the edge set in𝒢et\\mathcal\{G\}\_\{e\}^\{t\}, respectively\. Figure[3](https://arxiv.org/html/2606.13120#S2.F3)shows an example of𝒢et\\mathcal\{G\}\_\{e\}^\{t\}using the example question in Figure[1](https://arxiv.org/html/2606.13120#S1.F1)\. Each nodevi∈𝒱etv\_\{i\}\\in\\mathcal\{V\}\_\{e\}^\{t\}indicates an object,*i\.e\.*, an entity \(set\) or an attribute \(set\)\. As for edgesℛet\\mathcal\{R\}\_\{e\}^\{t\}, we employ logical operations to capture the complex topologies within𝒱et\\mathcal\{V\}\_\{e\}^\{t\}: \(1\)Intersectionreturns the intersection ofviv\_\{i\}andvjv\_\{j\}, denoted asvi∩vjv\_\{i\}\\cap v\_\{j\}\. For example, the intersection of Turing Award laureates and females is the set of female Turing Award laureates\. \(2\)Complementreturns the complement ofviv\_\{i\}, denoted asvi¯\\overline\{v\_\{i\}\}\. For example, the complement of Turing Award laureates \(with respect to all people\) is the set of people who have not received the Turing Award\. These two operations form a complete basis for expressing any entity setEnderton \([2001](https://arxiv.org/html/2606.13120#bib.bib28)\)\. To further support relations and attributes, we introduce: \(3\)Projection, which projectsviv\_\{i\}tovjv\_\{j\}via a specific relationrr, denoted asvj=πr​\(vi\)v\_\{j\}=\\pi\_\{r\}\(v\_\{i\}\)\. For example,πwinner​\(Turing Award\)\\pi\_\{\\text\{winner\}\}\(\\text\{Turing Award\}\)denotes Turing Award laureates, whileπgender​\(Jackie Chan\)\\pi\_\{\\text\{gender\}\}\(\\text\{Jackie Chan\}\)denotes Jackie Chan’s gender\. Based on the above definition, each edge in the graph is denoted by one of the three operations\.

After parsingqe\(t\)q\_\{e\}^\{\(t\)\}into𝒢et\\mathcal\{G\}\_\{e\}^\{t\}, we \(1\) detect reasoning redundancy by checking for isolated nodes or subgraphs in𝒢et\\mathcal\{G\}\_\{e\}^\{t\}, and \(2\) detect reasoning shortcuts by examining whether there exist structural bypasses leading to the answer node\. These detections can be achieved by an off\-the\-shelf toolkit,*i\.e\.*, NetworkX333[https://github\.com/networkx/networkx](https://github.com/networkx/networkx)\. Further, using𝒢et\\mathcal\{G\}\_\{e\}^\{t\}together with the detected reasoning redundancies and shortcuts, the high\-level guidance agent generates a textual instruction \(denoted asℐet\\mathcal\{I\}\_\{e\}^\{t\}\) that specifies the synthesis direction \(*e\.g\.*, which logical operations to add\), and any redundancy or shortcuts should be avoided in the subsequent iteration\. The instructions are fed into the QA synthesis agent and guide its behavior in the next iteration:

\(ℐet,⟨qe\(t\),ae\(t\)⟩\)→QA Synthesis Agent⟨qe\(t\+1\),ae\(t\+1\)⟩\\big\(\\mathcal\{I\}\_\{e\}^\{t\},\\langle q\_\{e\}^\{\(t\)\},a\_\{e\}^\{\(t\)\}\\rangle\\big\)\\xrightarrow\{\\text\{QA Synthesis Agent\}\}\\langle q\_\{e\}^\{\(t\+1\)\},a\_\{e\}^\{\(t\+1\)\}\\rangle\(2\)
The prompts of graph parsing and instruction generation are provided in the Appendix[A\.4](https://arxiv.org/html/2606.13120#A1.SS4)\.

#### Iteration Termination\.

The iteration of the three\-agent collaborative framework is repeated until all of the following conditions are satisfied: \(1\) the synthesized questionqe\(t\)q\_\{e\}^\{\(t\)\}contains no redundancy or shortcuts; \(2\) at least five iterations have been executed; \(3\) the reasoning graph contains at least five edges,*i\.e\.*,\|ℛet\|≥5\|\\mathcal\{R\}\_\{e\}^\{t\}\|\\geq 5\.

### 2\.2Data Quality

In the previous section, we mainly use the high\-level guidance agent to control the data quality,*i\.e\.*, avoid reasoning redundancy and shortcuts in the synthesized questions\. To further ensure the textual quality, uniqueness and difficulty of the collected QA pairs, we employ the following strategies:

#### Textual Quality\.

We evaluate whether each synthesized QA pair is fluent, clear, and unambiguous by using DeepSeek\-V3\.2DeepSeek\-AI \([2025](https://arxiv.org/html/2606.13120#bib.bib20)\)as the judge model\. Then, we filter out the low\-quality QA pairs\. The prompt is shown in Appendix[B](https://arxiv.org/html/2606.13120#A2)\.

#### Uniqueness and Difficulty\.

To avoid alternative answers in the synthesized questions, we adopt a cross‑validation method inspired by xbench\-DeepSearchXbench\-Team \([2025](https://arxiv.org/html/2606.13120#bib.bib25)\)\. In detail, for each question, we employ six cutting\-edge LLMs444DeepSeek\-V4, DeepSeek\-V3\.2, GLM\-5, Kimi\-K2\.6, Qwen3\.5\-397B\-A17B and Qwen3\.5\-122B\-A10B\.as search agents to answer the question three times independently555Temperature is set to 1\.0; top\_p is set to 0\.95\., resulting in 18 solutions\. If more than 80% of the solutions converge on the same incorrect answer, the question is treated as a multiple\-answer question and is discarded\. As for difficulty, if more than five \(out of six\) LLMs correctly answer the question, it will be discarded\.

After quality filtering, we balance the data distribution for each domain and ultimately obtain 400 English and 400 Chinese QA pairs, which constitute our EvoBrowseComp\. To further reveal the data quality, we conduct human analyses on randomly selected 100 QA samples \(50 English and 50 Chinese\)\. Since directly asking humans to answer questions in a web environment is overly challenging, we instead use the evidence listℰ\\mathcal\{E\}as an anchor and require human evaluators to verify \(1\) whether eachϵi∈ℰ\\epsilon\_\{i\}\\in\\mathcal\{E\}is correct, and \(2\) whether each synthesized question is consistent with its corresponding evidence listℰ\\mathcal\{E\}and is unambiguous; \(3\) whether each answer can be inferred by the evidence listℰ\\mathcal\{E\}\(for more details, please refer to Appendix[C](https://arxiv.org/html/2606.13120#A3)\)\. The results indicate that 93\.0% of evidence lists are entirely correct, and 90% of questions are both consistent with their corresponding evidence lists and unambiguous\.666The remaining 7% of evidence lists may involve hallucinations, and the remaining 10% of questions may exhibit minor ambiguity\.100\.0% of answers can be inferred fromℰ\\mathcal\{E\}\. Overall,*87%*of QA pairs simultaneously pass the above three verifications, indicating the superiority of our synthesis framework\.

### 2\.3Data Statistics

![Refer to caption](https://arxiv.org/html/2606.13120v1/x4.png)

Figure 4:Distribution across nine domains in EvoBrowseComp\.Table 1:Data Statistics of EvoBrowseComp compared with previous benchmark datasets \(Lang\.: Language\)\. “Length” denotes the average question length, and “Node” denotes the average number of nodes in the reasoning graphs\.EvoBrowseComp contains 800 high\-quality QA pairs \(400 in English and 400 in Chinese\)\. We analyze the data statistics from the following aspects:

Domain Distribution\.As shown in Figure[4](https://arxiv.org/html/2606.13120#S2.F4), EvoBrowseComp is evenly distributed across nine predefined domains, ensuring broad coverage of the knowledge areas\.

Length\.We calculate the average question length of EvoBrowseComp and previous benchmark datasets\. As shown in Table[1](https://arxiv.org/html/2606.13120#S2.T1), The average question lengths in EvoBrowseComp are 142\.48 and 162\.33 tokens for English and Chinese, respectively, which are generally longer than those in previous datasets, particularly for Chinese\.

Complexity\.We use reasoning graphs to structure complex questions in Section[2\.1](https://arxiv.org/html/2606.13120#S2.SS1)\. In addition to EvoBrowseComp, we also extract reasoning graphs for complex questions of previous datasets\. The average number of nodes in reasoning graphs, which can reflect questions’ complexity, is also reported in Table[1](https://arxiv.org/html/2606.13120#S2.T1): The average number of nodes in EvoBrowComp\-EN is similar to that in BrowseComp \(8\.62 vs\. 8\.85\)\. On Chinese data, the average number of nodes in EvoBrowComp\-Zh is significantly greater than others,*i\.e\.*, 8\.07 vs\. 6\.63/4\.63, indicating the complexity of our data\.

![Refer to caption](https://arxiv.org/html/2606.13120v1/x5.png)

Figure 5:Distribution of the number of distinct root domains per question in EvoBrowseComp\.Source Diversity\.During collecting the evidence listℰ\\mathcal\{E\}in data synthesis, we also record the source URL of each evidenceϵi∈ℰ\\epsilon\_\{i\}\\in\\mathcal\{E\}\. To characterize the source diversity of the synthesized questions, we calculate, for each question, the number of distinct root domains present in the correspondingℰ\\mathcal\{E\}\. For example, if the evidence for a given question is drawn from “a\.com” and “b\.com”, there are two distinct root domains\. Theoretically, the greater the number of distinct root domains, the more complex and difficult the question becomes\. As shown in Figure[5](https://arxiv.org/html/2606.13120#S2.F5), the number of distinct root domains per question in EvoBrowseComp exhibits a distinct bell\-shaped distribution\. On average, each question involves 4\.2 distinct root domains, and over 90% of questions require reasoning across at least three independent sources\. This makes cross\-site evidence aggregation and verification a necessary prerequisite for answering\.

### 2\.4Evaluation Protocol

In model evaluation, search agents are required to answer the given complex questions under the multi\-turn interactions with the following two web tools: \(1\)*Search*uses the Google search engine to retrieve information\. \(2\)*Visit*extracts targeted information from specific web pages\. The definition and implementation of these tools are derived fromTeamet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib26)\)\.777[https://github\.com/Alibaba\-NLP/DeepResearch](https://github.com/Alibaba-NLP/DeepResearch)Following previous benchmark datasetsWeiet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib3)\); Zhouet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib2)\), we also employ the LLM\-as\-a\-judge paradigm in model evaluation\. Specifically, a judge model is used to verify whether a model prediction is correct or not\. The judge prompt is derived fromTeamet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib26)\); Xbench\-Team \([2025](https://arxiv.org/html/2606.13120#bib.bib25)\), and is provided in the Appendix[D](https://arxiv.org/html/2606.13120#A4)\. We use GLM\-5\-ChatZenget al\.\([2026](https://arxiv.org/html/2606.13120#bib.bib22)\)as the judge model since it shows a strong correlation with human judgments \(please refer to the Appendix[E](https://arxiv.org/html/2606.13120#A5)\)\. The evaluation metric is accuracy based on the model judgments\.

Table 2:Experimental results on EvoBrowseComp\. Theboldand theunderlinedenote the best and second\-best performances, respectively\.

## 3Experiments

### 3\.1Experimental Setup

LLMs\.Based on our EvoBrowseComp, we evaluate the following cutting\-edge LLMs: Claude\-Opus\-4\.6Anthropic \([2026b](https://arxiv.org/html/2606.13120#bib.bib16)\), Qwen3\.5\-397B\-A17B\-FP8888[https://huggingface\.co/Qwen/Qwen3\.5\-397B\-A17B\-FP8](https://huggingface.co/Qwen/Qwen3.5-397B-A17B-FP8), Qwen3\.5\-122B\-A10B999[https://huggingface\.co/Qwen/Qwen3\.5\-122B\-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B), Qwen3\.5\-35B\-A3B101010[https://huggingface\.co/Qwen/Qwen3\.5\-35B\-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B), Qwen3\.5\-27B111111[https://huggingface\.co/Qwen/Qwen3\.5\-27B](https://huggingface.co/Qwen/Qwen3.5-27B)Qwen \([2026](https://arxiv.org/html/2606.13120#bib.bib21)\), Qwen3\-235B\-A22B\-Thinking\-2507121212[https://huggingface\.co/Qwen/Qwen3\-235B\-A22B\-Thinking\-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507)Yanget al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib23)\), DeepSeek\-V4\-Pro131313[https://huggingface\.co/deepseek\-ai/DeepSeek\-V4\-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro), DeepSeek\-V4\-Flash141414[https://huggingface\.co/deepseek\-ai/DeepSeek\-V4\-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)DeepSeek\-AI \([2026](https://arxiv.org/html/2606.13120#bib.bib18)\), DeepSeek\-V3\.2151515[https://huggingface\.co/deepseek\-ai/DeepSeek\-V3\.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2)DeepSeek\-AI \([2025](https://arxiv.org/html/2606.13120#bib.bib20)\), GLM\-5161616[https://huggingface\.co/zai\-org/GLM\-5](https://huggingface.co/zai-org/GLM-5)Zenget al\.\([2026](https://arxiv.org/html/2606.13120#bib.bib22)\)and Kimi\-K2\.6171717[https://huggingface\.co/moonshotai/Kimi\-K2\.6](https://huggingface.co/moonshotai/Kimi-K2.6)Kimi \([2026](https://arxiv.org/html/2606.13120#bib.bib19)\)\.

Implementation Details\.In data collection, all three agents are employed based on DeepSeek\-V3\.2DeepSeek\-AI \([2025](https://arxiv.org/html/2606.13120#bib.bib20)\)\. The pre\-defined thresholdkkis set to 5 in the information filtering agent\. In model evaluation, all open\-source LLMs are deployed on NVIDIA H20 GPUs \(96G\) using SGLang181818[https://github\.com/sgl\-project/sglang](https://github.com/sgl-project/sglang)\. Most models are run on 8 GPUs, except DeepSeek\-V4\-Pro and GLM\-5, which each use 32 GPUs, and DeepSeek\-V3\.2, which uses 16 GPUs\. We adopt sampling\-based decoding with a temperature of 0\.6 and a top\_p of 0\.95\. We set the maximum context length to 128K for all LLMs to ensure a fair comparison, and limit the maximum number of tool calls to 40\. All LLMs are evaluated with the \(maximum\) thinking mode\. After model prediction, we use GLM\-5\-Chat \(zero temperature\) as the judge model\. For each LLM, we run three independent evaluations and report the average\.

Table 3:The performance of three example LLMs on BrowseComp \(BC\), BrowseComp\-ZH \(BC\-ZH\) and EvoBrowseComp\-EN/ZH \(EvoBC\.\-EN/ZH\)\.
### 3\.2Results & Analyses

Table[2](https://arxiv.org/html/2606.13120#S2.T2)presents the model performances on EvoBrowseComp, and we analyze the results from the following aspects:

Tool\-Free Setting\.Without tools, most LLMs’ accuracy is below 5% \(*e\.g\.*, Kimi\-K2\.6 only achieves 0% and 1\.6% in English and Chinese, respectively\), and even the best\-performing LLM \(DeepSeek\-V3\.2\) only achieves 6\.3% in English and 10\.3% in Chinese\. The limited tool\-free performance suggests that EvoBrowseComp effectively prevents reliance on parameterized memory by introducing fresh knowledge in the complex questions\.

Tool\-Based Setting\.With access to web tools, in English, Claude\-Opus\-4\.6 ranks first with 44\.8%, followed by Qwen3\.5\-397B \(42\.0%\) and GLM\-5 \(39\.2%\)\. The results of the Chinese evaluation also show similar trends\. Taking Qwen3\.5\-397B, GLM\-5 and DeepSeek\-V3\.2 as example LLMs, we also compare their performance on EvoBrowseComp and previous BrowseComp\(\-ZH\)Weiet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib3)\); Zhouet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib2)\)\.191919The results on BrowseComp and BrowseComp\-Zh are directly borrowed fromZenget al\.\([2026](https://arxiv.org/html/2606.13120#bib.bib22)\); DeepSeek\-AI \([2025](https://arxiv.org/html/2606.13120#bib.bib20)\); Qwen \([2026](https://arxiv.org/html/2606.13120#bib.bib21)\)\.As shown in Table[3](https://arxiv.org/html/2606.13120#S3.T3), we find that the model performances on our data are significantly lower than those on BrowseComp / BrowseComp\-ZH, indicating the difficulty of our complex questions\.

Table 4:The performance of DeepSeek\-V4\-Flash on EvoBrowseComp using different reasoning efforts\. DS\.: DeepSeek; ACC\.: “accuracy”; ER\.: “Exceed Ratio” indicates the proportion of evaluation samples that exceeds the maximum allowed number of tool calls\.The Effect of Reasoning Effort\.We also observe that DeepSeek\-V4\-Pro/Flash underperforms DeepSeek\-V3\.2\. We manually inspect the predictions of DeepSeek\-V4 and find that many evaluation samples exceed the maximum allowed number of tool calls \(40; c\.f\., §[3\.1](https://arxiv.org/html/2606.13120#S3.SS1)\), which leads to its unsatisfactory performance\. To figure out the effect of reasoning effort, we use DeepSeek\-V4\-Flash as an example and evaluate its performance under three configurations: the original max\-reasoning setting, a high\-reasoning setting \(DeepSeek\-V4\-High\), and a non\-reasoning setting \(DeepSeek\-V4\-Chat\)\. In addition to accuracy, we also report the proportion of evaluation samples that exceeds the maximum allowed number of tool calls \(abbr\. ER\)\. As shown in Table[4](https://arxiv.org/html/2606.13120#S3.T4), we find that DeepSeek\-V4\-Flash\-High achieves the best performance among three configurations, while DeepSeek\-V4\-Flash\-Max even underperforms DeepSeek\-V4\-Flash\-Chat\. In terms of ER, we find that DeepSeek\-V4 achieves much high ER scores than the best\-performing LLM \(Claude\-Opus\-4\.6\)\. For example, DeepSeek\-V4\-Max achieves 75\.5% and 82\.5% ER scores in English and Chinese, respectively\. This phenomenon raises concerns about reasoning efficiency\. Although state\-of\-the\-art LLMs exhibit strong reasoning capabilities, enhancing their reasoning efficiency remains crucial for practical applications\.

## 4Related Work

A growing body of work introduces benchmarks to evaluate the browsing, reasoning, and retrieval capabilities of LLMs\. Early datasets such as NaturalQuestionsKwiatkowskiet al\.\([2019](https://arxiv.org/html/2606.13120#bib.bib34)\), TriviaQAJoshiet al\.\([2017](https://arxiv.org/html/2606.13120#bib.bib35)\), and HotpotQAYanget al\.\([2018](https://arxiv.org/html/2606.13120#bib.bib36)\)focus on single\-hop or multi\-hop fact retrieval; however, many of these datasets are effectively handled by cutting\-edge LLMsMialonet al\.\([2024](https://arxiv.org/html/2606.13120#bib.bib5)\)\. To raise the difficulty ceiling, BrowseCompWeiet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib3)\)introduces reverse\-engineered questions requiring persistence and creativity in information seeking, while BrowseComp\-PlusChenet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib1)\)provides a fixed, human\-verified corpus to disentangle retriever performance from search agent reasoning\. Parallel efforts such as BrowseComp\-ZHZhouet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib2)\)extend BrowseComp to Chinese, and WebWalkerQAWuet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib7)\)emphasizes vertical web traversal through structured official websites\. GAIAMialonet al\.\([2024](https://arxiv.org/html/2606.13120#bib.bib5)\)proposes conceptually simple yet execution\-heavy tasks for general AI assistants, requiring diverse tool use and multi\-step planning\. DeepSearchQAGuptaet al\.\([2026](https://arxiv.org/html/2606.13120#bib.bib4)\)shifts the evaluation focus from single\-answer retrieval to exhaustive set generation, stressing systematic collation, entity resolution, and stopping criteria\. More recently, SealQAPhamet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib6)\)highlights the brittleness of search\-augmented models under conflicting, noisy, or misleading retrieval results\. Despite their respective strengths, these benchmarks largely rely onstaticorfixedcorpora\. As noted byPhamet al\.\([2025](https://arxiv.org/html/2606.13120#bib.bib6)\), this static nature renders them susceptible to progressive data contamination\. Different from previous datasets, we design a three\-agent framework to automatically discover*fresh*knowledge and construct contamination\-free complex questions\. Without costly manual annotation, our data can be regularly updated to prevent data contamination and ensure temporal freshness\.

## 5Conclusion

In this paper, we introduce EvoBrowseComp, an evolving search agent benchmark, which contains 400 English and 400 Chinese contamination\-free complex QA pairs\. We design a three\-agent collaborative framework to discover fresh knowledge from the live web and synthesize the QA data\. Multiple strategies are implemented to ensure data quality, including the avoidance of reasoning redundancy and shortcuts, as well as the verification of textual quality, answer uniqueness, and question difficulty\. Human analyses further indicate that the synthesized data achieves a high level of quality\. Since the framework operates entirely automatically and does not require costly manual annotation, it can be regularly updated to prevent data contamination and ensure temporal freshness\. Furthermore, experimental results on cutting\-edge LLMs underscore the challenges posed by this data\. Thus, it establishes a sustainable paradigm for the future\-proof evaluation of search agents\.

## Limitations

While we show the effectiveness of the three\-agent synthesis framework, there are some limitations worth noting: \(1\) We employ DeepSeek\-V3\.2DeepSeek\-AI \([2025](https://arxiv.org/html/2606.13120#bib.bib20)\)as the backbone of the three agents, and thus, the synthesized data might involve the same biases and toxic behaviors exhibited by the model\. \(2\) In the model evaluation, we use the judge model to assess only the final answers rather than the entire reasoning trajectory\. Consequently, it becomes difficult to distinguish an agent that reasoned correctly from one that obtained the correct answer through inefficient or accidental means \(*e\.g\.*, lucky guessing\)\.

## Ethical Considerations

We discuss the main ethical considerations of EvoBrowseComp as follows: \(1\) Licenses\. We will release our synthesized data under CC\-BY\-NC\-SA 4\.0 license\. \(2\) Privacy Information\. We extract knowledge from the publicly available web pages, and we filter out potential privacy information via LLMs\.

## References

- Detecting breaking news rumors of emerging topics in social media\.Information Processing & Management57\(2\),pp\. 102018\.Cited by:[§2\.1](https://arxiv.org/html/2606.13120#S2.SS1.SSS0.Px3.p1.2)\.
- Anthropic \(2026a\)External Links:[Link](https://www.anthropic.com/engineering/eval-awareness-browsecomp)Cited by:[§1](https://arxiv.org/html/2606.13120#S1.p2.1)\.
- Anthropic \(2026b\)System card: claude opus 4\.6\.External Links:[Link](https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf)Cited by:[§1](https://arxiv.org/html/2606.13120#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.13120#S3.SS1.p1.1)\.
- Z\. Chen, X\. Ma, S\. Zhuang, P\. Nie, K\. Zou, A\. Liu, J\. Green, K\. Patel, R\. Meng, M\. Su,et al\.\(2025\)Browsecomp\-plus: a more fair and transparent evaluation benchmark of deep\-research agent\.arXiv preprint arXiv:2508\.06600\.Cited by:[§1](https://arxiv.org/html/2606.13120#S1.p1.1),[§1](https://arxiv.org/html/2606.13120#S1.p2.1),[§1](https://arxiv.org/html/2606.13120#S1.p3.1),[§4](https://arxiv.org/html/2606.13120#S4.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.CoRRabs/2512\.02556\.External Links:[Link](https://doi.org/10.48550/arXiv.2512.02556),[Document](https://dx.doi.org/10.48550/ARXIV.2512.02556),2512\.02556Cited by:[Appendix E](https://arxiv.org/html/2606.13120#A5.p1.1),[§2\.2](https://arxiv.org/html/2606.13120#S2.SS2.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.13120#S3.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.13120#S3.SS1.p2.1),[Limitations](https://arxiv.org/html/2606.13120#Sx1.p1.1),[footnote 19](https://arxiv.org/html/2606.13120#footnote19)\.
- DeepSeek\-AI \(2026\)DeepSeek\-v4: towards highly efficient million\-token context intelligence\.Cited by:[Appendix E](https://arxiv.org/html/2606.13120#A5.p1.1),[§3\.1](https://arxiv.org/html/2606.13120#S3.SS1.p1.1)\.
- H\. B\. Enderton \(2001\)A mathematical introduction to logic\.Elsevier\.Cited by:[§2\.1](https://arxiv.org/html/2606.13120#S2.SS1.SSS0.Px4.p2.20)\.
- N\. Gupta, R\. Chatterjee, L\. Haas, C\. Tao, A\. Wang, C\. Liu, H\. Oiwa, E\. Gribovskaya, J\. Ackermann, J\. Blitzer,et al\.\(2026\)DeepSearchQA: bridging the comprehensiveness gap for deep research agents\.arXiv preprint arXiv:2601\.20975\.Cited by:[§1](https://arxiv.org/html/2606.13120#S1.p1.1),[§1](https://arxiv.org/html/2606.13120#S1.p2.1),[§1](https://arxiv.org/html/2606.13120#S1.p3.1),[Table 1](https://arxiv.org/html/2606.13120#S2.T1.1.1.3.2.1),[§4](https://arxiv.org/html/2606.13120#S4.p1.1)\.
- M\. Joshi, E\. Choi, D\. Weld, and L\. Zettlemoyer \(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),R\. Barzilay and M\. Kan \(Eds\.\),Vancouver, Canada,pp\. 1601–1611\.External Links:[Link](https://aclanthology.org/P17-1147/),[Document](https://dx.doi.org/10.18653/v1/P17-1147)Cited by:[§4](https://arxiv.org/html/2606.13120#S4.p1.1)\.
- Kimi \(2026\)Kimi k2\.6: advancing open\-source coding\.External Links:[Link](https://www.kimi.com/blog/kimi-k2-6)Cited by:[Appendix E](https://arxiv.org/html/2606.13120#A5.p1.1),[§3\.1](https://arxiv.org/html/2606.13120#S3.SS1.p1.1)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee, K\. Toutanova, L\. Jones, M\. Kelcey, M\. Chang, A\. M\. Dai, J\. Uszkoreit, Q\. Le, and S\. Petrov \(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7,pp\. 452–466\.External Links:[Link](https://aclanthology.org/Q19-1026/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00276)Cited by:[§4](https://arxiv.org/html/2606.13120#S4.p1.1)\.
- K\. Li, Z\. Zhang, H\. Yin,et al\.\(2025\)Websailor: navigating super\-human reasoning for web agent\.arXiv preprint arXiv:2507\.02592\.Cited by:[§2\.1](https://arxiv.org/html/2606.13120#S2.SS1.SSS0.Px2.p3.15)\.
- A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong,et al\.\(2025\)Deepseek\-v3\. 2: pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556\.Cited by:[§2\.1](https://arxiv.org/html/2606.13120#S2.SS1.SSS0.Px1.p1.1)\.
- R\. Lu, Z\. Hou, Z\. Wang, H\. Zhang, X\. Liu, Y\. Li, S\. Feng, J\. Tang, and Y\. Dong \(2025\)Deepdive: advancing deep search agents with knowledge graphs and multi\-turn rl\.arXiv preprint arXiv:2509\.10446\.Cited by:[§2\.1](https://arxiv.org/html/2606.13120#S2.SS1.SSS0.Px2.p3.15),[§2\.1](https://arxiv.org/html/2606.13120#S2.SS1.SSS0.Px3.p1.2),[§2\.1](https://arxiv.org/html/2606.13120#S2.SS1.SSS0.Px4.p1.1)\.
- G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom \(2024\)GAIA: a benchmark for general AI assistants\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=fibxvahvs3)Cited by:[§1](https://arxiv.org/html/2606.13120#S1.p1.1),[§1](https://arxiv.org/html/2606.13120#S1.p2.1),[§1](https://arxiv.org/html/2606.13120#S1.p3.1),[§4](https://arxiv.org/html/2606.13120#S4.p1.1)\.
- OpenAI Team \(2025\)OpenAI o3 and o4\-mini system card\.External Links:[Link](https://cdn.openai.com/o3-mini-system-card-feb10.pdf)Cited by:[Appendix E](https://arxiv.org/html/2606.13120#A5.p1.1)\.
- S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez \(2025\)The berkeley function calling leaderboard \(BFCL\): from tool use to agentic evaluation of large language models\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=2GmDdhBdDk)Cited by:[§1](https://arxiv.org/html/2606.13120#S1.p1.1)\.
- T\. Pham, N\. Nguyen, P\. Zunjare, W\. Chen, Y\. Tseng, and T\. Vu \(2025\)SealQA: raising the bar for reasoning in search\-augmented language models\.arXiv preprint arXiv:2506\.01062\.Cited by:[§1](https://arxiv.org/html/2606.13120#S1.p1.1),[§1](https://arxiv.org/html/2606.13120#S1.p3.1),[§4](https://arxiv.org/html/2606.13120#S4.p1.1)\.
- Qwen \(2026\)Qwen3\.5: towards native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§3\.1](https://arxiv.org/html/2606.13120#S3.SS1.p1.1),[footnote 19](https://arxiv.org/html/2606.13120#footnote19)\.
- Z\. Tao, J\. Wu, W\. Yin,et al\.\(2025\)Webshaper: agentically data synthesizing via information\-seeking formalization\.arXiv preprint arXiv:2507\.15061\.Cited by:[§2\.1](https://arxiv.org/html/2606.13120#S2.SS1.SSS0.Px4.p1.1)\.
- T\. D\. Team, B\. Li, B\. Zhang, D\. Zhang, F\. Huang, G\. Li, G\. Chen, H\. Yin, J\. Wu, J\. Zhou,et al\.\(2025\)Tongyi deepresearch technical report\.arXiv preprint arXiv:2510\.24701\.Cited by:[§2\.4](https://arxiv.org/html/2606.13120#S2.SS4.p1.1)\.
- J\. Wei, Z\. Sun, S\. Papay, S\. McKinney, J\. Han, I\. Fulford, H\. W\. Chung, A\. T\. Passos, W\. Fedus, and A\. Glaese \(2025\)Browsecomp: a simple yet challenging benchmark for browsing agents\.arXiv preprint arXiv:2504\.12516\.Cited by:[§1](https://arxiv.org/html/2606.13120#S1.p1.1),[§1](https://arxiv.org/html/2606.13120#S1.p2.1),[§1](https://arxiv.org/html/2606.13120#S1.p3.1),[§2\.4](https://arxiv.org/html/2606.13120#S2.SS4.p1.1),[Table 1](https://arxiv.org/html/2606.13120#S2.T1.1.1.2.1.1),[§3\.2](https://arxiv.org/html/2606.13120#S3.SS2.p3.1),[§4](https://arxiv.org/html/2606.13120#S4.p1.1)\.
- J\. Wu, W\. Yin, Y\. Jiang, Z\. Wang, Z\. Xi, R\. Fang, L\. Zhang, Y\. He, D\. Zhou, P\. Xie, and F\. Huang \(2025\)WebWalker: benchmarking LLMs in web traversal\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 10290–10305\.External Links:[Link](https://aclanthology.org/2025.acl-long.508/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.508),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.13120#S1.p1.1),[Table 1](https://arxiv.org/html/2606.13120#S2.T1.1.1.6.5.1),[§4](https://arxiv.org/html/2606.13120#S4.p1.1)\.
- Xbench\-Team \(2025\)Xbench\-deepsearch\.External Links:[Link](https://xbench.org/agi/aisearch)Cited by:[§2\.2](https://arxiv.org/html/2606.13120#S2.SS2.SSS0.Px2.p1.1),[§2\.4](https://arxiv.org/html/2606.13120#S2.SS4.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§3\.1](https://arxiv.org/html/2606.13120#S3.SS1.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 2369–2380\.External Links:[Link](https://aclanthology.org/D18-1259/),[Document](https://dx.doi.org/10.18653/v1/D18-1259)Cited by:[§4](https://arxiv.org/html/2606.13120#S4.p1.1)\.
- J\. H\. Zar \(2005\)Spearman rank correlation\.Encyclopedia of biostatistics7\.Cited by:[Appendix E](https://arxiv.org/html/2606.13120#A5.p1.1)\.
- A\. Zeng, X\. Lv, Z\. Hou, Z\. Du, Q\. Zheng, B\. Chen, D\. Yin, C\. Ge, C\. Huang, C\. Xie,et al\.\(2026\)Glm\-5: from vibe coding to agentic engineering\.arXiv preprint arXiv:2602\.15763\.Cited by:[Appendix E](https://arxiv.org/html/2606.13120#A5.p1.1),[§2\.4](https://arxiv.org/html/2606.13120#S2.SS4.p1.1),[§3\.1](https://arxiv.org/html/2606.13120#S3.SS1.p1.1),[footnote 19](https://arxiv.org/html/2606.13120#footnote19)\.
- P\. Zhou, B\. Leon, X\. Ying, C\. Zhang, Y\. Shao, Q\. Ye, D\. Chong, Z\. Jin, C\. Xie, M\. Cao,et al\.\(2025\)Browsecomp\-zh: benchmarking web browsing ability of large language models in chinese\.arXiv preprint arXiv:2504\.19314\.Cited by:[§1](https://arxiv.org/html/2606.13120#S1.p1.1),[§1](https://arxiv.org/html/2606.13120#S1.p2.1),[§1](https://arxiv.org/html/2606.13120#S1.p3.1),[§2\.4](https://arxiv.org/html/2606.13120#S2.SS4.p1.1),[Table 1](https://arxiv.org/html/2606.13120#S2.T1.1.1.5.4.1),[§3\.2](https://arxiv.org/html/2606.13120#S3.SS2.p3.1),[§4](https://arxiv.org/html/2606.13120#S4.p1.1)\.

## Appendix AConstruction Details

### A\.1Examples of Seed Entity Construction

Prompt for Seed Entity ConstructionYou are an expert in seed entity generation\. Given a “primary domain : sub\-domain” pair, your task is to aggregate entities that have recently surfaced \(in 2026\) under the specified sub\-domain, as reported by authoritative news outlets or official websites\.Input•Primary domain:\{primary\_domain\}•Sub\-domain:\{secondary\_domain\}•Number of seeds to generate:\{num\_seeds\}Tool•search\(query\): performs a live web search and returns a list of webpage snippets\.Requirements1\.Freshness: Each entity must have either newly emerged in 2026 or undergone a significant attribute change in 2026 \(e\.g\., newly founded organizations, newly released products, newly appointed individuals, newly occurred events\)\.2\.Source reliability: Each entity must be reported by authoritative news sources \(e\.g\., Reuters, BBC, Xinhua\) or official websites \(e\.g\., \.gov sites, official homepages, reputable encyclopedias\)\. Avoid personal blogs, self\-media, and content farms\.3\.Diversity: Cover a broad range of entity types, including people, organizations, events, products, concepts, and locations\.4\.Multi\-round search: Issue multiplesearch\(\)calls with diversified queries to comprehensively cover the different facets of the sub\-domain\.Output Format``` { "domain": { "primary": "<primary domain>", "secondary": "<sub-domain>"}, "total_count": {num_seeds}, "seeds": [ {"id": 1, "entity_name": "<entity name>", "is_fresh": true} ] } ``` Now begin: first plan your search queries, then perform multiple rounds ofsearch\(\), and finally output\{num\_seeds\}seed entities\.

### A\.2Examples of the QA Synthesis Agent

Prompt for Evidence CollectionRole: Knowledge Collection AgentStarting from the given seed, perform multiple rounds ofsearchandvisitto collect structured evidence that will support downstream QA construction\.Definition of Fresh KnowledgeEvents, relations, attributes updates that occurred on or after January 1, 2026\.Tools•search\(query\): live web search\.•visit\(url\): fetch the full content of a specific page\.Strategy1\.Iterative exploration: Starting from the seed, iteratively invoke search/visit, following hyperlinks to expand the entity network\.2\.Prioritize fresh knowledge: Augment queries with time\-aware keywords such as “2026”, “latest”, or “current”\.3\.Mandatory verification: Every piece of evidence must come from a page you have actually visited; do not record evidence based on search snippets alone\.4\.Chainability: Evidence items should share entities with one another so that they can later be linked into evidence chains\.Output Format[⬇](data:text/plain;base64,WwogIHsKICAgICJldmlkZW5jZV9pZCI6ICJlXzAwMSIsCiAgICAiaGVhZF9lbnRpdHkiOiAiRW50aXR5IEEiLAogICAgInJlbGF0aW9uIjogInJlbGF0aW9uIG9yIGF0dHJpYnV0ZSBkZXNjcmlwdGlvbiIsCiAgICAidGFpbF9lbnRpdHkiOiAiRW50aXR5IEIgb3IgYXR0cmlidXRlIHZhbHVlIiwKICAgICJldmlkZW5jZV90ZXh0IjogInZlcmJhdGltIHNuaXBwZXQgZnJvbSB0aGUgc291cmNlIHBhZ2UiLAogICAgInNvdXJjZV91cmwiOiAiVVJMIGFjdHVhbGx5IHZpc2l0ZWQiLAogICAgImlzX2ZyZXNoX2tub3dsZWRnZSI6IHRydWUKICB9Cl0=)\[\{"evidence\_id":"e\_001","head\_entity":"EntityA","relation":"relationorattributedescription","tail\_entity":"EntityBorattributevalue","evidence\_text":"verbatimsnippetfromthesourcepage","source\_url":"URLactuallyvisited","is\_fresh\_knowledge":true\}\]Quantity RequirementCollect at least\{num\_evidence\}evidence items, with no less than 50% marked as fresh knowledge\.<seed\>\{seed\}</seed\>

Prompt for QA ConstructionRole: QA Construction AgentBased on the provided evidence list, construct\{num\}reasoning QA pairs\.<evidence\_list\> \{evidence\_list\} </evidence\_list\>Hard Requirements1\. Reasoning Chain•At least 5 hops, with each hop introducing a fresh entity or relation \(no looping in place\)\.•The final hop must be fresh knowledge\.\(is\_fresh\_knowledge = true\)2\. AnswerThe answer must fall into one of the following types to ensure uniqueness and evaluability:•Deterministic attributes \(time, number, or proper nouns with unique referents\)\.•The intersection, difference, or ranking result derived from multiple entities along the chain\.3\. Obfuscation \(Core\)Goal: prevent the model from identifying entities via memorization or a single search; the full reasoning chain must be traversed\.•Non\-fresh knowledge: use descriptions whose individual words are common but whose combination points to a rare attribute\.•Fresh knowledge: anchor the description on a non\-core but retrievable attribute, and phrase it with moderate vagueness\.•Time / location: replace explicit values with event anchors or rare attribute combinations\.•The answer entity must not be directly or indirectly hinted at within the question\.4\. Anti\-ShortcutThe final interrogative clause must be tightly bound to the last unresolved entity in the reasoning chain, and must not pivot to a public event or globally known attribute\.Output Format[⬇](data:text/plain;base64,WwogIHsKICAgICJpZCI6ICJxYV8wMDEiLAogICAgInF1ZXN0aW9uIjogIm9iZnVzY2F0ZWQgcXVlc3Rpb24iLAogICAgImFuc3dlciI6ICJmaW5hbCBhbnN3ZXIiLAogICAgImhvcF9jb3VudCI6IDUsCiAgICAia25vd2xlZGdlX29yZGVyIjogIm5vbmZyZXNoLW5vbmZyZXNoLWZyZXNoLWZyZXNoLWZyZXNoIiwKICAgICJyZWFzb25pbmdfY2hhaW4iOiBbCiAgICAgIHsKICAgICAgICAiaG9wIjogMSwKICAgICAgICAiZXZpZGVuY2VfaWQiOiAiZV94eHgiLAogICAgICAgICJlbnRpdHkiOiAiLi4uIiwKICAgICAgICAicmVsYXRpb24iOiAiLi4uIiwKICAgICAgICAiaXNfZnJlc2hfa25vd2xlZGdlIjogZmFsc2UKICAgICAgfQogICAgXQogIH0KXQ==)\[\{"id":"qa\_001","question":"obfuscatedquestion","answer":"finalanswer","hop\_count":5,"knowledge\_order":"nonfresh\-nonfresh\-fresh\-fresh\-fresh","reasoning\_chain":\[\{"hop":1,"evidence\_id":"e\_xxx","entity":"\.\.\.","relation":"\.\.\.","is\_fresh\_knowledge":false\}\]\}\]Output the JSON array directly\.

### A\.3Examples of Information Filtering Agent

Prompt for Fresh Knowledge Reliability AssessmentRole: Knowledge Reliability Verification AssistantUse thesearchandvisittools tocross\-validatethe given evidence list and assess its reliability\.Evidence List to VerifyEach item is a triple\(head\_entity, relation, tail\_entity\)supported byevidence\_textfromsource\_url, and flagged as fresh knowledge viais\_fresh\_knowledge\.\{evidence\_list\}Tools•search\(query\): live web search\.•visit\(url\): fetch the full content of a specific page\.Verification Procedure1\.For each evidence item, extract the key fact represented by the triple\(head\_entity, relation, tail\_entity\)\.2\.Issuesearchqueries to retrieve relevant results, andvisitauthoritative sources \(official websites, mainstream media, reputable encyclopedias\) when further confirmation is needed\. The providedsource\_urlmay be visited as one reference, butmust notbe counted as an independent corroboration\.3\.Cross\-validation: for each triple, find at least two independent sources that corroborate it\.4\.Assign an overall reliability label for the entire evidence list:•credible: every triple is consistently supported by multiple independent and trustworthy sources\.•not credible: at least one triple is contradicted by sources, or supported only by untrustworthy sources\.•unclear: available information is insufficient to make a definitive judgment on one or more triples\.Output FormatOutput only the following JSON\. Do not add any additional text or code\-block markers:[⬇](data:text/plain;base64,ewogICJyZWxpYWJpbGl0eSI6ICJjcmVkaWJsZSB8IG5vdCBjcmVkaWJsZSB8IHVuY2xlYXIiCn0=)\{"reliability":"credible\|notcredible\|unclear"\}

Prompt for Non\-fresh Knowledge Popularity AssessmentRole: Knowledge Assessment AssistantBasedsolely on your internal knowledge, judge whether the given evidence list is too popular or overly covered\.Evidence List to AssessEach item is a triple\(head\_entity, relation, tail\_entity\)supported byevidence\_textfromsource\_url, and flagged as non\-fresh knowledge viais\_fresh\_knowledge\.\{evidence\_list\}Criteria•popular: the entities or facts described by the triples have prominent, widely known features, allowing you to identify the referents directly and unambiguously\.•non\-popular: the descriptions are relatively obscure or lack distinctive features, requiring additional clues to identify the referents\.Instructions•Assess every triple in the list, then assign one overall label:–Label aspopularonly if the majority of triples clearly describe widely known entities or facts\.–Otherwise, label asnon\-popular\.•If the content lies beyond your knowledge scope \(e\.g\., involves future information or entirely unfamiliar concepts\), label it asnon\-popular\.Output FormatOutput only the following JSON\. Do not add any additional text or code\-block markers:[⬇](data:text/plain;base64,ewogICJwb3B1bGFyaXR5IjogInBvcHVsYXIgfCBub24tcG9wdWxhciIKfQ==)\{"popularity":"popular\|non\-popular"\}

### A\.4Examples of High\-level Guidance Agent

Prompt for Question Graph ParsingRole: Question Graph Parsing ExpertParse the given\(question, answer, reasoning\_chain\)into a strict Question Graph\.Graph Definition•Node: a set of entities or attribute values\.•Edge: one of exactly three operations\.–projection\(πr\\pi\_\{r\}\): traverse one step along relationrr, either forward or backward\.–intersection\(∩\\cap\): take the intersection of multiple entity sets\.–complement\(¬\\neg\): take the complement of an entity set\.Core Principles1\.Faithful to the literal semantics of the question: thereasoning\_chainserves only as a reference for entity and relation names; do not use it to forcibly connect subgraphs that have no logical dependency in the question itself\.2\.Local connectivity: even when an isolated subgraph functions as mere “background filler,” its internal edges must still be built according to the literal logic expressed in the question\.Node Fields•node\_id: starts fromn0\.•label: semantic description of the node\.•known\_entities: entities explicitly given in the question; otherwisenull\.•reference\_entities: real entity names not mentioned in the question but provided by the reasoning\_chain; otherwisenull\.•is\_root: whether this node is the answer node\.Input•question: \{question\}•answer: \{answer\}•reasoning\_chain: \{reasoning\_chain\}Output FormatOutput only the following JSON, with no additional text:[⬇](data:text/plain;base64,ewogICJxdWVzdGlvbiI6ICIuLi4iLAogICJhbnN3ZXIiOiAiLi4uIiwKICAicm9vdF9ub2RlX2lkIjogIm5feCIsCiAgIm5vZGVzIjogWwogICAgewogICAgICAibm9kZV9pZCI6ICJuMCIsCiAgICAgICJsYWJlbCI6ICIuLi4iLAogICAgICAia25vd25fZW50aXRpZXMiOiBbLi4uXSBvciBudWxsLAogICAgICAicmVmZXJlbmNlX2VudGl0aWVzIjogWy4uLl0gb3IgbnVsbCwKICAgICAgImlzX3Jvb3QiOiBmYWxzZX0KICBdLAogICJlZGdlcyI6IFsKICAgIHsKICAgICAgImVkZ2VfaWQiOiAiZTAiLAogICAgICAic291cmNlX2lkcyI6IFsibjAiXSwKICAgICAgInRhcmdldF9pZCI6ICJuMSIsCiAgICAgICJvcCI6ICJwcm9qZWN0aW9uIHwgaW50ZXJzZWN0aW9uIHwgY29tcGxlbWVudCIsCiAgICAgICJyZWxhdGlvbiI6ICIuLi4iLAogICAgICAiZGlyZWN0aW9uIjogImZvcndhcmQgfCBiYWNrd2FyZCJ9CiAgXQp9)\{"question":"\.\.\.","answer":"\.\.\.","root\_node\_id":"n\_x","nodes":\[\{"node\_id":"n0","label":"\.\.\.","known\_entities":\[\.\.\.\]ornull,"reference\_entities":\[\.\.\.\]ornull,"is\_root":false\}\],"edges":\[\{"edge\_id":"e0","source\_ids":\["n0"\],"target\_id":"n1","op":"projection\|intersection\|complement","relation":"\.\.\.","direction":"forward\|backward"\}\]\}

Prompt for QA Synthesis DirectionRole: QA Synthesis DirectionGiven the parsed Question Graph, proposeoneconcrete expansion suggestion that makes the question more complex and adversarial\.Input•question: \{question\}•answer: \{answer\}•graph\_json: \{graph\_json\}Allowed Operations\(only the following three edge types may be added\)•projection: traverse one step along a relation\.•intersection: add an additional condition that must be jointly satisfied\.•complement: add an exclusionary constraint\.Candidate Strategies•Strategy A \(Complement\): attach a “never / excluding …” constraint to a node, producing a hard negative that retrieval systems struggle to handle\.•Strategy B \(Inverse Projection\): flip a forward projection into its inverse \(effect→\\rightarrowcause\), which is inherently harder to reason about than forward inference\.•Strategy C \(Intersection Expansion\): add an intersection branch on an intermediate node\. Placing it on an early node \(left\-deep tree\) or on a node close to the answer \(right\-deep tree\) modulates how the reasoning difficulty is distributed along the path\.Output FormatOutput only the following JSON, with no additional text:[⬇](data:text/plain;base64,ewogICJleHBhbnNpb25fYWR2aWNlIjogewogICAgInRhcmdldF90ZXh0X3NwYW4iOiAidGhlIGV4YWN0IHRleHQgc3BhbiBvZiB0aGUgdGFyZ2V0IGVudGl0eSBjb3BpZWQgdmVyYmF0aW0gZnJvbSB0aGUgb3JpZ2luYWwgcXVlc3Rpb24iLAoKICAgICJzdWdnZXN0ZWRfb3BlcmF0aW9uIjogImNvbXBsZW1lbnQgfCBwcm9qZWN0aW9uIHwgaW50ZXJzZWN0aW9uIiwKCiAgICAic2VsZWN0ZWRfc3RyYXRlZ3kiOiAiQ29tcGxlbWVudCB8IEludmVyc2UgUHJvamVjdGlvbiB8IEludGVyc2VjdGlvbiBFeHBhbnNpb24iLAoKICAgICJyZWFzb25pbmciOiAid2h5IHRoaXMgb3BlcmF0aW9uIGFuZCBzdHJhdGVneSBhcmUgY2hvc2VuLCBhbmQgaG93IHRoZXkgaW5jcmVhc2UgdGhlIGdyYXBoJ3MgY29tcGxleGl0eSBhbmQgcmVhc29uaW5nIGRpZmZpY3VsdHkiLAoKICAgICJzZW1hbnRpY19zdWdnZXN0aW9uIjogImEgbmF0dXJhbC1sYW5ndWFnZSBkZXNjcmlwdGlvbiBvZiB0aGUgY29uY3JldGUgbW9kaWZpY2F0aW9uIHRvIGFwcGx5IgogIH0KfQ==)\{"expansion\_advice":\{"target\_text\_span":"theexacttextspanofthetargetentitycopiedverbatimfromtheoriginalquestion","suggested\_operation":"complement\|projection\|intersection","selected\_strategy":"Complement\|InverseProjection\|IntersectionExpansion","reasoning":"whythisoperationandstrategyarechosen,andhowtheyincreasethegraph’scomplexityandreasoningdifficulty","semantic\_suggestion":"anatural\-languagedescriptionoftheconcretemodificationtoapply"\}\}

## Appendix BLow Quality QA pairs filtering

Prompt for QA Data Quality FilteringRole: QA Data Quality FilteringInspect the given QA pair and output only “pass” or “fail”\.Input•question: \{question\}•answer: \{answer\}Inspection Criteria\(failing any single item results in “fail”\)1\. Completeness•Both question and answer are non\-empty, non\-whitespace, and free of truncation or garbled text\.2\. Question Language Quality•Fluent and natural, consistent with how a human would naturally ask, with no grammatical errors or signs of machine\-generated patchwork\.•Sentence structure is clear, avoiding excessive nested modifiers\.•Semantically unique and unambiguous, with no unclear references, vague scope, or multiple valid interpretations\.•Forms a complete interrogative sentence with an explicit target of inquiry\.3\. Answer Quality•Concise and definite, easy to evaluate automatically \(e\.g\., a specific entity, number, or date\)\.•Semantically unique, with no equally valid alternative answers\.•Free of ambiguous expressions or vague qualifiers\.Output FormatOutput only the following JSON, with no additional text or code\-block markers:[⬇](data:text/plain;base64,ewogICJ0aGlua2luZyI6ICJicmllZiByZWFzb25pbmciLAoKICAicmVzdWx0IjogInBhc3MgfCBmYWlsIgp9)\{"thinking":"briefreasoning","result":"pass\|fail"\}

## Appendix CHuman Analyses on Data Quality

Two data experts from the author team, proficient in both English and Chinese, are involved in the human analyses\. For \(1\) whether eachϵi∈ℰ\\epsilon\_\{i\}\\in\\mathcal\{E\}is correct: each expert verifies whetherϵi\\epsilon\_\{i\}is consistent with or can be inferred from its source web pages\. For \(2\) whether each synthesized question is consistent with its correspondingℰ\\mathcal\{E\}and is unambiguous; and \(3\) whether each answer can be inferred by the correspondingℰ\\mathcal\{E\}, the data experts also give their judgments independently\. In cases where the two experts have differing judgments on a specific issue, the final decision will be reached through discussion\.

## Appendix DPrompt of LLM\-as\-a\-judge

Prompt for LLM\-as\-a\-JudgeYou are a general artificial intelligence assistant\. Based on the\[Correct Answer\]provided below, judge whether the\[Response\]to the\[Original Question\]below is correct\.\[Original Question\]: \{question\}\[Correct Answer\]: \{correct\_answer\}\[Response\]: \{response\}Your judgment must follow the following format and standards:Final Answer: The final accurate answer extracted from the\[Response\]\. If there is no clear final answer in the\[Response\], fill in’None’\.Explanation: Explain why the\[Final Answer\]is correct or incorrect based on the\[Correct Answer\]\. Only focus on whether there is a substantial difference between the\[Final Answer\]and the\[Correct Answer\]; do not comment on the background of the question, do not attempt to re\-solve the problem, do not defend any answer different from the\[Correct Answer\], and only focus on judging whether the answers are consistent\.Conclusion: If the\[Final Answer\]is consistent with the\[Correct Answer\]provided above, or within an acceptable small error range for numerical questions, fill in’Correct’; otherwise \(i\.e\., in case of any inconsistency, ambiguity, inequivalence, or incorrect extracted answer\), fill in’Incorrect’\.

Table 5:The Spearman correlation between LLM judge and human judge\.
## Appendix ESelection of the Judge Model

To select a reliable judge model in model evaluation, we randomly select 800 model predictions from the main experiments \(Section[3\.1](https://arxiv.org/html/2606.13120#S3.SS1)\), and respectively use the following cutting\-edge LLMs as the judge models: GPT\-4\.1OpenAI Team \([2025](https://arxiv.org/html/2606.13120#bib.bib39)\), DeepSeek\-V4\-Flash\-Chat, DeepSeek\-V4\-Flash\-MaxDeepSeek\-AI \([2026](https://arxiv.org/html/2606.13120#bib.bib18)\), DeepSeek\-V3\.2\-Chat, DeepSeek\-V3\.2\-ThinkDeepSeek\-AI \([2025](https://arxiv.org/html/2606.13120#bib.bib20)\), Kimi\-K2\.6\-Chat, Kimi\-K2\.6\-ThinkKimi \([2026](https://arxiv.org/html/2606.13120#bib.bib19)\), GLM\-5\-Chat and GLM\-5\-ThinkZenget al\.\([2026](https://arxiv.org/html/2606.13120#bib.bib22)\)\. To figure out their correlation with humans, we also manually annotate whether each model prediction is correct or not\. Similar to the human analyses \(Appendix[C](https://arxiv.org/html/2606.13120#A3)\), two data experts participate in the human annotation process, independently making their assessments and subsequently discussing any discrepancies in their judgments\. As shown in Table[5](https://arxiv.org/html/2606.13120#A4.T5), GLM\-5\-Chat achieves the best correlation with human judgments in terms of Spearman correlationZar \([2005](https://arxiv.org/html/2606.13120#bib.bib37)\)\.

Similar Articles

BrowseComp: a benchmark for browsing agents

OpenAI Blog

OpenAI released BrowseComp, a benchmark of 1,266 challenging problems designed to measure AI agents' ability to locate hard-to-find information across the internet, available in their simple evals GitHub repository.