DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks
Summary
DailyReport is an open-ended benchmark for evaluating search agents on daily search tasks, featuring 150 tasks and 3,546 rubrics for interpretable, user-centric evaluation.
View Cached Full Text
Cached at: 06/12/26, 08:54 AM
# DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks
Source: [https://arxiv.org/html/2606.12871](https://arxiv.org/html/2606.12871)
Jingxuan Han1,∗,‡, Wei Liu2,∗, Mingyang Zhu2,∗, Youpeng Wang1,‡, Ziwen Wang2, Lin Qiu2,†, Xuezhi Cao2, Xunliang Cai2, Zheren Fu1, Licheng Zhang1, Zhendong Mao1,§ 1University of Science and Technology of China 2Meituan \{hjx999222, wyp220517\}@mail\.ustc\.edu\.cn \{liuwei304, zhumingyang09\}@meituan\.com ∗Equal contribution\.†Project leader\.§Corresponding author\. ‡Work was done during their internship
###### Abstract
Search Agents \(SAs\) typically leverage large language models \(LLMs\) to support complex information\-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses\. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real\-world user scenarios\. Moreover, their reliance on coarse task\-level rubrics often limits evaluation interpretability\. To bridge this gap, we introduceDailyReport, an open\-ended benchmark to evaluate SA capabilities on daily search tasks\. It contains 150 open\-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real\-world users\. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions\. Through cascade performance attribution and user\-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score\. Our results on 17 agentic systems show that current systems still fall short of users’ expectations\. To facilitate future research, our dataset and code are made publicly available at[https://github\.com/AGI\-Eval\-Official/DailyReport](https://github.com/AGI-Eval-Official/DailyReport)\.
## 1Introduction
With the rise of open\-domain web agents, information seeking is moving from traditional keyword retrieval to agentic research\. Search Agents \(SAs\) have therefore emerged to address users’ information needs through extensive web exploration and long\-horizon reasoningHuanget al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib26)\)\. These agents can explore hundreds of web sources and synthesize heterogeneous information into comprehensive responsesWanget al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib1)\)\. As these agentic systems become increasingly capable, it is essential to evaluate their ability to conduct large\-scale information gathering and reasoning\.
Recently, several benchmarks have been introduced to evaluate the SAsFanet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib6)\)\. For task construction, most worksWeiet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib2)\); Duet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib5)\); Abaskohiet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib8)\); Sharmaet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib9)\)rely on domain experts to construct specialized research tasks\. These tasks mainly assess agents on overprocessed or professional questions within specific fields, which are unlikely to arise in real\-world scenarios\. Moreover, their static design fails to capture evolving real\-world information needs and raises concerns about potential data contamination\. For evaluation, existing studiesXuet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib7)\); Liet al\.\([2026](https://arxiv.org/html/2606.12871#bib.bib4)\)generally define task\-level rubrics over coarse\-grained dimensions and aggregate their scores linearly\. This often undermines evaluation interpretability and fails to quantify performance from the user perspective\.
Figure 1:DailyReport structure\. We construct daily search tasks and cascade rubrics for evaluating search agents\.In this work, we proposeDailyReport, an open\-ended benchmark to evaluate SAs ondaily search tasks\. Unlike previous benchmarks centered on specialized domain problems, DailyReport primarily evaluates whether agents can reliably satisfy everyday users’ timely and practical information needs\. It derives its tasks from trending topics and user comments on popular platforms \(e\.g\., Weibo, Facebook\), capturing widely discussed information needs from authentic daily user contexts\. DailyReport comprises 150 tasks across two types and 3,546 associated rubrics\. These tasks span 10 high\-level domains and 35 fine\-grained categories, reflecting broad user interests through a multi\-level taxonomy\. Built on time\-sensitive trending topics, DailyReport also supports continuous updates to reflect evolving user needs in real\-world scenarios\.
We develop a user\-centric cascade evaluation pipeline for SAs on these tasks\. Consider an authentic user query such as"List the Chinese universities in the 2026 QS Top 100 rankings, and analyze their respective strengths and weaknesses\.”\. If the agent fails to correctly identify the universities, any subsequent analysis becomes meaningless to users\. This suggests that rubrics should not be treated independently across dimensions, and different task components have hierarchical priorities from the user perspective\. In our pipeline, we decompose each task into subtasks and design cascade rubrics along three disentangled dimensions\. We first assess the subtask on theinstruction followingdimension, and then evaluatefactualityandrationalityaccordingly\. Finally, we apply cascade performance attribution to derive interpretable dimensional scores, and further incorporate subtask importance into user\-centric performance aggregation to explicitly quantify user preference\.
We evaluate17agentic systems across three groups using DailyReport\. The results show that existing agents perform well in instruction following, but still struggle with factuality and rationality\. Notably, their user preference scoresremain particularly limited, revealing a clear gap between current SA outputs and users’ perceived expectations\. We conduct detailed solving\-trace analysis to help diagnose underlying failure patterns and provide valuable guidance for future SA advances\.
The structure of DailyReport is shown in Figure[1](https://arxiv.org/html/2606.12871#S1.F1)\. In summary, our contributions are as follows:
- •We proposeDailyReport, a benchmark for evaluating SAs on daily search tasks\. These tasks are grounded in real\-world scenarios to reflect authentic user needs\. Consisting of 150 tasks and 3,546 rubrics, DailyReport is supported by over 500\-hours human annotation\.
- •We introducea user\-centric cascade evaluation pipeline\. It computes the subtask performance using cascade rubrics along disentangled dimensions, and then enables interpretable dimensional evaluation and explicit user preference quantification accordingly\.
- •We conducta thorough empirical assessmentof 17 frontier agentic systems across three groups\. The results reveal key strengths and limitations of current search agents\.
## 2Related Work
### 2\.1Benchmarks for Search Agents
As SA evolves, several benchmarks have emerged to evaluate their capabilities\. The first groupChenet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib12)\); Liet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib13)\); Songet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib16)\); Wuet al\.\([2026](https://arxiv.org/html/2606.12871#bib.bib14)\)targets fixed\-answer tasks that assess information retrieval and multi\-step reasoning\. BrowseCompWeiet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib2)\)serves as a foundational effort evaluating web browsing capabilities\. WideSearchWonget al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib3)\)focuses on wide\-context information aggregation requiring the collection of large volumes of atomic facts\. The second groupBigeardet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib17)\); Lyuet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib18)\); Huanget al\.\([2026](https://arxiv.org/html/2606.12871#bib.bib15)\)evaluates agents through comprehensive report generation\. DeepResearch BenchDuet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib5)\)proposes two complementary frameworks assessing report quality and retrieval ability, respectively\. DeepResearch Bench IILiet al\.\([2026](https://arxiv.org/html/2606.12871#bib.bib4)\)collects expert\-written investigative reports from reputable open\-access venues and constructs research\-style tasks following a similar domain distribution\.
LiveResearchBenchWanget al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib1)\)attempts to align tasks with daily user demands, but remains largely U\.S\.\-centric with limited regional coverage\. As shown in Table[1](https://arxiv.org/html/2606.12871#S2.T1), compared with prior works, our benchmark adopts up\-to\-date daily search tasks that are aligned with real\-world user demands\. It employs cascade rubrics along disentangled dimensions, enabling interpretable performance attribution and user preference quantification for SA evaluation\.
MethodOpen\-EndedTask FormatsDailyUser DemandsUp\-to\-date &Dynamic EvolvingDisentangledEval DimensionCascadeEval RubricsQuantifyUser PreferenceBrowseComp×\\times×\\times×\\times✓×\\times×\\timesWideSearch×\\times×\\times×\\times✓×\\times×\\timesDeepResearch Bench✓×\\times×\\times×\\times×\\times×\\timesDeepResearch Bench II✓×\\times×\\times×\\times×\\times×\\timesLiveResearchBench✓✓✓×\\times×\\times×\\timesResearchRubrics✓×\\times×\\times×\\times×\\times×\\timesDailyReport \(Ours\)✓✓✓✓✓✓
Table 1:Comparison of representative benchmarks across task\-oriented dimensions \(first three columns\) and evaluation\-oriented dimensions \(last three columns\)\.
### 2\.2Search Agents
The remarkable progress of LLMs has accelerated the development of SAsZhouet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib24)\); Xiet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib25)\), particularly Deep Research Agents \(DRAs\) for challenging report\-generation tasks\. LangChain’s Deep ResearcherLearningCircuit \([2025](https://arxiv.org/html/2606.12871#bib.bib19)\)performs multi\-step web search and synthesizes information locally for response generation\. DeepResearcherZhenget al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib21)\)scales reinforcement learning with authentic web search interactions for agent training\. Tongyi DeepResearchTeamet al\.\([2025](https://arxiv.org/html/2606.12871#bib.bib23)\)combines agentic mid\-training and post\-training, enabling scalable reasoning and information seeking across complex tasks\. Meanwhile, recent production\-grade agents, including GeminiGoogle \([2025](https://arxiv.org/html/2606.12871#bib.bib29)\), GrokxAI \([2025](https://arxiv.org/html/2606.12871#bib.bib31)\), and Qwen Deep ResearchTeam \([2025](https://arxiv.org/html/2606.12871#bib.bib30)\), have shown the capability to perform multi\-step web exploration and synthesize comprehensive research reports\. Based on these works, DailyReport systematically analyzes the capabilities and limitations of current SAs to further advance this field\.
## 3DailyReport Benchmark
Figure 2:Detailed characteristics of daily search tasks in DailyReport\. The benchmark comprises 150 expert\-curated tasks with 3,546 detailed rubrics across 10 high\-level domains and 35 fine\-grained categories\. It evaluates search agents in daily user scenarios and aligns closely with predominant real\-world user demands\.### 3\.1Task Characteristic
Figure[2](https://arxiv.org/html/2606.12871#S3.F2)provides the detailed task characteristics of DailyReport\. Compared with existing researches, it has the following distinctive features:
Tasks are rooted in real\-world scenarios and better capture users’ daily search needs\.For example, the search task on QS rankings in Figure[2](https://arxiv.org/html/2606.12871#S3.F2)is derived from authentic trending topics during the admissions season\. It directly reflects practical user interests in university selection and academic planning\. In addition, these tasks are framed as broad queries that cover multiple related sub\-questions for report generation, which better aligns with how typical users search their needs in real worlds\.
Tasks are grounded in up\-to\-date trending topics and continuously evolving\.As illustrated in Figure[2](https://arxiv.org/html/2606.12871#S3.F2), these tasks are consistently grounded in recent real\-world events and are regularly updated\. This requires agents to iteratively search for information on user\-relevant trending topics, rather than relying solely on an LLM’s internal knowledge\.
### 3\.2Task Construction
The task construction procedure is primarily conducted by recruited human experts in three stages:\(1\) Trending Topic Selection\.\(2\) Expert\-crafted Task Formulation\.\(3\) Hybrid Topic Annotation\.
##### Trending Topic Collection
To root our tasks in real\-world scenarios, we primarily select the trending topics from major Western platforms \(e\.g\., Facebook, Reddit, and Twitter\) and Chinese platforms \(e\.g\., Weibo, Xiaohongshu, and Zhihu\)\. The collected topic information consists of trending event posts and corresponding user comments, ensuring diverse and regionally representative coverage of authentic user information demands\.
##### Expert\-crafted Task Formulation
We recruit human experts to formulate daily search tasks from each topic report and its user comments\. This process yields 150 open\-ended tasks that evaluates whether agents can reliably satisfy real users’ timely and practical information needs\. We set the following requirements for task formulation:
- •Principle:\(1\)Authenticity: Tasks must be realistic and reflect the genuine information needs of specific user demographics\. \(2\)Clarity: Task descriptions strictly avoid ambiguous phrasing to ensure precise instructions\. \(3\)Safety: Tasks are benign to prevent being rejected by the safety mechanisms\.
- •Type:\(1\)100 retrieval\-centric tasks, which focus on retrieving and integrating objective information about specified entities, with only lightweight analysis\. \(2\)50 analysis\-centric tasks, which focus on broader subjective topics and require SAs to autonomously identify relevant information for deeper analysis\.
##### Hybrid Task Annotation
Considering the diversity of daily domains, annotators conduct hybrid task annotation\. They first classify each task into 35 fine\-grained categories and then consolidate these categories into 10 high\-level domains\. Fine\-grained categories represent specific user interests \(likeeducation\), while high\-level domains represent broader fields \(likeSocial Livelihood\)\.
### 3\.3Rubric Generation
We decompose each task into subtasks and generate cascade rubrics for each subtask across disentangled dimensions\. This process combines LLM\-based generation with extensive human refinement, while also supporting full LLM\-based automation\.
##### Subtask Decomposition
We further categorize commonly emphasized constraints by users into several groups\. For instance,Scope Constraintsdictates that the response must adhere to the boundaries defined in the requirement \(e\.g\., temporal, spatial\)\. Detailed definitions are provided in Appendix[A](https://arxiv.org/html/2606.12871#A1)\. Our subtasks are then formulated with different combinations of these constraints and generally follow the principles: \(1\)Atomicity: Subtasks must be atomic and a single subtask typically corresponds to one constraint type \(excludingScopeandCompletenessconstraints, which cannot exist independently\)\. \(2\)Coverage: Subtask aggregation must cover every requirement of the original task\. \(3\)Traceability: Subtasks must be strictly grounded in the original task to avoid hallucination\.
##### Rubric Formulation
As shown in Table[2](https://arxiv.org/html/2606.12871#S3.T2), we define three disentangled dimensions to evaluate SAs on our daily search tasks\. LLMs then formulate cascade rubrics across the three dimensions with human expert assistance for subtask assessment\. Compared with traditional macro\-level rubrics, our cascade rubrics support more interpretable performance attribution and allow subtask importance to be incorporated during aggregation\.
DimensionsDescriptionInstruction FollowingEvaluates the agent’s ability toaccuratelyunderstand andfullyexecute user instructions\.\(Objective\)FactualityEvaluates the agent’s ability to generatefactually accuratecontent\. The verification process requires external search tools\.\(Objective\)RationalityEvaluates the ability to producelogically coherentreasoning and analysis\. The process can be judged solely by cross\-referencing the context\.\(Subjective\)Table 2:Three disentangled evaluation dimensions which are designed to be strictly orthogonal\.
## 4User\-centric Cascade Evaluation
### 4\.1Rubric Assessment
We employ cascade rubrics across the three dimensions for subtask evaluation\. LetTiT\_\{i\}denote thei\-th subtask, andRes=SA\(Ti\)\\mathrm\{Res\}=\\mathrm\{SA\}\(T\_\{i\}\)denote the agent’s response toTiT\_\{i\}, wherei∈\[1,n\]i\\in\[1,n\]andnnis the total number of subtasks\. The dimensional scoredimi\\mathrm\{dim\}\_\{i\}is calculated as in Eq\.[1](https://arxiv.org/html/2606.12871#S4.E1), wheredim∈\{ins,fac,rat\}\\mathrm\{dim\}\\in\\\{\\mathrm\{ins\},\\mathrm\{fac\},\\mathrm\{rat\}\\\}\. For each dimension, the judge modelJudgedim\\operatorname\{Judge\}\_\{\\mathrm\{dim\}\}evaluatesRes\\mathrm\{Res\}against the corresponding rubricrdim\(Ti\)r\_\{\\mathrm\{dim\}\}\(T\_\{i\}\)\. More judgment details are provided in Appendix[B](https://arxiv.org/html/2606.12871#A2)\. This produces three subtask scores,insi\\mathrm\{ins\}\_\{i\},faci\\mathrm\{fac\}\_\{i\}, andrati\\mathrm\{rat\}\_\{i\}, which represent the dimensional performance of thei\-th subtask\.
dimi=Judgedim\(Ti,Res,rdim\(Ti\)\)\\mathrm\{dim\}\_\{i\}=\\operatorname\{Judge\}\_\{\\mathrm\{dim\}\}\\Big\(T\_\{i\},\\mathrm\{Res\},r\_\{\\mathrm\{dim\}\}\(T\_\{i\}\)\\Big\)\(1\)
### 4\.2Cascade Performance Attribution
We derive an interpretable overall score for each dimension by accounting for subtask performance dependencies among the three dimensions\.
##### Instruction Following
For a given SA system, we directly obtain its subtask performanceinsi\\mathrm\{ins\}\_\{i\}on instruction following dimension using Eq\.[1](https://arxiv.org/html/2606.12871#S4.E1)\. The scoreinsi∈\{0,0\.5,1\}\\mathrm\{ins\}\_\{i\}\\in\\\{0,0\.5,1\\\}reflects whetherRes\\mathrm\{Res\}fully, partially, or fails to satisfy the corresponding rubric\. The overall scoreIns\\mathrm\{Ins\}is defined as the average performance across all subtasks, as shown in Eq\.[2](https://arxiv.org/html/2606.12871#S4.E2)\.
Ins=∑k=1ninskn\\text\{Ins\}=\\frac\{\\sum\_\{k=1\}^\{n\}\\mathrm\{ins\}\_\{k\}\}\{n\}\(2\)
Algorithm 1User\-Centric Aggregation1:
pk∈\{P0,P1,P2\(a\),P2\}p\_\{k\}\\in\\\{P0,P1,P2\(a\),P2\\\}, scores
ok∈\[0,1\]o\_\{k\}\\in\[0,1\]
2:
𝒮0←\{k:pk=P0\},𝒮1←\{k:pk=P1\}\\mathcal\{S\}\_\{0\}\\leftarrow\\\{k:p\_\{k\}=P0\\\},\\mathcal\{S\}\_\{1\}\\leftarrow\\\{k:p\_\{k\}=P1\\\}
3:
𝒮2\(a\)←\{k:pk=P2\(a\)\}\\mathcal\{S\}\_\{2\}^\{\(a\)\}\\leftarrow\\\{k:p\_\{k\}=P2\(a\)\\\}for each group
a∈\{1,…,N\}a\\in\\\{1,\\dots,N\\\}
4:
𝒢←Mean\{ok:k∈𝒮2\(a\)\}\\mathcal\{G\}\\leftarrow\\text\{Mean\}\\\{o\_\{k\}:k\\in\\mathcal\{S\}\_\{2\}^\{\(a\)\}\\\}
5:
c0←Mean\{ok:k∈𝒮0\}c\_\{0\}\\leftarrow\\text\{Mean\}\\\{o\_\{k\}:k\\in\\mathcal\{S\}\_\{0\}\\\}if
𝒮0≠∅\\mathcal\{S\}\_\{0\}\\neq\\emptysetelse
11
6:
c1←Mean\{ok:k∈𝒮1\}∪𝒢c\_\{1\}\\leftarrow\\text\{Mean\}\\\{o\_\{k\}:k\\in\\mathcal\{S\}\_\{1\}\\\}\\cup\\mathcal\{G\}
7:if
∀k:ok=1\\forall\\,k:o\_\{k\}=1thenreturnUserPref
=4=4
8:endif
9:if
c0=0∨c1<0\.3∨\(c0<0\.5∧c1<0\.5\)c\_\{0\}=0\\lor c\_\{1\}<0\.3\\lor\(c\_\{0\}<0\.5\\land c\_\{1\}<0\.5\)thenreturnUserPref
=1=1
10:endif
11:
v1←∀k∈𝒮1:ok\>0v\_\{1\}\\leftarrow\\forall\\,k\\in\\mathcal\{S\}\_\{1\}:o\_\{k\}\>0
12:
v2←∃k∈⋃a𝒮2\(a\):ok\>0v\_\{2\}\\leftarrow\\exists\\,k\\in\\bigcup\_\{a\}\\mathcal\{S\}\_\{2\}^\{\(a\)\}:o\_\{k\}\>0
13:if
c0≥0\.5∧v1∧v2∧c1≥0\.7c\_\{0\}\\geq 0\.5\\land v\_\{1\}\\land v\_\{2\}\\land c\_\{1\}\\geq 0\.7then
14:returnUserPref
=3=3
15:else
16:returnUserPref
=2=2
17:endif
##### Factuality
We perform cascade performance attribution to obtain reliable factuality performance, where the factuality dimension is considered only if the response satisfies the corresponding instruction following requirement\. Otherwise, the required target content is absent, and evaluating its factuality is no longer meaningful\. Inspired by this, we define the overall factuality score as Eq\.[3](https://arxiv.org/html/2606.12871#S4.E3)\.
Fac=∑k=1nδk⋅insk⋅fack∑k=1nδk⋅insk,\\mathrm\{Fac\}=\\frac\{\\sum\_\{k=1\}^\{n\}\\delta\_\{k\}\\cdot\\mathrm\{ins\}\_\{k\}\\cdot\\mathrm\{fac\}\_\{k\}\}\{\\sum\_\{k=1\}^\{n\}\\delta\_\{k\}\\cdot\\mathrm\{ins\}\_\{k\}\},\(3\)whereδk=1\\delta\_\{k\}=1if the subtask includes the factuality rubric, andδk=0\\delta\_\{k\}=0otherwise\. We first extract the objective claims inRes\\mathrm\{Res\}, along with their supporting references if available\. The judge model then verifies each claim using web search and assigns the factuality scorefack∈\[0,1\]\\mathrm\{fac\}\_\{k\}\\in\[0,1\]accordingly\.
##### Rationality
Similarly, we formulate the overall rationality score in Eq\.[4](https://arxiv.org/html/2606.12871#S4.E4), whereφk\\varphi\_\{k\}indicates whether thek\-th subtask contains the rationality rubric\. The scoreratk∈\{0,0\.5,1\}\\mathrm\{rat\}\_\{k\}\\in\\\{0,0\.5,1\\\}is assigned by the judge model based on whetherRes\\mathrm\{Res\}is logically reasonable\. To reduce its coupling with factuality, the judge model primarily focuses on the subjective reasoning and analytical part ofRes\\mathrm\{Res\}\.
Rat=∑k=1nφk⋅insk⋅ratk∑k=1nφk⋅insk\\mathrm\{Rat\}=\\frac\{\\sum\_\{k=1\}^\{n\}\\varphi\_\{k\}\\cdot\\mathrm\{ins\}\_\{k\}\\cdot\\mathrm\{rat\}\_\{k\}\}\{\\sum\_\{k=1\}^\{n\}\\varphi\_\{k\}\\cdot\\mathrm\{ins\}\_\{k\}\}\(4\)
### 4\.3User\-centric Performance Aggregation
We apply user\-centric aggregation for subtask performance to obtain the user preference score\. First, we define four user preference levels according to real users’ perceived helpfulness:1 \(Unhelpful\):The response entirely misses the user’s core needs and is almost unusable for users\.2 \(Deficient\):The response satisfies some user requirements, but contains significant flaws that negatively impact the user experience\.3 \(Acceptable\):The response satisfies the primary user needs, with only minor flaws that do not significantly affect the overall experience\.4 \(Perfect\):The response fully satisfies the user’s needs with almost no errors\.
Then, we recruit the task creators to conduct an ablation study for each subtask to obtain its user\-perceived importance\. Specifically, they estimate the user preference level when only the target subtask is left unsatisfied, and assign its importance according to the following mapping:P0: if the resulting response is rated as 1 \(Unhelpful\),P1: if rated as 2 \(Deficient\),P2: if rated as 3 \(Acceptable\)\.P2\(a\)denotes subtasks that are 3 \(acceptable\) when missed alone but can cause 2 \(Deficient\) when multiple such subtasks are missed\.
Moreover, we calculate the overall subtask performance as Eq\.[5](https://arxiv.org/html/2606.12871#S4.E5)according to user experience, where factuality and rationality are meaningful only when the response follows the instructions\.
ok=12⋅insk⋅\(fack\+ratk\)o\_\{k\}=\\frac\{1\}\{2\}\\cdot\\mathrm\{ins\}\_\{k\}\\cdot\(\\mathrm\{fac\}\_\{k\}\+\\mathrm\{rat\}\_\{k\}\)\(5\)
Finally, the user\-centric aggregation algorithm is developed as in Alg\.[1](https://arxiv.org/html/2606.12871#alg1), which aggregates subtask performanceoko\_\{k\}based on its importance and computes the overall user preference scoreUserPref\\mathrm\{UserPref\}\.
ModelUserPrefSubTask PassInstFollowFactualityRationalityDeep Research AgentsOpenAI o3 Deep ResearchOpenAI \([2025a](https://arxiv.org/html/2606.12871#bib.bib27)\)2\.420\.2280\.9670\.6160\.856OpenAI o4\-mini Deep ResearchOpenAI \([2025b](https://arxiv.org/html/2606.12871#bib.bib28)\)2\.400\.2410\.9610\.6630\.778Gemini Deep ResearchGoogle \([2025](https://arxiv.org/html/2606.12871#bib.bib29)\)2\.410\.1840\.9730\.6350\.765Qwen Deep ResearchTeam \([2025](https://arxiv.org/html/2606.12871#bib.bib30)\)2\.170\.1190\.9340\.6120\.662Grok 3 Deep ResearchxAI \([2025](https://arxiv.org/html/2606.12871#bib.bib31)\)2\.480\.3010\.9170\.7310\.909LLMs with Search ToolsClaude Opus 4\.6Anthropic \([2026](https://arxiv.org/html/2606.12871#bib.bib33)\)2\.790\.2610\.9760\.7960\.820GPT 5\.4OpenAI \([2026](https://arxiv.org/html/2606.12871#bib.bib34)\)2\.890\.4840\.9820\.8350\.930Gemini 3\.1 ProDeepMind \([2026](https://arxiv.org/html/2606.12871#bib.bib35)\)2\.630\.2910\.9760\.7300\.802GLM 5GLM\-5\-Team \([2026](https://arxiv.org/html/2606.12871#bib.bib36)\)2\.680\.2500\.9720\.7840\.775Kimi K2\.5Team \([2026a](https://arxiv.org/html/2606.12871#bib.bib37)\)2\.600\.2150\.9700\.7280\.786Qwen 3\.5Team \([2026b](https://arxiv.org/html/2606.12871#bib.bib38)\)2\.670\.2080\.9600\.7760\.757LLMs with Claude CodeCC\-Claude Opus 4\.6Anthropic \([2026](https://arxiv.org/html/2606.12871#bib.bib33)\)2\.650\.2060\.9710\.7560\.809CC\-GPT 5\.4OpenAI \([2026](https://arxiv.org/html/2606.12871#bib.bib34)\)2\.870\.4780\.9890\.8130\.933CC\-Gemini 3\.1 ProDeepMind \([2026](https://arxiv.org/html/2606.12871#bib.bib35)\)2\.580\.2620\.9710\.6840\.821CC\-GLM 5GLM\-5\-Team \([2026](https://arxiv.org/html/2606.12871#bib.bib36)\)2\.650\.2650\.9650\.7670\.809CC\-Kimi K2\.5Team \([2026a](https://arxiv.org/html/2606.12871#bib.bib37)\)2\.610\.2230\.9640\.7180\.796CC\-Qwen 3\.5Team \([2026b](https://arxiv.org/html/2606.12871#bib.bib38)\)2\.510\.1990\.9670\.7180\.782
Table 3:Evaluation results of 17 system settings on DailyReport across three categories\.Bold valuesindicate the highest score in each column, whileunderlineddenotes the second highest\.
## 5Experiment
### 5\.1Experiment Setup
We conduct a comprehensive evaluation of 17 agentic systems in three groups: native DRAs, search\-augmented LLMs, and LLMs with Claude Code\. We select Gemini\-3\-flash as the judge model and enabled reasoning mode for all evaluated models\.
### 5\.2Main Results
Table[3](https://arxiv.org/html/2606.12871#S4.T3)illustrates the main results of frontier agentic systems on DailyReport\. Overall, LLMs with search tools achieve the best performance, followed by LLM equipped with Claude Code, while native DRAs obtain relatively lower scores\. Among all systems, GPT 5\.4\-based configurations performs best\. This suggests that daily search tasks benefit from the combination of direct web search and strong general\-purpose LLMs\. In contrast, Claude Code is optimized for code\-oriented workflows, which may introduce redundant context and lead to suboptimal results on search\-intensive tasks\. Native DRAs may rely on specialized internal models to balance cost, latency, and stability, making them less effective than stronger general models\.
Current systems are particularly weak on UserPref: even the highest score remains below the acceptable level of 3, showing that they still struggle to produce consistently satisfactory responses\. For further comparison, we report SubTask Pass, the proportion of subtasks satisfying all rubric criteria, which remains low across systems\. A system may achieve higher UserPref despite lower SubTask Pass when it satisfies more high\-importance subtasks while missing less critical ones\.



Figure 3:Task type effect across three dimensions\. For each model, we report the differenceΔ=Avganalysis−Avgretrieval\\Delta=\\mathrm\{Avg\}\_\{\\mathrm\{analysis\}\}\-\\mathrm\{Avg\}\_\{\\mathrm\{retrieval\}\}between its average scores on 50 analysis\-centric and 100 retrieval\-centric tasks\. Blue bars indicateΔ\>0\\Delta\>0and stronger analysis\-centric task performance, while yellow bars indicate the opposite\.

Figure 4:Trace Analysis\. Avg\_Search\_Calls measures the total number of search\-tool calls\. Reference\_Ratio measures the proportion of claims that are supported by references, Reference\_Support measures the factual accuracy of claims with references, and No\_Reference\_Support measures the factual accuracy of claims without references\.The dimensional scores reveal different capability bottlenecks of current agentic systems\. First, all systems achieve relatively high InstFollow scores, suggesting that frontier models generally possess strong instruction\-following abilities and can cover most explicit user requirements\. In contrast, Factuality remains the weakest dimension, indicating that systems still struggle to acquire accurate and timely evidence to avoid hallucinated claims\. Rationality is still far from perfect, possibly because reasoning over trending topics often involves incomplete, timely, or conflicting information\.
### 5\.3Task Type Analysis
Task type effects across three dimensions are shown in Figure[3](https://arxiv.org/html/2606.12871#S5.F3)\. In total, analysis\-centric tasks show slightly better instruction following and rationality, but lower factuality\. Specifically, analysis\-centric tasks are more open\-ended and usually provide broader analytical requirements, making it easier for models to cover the requested aspects and obtain higher InstFollow scores\. However, open\-ended analysis also leads to more divergent search paths\. The retrieved evidence is often scattered across heterogeneous sources, so claims are harder to triangulate through cross\-source verification than in retrieval\-centric tasks\. As a result, models are more likely to introduce unsupported factual claims and suffer from lower factuality\. The stronger rationality performance on analysis\-centric tasks can be explained by their focus on topic\-level summarization and subjective analysis, which better match models’ strengths in open\-ended analytical writing\. In addition, the task formulations usually provide explicit analytical directions that help models organize coherent explanations and arguments\.
### 5\.4Trace Analysis
We analyze the solving traces of each system in Figure[4](https://arxiv.org/html/2606.12871#S5.F4)\. Search\-tool usage directly reflects the extent of retrieval and iterative reasoning, which shows the strongest association with overall performance\. This suggests thatfuture SAs should incorporate mechanisms to ensure sufficient retrieval before generation\.Compared to search\-augmented LLMs, LLMs with Claude Code invoke search tools less frequently, possibly because the code\-oriented framework encourages context reuse and avoids unnecessary tool calls for efficiency\.
We additionally examine the weaker factuality dimension through three reference\-related metrics\. Most systems achieve a high Reference\_Ratio, indicating that they tend to support generated claims with references\. This improves the factual accuracy over unsupported claims to some extent\. However, Reference\_Support remains lower, showing that citing references does not always guarantee factual correctness\. This highlights thatfuture SAs need to improve reference quality and reference\-claim alignment, which are still inadequate in current systems, as further analyzed in Appendix[B](https://arxiv.org/html/2606.12871#A2)\.
### 5\.5Meta Evaluation
ModelsUserPIns \(%\)Fac \(%\)Rat \(%\)GPT\-search2\.90±0\.0072\.90\_\{\\scriptscriptstyle\\pm 0\.007\}98\.3±0\.398\.3\_\{\\scriptscriptstyle\\pm 0\.3\}83\.6±0\.283\.6\_\{\\scriptscriptstyle\\pm 0\.2\}93\.4±0\.493\.4\_\{\\scriptscriptstyle\\pm 0\.4\}Claude\-search2\.78±0\.0102\.78\_\{\\scriptscriptstyle\\pm 0\.010\}97\.8±0\.297\.8\_\{\\scriptscriptstyle\\pm 0\.2\}78\.5±1\.078\.5\_\{\\scriptscriptstyle\\pm 1\.0\}81\.5±0\.581\.5\_\{\\scriptscriptstyle\\pm 0\.5\}Gemini\-search2\.64±0\.0102\.64\_\{\\scriptscriptstyle\\pm 0\.010\}97\.7±0\.197\.7\_\{\\scriptscriptstyle\\pm 0\.1\}69\.9±2\.769\.9\_\{\\scriptscriptstyle\\pm 2\.7\}80\.5±0\.980\.5\_\{\\scriptscriptstyle\\pm 0\.9\}
Table 4:Robustness Analyses\.##### Robustness
Evaluation stability reflects the reproducibility and practical usability of a benchmark, yet it is often ignored in existing open\-ended SA benchmarks\. We conduct a robustness analysis on DailyReport by selecting three representative models and repeating the evaluation three times\. We use the standard deviation across runs to measure evaluation stability\. As shown in Table[4](https://arxiv.org/html/2606.12871#S5.T4), the results exhibit low variance, demonstrating that DailyReport provides stable results\. This supports its practical value as a reliable benchmark for SAs\.
##### Judge Model Selection
ModelsIns \(%\)Fac \(%\)Rea \(%\)Avg\.Cost \($\)GPT\-5\.292\.191\.791\.42\.04Gemini\-2\.5\-Pro94\.593\.193\.81\.58Claude\-4\.5\-Sonnet95\.194\.595\.72\.53Gemini\-3\-flash96\.594\.295\.30\.45
Table 5:Accuracy and cost of different judge LLMs\.We conduct a meta\-evaluation to compare different LLMs as evaluators\. Each LLM evaluates the same set of reports, and we compute its accuracy against human expert annotations, as reported in Table[5](https://arxiv.org/html/2606.12871#S5.T5)\. Gemini\-3\-Flash follows our criteria more accurately than GPT\-5\.2 and Gemini\-2\.5\-Pro, while achieving comparable agreement to Claude 4\.5 Sonnet\. Considering both evaluation accuracy and cost, we select Gemini\-3\-Flash as the judge model for all experiments\.
##### Metric Validation
To validate our metrics, we conduct a meta\-evaluation on 300 randomly sampled subtasks\. For instruction\-following, human annotators and the judge model independently evaluate these samples\. Final labels are determined through adjudication, where experts review both results to make more informed decisions that approximate the ground truth\. Our evaluation achieves 96\.5% accuracy, substantially exceeding human annotation accuracy of 88\.4%\. Since factuality and rationality are difficult for humans to annotate in long reports, we instead assess these metrics through manual spot checks\. The accuracy reaches 94\.2% for factuality and 95\.3% for rationality, meeting the expected requirements\. For user preference, users are given the generated reports and subtask results, and assign an overall score from 1 to 4 to indicate their preference\. The agreement heatmap is in Appendix[B](https://arxiv.org/html/2606.12871#A2)\. UserPref achieves high agreement with real user ratings, with a Weighted Cohen’s Kappa score of 0\.859\. This suggests that it effectively reflects real users’ perceived experience\.
Figure 5:Domain distribution\. The heatmap reports the average UserPref scores of different systems on analysis\-centric and retrieval\-centric tasks across 10 high\-level domains\.
### 5\.6Domain Distribution
UserPref across 10 high\-level domains is shown in Figure[5](https://arxiv.org/html/2606.12871#S5.F5)\. Systems generally achieve higher user preference in domains such as Politics & Law and Industrial Economies, where information is more structured and can be verified through authoritative sources, such as official announcements, institutional reports, or mainstream news coverage\. In contrast, domains such as Sports & Entertainment tend to receive lower scores, as they often involve rapidly changing events and subjective user opinions, making it harder to retrieve comprehensive evidence and produce reliable analysis\. This domain\-level variation suggests that current search agents perform better on topics with stable and well\-documented evidence, but still struggle with highly dynamic or subjective information needs\.
## 6Conclusion
In this work, we present an open\-ended benchmark \(DailyReport\) to evaluate search agents on daily search tasks\. It contains 150 tasks with 3,546 associated rubrics, capturing widely discussed and timely information needs of real\-world users\. We decompose each task into subtasks and design cascade rubrics along disentangled dimensions for subtask evaluation\. Through cascade performance attribution and user\-centric aggregation, DailyReport produces interpretable dimensional scores and an additional user preference score\. Finally, we conduct an empirical assessment of 17 agentic systems to characterize current search agents and offer insights for future research in this area\.
## Acknowledgments
We would like to thank Ruyu Ruan, Yinglong Deng, Yi Shi, Jianfei Zhao, Jiayi Guo, Hao Zheng, Zhiqiang Li, Mingyue Yuan, Danni Li, Ting Zeng, Xin Tang, Luju Gao, Zixi Yuan, and Tingting Liang for their valuable contributions to the benchmark construction process\.
## References
- A\. Abaskohi, T\. Chen, M\. Muñoz\-Mármol, C\. Fox, A\. V\. Ramesh, É\. Marcotte, X\. H\. Lù, N\. Chapados, S\. Gella, P\. West,et al\.\(2025\)Drbench: a realistic benchmark for enterprise deep research\.arXiv preprint arXiv:2510\.00172\.Cited by:[§1](https://arxiv.org/html/2606.12871#S1.p2.1)\.
- Claude 4\.6 opus system card\.Technical reportAnthropic\.External Links:[Link](https://www.anthropic.com/claude-opus-4-6-system-card)Cited by:[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.16.1),[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.9.1)\.
- A\. Bigeard, L\. Nashold, R\. Krishnan, and S\. Wu \(2025\)Finance agent benchmark: benchmarking llms on real\-world financial research tasks\.arXiv preprint arXiv:2508\.00828\.Cited by:[§2\.1](https://arxiv.org/html/2606.12871#S2.SS1.p1.1)\.
- K\. Chen, Y\. Ren, Y\. Liu, X\. Hu, H\. Tian, T\. Xie, F\. Liu, H\. Zhang, H\. Liu, Y\. Gong,et al\.\(2025\)Xbench: tracking agents productivity scaling with profession\-aligned real\-world evaluations\.arXiv preprint arXiv:2506\.13651\.Cited by:[§2\.1](https://arxiv.org/html/2606.12871#S2.SS1.p1.1)\.
- G\. DeepMind \(2026\)Gemini 3\.1 pro model card\.Technical reportGoogle DeepMind\.External Links:[Link](https://deepmind.google/models/model-cards/gemini-3-1-pro/)Cited by:[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.11.1),[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.18.1)\.
- M\. Du, B\. Xu, C\. Zhu, X\. Wang, and Z\. Mao \(2025\)Deepresearch bench: a comprehensive benchmark for deep research agents\.arXiv preprint arXiv:2506\.11763\.Cited by:[§1](https://arxiv.org/html/2606.12871#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.12871#S2.SS1.p1.1)\.
- T\. Fan, X\. Niu, Y\. Zheng, F\. Zhang, C\. Huang, B\. Chen, J\. Lin, and C\. Huang \(2025\)Understanding deepresearch via reports\.arXiv preprint arXiv:2510\.07861\.Cited by:[§1](https://arxiv.org/html/2606.12871#S1.p2.1)\.
- GLM\-5\-Team \(2026\)GLM\-5: from vibe coding to agentic engineering\.arXiv preprint arXiv:2602\.15763\.Cited by:[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.12.1),[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.19.1)\.
- Google \(2025\)Google gemini deep research\.Note:[https://blog\.google/innovation\-and\-ai/technology/developers\-tools/deep\-research\-agent\-gemini\-api/](https://blog.google/innovation-and-ai/technology/developers-tools/deep-research-agent-gemini-api/)Accessed: 2026\-04\-20Cited by:[§2\.2](https://arxiv.org/html/2606.12871#S2.SS2.p1.1),[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.5.1)\.
- P\. Huang, Z\. Zhong, Z\. Wan, D\. Zhou, S\. Alam, X\. Wang, Z\. Li, Z\. Dou, L\. Zhu, J\. Xiong,et al\.\(2026\)MMDeepResearch\-bench: a benchmark for multimodal deep research agents\.arXiv preprint arXiv:2601\.12346\.Cited by:[§2\.1](https://arxiv.org/html/2606.12871#S2.SS1.p1.1)\.
- Y\. Huang, Y\. Chen, H\. Zhang, K\. Li, H\. Zhou, M\. Fang, L\. Yang, X\. Li, L\. Shang, S\. Xu,et al\.\(2025\)Deep research agents: a systematic examination and roadmap\.arXiv preprint arXiv:2506\.18096\.Cited by:[§1](https://arxiv.org/html/2606.12871#S1.p1.1)\.
- Jina AI \(2024\)Jina reader: convert any url to markdown for llms\.Note:[https://jina\.ai/reader/](https://jina.ai/reader/)Accessed: 2026\-05\-20Cited by:[§B\.3\.2](https://arxiv.org/html/2606.12871#A2.SS3.SSS2.p1.4)\.
- LearningCircuit \(2025\)Local deep research\.GitHub\.Note:[https://github\.com/LearningCircuit/local\-deep\-research](https://github.com/LearningCircuit/local-deep-research)Cited by:[§2\.2](https://arxiv.org/html/2606.12871#S2.SS2.p1.1)\.
- R\. Li, M\. Du, B\. Xu, C\. Zhu, X\. Wang, and Z\. Mao \(2026\)DeepResearch bench ii: diagnosing deep research agents via rubrics from expert report\.arXiv preprint arXiv:2601\.08536\.Cited by:[§1](https://arxiv.org/html/2606.12871#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.12871#S2.SS1.p1.1)\.
- S\. Li, X\. Bu, W\. Wang, J\. Liu, J\. Dong, H\. He, H\. Lu, H\. Zhang, C\. Jing, Z\. Li,et al\.\(2025\)Mm\-browsecomp: a comprehensive benchmark for multimodal browsing agents\.arXiv preprint arXiv:2508\.13186\.Cited by:[§2\.1](https://arxiv.org/html/2606.12871#S2.SS1.p1.1)\.
- Y\. Lyu, X\. Zhang, L\. Yan, M\. de Rijke, Z\. Ren, and X\. Chen \(2025\)Deepshop: a benchmark for deep research shopping agents\.arXiv preprint arXiv:2506\.02839\.Cited by:[§2\.1](https://arxiv.org/html/2606.12871#S2.SS1.p1.1)\.
- OpenAI \(2025a\)OpenAI o3 deep research\.Note:[https://developers\.openai\.com/api/docs/models/o3\-deep\-research](https://developers.openai.com/api/docs/models/o3-deep-research)Accessed: 2026\-04\-20Cited by:[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.3.1)\.
- OpenAI \(2025b\)OpenAI o4\-mini deep research\.Note:[https://developers\.openai\.com/api/docs/models/o4\-mini\-deep\-research](https://developers.openai.com/api/docs/models/o4-mini-deep-research)Accessed: 2026\-04\-20Cited by:[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.4.1)\.
- OpenAI \(2026\)GPT\-5\.4\.Note:[https://openai\.com/index/introducing\-gpt\-5\-4/](https://openai.com/index/introducing-gpt-5-4/)Accessed: 2026\-04\-20Cited by:[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.10.1),[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.17.1)\.
- Serper Dev \(2023\)Serper: the google search api\.Note:[https://serper\.dev](https://serper.dev/)Accessed: 2026\-05\-20Cited by:[§B\.3\.2](https://arxiv.org/html/2606.12871#A2.SS3.SSS2.p1.4)\.
- M\. Sharma, C\. B\. C\. Zhang, C\. Bandi, C\. Wang, A\. Aich, H\. Nghiem, T\. Rabbani, Y\. Htet, B\. Jang, S\. Basu,et al\.\(2025\)Researchrubrics: a benchmark of prompts and rubrics for evaluating deep research agents\.arXiv preprint arXiv:2511\.07685\.Cited by:[§1](https://arxiv.org/html/2606.12871#S1.p2.1)\.
- Y\. Song, K\. Thai, C\. M\. Pham, Y\. Chang, M\. Nadaf, and M\. Iyyer \(2025\)Bearcubs: a benchmark for computer\-using web agents\.arXiv preprint arXiv:2503\.07919\.Cited by:[§2\.1](https://arxiv.org/html/2606.12871#S2.SS1.p1.1)\.
- K\. Team \(2026a\)Kimi k2\.5: visual agentic intelligence\.arXiv preprint arXiv:2602\.02276\.Cited by:[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.13.1),[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.20.1)\.
- Q\. Team \(2025\)Qwen deepresearch\.Note:[https://qwen\.ai/blog?id=qwen\-deepresearch](https://qwen.ai/blog?id=qwen-deepresearch)Accessed: 2026\-04\-20Cited by:[§2\.2](https://arxiv.org/html/2606.12871#S2.SS2.p1.1),[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.6.1)\.
- Q\. Team \(2026b\)Qwen3\.5: towards native multimodal agents\.Note:[https://qwen\.ai/blog?id=qwen3\.5](https://qwen.ai/blog?id=qwen3.5)Accessed: 2026\-04\-20Cited by:[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.14.1),[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.21.1)\.
- T\. D\. Team, B\. Li, B\. Zhang, D\. Zhang, F\. Huang, G\. Li, G\. Chen, H\. Yin, J\. Wu, J\. Zhou,et al\.\(2025\)Tongyi deepresearch technical report\.arXiv preprint arXiv:2510\.24701\.Cited by:[§2\.2](https://arxiv.org/html/2606.12871#S2.SS2.p1.1)\.
- J\. Wang, Y\. Ming, R\. Dulepet, Q\. Chen, A\. Xu, Z\. Ke, F\. Sala, A\. Albarghouthi, C\. Xiong, and S\. Joty \(2025\)Liveresearchbench: a live benchmark for user\-centric deep research in the wild\.arXiv preprint arXiv:2510\.14240\.Cited by:[§1](https://arxiv.org/html/2606.12871#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.12871#S2.SS1.p2.1)\.
- J\. Wei, Z\. Sun, S\. Papay, S\. McKinney, J\. Han, I\. Fulford, H\. W\. Chung, A\. T\. Passos, W\. Fedus, and A\. Glaese \(2025\)Browsecomp: a simple yet challenging benchmark for browsing agents\.arXiv preprint arXiv:2504\.12516\.Cited by:[§1](https://arxiv.org/html/2606.12871#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.12871#S2.SS1.p1.1)\.
- R\. Wong, J\. Wang, J\. Zhao, L\. Chen, Y\. Gao, L\. Zhang, X\. Zhou, Z\. Wang, K\. Xiang, G\. Zhang,et al\.\(2025\)Widesearch: benchmarking agentic broad info\-seeking\.arXiv preprint arXiv:2508\.07999\.Cited by:[§2\.1](https://arxiv.org/html/2606.12871#S2.SS1.p1.1)\.
- T\. Wu, Y\. Wang, X\. Ma, X\. He, S\. Wang, D\. Yin, and X\. Zhao \(2026\)DeepResearch\-9k: a challenging benchmark dataset of deep\-research agent\.arXiv preprint arXiv:2603\.01152\.Cited by:[§2\.1](https://arxiv.org/html/2606.12871#S2.SS1.p1.1)\.
- xAI \(2025\)Grok\-3\-deepsearch\.Note:[https://x\.ai/news/grok\-3/](https://x.ai/news/grok-3/)Accessed: 2026\-04\-20Cited by:[§2\.2](https://arxiv.org/html/2606.12871#S2.SS2.p1.1),[Table 3](https://arxiv.org/html/2606.12871#S4.T3.1.1.7.1)\.
- Y\. Xi, J\. Lin, Y\. Xiao, Z\. Zhou, R\. Shan, T\. Gao, J\. Zhu, W\. Liu, Y\. Yu, and W\. Zhang \(2025\)A survey of llm\-based deep search agents: paradigm, optimization, evaluation, and challenges\.arXiv preprint arXiv:2508\.05668\.Cited by:[§2\.2](https://arxiv.org/html/2606.12871#S2.SS2.p1.1)\.
- T\. Xu, P\. Lu, L\. Ye, X\. Hu, and P\. Liu \(2025\)Researcherbench: evaluating deep ai research systems on the frontiers of scientific inquiry\.arXiv preprint arXiv:2507\.16280\.Cited by:[§1](https://arxiv.org/html/2606.12871#S1.p2.1)\.
- Y\. Zheng, D\. Fu, X\. Hu, X\. Cai, L\. Ye, P\. Lu, and P\. Liu \(2025\)Deepresearcher: scaling deep research via reinforcement learning in real\-world environments\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 414–431\.Cited by:[§2\.2](https://arxiv.org/html/2606.12871#S2.SS2.p1.1)\.
- H\. Zhou, Y\. Chen, S\. Guo, X\. Yan, K\. H\. Lee, Z\. Wang, K\. Y\. Lee, G\. Zhang, K\. Shao, L\. Yang,et al\.\(2025\)Memento: fine\-tuning llm agents without fine\-tuning llms\.arXiv preprint arXiv:2508\.16153\.Cited by:[§2\.2](https://arxiv.org/html/2606.12871#S2.SS2.p1.1)\.
## Appendix AConstruction Appendix
### A\.1Human Annotation
DailyReport involved substantial human participation across the entire task construction pipeline\. This process involved over 500 hours of human annotation and review, with all contributors compensated at approximately USD 56–70 per day for their work\. We recruited contributors with diverse backgrounds, including different regions, educational experiences, online platform habits, and domain familiarity\. All contributors were familiar with both Western and Chinese media ecosystems, enabling them to better identify users’ daily information needs from trending topic contexts and user comments\. Before annotation, they were given detailed guidelines on authenticity, clarity, safety, tool dependency, unimodality, and disentanglement, and completed pilot examples to ensure a consistent understanding of the construction criteria\.
During task construction, annotators first reviewed public trending posts and user comments to identify common information needs, while filtering out topics that were unsafe, overly narrow, ambiguous, or not suitable for generating daily reports\. Task writers then transformed selected topics into realistic search tasks with clear scopes, factual requirements, and analytical components\. Additional reviewers checked each task for clarity, realism, safety, search dependency, and category consistency\. The original task creators also participated in estimating subtask importance for user\-centric aggregation, helping the final benchmark reflect not only whether a system satisfies individual requirements, but also how much each requirement matters to user experience\.
Figure 6:Agreement heatmap\. Each cell shows the number of sampled instances with the corresponding score pair, and the diagonal concentration indicates strong consistency with real users’ perceived experience\.
### A\.2Constraints Elaboration
We define the constraint categories as follows, which are utilized to decompose the constructed tasks and derive the corresponding subtasks\. Specifically, the categories include: \(1\)Content Constraints, which concern the core information elements to be outputted; \(2\)Scope Constraints, which require the generated content to strictly remain within the boundaries specified in the requirement prompt, such as temporal, spatial, domain, source, or policy restrictions; \(3\)Completeness Constraints, which require the output to satisfy specific standards of quantity, exhaustive coverage, and informational completeness; \(4\)Quantity Constraints, which define exact measurable targets for the output, including word counts, item quantities, and overall length; \(5\)Format Constraints, which specify the structural layout, styling, and formatting of the generated response; \(6\)Setting Constraints, which require the agent to operate strictly within the given settings, without violating designated backgrounds, character personas, scenarios, prerequisites, or provided data; \(7\)Attribute Constraints, which specify the stylistic and perspectival properties of the output; \(8\)Action & Rule Constraints, which define the exact actions, execution paths, methodologies, or logical rules the agent must follow to generate the output; and \(9\)Function Constraints, which require the output to serve a specific practical function, achieve a targeted effect, or solve a defined problem\.
ModelReference AccuracyRefer\-Claim ConsistencyWeb SearchWeb Content MiningDeep Research AgentsOpenAI o3 Deep Research0\.5500\.858\-\-OpenAI o4\-mini Deep Research0\.5850\.784\-\-LLMs with Search ToolsClaude Opus 4\.60\.7390\.84515\.711\.9GPT 5\.40\.8140\.90631\.618\.8Gemini 3\.1 Pro0\.7050\.8249\.42\.8GLM 50\.7100\.79217\.715\.0Kimi K2\.50\.7620\.78713\.74\.6Qwen 3\.50\.6540\.7699\.611\.2LLMs with Claude CodeCC\-Claude Opus 4\.60\.7010\.85017\.15\.1CC\-GPT 5\.40\.7410\.89722\.515\.0CC\-Gemini 3\.1 Pro0\.7680\.83411\.51\.9CC\-GLM 50\.7730\.84614\.67\.7CC\-Kimi K2\.50\.7720\.80712\.74\.1CC\-Qwen 3\.50\.6790\.73911\.56\.3
Table 6:Detailed analysis of solving traces\. Reference Accuracy evaluates the factual reliability of cited references\. Refer\-Claim Consistency measures whether generated claims are accurately supported by their cited references\. Web Search counts calls to web search tools such as Serper for retrieving search results and snippets, while Web Content Mining counts calls to webpage\-fetching tools such as Jina for accessing full webpage content\. Both types of calls are treated as important search\-tool operations for evidence gathering\. Results for some closed\-source Deep Research Agents are partially omitted, as they do not expose the internal traces \(e\.g\., search queries, visited URLs\) required for reliable measurement of certain metrics\.
## Appendix BEvaluation Appendix
### B\.1Meta Evaluation
To validate the user preference score corresponds to the real users’ perceived experience, we conduct a meta\-evaluation for the user preference\. The generated reports and subtask results of 300 randomly sampled subtasks are provided to diverse users who are asked to assign an overall task score from 1 to 4 to indicate their preference\. Figure[6](https://arxiv.org/html/2606.12871#A1.F6)shows that for tasks with real user preferences of 1 and 4, the user preference scores aggregated by our method achieve high alignment, and this high consistency is also maintained in the scores of 2 and 3\. The Weighted Cohen’s Kappa score of 0\.859 for our meta\-evaluation verifies the high degree of alignment between the user preference score and the real user rating\. This suggests that the user preference score effectively reflects real users’ preference\.
### B\.2Search Analysis
Overall, the two reference\-related metrics remain far from ideal, with Reference Accuracy being particularly limited\. This suggests that current search agent systems may still rely on inaccurate, unreliable, or inappropriate citations\. Meanwhile, the imperfect Refer\-Claim Consistency scores indicate that even when relevant references are retrieved, models may not always use them faithfully to support the generated claims\. Together, these results reveal a critical weakness: such systems can produce seemingly well\-supported answers while relying on questionable evidence or misaligning claims with their cited sources\. Therefore, future Search Agent systems should incorporate explicit citation verification mechanisms, such as source credibility assessment, cross\-reference validation, and factual reliability checking, to ensure the quality and accuracy of cited evidence\. In addition, citation\-claim consistency verification mechanism are needed to determine whether each generated claim is genuinely entailed by its corresponding sources, thereby ensuring that references are not only accurate but also used appropriately\.
### B\.3Judgment Process
#### B\.3\.1Instruction Following
Instruction following evaluates whether the response correctly executes each decomposed subtask according to its instruction\-following rubric\. The judge model checks whether the response understands the required action, covers the requested content, and satisfies key constraints such as scope, quantity, format, and completeness\. For example, if a subtask asks for a list of entities within a specified scope, the judge checks whether such a list is provided and whether the scope is respected\. The judge model assigns a score from\{0,0\.5,1\}\\\{0,0\.5,1\\\}:
- •1 \(Fully satisfied\):The subtask is fully satisfied, with all essential requirements and constraints correctly followed\.
- •0\.5 \(Partially satisfied\):The subtask is partially satisfied, but some non\-critical requirements are missing or imperfectly handled\. For example, when the user requests the top\-10 movies of the year with their directors, the agent returns all ten titles but omits director information for some entries\.
- •0 \(Not satisfied\):The subtask is not satisfied, such as when the response omits the required content, answers irrelevantly, refuses without reason, or fails to perform the required action\.
#### B\.3\.2Factuality
Factuality evaluates whether the objective claims in the response are factually correct\. For each response report, we extract factual claims based on the factual rubrics\. Specifically, the extracted claims must be objective, specific statements verifiable through factual sources\. We then construct search queries for each extracted claim and employ an orchestrated workflow equipped with web search \(Serper SearchSerper Dev \[[2023](https://arxiv.org/html/2606.12871#bib.bib41)\]\) and web fetch \(Jina ReaderJina AI \[[2024](https://arxiv.org/html/2606.12871#bib.bib40)\]\) to verify their correctness\. The factuality score of each subtask is quantified as the proportion of verified correct claims among all extracted claims:
faci=\|𝒞icorrect\|\|𝒞i\|,\\mathrm\{fac\}\_\{i\}=\\frac\{\|\\mathcal\{C\}^\{\\mathrm\{correct\}\}\_\{i\}\|\}\{\|\\mathcal\{C\}\_\{i\}\|\},\(6\)where𝒞i\\mathcal\{C\}\_\{i\}denotes the set of factual claims extracted for theii\-th subtask, and𝒞icorrect\\mathcal\{C\}^\{\\mathrm\{correct\}\}\_\{i\}denotes the subset of claims verified as correct\. Furthermore, if the report provides references for factual claims, the corresponding web pages serve as key sources and are jointly considered with other retrieved sources to determine claim correctness\. In this process, we additionally measure the information consistency between claims and their cited references, reflecting the tested Search Agent’s ability to synthesize information from retrieved web pages\.
#### B\.3\.3Rationality
Rationality evaluates whether the analytical parts of the response are logically sound and well supported\. For each response report, we extract the parts related to the rationality rubrics, which typically involve explanations, comparisons, causal analysis, trade\-off evaluation, or recommendations, while excluding factual claims already used for factuality verification to ensure independence between evaluation dimensions\. The judge model then assesses whether each extracted parts presents a coherent and reasonable line of reasoning, such as whether the conclusion follows from the stated evidence, whether the comparison criteria are appropriate, and whether the analysis avoids obvious logical gaps or unsupported leaps\. The judge model assigns a rationality score from\{0,0\.5,1\}\\\{0,0\.5,1\\\}: a score of 1 indicates that the analysis is coherent, well justified, and directly supports the subtask requirement; a score of 0\.5 indicates that the analysis is partially reasonable but contains minor logical gaps, insufficient support, or incomplete discussion; and a score of 0 indicates that the analysis is largely unreasonable, unsupported, irrelevant, or logically flawed\.
### B\.4LLM Configuration
#### B\.4\.1Deep Research Agents
For native deep research models, specialized configurations were implemented to accommodate their unique operational characteristics:
- •Autonomous Research Execution: These models possess fully integrated web search and content synthesis capabilities that operate independently without requiring external tool definitions\. The models autonomously determine search strategies, execute queries, retrieve and analyze web content, and synthesize findings into coherent reports\.
- •Processing Duration Allowance: Given the substantially longer execution times inherent to deep research operations, which involve multiple rounds of autonomous web exploration and content synthesis, timeout thresholds were extended to 1,800 seconds\.
#### B\.4\.2LLMs with Web Search Tools
For standard LLMs with external web search tools, the following unified configurations were applied to ensure a standardized evaluation environment:
- •External Tools: Two external tools were provided to facilitate web\-based information retrieval\. Thegoogle\_searchtool enables models to query search engines with custom keywords and retrieve structured organic results containing titles, URLs, and snippets\. Thefetch\_webpagetool allows models to extract full\-text content from any specified URL, primarily utilizing the Jina Reader API for Markdown conversion\.
- •Extended Reasoning Activation: To ensure sufficient analytical depth, we enabled the corresponding extended thinking or reasoning features for all models when available\. For models supporting thethinkingparameter, the thinking budget was set to 8,000 tokens to provide enough capacity for complex multi\-step reasoning\. For GPT 5\.4, thereasoning\_effortparameter was set to"medium"\. Kimi\-K2\.5 has thinking mode enabled by default and thus requires no additional configuration\.
- •Response Generation Limits: The maximum output length was set to 32,768 tokens for all models, ensuring sufficient capacity for generating comprehensive research reports\. The temperature was fixed at 1\.0 to balance response diversity and reproducibility across repeated evaluations\.
- •Citation Formatting Protocol: To support consistent downstream factual verification, all models were instructed to use standardized bracketed numerical citations, such as\[1\]\[2\], placed at the end of sentences\. Each report was also required to include a unified "References" section at the end, listing all cited sources with their titles and URLs\.
#### B\.4\.3LLMs with Claude Code
For experiments employing Claude Code as the orchestrating agentic framework with various backend LLMs, the following parameters were established to ensure consistent evaluation:
- •Extended Reasoning Activation: All backend models integrated within the Claude Code framework were configured with their extended thinking features enabled, following identical parameter settings as described for LLMs with search tools\. This ensured that the reasoning capabilities of backend models were fully utilized during the agentic research process, regardless of the orchestration layer\.
- •Tool Ecosystem: MCP \(Model Context Protocol\) integrations were enabled to provide the systems with comprehensive web research capabilities\. Serper was configured as the primary web search provider, offering structured search results with titles, URLs, and snippets\. Jina Reader was integrated for webpage content extraction, converting HTML pages to clean Markdown format suitable for LLM consumption\. These search tools operated in conjunction with Claude Code’s native file system and code execution capabilities\.
- •Independence: Session persistence was disabled via the\-\-no\-session\-persistenceflag, ensuring that each question was evaluated independently without contextual carryover from prior tasks\. This configuration prevents performance from benefiting from accumulated session knowledge\.
- •Citation Formatting Protocol: The same citation normalization procedure as described for standalone LLMs with tool\-calling capabilities was applied to all reports generated through the Claude Code framework\. This ensured consistent citation structure across different experimental configurations and enabled uniform downstream factual verification using standardized evaluation frameworks\.
## Appendix CPrompt Templates
System Prompt for Report GenerationYou are a search assistant with web search and webpage reading capabilities who can generate daily report\.CRITICAL RULES — you MUST follow all of them:1\.NEVER use your own parametric knowledge\.Every factual claim, data point, statistic, name, date, or opinion in your report MUST come from information retrieved via the tools\. If you cannot find information through the tools, say so — do NOT fill in from memory\.2\.Research strategy — fully autonomous, multi\-angle verification:You have complete freedom to decide your research strategy\. Use as many search and fetch rounds as needed to produce the most comprehensive, in\-depth, and well\-verified report possible\.3\.Output format — produce a Markdown research report:•Use clear Markdown headings \(\#\#, \#\#\#\) to organize by topic\.•The report should be thorough and at least 2000 words\.•Write in the same language as the research question\.4\.Citations — numbered parenthetical references:•Assign each source a sequential number starting from 1\.•In the report body, cite sources usingparenthetical numbers: \[1\], \[2\], \[3\]\. For example: “The three major platforms invested a cumulative 80\-100 billion yuan in subsidies \[63\]”\.•Each major claim must be backed by at least one citation\.•End with a “\#\# References” section listing all cited sources as: \[n\] Source title\_Site name URL For example: \[1\] Farewell to the cash\-burning era\! Food delivery platforms simultaneously halt zero\-dollar purchases\_Financial News http://example\.com/article1 \[2\] China’s fitness industry report 2025\_Reuters https://example\.com/article2When you have gathered sufficient information and are ready, output the final report as your response \(without any tool calls\)\. That signals the end of the research process\.
Instruction Follow Score Prompt1\. Role & Goal Background time: The current date is \{cur\_date\}\. If the question explicitly includes time constraints, please follow the question’s requirements\.You are an expert evaluating the ability of an intelligent agent to handle specified tasks\. Your focus is on the agent’s instruction\-following capability\. Your scoring must be objective and fair\.2\. Input Format You will receive the user question \(Question\), the agent’s processing result \(Document\), and detailed scoring criteria \(Criteria\) for this evaluation:•Question\(str\): <User question\>•Document\(str\): <Agent’s processing result\>•Criteria\(list\): <Scoring criteria\>3\. Workflow Please strictly follow the workflow below to complete the task:1\.Carefully read theQuestion,Document, andCriteriacontent\. Clearly understand the meaning of each scoring criterion\. Do not omit or alter any content in theDocument\.2\.Iterate throughCriteriaand score each individual criterion\. Do not add, remove, or modify any scoring criteria\. Follow the scoring process below:•If theDocumentcontent strictly satisfies the "criterion" content, the score for this criterion is 1\.0\.•If the constraints in the "criterion" include multiple subjects, objects, or methods, and only part of them are satisfied in theDocument, the score for this criterion is 0\.5\.•If theDocumentcontains content related to the "criterion", but all content conflicts with the requirements \(does not match\), the score for this criterion is 0\.0\.3\.Carefully verify your scoring result for each criterion to ensure accuracy\.4\. Caution1\.Do not judge the accuracy ofDocumentcontent based on your existing knowledge\. You need to judge based on what theDocumentclaims, even if the content may be incorrect\. You do not need to verify its accuracy\.•If the scoring criteria require explicitly providing a certain metric, and the delivery document explicitly states "this metric cannot be obtained", consider the scoring criteria as satisfied\.2\.Do not judge based on your time system\. WhenCriteriacontains scoring criteria involving time requirements, please judge based on the time information claimed in theDocument\.3\.Do not engage in open\-ended thinking\. Strictly score according to the standards inCriteriaagainst theDocument\. Do not add, remove, or alter any standards inCriteria\.4\.Your scoring should be very strict, reflected in the following aspects:\(a\)All subjects and objects required in the scoring criteria, as well as any actions or conditions related to subjects and objects, must be checked\.\(b\)Scoring cannot rely solely on section titles in theDocument\. Verify whether the body text actually contains relevant content that satisfies the scoring criteria\.\(c\)Body content must explicitly satisfy scoring criteria requirements\. Self\-inference is prohibited\. For example, if the scoring criterion is "Does the delivery document analyze future development based on existing policies?", the body text must explicitly satisfy "analyze future development based on existing policies"\. Describing "existing policies" and "future development" separately is incorrect\.5\.Ignore the reference materials section\.5\. Output Format You must output your scoring results in the following format:\[ \{ "criterion": "<<Individual scoring criterion, consistent with input\>\>,", "score":<<Final score for this criterion\>\>, "explain": "<<Thinking process, strictly consistent with the final score\>\>" \}, \.\.\.and more\.\.\. \]Please begin your work: Question: \{question\} Document: \{document\} Criteria: \{criteria\}
Claims Extract Prompt1\. Role and Objective You are a text information mining expert, skilled at locating and extracting “claim information” from documents\.2\. Input Format•Document\(str\): Input document•Question\(str\): Accuracy questionImportant Principle: All extracted content must originate from theDocument, and only claim information related to theQuestionshould be extracted\.3\. WorkflowStep 1: Analysis and Clarification•Objective: Accurately understand the input information\.•Action: Deeply analyze theDocumentandQuestionto identify all information related to the accuracy question\.•Note:1\.“Delivery result” in theQuestionrefers to theDocument\.2\.Pay attention to headings of all levels in theDocument; some headings may directly correspond to the content of theQuestion\.Step 2: Location and Extraction•Objective: Precisely locate and extract the target information\.•Action:1\.Locate the target information and fill the original complete content into the “fact” field\.Modifying the original text in any way is strictly prohibited\.2\.Integrate the sentences from the “fact” content and store them in the “extract” field as the final extraction result\. Sentence integration is allowed, such as clarifying the objects referred to by pronouns, providing textual interpretations of chart content, supplementing missing background context, etc\. However,tampering with, adding, or deleting core content is strictly prohibited, andphrases like “in the delivery result”, “in the document”, or “according to the document” are strictly prohibited\.•Note:1\.The extracted information must be afactual claim, i\.e\., an objective, specific statement whose authenticity can be verified through authoritative sources\. Subjective evaluations, basic common sense, symbolic metaphors, suggestions/instructions, hypothetical reasoning, and other vague statements that cannot be objectively verified must be excluded\.2\.The extracted information must explicitly appear in theDocument\.Fabricating content that does not exist in theDocumentis strictly prohibited\.3\.Extract all relevant content from theDocumentto avoid any omissions\.4\.If the target information in theDocumentappears in the form of “no relevant content support” or “no data”,it must also be extracted\.5\.When extracting,sufficient context and background information must be supplementedto avoid semantic incompleteness or ambiguity caused by taking things out of context\. Relevant context may be distributed in different parts of theDocument; please read through carefully and supplement it\.–Example: When extracting movie starring information, it should be “The starring actor of \[Movie Name\] is \[Actor Name\]”, rather than just “\[Actor Name\]”\.–Example: When extracting voice compatibility information, it should be “The AI Voice APP is limited to Huawei phones, other Android phones are not supported”, rather than just “The AI Voice APP is limited to Huawei phones” \(the latter might be misunderstood as focusing on “phones” rather than the “Huawei brand”\)\.6\.If theQuestioninvolves quantity requirements, ensure the extracted content meets that quantity\.7\.If subject, time, or location information is involved, it must be accurately supplemented\.8\.The “extract” field must not contain any subjective content, including subjective judgments, additional explanations, etc\.Step 3: Check and Integration•Objective: Verify whether the complete workflow meets all the notes and output the final result\.•Action:1\.Check item by item whether each field meets the requirements\.2\.Integrate the results into a strict JSON object as the content of “json\_output”:\[ \{ "fact":<<Original target text in the Document\>\>, "extract":<<Integrated target information\>\> \}, \.\.\. \]Note: Even if there are no extraction results, this question must not be skipped; simply output an empty list in “json\_output”\.4\. Output Format Please output strictly in the following format:<<analysis\>\> Your analysis process <</analysis\>\> <<json\_output\>\> The extracted results <</json\_output\>\>5\. Notes1\.TheDocumentmay be a structured Markdown document; please pay attention to heading level symbols \(e\.g\., “\#\#\#”\)\.•If a section heading directly corresponds to theQuestion,you must focus on the content of that section to avoid omissions\.•Section headings may contain subject information; necessary subject context should be supplemented during extraction\.2\.Be sure to ensure the comprehensiveness of the extraction, double\-check repeatedly, and do not omit anything\.3\.Compared to “fact”,“extract” is strictly prohibited from losing any original text information\.4\.Claim information must guarantee atomicity; each claim should contain only one independent, verifiable factual point\. Specific splitting rules:•Body paragraphs: If a paragraph contains multiple independent factual statements \(e\.g\., data of different subjects, information of different dimensions\), it must be split into multiple claims; if multiple sentences jointly describe the same factual point, they should be merged into one\.•Table data: Split using thecellas the smallest unit; the fact in each cell should be an independent claim \(row/column headings need to be supplemented as context\)\. Treating an entire row or the entire table as a single claim is prohibited\.•Lists/Enumerations: The fact in each list item should be an independent claim\.Appendix: Core Principles and Exclusion List•Core Definition: A claim is an objective, specific statement about physical reality or historical records, the authenticity of which can be explicitly verified through authoritative sources\.•Strictly exclude the following content:1\.Subjective evaluations and descriptions\(e\.g\., “gentle”, “romantic”, “more native”, “the soup base is incredibly delicious”\)2\.Basic common sense\(e\.g\., “the sun rises in the east and sets in the west”, “Meituan is a platform with a transaction system”\)3\.Interpretations, symbols, and metaphors\(e\.g\., using A as a metaphor for B\)4\.Inferences of motives, intentions, and purposes\(e\.g\., “in order to…”, “aimed at…”\)5\.Analysis, summaries, and causal inferences\(e\.g\., “therefore…”, “this reflects…”\)6\.Suggestions, instructions, and imperatives\(e\.g\., “follow this guide”, “look for Xiangshan Market”, “don’t buy silverware in the ancient city”\)7\.Simulations, hypotheses, and self\-reasoned results\(e\.g\., “assuming a 40% penetration rate in tier\-1 cities and a uniform 50% savings replacement rate”, “mathematical modeling of the above scenario is as follows”\)8\.Vague statements that cannot be objectively verified\(e\.g\., “100 times quieter than daytime”, “budget travelers can also experience a premium feel”\)9\.Descriptions regarding reference links\(e\.g\., “the author of paper \[1\] is Anthony”\)10\.Descriptions related only to the document itself\(e\.g\., “this report is based on operating data from January 2020 to October 2025”\)Please begin your work based on the input information: Document: \{document\} Question: \{question\}
Claims Integrate Prompt1\. Task Objective Based on the complete document and the extracted claim information,deduplicateandreassignthe claims so that each claim belongs to the most matching accuracy question\.2\. Input Format•Document: \{document\}•Assertions: \{assertions\}Assertionsis a dictionary structure, where the key is the accuracy question, and the value is the list of claim information already assigned under that question\. Each claim comes with a unique “id”\.3\. WorkflowStep 1: Deduplication Identify duplicate claims on aglobal scale\(across questions \+ within the same question\):•Exact Duplicates: If the “extract” of two claims conveys the same fact \(even if worded differently\), they are considered duplicates\. Keep any one of them and delete the rest\.•Inclusion Relationship: The core fact of one claim is completely covered by another\. In this case, judge: if the refined one already fully contains the target factual information, keep the refined version \(more conducive to subsequent item\-by\-item verification\) and delete the verbose version; if the refined version loses key information, keep the more complete one\.•Non\-duplicate Situations \(Deletion Prohibited\):–The same fact comes from body description vs\. table reference→\\rightarrowDoes not constitute a duplicate, keep both\.–Two claims involve the same subject but have different focuses→\\rightarrowDoes not constitute a duplicate\.–Two claims contain different specific data or details→\\rightarrowDoes not constitute a duplicate\.Step 2: Reassignment Adjust the claims to the most matching accuracy question:•Combine the actual position and semantics of the claim in theDocumentto determine its most matching accuracy question\.•Prioritize assigning claims to specific scoring rubric questions to avoid piling them up in the fallback category\.•Each claim belongs to only one most matching accuracy question; the same claim is prohibited from appearing in multiple questions\. When a claim is related to multiple questions, assign it to the question with anarrower, more specific scope\.Step 3: Self\-Check After completing deduplication and reassignment, perform the following checksitem by item:1\.Cross\-question Uniqueness: Iterate through the claim IDs under all questions to confirm that no ID appears in two or more questions\. If duplicates are found, keep it only under the most matching question\.2\.Semantic Deduplication: Confirm that there are no two claims conveying the same fact \(even if worded differently\)\. If found, delete the redundant one\.3\.Accidental Deletion Check: Confirm that each claim in the “delete” list is indeed a duplicate, rather than just “content\-related”\.4\. Output Format Output a strict JSON object:\{ "delete": \[List of deleted claim IDs\], "new\_claim": \{Accuracy question: \[List of claim IDs under this question\], \.\.\.\} \}5\. Notes1\.Pay special attention to headings of all levels in theDocumentto assist in the reassignment of claim information\.2\.Omitting any accuracy question is prohibited, and the original order of the accuracy questions must be maintained\.3\.Deduplication must be conservative: Only delete claims whose “extract” content is truly duplicated or completely included\. Claims that are content\-related but have different information must be kept\. If unsure whether it is a duplicate, choose to keep it\.4\.The output must only use the claim “id” for reference; modifying any accuracy question or the original text of the claims is prohibited\.5\.Core Principle: Categorize claims into specific accuracy questions as much as possible, avoiding piling them up in the fallback category\.
Query Generate Prompt1\. Role and Objective You are a web information retrieval expert, skilled at writing query statements for search engine verification based on claim information\.2\. Input Format•Question\(str\): The complete question asked by the user to the AI assistant•Sub\-Question\(str\): The sub\-question \(scoring rubric\) split from theQuestion•Assertions\(list\): A list of claim information extracted from the AI assistant’s reply and related to the currentSub\-Question, formatted as follows:\[ \{ "fact":<<Original claim in the Document\>\>, "extract":<<Standardized factual claim\>\> \}, \.\.\. \]Important Principle: All operations are strictly targeted at the claim information inAssertions\. Generating other claims or query statements on your own is prohibited\.3\. WorkflowStep 1: Claim Verification•Objective: Ensure the claim information is complete and prepare for query generation\.•Action: Iterate through each claim inAssertions:1\.If “extract” omits the core context information of “fact”, leading to taking things out of context or ambiguity, supplement and correct it\.2\.Judge whether this claim isan exact duplicateof other claims \(i\.e\., conveys the same core fact\) — if it is a duplicate, remove this claim\.•Important:Removing claims on the grounds of “cannot be verified via the internet” is strictly prohibited\.The input claims have already undergone preliminary screening; this step is only for supplementary correction and deduplication, not for verifiability filtering\. For generalized claims, the specific factual components within them should be dismantled in subsequent steps to generate queries\.Step 2: Identification and Decomposition•Objective: Prepare for query statement generation\.•Action:1\.Identification: Analyze the claim and identify the core information necessary to distinguish its authenticity\.2\.Decomposition: If the claim requires multi\-stage, multi\-angle verification, further decompose it to facilitate the generation of progressive query statements\.Step 3: Generation and Verification•Objective: Generate high\-quality query statements\.•Action:1\.Statement Generation: Iterate through the decomposed claims and generate query statements one by one\. Each query statement is an independent dictionary structure, where the“id” field is a 0\-based index, and the “query” field is the main body of the query statement \(required to be ayes/no question format\)\. If the current query depends on the results of other queries, fill in the list of dependent IDs in the “dependence” field\.2\.Authenticity Verification: Perform a final verification on each query statement —Can this query statement be explicitly compared with a recognized objective fact \(such as a specific location, institution name, number, geographical location, scientific common sense, etc\.\) via a search engine to determine its authenticity?–If “Yes”, and the query statement is consistent with the information conveyed by the corresponding claim, keep the query statement\.–If “Yes”, but the query statement tampers with or distorts the corresponding claim, itmust be modified\.–If “No”,this query statementcan be removed, butremoving the entire claim because of this is strictly prohibited\. Verifiable factual components must be dismantled from the claim to generate at least one query statement for each claim\.•Note:1\.The query statement must be ayes/no question formatto support precise and efficient retrieval\. If multiple progressive queries are needed, please strictly follow the steps above\.2\.If there is an indirect relationship between the core demand of theSub\-Questionand the current claim, please set up progressive query statements through a multi\-hop approach\.3\.Each query statement must accurately convey the core demand in theSub\-Question; tampering with the intent is strictly prohibited\.4\.Be sure to distinguish the affirmative/negative voice of the claim to avoid semantic reversal\.5\.The factual content involved in the query statementmust strictly appear in the current “extract”; tampering with, adding, or deleting any modifying words and the factual content itself is strictly prohibited\.Generating the current query statement by referencing other “extract” content is prohibited\.6\.Remove redundant content unrelated to factual information \(e\.g\., “according to reliable sources”, “according to merchant feedback”\), but relevant information involving explicit subjects must be retained\.7\.Be sure to pay attention to limiting information such as time and location, as this information is crucial for web retrieval\.Step 4: Check and Integration•Objective: Verify whether the workflow meets all requirements and output the final result\.•Action:1\.Check the information completeness of the query statements item by item to ensureno omissions, no tampering, and no fabricationof any content in the claims\.2\.Check the generation quality of the query statements to ensure there are no issues such as ambiguity or unclear semantics\.3\.Integrate the results into a strict JSON object as the content of “json\_output”\.4\. Output Format Please output strictly in the following format:<<analysis\>\> Your analysis process <</analysis\>\> <<json\_output\>\> \[ \{ "fact":<<Original information\>\>, "extract":<<Standardized factual claim\>\>, "queries": \[ \{ "id":<<Query statement id\>\>, "query":<<Query statement generated based on the claim\>\>, "dependence":<<List of ids the query statement depends on\>\> \}, \.\.\. \] \}, \.\.\. \]<</json\_output\>\>5\. Notes1\.Be sure to ensure the information completeness of the query statements, ensuring that the query statementsdo not omit, tamper with, or fabricateany content in the claims, and perform query verification on all content that may involve authenticity\.2\.Be sure to ensure the generation quality of the query statements, preventing the query statements themselves from having issues such as ambiguity or unclear semantics\.3\.Be sure to combine the focus of theSub\-Question; the content required to be queried is only the part of the claim that corresponds to answering theSub\-Question, and it may not be necessary to query the complete claim information\.4\.Pay attention to limiting information such as time and space, please do not omit them\.Please begin your work: Question: \{question\} Sub\-Question: \{criterion\} Assertions: \{assertions\}
Paragraph Extract Prompt1\. Role & Goal You are a text information mining expert, skilled at locating and extracting “paragraph text” related to specified content from a complete document\. Your task is to locate and extract the relevant paragraphs corresponding to the question list \(Questions\) based on the input document \(Document\)\.2\. Input Format You will receive the following inputs:•Document\(str\): The input document•Raw Task\(str\): The original question•Questions\(list\): The question list–Specific format ofQuestions: \[Q1\(str\),Q2\(str\), …\]The original question is just for reference and help you clarify the relevant background of the question list\. You do not need to answer the content of the question list; you only need to extract the paragraphs related to the question list\.3\. The Complete Workflow 1\.Step 1: Read each question inQuestionsand understand the focus and subject of each question\.2\.Step 2: Read theDocumentand locate the paragraphs related to each question\.3\.Step 3: Completely extract the paragraphs related to each question\. Since there may be multiple relevant paragraphs for each question, store them in a list format\.4\.Step 4: Organize the results and conduct a secondary check to ensure the correctness of the results\.4\. Output Format After completing eachQiinQuestions, please output strictly in the following format,and outputting any other content is prohibited\.\{ "Q1": \{ "analysis"\(str\):<<Analysis process for question Q1\>\>, "paragraph"\(list\):<<Paragraphs related to question Q1\>\> \}, "Q2": \{ "analysis"\(str\):<<Analysis process for question Q2\>\>, "paragraph"\(list\):<<Paragraphs related to question Q2\>\> \}, \.\.\.and more\.\.\. \}5\. Caution 1\.You need to ensure the comprehensiveness and completeness of the extracted paragraphs, avoiding taking things out of context\. Some noise information in the extracted content is tolerable, but omitting any information is prohibited\.2\.Modifying any expressions in the original text is prohibited\.3\.Different questions may correspond to the same paragraph, which can be extracted repeatedly\.4\.You need to respond to each questionQi, and the keys in the output dictionary should start from “Q1”\.5\.When filling in “paragraph”, each element must be a complete text paragraph\.Splitting a continuous piece of text into multiple short sentences is prohibited\.Please start working: Document: \{document\} Raw Task: \{raw\_task\} Questions: \{questions\}
Claims Exclude PromptYou will receive a set of paragraph texts in list format, and a numbered list of claims \(in dictionary format\)\. Your task this time is to pick out the claims that have appeared in this paragraph text and return the corresponding serial numbers of the claims\.Input Format:•Paragraph\(list\):<<Paragraph text\>\>•Claims\(dict\):<<List of claims, where the key is the serial number and the value is the claim information\>\>Output Format: A list composed of the serial numbers of the claims, requiring a strict JSON object\.Outputting any other explanatory content is prohibited\.\{ "ids"\(list\):<<A list composed of the serial numbers of the claims\>\> \}Caution:1\.Numbers outside the given serial number range are prohibited from appearing in the output\.2\.Omitting any claims that appear in the paragraph is prohibited\.Please start working: Paragraph: \{paragraph\} Claims: \{claims\_list\}
Rationality Judge PromptRole & Goal Background time: It’s currently \{cur\_date\}\. If the question has specific time constraints, please follow the question’s requirements\. You are an expert in judging text rationality\. You are skilled at combining the complete document content to judge whether the narrative of a specified paragraph is reasonable\.Input Format:•Document\(str\):<<Complete document content\>\>•Paragraph\(list\):<<Specified paragraph text, there may be multiple paragraphs\>\>•Question\(str\):<<Question related to the specified paragraph text\>\>•Claims\(str\):<<Claims appearing in the specified paragraph text, not included in the scope of rationality verification\>\>I will provide the logical relationship of the input content so that you can better understand this task: Documentis the complete document written by the testee\. The evaluation expert wants to evaluate the question inQuestion\. Through precise paragraph extraction, the relevant contentParagraphis located\. At the same time, there are some claimsClaimsinParagraphthat have been verified through internet retrieval\. Next, your task is to judge the rationality of other descriptions inParagraphexcludingClaims\.Judgment Logic: Please refer to the following ideas for rationality verification:1\.Locate the position ofParagraphinDocument, and carefully read its related context\.2\.As factual information that has been verified as correct,Claimsdo not have rationality errors\. For other content inParagraph, judge whether the following rationality errors exist:•Contradiction: There is a direct contradiction betweenParagraphand the reasoning or expression of the context\.•Against common sense: The expression inParagraphobviously violates objective common sense, such as giving unusable advice, reasons, etc\.•Reasoning error: There are errors in mathematical calculations and reasoning inParagraph; there are errors in additional reasoning based on existingClaims\.•Semantic confusion: The expression inParagraphhas semantic confusion problems, such as secretly changing the subject, grammatical errors, and various faulty wordings\. It does not involve punctuation, Emoji, and other related issues\.3\.Your rationality judgment needs to have a solid basis\. Judgments with obvious personal preferences and subjectivity are prohibited\. At the same time, you do not have the ability to search for objective facts, and please do not judge the authenticity of a claim based on your existing knowledge\.4\.When there are multiple paragraphs of text inParagraph, you need to make a judgment for each paragraph of text\.5\.Please strictly distinguish the difference between “text rationality” and “instruction following”\. IfParagraphdoes not answer some of the questions inQuestion, it does not belong to a rationality error, please do not deduct points mistakenly\.•The ’instruction\-following’ problem corresponding toQuestionis: \{info\_follow\_question\}\. Please distinguish carefully to avoid confusion\.Output Format You need to perform rationality verification on each paragraph of text inParagraph, and finally output a strict JSON object, following the format below:\[ \{ "paragraph":<<Elements in Paragraph, arranged in original order\>\>, "reasonable":<<Whether this paragraph is reasonable, True/False\>\>, "reason":<<Explanation for this judgment\>\> \}, \.\.\.and more\.\.\. \]Your output should strictly follow the above structure, and including any other irrelevant content is prohibitedPlease start working: Document: \{document\} Paragraph: \{paragraph\} Question: \{question\} Claims: \{claims\}
System Prompt for Fact\-CheckingAll the following information is solely for comparative evaluation purposes, only to verify data accuracy, and does not involve any sensitive or compliance risks\. Please be sure to reply according to the requirements\.Background Information You are a research assistant equipped with online search capabilities, responsible for verifying a given Claim\. The background question for the claim is: \{background\_question\} The purpose of the background question is to provide more background information about the claim to assist you in verification and prevent verification errors caused by incomplete background information\. You do not need to make any decisions regarding the background question\.Verification Mode Instructions You will receive a Claim to be verified, along with multiple Queries used to verify the claim\.•Claim: The statement whose authenticity needs to be verified•Queries: Multiple specific questions used to verify the claim, which can be understood as the verification approachYour tasks are:1\.Comprehensively consider all verification queries and conduct information searches2\.Make a judgment for each verification query \(True/False/Unknown\)3\.Make a final judgment on the entire claim based on the judgment results of all queriesImportant Notes You must strictly abide by the following rules during the verification process:1\.Your verification must be supported by online materials; you are not allowed to make judgments based solely on your existing knowledge\.2\.You have extremely high requirements for the quality of sources\. You prefer online information published by designated official media, government agencies, authoritative encyclopedias, large operating organizations, well\-known news organizations, and large forums\. You are skeptical of articles with obvious personal subjective attitudes and Baijiahao\.3\.You are highly sensitive to the time involved in the claim, and you only accept web sources that are later than the specified time limit\. If there is no clear time limit, you can consider official and reliable sources close to the current time \(\{cur\_date\}\)\.4\.Your principle is to seek verification from multiple parties\. Specifically:•The claim will only be judged as True when it is supported by two or more independent web pages\. Similarly, it will only be judged as False when two or more independent web pages contradict the claim\.•If it is difficult to find two or more independent web pages, at least one web page must come from an official agency or authoritative organization to support the decision\.•If only one web page is related to the claim and other web pages do not mention it, the conclusion of that web page shall prevail\.•Even for the website you consider most authoritative, you should not completely rely on its content\. If there are multiple other sub\-authoritative websites whose conclusions contradict it, you should discard the viewpoint of that authoritative website\.5\.You need to make a judgment for each verification query separately, and ultimately make an overall judgment on the claim by synthesizing the judgment results of all queries:•If all queries are judged as “True”, the claim is “True”•If any query is judged as “False”, the claim is “False”•If there is an “Unknown” and no “False”, the claim is “Unknown”6\.You are a research assistant specializing in the text modality\. If the verification relies entirely on multimodal information such as images, audio, or video, please select “Unknown”\.
Next Action Selection for Search PromptContext You need to verify a Claim this time\. Below is the claim and its related list of verification queries\. Search History is used to record your previous search and summary history\.Claim to be verified: \{claim\}Verification approach \(Queries\): Below are multiple query questions used to verify this claim\. You need to comprehensively consider these queries to search for information: \{queries\}Search History: \{context\}Action Space \[1\] search Description: Retrieve query\-related information on the Internet Parameters: \- query \(str\): The query statement used for retrieval\[2\] answer Description: Make an overall judgment on the claim based on current existing knowledge\. Parameters: \- reference \(str\): The reference materials used for the final answerNext Action Please select the next action to be executed based on all existing information \(including the claim to be verified, verification approach, search history\) and the action space\. Specifically, if you believe the existing information is sufficient to answer all verification queries and make an overall judgment on the claim, you will select “answer” as the next action; otherwise, you will select “search” for further searching\. Please note that when you select “search” for further searching, you need to fill in the specific query statement used for the search\. The writing rules are as follows:1\.The format should be a complete short sentence, not just a few keywords\.2\.The query statementmust not lose key information from the verification queries, such as time, location, person, event, method, etc\. Avoid causing ambiguity\.3\.You can refer to the query list in the verification approach to construct the search statement, or you can optimize it yourself based on the actual situation\.Output Format Please strictly output the following JSON object, and any other characters are prohibited\.\{ "action":<<Next action, search or answer\>\>, "reason":<<Reason for selecting this action\>\>, "search\_query":<<If the next action is "search", fill in the specific query statement used for the search; if the next action is "answer", return an empty string\>\> \}"""
Webpage Content Verification PromptThe Claim you need to verify this time is: \{claim\} The related verification queries are: \{queries\}Subtask Description You have currently obtained a potentially relevant webpage, and you need to judge whether this webpage can help answer the above verification queries, thereby verifying the claim\. Below is some information about this relevant webpage:•Title: \{page\_title\}•Content: \{page\_content\}•Date: \{page\_date\}Workflow1\.Carefully read the webpage content information provided above, and examine the true degree of relevance of this information to the claim and each verification query\.2\.For each verification query, judge whether this webpage can provide an answer\.3\.When the information in the webpage is sufficient to make a decision \(True or False\) for a certain verification query, you need to return the relevant partial information and the decision result\. If you think the webpage information is irrelevant to a certain query, return None in the corresponding field\.4\.After completing the above tasks, integrate all results intoa strict JSON object\.Special Attention Examples1\.You are very sensitive to numbers\. When the number of digits is the same, any discrepancy will be considered an error\. When the number of digits is different, you can accept rounding\. For ranges represented by numbers, you uphold the same standard as the former\. For example:•100 and 110 have the same number of digits and do not accept rounding, so they will be considered an error\.•100\.1 and 100 have a different number of digits and meet the rounding standard, so they will be considered correct\.2\.It is prohibited to add any subjective inference or extension during the judgment process\. All judgments must be made based on the verification queries and the actual webpage content\.Output Format Please strictly output the following JSON object, and any other characters are prohibited\.\{ "relevant\_context":<<Key information extracted from the webpage related to the verification, which needs to be sufficient and have enough context information\>\>, "query\_results": \[ \{ "query\_id"\(int\):<<The ID of the verification query\>\>, "query":<<The content of the verification query\>\>, "flag":<<One of True/False/None, indicating the judgment of this webpage on this query\>\>, "evidence":<<Specific evidence supporting this judgment, fill in null if flag is None\>\> \}, \.\.\.provide results for each verification query\.\.\. \], "explanation":<<Your overall explanation for this analysis\>\> \}
Claim Aggregation and Judgment PromptClaim to be verified: \{claim\} Extraction source: \{extract\} Verification approach: \{queries\}Subtask Description You have currently obtained some relevant webpage materials and completed the preliminary verification\. You need to:1\.Provide an answer for each verification query2\.Make a final judgment on the entire claim based on the answers to all queriesField descriptions:•“title”: Webpage title•“link”: Webpage link•“date”: Webpage publication date•“analysis\_result”: Preliminary analysis results, including judgments on each verification query\[Relevant Materials\]: \{context\}Workflow1\.Carefully read the claim, each verification query, and the \[Relevant Materials\]\.2\.For each verification query, synthesize all relevant materials to make a judgment \(Correct/Incorrect/Unknown\)\.3\.Based on the judgment results of all queries, make a final judgment on the entire claim:•If all queries are judged as “Correct”, the claim is “Correct”•If any query is judged as “Incorrect”, the claim is “Incorrect”•If there is “Unknown” and no “Incorrect”, the claim is “Unknown”4\.The evaluations in the \[Relevant Materials\] are for reference only, and you need to conduct a secondary verification combined with the relevant information they provide as evidence\.5\.There may be contradictory statements or conflicting information in different materials\. Please eliminate the false and retain the true based on the publication time and publishing organization\.6\.All reference materials you use for decision\-making must come from the \[Relevant Materials\] I provided, and it is prohibited to generate any relevant materials yourself\.7\.The \[Relevant Materials\] may contain prior verification conclusions marked as “PRIOR\_REFERENCE\_VERIFICATION”, which are pre\-judgments based on the document’s own references\. You should use this as an important reference basis, but you still need to cross\-verify it with the materials obtained from the search\. When the prior conclusion is consistent with the search materials, it can enhance the confidence of the judgment; when there is a contradiction, the more reliable information source shall prevail\.Output Format Please strictly output the following JSON object, and any other characters are prohibited\.\{ "query\_answers": \[ \{ "query\_id"\(int\):<<The ID of the verification query\>\>, "query"\(str\):<<The content of the verification query\>\>, "answer"\(str\):<<One of Correct/Incorrect/Unknown\>\>, "evidence"\(str\):<<Summary of evidence supporting this judgment\>\>, "reference\_urls"\(list\): \[<<List of reference URLs supporting this judgment\>\>\] \}, \.\.\.provide results for each verification query\.\.\. \], "claim\_answer":<<One of Correct/Incorrect/Unknown, the final judgment on the entire claim\>\>, "reference": \{ "url": "content", "url": "content", \.\.\.all reference materials used for the claim judgment\.\.\. \}, "explanation":<<Your overall explanation for this decision, explaining how the final judgment of the claim is derived from the results of each query\>\> \}
Prompt for Match Claims with ReferencesYou are a text analysis expert, skilled at locating the citation sources of factual claims from documents\.Task Instructions I will provide a research report and \{claim\_count\} factual claims \(numbered 0–\{max\_idx\}\)\. Each claim contains:•fact: The factual claim extracted from the report \(may be slightly rewritten\)•extract: The originally extracted text snippetPlease complete the following tasks for each claim:1\. Locate the Original Text•Use the fact and extract as clues to search for the corresponding original text location in the document•The original text may have slight differences from the fact/extract \(such as formatting or punctuation\), but the core content should be consistent2\. Extract Citations After locating the original text, extract all relevant citation markers\. Pay attention to the context of the original text; a single fact may correspond to multiple citations, do not miss any\.Citation Format Instructions Citations in the main body of the research report may appear in the following formats, please identify them carefully:1\.A piece of text \+ \[number\]\(there may be multiple\) For example: “divided society into 7 classes\[15\]” or “\[15\]\[16\]”2\.A piece of text \+ \[\(number\)\]\(link\) For example: “divided society into 7 classes\[\(15\)\]\(https://www\.example\.com\)”3\.\[citation:N\]format For example: “Lefit’s single stores usually achieve break\-even within a few months of opening \[citation:53\]”4\.or format For example: “only about 4%–5% in China”Extraction Notes1\.Ensure original\_text is complete and understandable: It must be a complete and understandable sentence or paragraph2\.Handling multiple citations: If there are multiple citations, list all of them in the references array3\.ref\_idx extraction rules: Only extract the number part, ignore position information \(e\.g\., for \[15†\\daggerL10\], only extract “15”\)4\.Reference URL lookup: Look up the URL based on the index from the reference list at the end of the document5\.No citation case: If the original text has no citation markers, references should be an empty arrayOutput Format Please directly output a JSON array, where each element corresponds to a claim:\[ \{ "located": true, "original\_text": "Original text snippet containing citation markers", "has\_reference": true, "references": \[ \{"ref\_idx": "1", "ref\_url": "https://example1\.com"\}, \{"ref\_idx": "2", "ref\_url": "https://example2\.com"\} \] \}, \{ "located": false, "original\_text": null, "has\_reference": false, "references": \[\] \} \]Field descriptions:•located: boolean, whether the original text was found in the document•original\_text: string or null, the original text content \(including citation markers\)•has\_reference: boolean, whether there is a citation•references: array, each element contains ref\_idx \(string\) and ref\_url \(string or null\)Please begin your work: Document: \{document\}Claims\(Total of \{claim\_count\}\): \{claims\}Please directly output a JSON array, you must return \{claim\_count\} elements, do not output any explanations\.
Prompt for Reference Consistency JudgmentYou will see a reference material and some statements\. Please judge whether the statement is support, conflict, or unknown with respect to the reference material\. Note:Judgment Criteria:1\.First, determine whether the reference material contains valid content\. If there is no valid information in the reference material \(such as a “page not found” page, garbled text, or irrelevant content\), then the status of all statements is consideredunknown\.2\.If the reference material is valid:•support: The facts or data contained in the statement can be fully or partially found in the reference material \(data rounding is acceptable\)•conflict: The facts or data contained in the statement explicitly contradict or conflict with the content in the reference material•unknown: The relevant information of the statement is neither supported nor contradicted in the reference material, making it impossible to judgeOutput Format: Return a JSON list, where each item contains:•idx:<<The sequence number of the statement\>\>•result:<<The judgment result \(support/conflict/unknown\)\>\>•explanation:<<Relevant explanation for the reason of the judgment\>\>•context:<<Relevant evidence information extracted from the reference material \(extract the supporting original text for support, extract the conflicting original text for conflict, leave as an empty string for unknown\)\>\>For example:\[ \{ "idx": 1, "result": "support", "explanation": "The reference material explicitly mentions this data", "context": "According to reports, the company’s revenue reached 50 million yuan in 2024\.\.\." \}, \{ "idx": 2, "result": "conflict", "explanation": "The data shown in the reference material does not match the statement", "context": "The report shows that the company’s revenue in 2024 was only 30 million yuan\.\.\." \}, \{ "idx": 3, "result": "unknown", "explanation": "The reference material does not mention relevant information", "context": "" \]Below are the reference material and statements: <<reference\>\> \{reference\} <</reference\>\><<statements\>\> \{statements\} <</statements\>\>Begin the judgment below, directly output the JSON list, do not output any chit\-chat or explanations\.Similar Articles
DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation
DR³-Eval is a benchmark for evaluating deep research agents on multimodal, multi-file report generation with a realistic web environment simulation and comprehensive evaluation framework measuring information recall, factual accuracy, citation coverage, instruction following, and depth quality.
DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning
This technical report introduces DuMate-DeepResearch, a multi-agent framework for deep research tasks that decouples the agent core from a tool ecosystem, and incorporates graph-based dynamic planning, recursive two-level execution, and rubric-based test-time optimization. The system achieves state-of-the-art results on two deep research benchmarks, demonstrating the value of auditable agent infrastructure.
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
ResearchClawBench is a benchmark for evaluating end-to-end autonomous scientific research across 40 tasks from 10 domains, revealing that current AI agents and LLMs achieve low re-discovery accuracy, with Claude Code averaging 21.5 and Claude-Opus-4.7 averaging 20.7 out of a possible score.
EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge
EvoBrowseComp is an evolving benchmark with 800 contamination-free questions for evaluating search agents, designed to prevent parametric memorization and maintain temporal freshness through a three-agent framework.
@tom_doerr: Fully open sources training data for 30B scale search agents https://github.com/PolarSeeker/OpenSeeker…
OpenSeeker fully open-sources training data and models for 30B-scale ReAct-based search agents, achieving state-of-the-art performance on multiple benchmarks including BrowseComp and Humanity's Last Exam. It is the first purely academic project to reach frontier search benchmark performance while releasing complete training data.