Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
Summary
This paper introduces REFLECT, a meta-evaluation benchmark for assessing the reliability of LLM judges in evaluating deep research agents. Experiments show current LLM judges remain unreliable, with overall accuracies below 55% across reasoning, tool-use, and report-quality failures.
View Cached Full Text
Cached at: 05/20/26, 08:24 AM
# Can We Trust LLM Judges for Evidence-based Research Agents?
Source: [https://arxiv.org/html/2605.19196](https://arxiv.org/html/2605.19196)
Leyao Wang1,♡,†Yanan He1,♡,†Peng Chen1,†Asaf Yehudai2,†Yixin Liu1Rex Ying1Michal Shmueli\-Scheuer2Arman Cohan1,† 1Yale University2IBM Research \{leyao\.wang\.lw855, yanan\.he, peng\.chen\.pc838, yixin\.liu, rex\.ying, arman\.cohan\}@yale\.edu Asaf\.Yehudai@ibm\.com, shmueli@il\.ibm\.com ♡\\heartsuitJoint first authors\.†\\daggerCore contributors\.
###### Abstract
Deep research agents increasingly automate complex information\-seeking tasks, producing evidence\-grounded reports via multi\-step reasoning, tool use, and synthesis\. Their growing role demands scalable, reliable evaluation, positioning LLM\-as\-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality\. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta\-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves\. Existing meta\-evaluations fall short in two ways: \(1\) reliance on coarse, subjective human\-preference agreement; \(2\) focus on instruction\-following or verifiable tasks, leaving open\-ended agent executions unexplored\. To address these gaps, we introduceReflect\(REliableFine\-grainedLLM judgeEvaluation viaControlled inTervention\), a meta\-evaluation benchmark targeting fine\-grained failure detection in agentic environments\.Reflectdefines a detailed taxonomy of process\- and outcome\-level failure modes, instantiated by performing controlled and localized interventions on quality\-screened agent execution traces\. This yieldsverifiable,comprehensive, andfine\-grainedinstances for validating the judge models\. Our experiments show that current LLM judges remain unreliable: even the best\-performing models achieve overall accuracies below 55% across reasoning, tool\-use, and report\-quality failures, with especially poor performance on evidence verification\. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents\.
††footnotetext:See full author contributions[here](https://arxiv.org/html/2605.19196#S6)\.## 1Introduction
Deep research agents are increasingly important for automating complex information\-seeking tasks\. They can investigate open\-ended questions through browser interaction, reasoning, and synthesis, ultimately producing evidence\-grounded long\-form reports\[[34](https://arxiv.org/html/2605.19196#bib.bib34),[23](https://arxiv.org/html/2605.19196#bib.bib23),[52](https://arxiv.org/html/2605.19196#bib.bib52),[43](https://arxiv.org/html/2605.19196#bib.bib43)\]\. As these agents are increasingly used in realistic research workflows, rigorous evaluation becomes essential, motivating recent benchmarks that assess long\-form report generation, research\-tools integration and research\-process quality\[[6](https://arxiv.org/html/2605.19196#bib.bib6),[17](https://arxiv.org/html/2605.19196#bib.bib17),[5](https://arxiv.org/html/2605.19196#bib.bib5),[49](https://arxiv.org/html/2605.19196#bib.bib49),[54](https://arxiv.org/html/2605.19196#bib.bib54)\]\. However, evaluation remains challenging: the final report is long\-form and knowledge\-intensive, making cited sources difficult to verify; and the execution trajectory is multi\-step, open\-ended, and difficult to audit, making it hard to assess whether a fluent report truly reflects sound retrieval and well\-supported claims\.
Figure 1:Data distribution ofReflectacross reasoning\-process \(N=140N=140\), tool\-use \(N=132N=132\), and outcome\-level \(N=200N=200\) error types\. The outer rings represent the high\-level failure dimensions of deep research agents and their corresponding proportions, while the inner rings break each dimension down into fine\-grained error types defined by our taxonomy, which is summarized from prior work \(see Table[4](https://arxiv.org/html/2605.19196#A1.T4)\) and further verified through case studies of natural rollouts \(see Appendix[D](https://arxiv.org/html/2605.19196#A4)\)\.Such challenges make human evaluation over full research trajectories costly and infeasible at scale, motivating LLM\-as\-judge as a scalable supervision paradigm for assessing report quality, tool integration, and intermediate reasoning processes\[[63](https://arxiv.org/html/2605.19196#bib.bib63),[26](https://arxiv.org/html/2605.19196#bib.bib26),[7](https://arxiv.org/html/2605.19196#bib.bib7),[6](https://arxiv.org/html/2605.19196#bib.bib6),[5](https://arxiv.org/html/2605.19196#bib.bib5),[17](https://arxiv.org/html/2605.19196#bib.bib17),[54](https://arxiv.org/html/2605.19196#bib.bib54)\]\. Related work further uses LLM judges or reward models to supervise search behavior, step\-level reasoning, and citation\-aware training signals\[[58](https://arxiv.org/html/2605.19196#bib.bib58),[44](https://arxiv.org/html/2605.19196#bib.bib44),[59](https://arxiv.org/html/2605.19196#bib.bib59),[18](https://arxiv.org/html/2605.19196#bib.bib18),[61](https://arxiv.org/html/2605.19196#bib.bib61),[45](https://arxiv.org/html/2605.19196#bib.bib45)\]\. Yet the reliability of these judges when evaluating deep research agents remains poorly understood, posing a critical meta\-evaluation problem\[[27](https://arxiv.org/html/2605.19196#bib.bib27)\]: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves\.
However, existing meta\-evaluation protocols are ill\-suited for assessing judge reliability in deep research agent settings\. Prior work validates automated judges by measuring agreement with human ratings, rankings, or pairwise preferences over model outputs\[[20](https://arxiv.org/html/2605.19196#bib.bib20),[10](https://arxiv.org/html/2605.19196#bib.bib10),[5](https://arxiv.org/html/2605.19196#bib.bib5),[55](https://arxiv.org/html/2605.19196#bib.bib55)\]\. This paradigm leaves three critical gaps for evidence\-based research agents:\(1\) Coarse and subjective labels\.Overall preferences indicate which output humans favor, but shed little light on which specific failures a judge detects or misses\.\(2\) Absence of ground truth in open\-ended tasks\. Prior meta\-evaluation targets settings with verifiable answers, such as mathematics, coding, or factual QA\. Deep research agents instead operate in open\-ended settings with no single correct answer or canonical trajectory, making reliable labels difficult to construct for retrieval, tool use, reasoning, and synthesis\.\(3\) Insufficient coverage of process\-level execution\.Existing protocols assess judges against coarse human judgments over final outputs, offering limited insight into whether LLM judges can detect process\-level failures such as poor evidence gathering or tool misuse\.
To address these gaps, we introduceReflect\(REliableFine\-grainedLLM judgeEvaluation viaControlled inTervention\), a meta\-evaluation benchmark targeting fine\-grained failure detection of LLM judges for non\-verifiable agentic execution\.Reflectoffers three key advantages:\(1\) Verifiable ground\-truth labels: instead of relying on subjective human preferences, we make controlled, localized interventions on quality\-screened agent trajectories and reports, making labels objective and directly verifiable by construction\.\(2\) Comprehensive and realistic failure coverage: perturbations are drawn from a taxonomy of realistic failures spanning both process\- and outcome\-level errors in reasoning, tool use, evidence gathering, and synthesis\.\(3\) Fine\-grained diagnostic signal: by reframing meta\-evaluation as failure detection with known failure types and locations,Reflectenables precise identification of judge blind spots and systematic comparison between fine\-grained and holistic evaluation paradigms\.
UsingReflect, we evaluate various LLM\-judges, including both those that perform holistic and fine\-grained, step\-level evaluations\. Our experiments reveal major reliability gaps in current LLM judges: judges fail in different ways, and no single aggregate score captures overall reliability\. Fine\-grained evaluation is more effective than holistic scoring, particularly for macro\-level structural failures that require cross\-stage tracing\. Overall,Reflectexposes overlooked failure types and vulnerable components, offering guidance for improving judge prompts and protocols for more reliable agentic research systems\. Our contributions are threefold:
1. 1\.We introduceReflect, the first comprehensive and fine\-grained meta\-evaluation benchmark for assessing LLM judges in deep research agent executions traces and reports, converting judge evaluation from subjective, coarse preference matching into targeted failure detection\.
2. 2\.We construct the benchmark using controlled, localized interventions based on comprehensive error taxonomy of deep research agent, producing instances with specific failure types and verifiable ground\-truth labels\.
3. 3\.We systematically study judge reliability and cost across models, failure categories, and evaluation protocols\. Our findings reveal major reliability gaps in current judges and point to fine\-grained judging protocols as a potential enhancement for robust evaluation pipelines for deep research agents\.
## 2Reflect
### 2\.1Benchmark Task Formulation
Reflectframes judge meta\-evaluation as an accuracy\-based preference task over research\-agent executions\. Each instance pairs a reference execution with a controlled failure\-bearing alternative\. A reliable judge should assign higher quality to the reference, thereby showing sensitivity to the targeted failure\. This formulation supports both process\-level evaluation of trajectories and outcome\-level evaluation of final reports, while retaining verifiable labels for open\-ended research tasks that lack a single canonical answer\. We first formalize the benchmark task and failure space in §[2\.1](https://arxiv.org/html/2605.19196#S2.SS1), then describe the four\-stage construction pipeline used to build verified clean\-perturbed pairs in §[2\.2](https://arxiv.org/html/2605.19196#S2.SS2)\.
Agent executions\.An evidence\-based deep research agent𝒜\\mathcal\{A\}maps an input queryqqto an executionξ=\(q,τ,y\)\\xi=\(q,\\tau,y\), whereτ\\tauis the research trajectory andyyis the final long\-form answer\. Following ReAct\[[52](https://arxiv.org/html/2605.19196#bib.bib52)\], the trajectory is a sequence of reasoning, tool\-call, and tool\-response triples,τ=\(\(rt,ct,st\)\)t=1T\\tau=\\big\(\(r\_\{t\},c\_\{t\},s\_\{t\}\)\\big\)\_\{t=1\}^\{T\}, with historyh<t=\(q,r<t,c<t,s<t\)h\_\{<t\}=\(q,r\_\{<t\},c\_\{<t\},s\_\{<t\}\)\. At each step, the agent generatesrt=𝒜reason\(h<t\)r\_\{t\}=\\mathcal\{A\}\_\{\\mathrm\{reason\}\}\(h\_\{<t\}\)and selects a tool callct=𝒜tool\(h<t,rt\)c\_\{t\}=\\mathcal\{A\}\_\{\\mathrm\{tool\}\}\(h\_\{<t\},r\_\{t\}\)\. Each tool callct=\(ut,θt\)c\_\{t\}=\(u\_\{t\},\\theta\_\{t\}\)specifies a toolutu\_\{t\}from the available tool set𝒰\\mathcal\{U\}and its argumentsθt\\theta\_\{t\}\. The tool returns a responsest=ℰ\(ct\)s\_\{t\}=\\mathcal\{E\}\(c\_\{t\}\)\. After completing the trajectory, the final answer is produced asy=𝒜ans\(q,τ\)y=\\mathcal\{A\}\_\{\\mathrm\{ans\}\}\(q,\\tau\)\.
Failure space\.We partition the failure space asℱ=ℱproc∪ℱout\\mathcal\{F\}=\\mathcal\{F\}\_\{\\mathrm\{proc\}\}\\cup\\mathcal\{F\}\_\{\\mathrm\{out\}\}\. Process\-level failuresℱproc\\mathcal\{F\}\_\{\\mathrm\{proc\}\}arise within the trajectoryτ\\tau, including errors in reasoning, tool calls, and the use or interpretation of tool responses\. Outcome\-level failuresℱout\\mathcal\{F\}\_\{\\mathrm\{out\}\}arise in the final answeryy\. We derived and adapted the full error taxonomy from prior work on long\-form QA, deep research agents, and agent evaluation\[[66](https://arxiv.org/html/2605.19196#bib.bib66),[58](https://arxiv.org/html/2605.19196#bib.bib58),[44](https://arxiv.org/html/2605.19196#bib.bib44),[59](https://arxiv.org/html/2605.19196#bib.bib59)\]\. Details of the our error taxonomy distribution can be found in Figure[1](https://arxiv.org/html/2605.19196#S1.F1)and its relation to existing schemes are given in Appendix[A\.2](https://arxiv.org/html/2605.19196#A1.SS2)\.
Figure 2:Overview of the benchmark construction pipeline ofReflect, which collects agent trajectories, applies controlled perturbations to reasoning, tool use, and answers, and validates the resulting samples through automated filtering and human review\.Benchmark instances\.Given a quality\-screened agent executionξ⋆=\(q,τ⋆,y⋆\)\\xi^\{\\star\}=\(q,\\tau^\{\\star\},y^\{\\star\}\)and a target failure typef∈ℱf\\in\\mathcal\{F\}, a perturbation operatorΠf\\Pi\_\{f\}produces a corrupted executionξ~=Πf\(ξ⋆\)\\tilde\{\\xi\}=\\Pi\_\{f\}\(\\xi^\{\\star\}\)that containsffand differs fromξ⋆\\xi^\{\\star\}only at a designated edit site\. Each benchmark instancebi∈ℬb\_\{i\}\\in\\mathcal\{B\}consists of a verified reference\-corrupted execution pair, a failure label, and an edit siteℓi\\ell\_\{i\}:
ℬ=\{bi=\(ξi⋆,ξ~i,fi,ℓi\)\}i=1M,ξ~i=Πfi\(ξi⋆\)\.\\mathcal\{B\}=\\big\\\{b\_\{i\}=\(\\xi\_\{i\}^\{\\star\},\\tilde\{\\xi\}\_\{i\},f\_\{i\},\\ell\_\{i\}\)\\big\\\}\_\{i=1\}^\{M\},\\qquad\\tilde\{\\xi\}\_\{i\}=\\Pi\_\{f\_\{i\}\}\(\\xi\_\{i\}^\{\\star\}\)\.Hereℓi\\ell\_\{i\}is a trajectory stept∈\{1,…,Ti\}t\\in\\\{1,\\dots,T\_\{i\}\\\}for process\-level perturbations and a contiguous answer chunk for outcome\-level perturbations\. Sinceξi⋆\\xi\_\{i\}^\{\\star\}is verified to be free offif\_\{i\}andξ~i\\tilde\{\\xi\}\_\{i\}to contain it, each instance provides ground truth for judge meta\-evaluation\. The edit\-site metadata also supports localization analysis, scored by step\-level exact match for trajectories and chunk\-level overlap for answers; see Section[3](https://arxiv.org/html/2605.19196#S3)\.
Judge interfaces\.A judge𝒥\\mathcal\{J\}is evaluated by whether it identifies or prefers the reference execution over its failure\-bearing counterpart\. We support three interfaces commonly used in evaluation and training\.
Scalar judging\.A scalar \(or pointwise\) judge assigns a quality scoreS𝒥\(ξ\)∈ℝS\_\{\\mathcal\{J\}\}\(\\xi\)\\in\\mathbb\{R\}, as in reward modeling or score\-based filtering\. For a reference\-corrupted pair, we define the score gap and success indicator as
Δ𝒥\(ξ⋆,ξ~\)=S𝒥\(ξ⋆\)−S𝒥\(ξ~\),z𝒥\(ξ⋆,ξ~\)=𝕀\[Δ𝒥\(ξ⋆,ξ~\)\>ϵ\]\.\\Delta\_\{\\mathcal\{J\}\}\(\\xi^\{\\star\},\\tilde\{\\xi\}\)=S\_\{\\mathcal\{J\}\}\(\\xi^\{\\star\}\)\-S\_\{\\mathcal\{J\}\}\(\\tilde\{\\xi\}\),\\qquad z\_\{\\mathcal\{J\}\}\(\\xi^\{\\star\},\\tilde\{\\xi\}\)=\\mathbb\{I\}\\\!\\left\[\\Delta\_\{\\mathcal\{J\}\}\(\\xi^\{\\star\},\\tilde\{\\xi\}\)\>\\epsilon\\right\]\.We useϵ=0\\epsilon=0as the default margin throughout the paper\.
Pairwise judging\.A pairwise judge directly compares two executions, matching preference\-learning settings such as DPO\-style training,\[[40](https://arxiv.org/html/2605.19196#bib.bib40)\]and returnsP𝒥\(ξ⋆,ξ~\)∈\{ξ⋆,ξ~,tie\}P\_\{\\mathcal\{J\}\}\(\\xi^\{\\star\},\\tilde\{\\xi\}\)\\in\\\{\\xi^\{\\star\},\\tilde\{\\xi\},\\mathrm\{tie\}\\\}\. It succeeds when
z𝒥\(ξ⋆,ξ~\)=𝕀\[P𝒥\(ξ⋆,ξ~\)=ξ⋆\]\.z\_\{\\mathcal\{J\}\}\(\\xi^\{\\star\},\\tilde\{\\xi\}\)=\\mathbb\{I\}\\\!\\left\[P\_\{\\mathcal\{J\}\}\(\\xi^\{\\star\},\\tilde\{\\xi\}\)=\\xi^\{\\star\}\\right\]\.
Ranking judging\.A ranking judge selects the best execution from a candidate set, corresponding to Best\-of\-NNinference\-time scaling or reranking\. Let𝒫\\mathcal\{P\}denote a set of perturbation types, each producing a candidateξ~a\\tilde\{\\xi\}\_\{a\}fora∈𝒫a\\in\\mathcal\{P\}\. The judge sees𝒞=\{ξ⋆\}∪\{ξ~a:a∈𝒫\}\\mathcal\{C\}=\\\{\\xi^\{\\star\}\\\}\\cup\\\{\\tilde\{\\xi\}\_\{a\}:a\\in\\mathcal\{P\}\\\}, selectsT𝒥\(𝒞\)∈𝒞T\_\{\\mathcal\{J\}\}\(\\mathcal\{C\}\)\\in\\mathcal\{C\}, and succeeds when
z𝒥\(𝒞\)=𝕀\[T𝒥\(𝒞\)=ξ⋆\]\.z\_\{\\mathcal\{J\}\}\(\\mathcal\{C\}\)=\\mathbb\{I\}\\\!\\left\[T\_\{\\mathcal\{J\}\}\(\\mathcal\{C\}\)=\\xi^\{\\star\}\\right\]\.
### 2\.2Benchmark Construction Pipeline
We instantiate the perturbation operatorsΠf\\Pi\_\{f\}through a four\-stage pipeline: taxonomy construction, reference screening, controlled intervention, and combined automated filtering and human validation\.
Taxonomy construction\.We construct a failure spaceℱ\\mathcal\{F\}by synthesizing categories from prior work on long\-form QA, deep research benchmark, and agentic trajectory supervision\[[66](https://arxiv.org/html/2605.19196#bib.bib66),[64](https://arxiv.org/html/2605.19196#bib.bib64),[59](https://arxiv.org/html/2605.19196#bib.bib59),[44](https://arxiv.org/html/2605.19196#bib.bib44)\]\. Existing taxonomies typically emphasize either final\-answer quality or trajectory behavior in isolation; ours unifies both views and is the basis for the process/outcome partition above\. To verify that the taxonomy reflects real agent behavior rather than an a priori list, we sample natural rollouts on held\-out queries and, under model\-assisted and human review, map each observed failure either to a category inℱ\\mathcal\{F\}or to an out\-of\-taxonomy bucket\. Case studies can be found in Appendix[D](https://arxiv.org/html/2605.19196#A4)\.
Reference screening\.We draw candidate reference executions from strong agent rollouts\. For each target failure typeff, we only require that the selected reference does not already containffat the chosen edit \(i\.e\. the step or chunk to edit\)\. Candidate references are screened using automatic checks for schema validity, English language content, and usable trajectory or answer structure, followed by targeted validation for the absence offf\.
Controlled Intervention\.For each failure typef∈ℱf\\in\\mathcal\{F\}, we define a perturbation operatorΠf\\Pi\_\{f\}implemented as an LLM\-based editor\. Starting from clean seedsξ⋆\\xi^\{\\star\}obtained from strong agent rollouts, we use LLM along with human supervision to pre\-filter to ensureffis absent in the seed, then apply
ξ~=Πf\(ξ⋆\)=Editθ\(ξ⋆,f,df,ℓ;pf\),\\tilde\{\\xi\}=\\Pi\_\{f\}\(\\xi^\{\\star\}\)=\\mathrm\{Edit\}\_\{\\theta\}\\\!\\big\(\\xi^\{\\star\},\\,f,\\,d\_\{f\},\\,\\ell;\\,p\_\{f\}\\big\),whereEditθ\\mathrm\{Edit\}\_\{\\theta\}denotes an LLM editor with parametersθ\\theta,dfd\_\{f\}is a natural\-language definition of the failure type,ℓ\\ellis the target edit site sampled from candidate sites inξ⋆\\xi^\{\\star\}, andpfp\_\{f\}is a type\-specific perturbation prompt\. Each operator targets either the trajectoryτ⋆\\tau^\{\\star\}\(forf∈ℱprocf\\in\\mathcal\{F\}\_\{\\mathrm\{proc\}\}\) or the answery⋆y^\{\\star\}\(forf∈ℱoutf\\in\\mathcal\{F\}\_\{\\mathrm\{out\}\}\); concrete worked examples are listed in Appendix[D](https://arxiv.org/html/2605.19196#A4)\.
Following adversarial comparison benchmarks such as LLMBar\[[57](https://arxiv.org/html/2605.19196#bib.bib57)\], we constrain edits to belocalized,plausible, andminimal: a perturbation should introduceffatℓ\\ellwhile preserving fluency, coherence, and all content outsideℓ\\ell\. For trajectories, this means that surrounding steps and their observations are left unchanged and the edited step remains syntactically well\-formed; for answers, it means that only the targeted chunk is rewritten\. This discipline ensures that judge success depends on detectingffrather than exploiting superficial artifacts such as length, formatting, or stylistic drift\.
Automated filtering and human validation\.Every original\-perturbed pair is screened by automated filters that remove pairs with no substantive change, malformed outputs, non\-English text, formatting artifacts, or invalid input\-output structure for the target judge interface\. Pairs that pass the filter proceed to a human validation step\. Two annotators with graduate\-level expertise in NLP independently verify three conditions for every pair:ξ~\\tilde\{\\xi\}contains the target failureff,ξ⋆\\xi^\{\\star\}does not, and the perturbation introduces no major unintended failures\. Annotators completed a calibration round on a held\-out development sample before the main study, and disagreements on the main study were resolved through adjudicated discussion\. We obtain an inter\-annotator agreement ofκ=0\.86\\kappa=0\.86, indicating substantial agreement\. Final dataset statistics are illustrated in Figure[1](https://arxiv.org/html/2605.19196#S1.F1)and report in in Table[6](https://arxiv.org/html/2605.19196#A2.T6)in Appendix[B\.1](https://arxiv.org/html/2605.19196#A2.SS1)\.
## 3Experiments
We design our experiments to answer the following key research questions:
RQ1: Model capability\.How do different judge models perform in detecting various fine\-grained failure modes, and how do open\-weight models compare with frontier closed\-source models?RQ2: Evaluation protocol\.How do judging protocols, including holistic versus fine\-grained evaluation, rubric guidance, and explicit reasoning, affect the judge’s reliability?RQ3: Judge blind spots\.Which process\-level and outcome\-level failure types are systematically missed by LLM judges, and how do these blind spots depend on evaluation granularity?RQ4: best\-of\-NNand cost\-performance trade\-off\.Can LLM judges identify the verified original execution among multiple failure\-bearing alternatives \(a useful setup for best\-of\-N inference\-time scaling\), and which protocol choices provide the best reliability\-cost trade\-off?
### 3\.1Experimental Setup
Evaluation Protocols\.We formulate judge reliability as an accuracy\-based preference task: given a verified reference execution and failure\-bearing alternatives, the judge should prefer the reference\. We evaluate two targets: the agent’s execution process and its final output\. Process\-level evaluation assesses trajectories, distinguishing*reasoning behavior*\(e\.g\., planning, reflection, and evidence use\) from*tool\-use behavior*\(e\.g\., tool selection, argument construction, and response interpretation\), while outcome\-level evaluation assesses the final report\.
We vary three protocol axes: \(i\)*judging granularity*, comparing holistic judgments over full trajectories or reports with fine\-grained judgments over localized steps or chunks; \(ii\)*comparison format*, comparing pointwise independent scoring with pairwise direct comparison; and \(iii\)*prompting format*, comparing rubric\-based judgments with non\-rubric overall judgments\. For pairwise evaluation, we use a swapped\-order design to mitigate position bias\[[27](https://arxiv.org/html/2605.19196#bib.bib27)\]\. Additional prompt details are provided in Appendix[C](https://arxiv.org/html/2605.19196#A3)\.
Judge Models\.We evaluate a wide array of LLM judges covering both open\-weight and proprietary models \. Such judges are increasingly used beyond offline evaluations, such as best\-of\-NNselection and RL\-style training signals\[[65](https://arxiv.org/html/2605.19196#bib.bib65),[30](https://arxiv.org/html/2605.19196#bib.bib30)\]\. The open\-weight judges include Qwen3\-8B, Qwen3\-32B, and Qwen3\-235B\-A22B\[[51](https://arxiv.org/html/2605.19196#bib.bib51)\], Llama\-3\.1\-70B\[[16](https://arxiv.org/html/2605.19196#bib.bib16)\], Gemma3\-27B\[[12](https://arxiv.org/html/2605.19196#bib.bib12)\], and GPT\-OSS\-120B\[[1](https://arxiv.org/html/2605.19196#bib.bib1)\]\. The proprietary judges include Gemini\-2\.0\-Flash\[[13](https://arxiv.org/html/2605.19196#bib.bib13)\], Gemini\-2\.5\-Flash\[[14](https://arxiv.org/html/2605.19196#bib.bib14)\], Gemini\-3\.1\-Pro\[[15](https://arxiv.org/html/2605.19196#bib.bib15)\], GPT\-5\.3\-Codex\[[37](https://arxiv.org/html/2605.19196#bib.bib37)\], GPT\-5\.4\[[38](https://arxiv.org/html/2605.19196#bib.bib38)\], GPT\-5\-mini\[[36](https://arxiv.org/html/2605.19196#bib.bib36)\], Claude\-Haiku\-4\.5\[[2](https://arxiv.org/html/2605.19196#bib.bib2)\], and Claude\-Opus\-4\.7\[[3](https://arxiv.org/html/2605.19196#bib.bib3)\]\.
Benchmark Instances\.Our benchmark draws on different sources for process\-level and outcome\-level perturbations\. For process\-level evaluation, we use clean agent trajectories from two trace sources: cleaned DR\.TULU\[[45](https://arxiv.org/html/2605.19196#bib.bib45)\]and Tongyi DeepResearch\[[48](https://arxiv.org/html/2605.19196#bib.bib48)\]\.111We do not assume that source trajectories are globally error\-free\. They are used as reference executions after screening and validation for the target failure type: the reference must not contain the target failure at the selected edit site, while the edited alternative must contain that failure and preserve the surrounding trajectory\. This paired counterfactual design controls for residual imperfections shared by both executions and tests whether judges are sensitive to the controlled localized degradation\.These traces provide the reasoning and tool\-use steps used to construct process\-level perturbations\. For outcome\-level evaluation, we use final reports from cleaned DR\.TULU\[[45](https://arxiv.org/html/2605.19196#bib.bib45)\]and English final answers sampled from DeepResearch Bench\[[6](https://arxiv.org/html/2605.19196#bib.bib6)\]\. All instances are normalized into a shared format containing the user question, final answer, and trajectory steps when available\.
Metrics\.We useaccuracyas the primary metric, consistent with reward\-model and judge meta\-evaluation benchmarks\[[30](https://arxiv.org/html/2605.19196#bib.bib30),[65](https://arxiv.org/html/2605.19196#bib.bib65)\], and following the scalar\-judging success criterion defined in Section[2\.1](https://arxiv.org/html/2605.19196#S2.SS1)\. A judge is correct on a pair\(ξi⋆,ξ~i\)\(\\xi\_\{i\}^\{\\star\},\\tilde\{\\xi\}\_\{i\}\)if the original execution receives a strictly higher final score than the perturbed execution\. I\.e\.,Acc=1N∑i=1N𝕀\[S𝒥\(ξi⋆\)\>S𝒥\(ξ~i\)\]\.\\mathrm\{Acc\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\\!\\left\[S\_\{\\mathcal\{J\}\}\(\\xi\_\{i\}^\{\\star\}\)\>S\_\{\\mathcal\{J\}\}\(\\tilde\{\\xi\}\_\{i\}\)\\right\]\.
For non\-rubric scoring,S𝒥S\_\{\\mathcal\{J\}\}is the judge’s direct overall score\. For rubric scoring,S𝒥\(x\)=1K∑k=1Ks𝒥,k\(x\)S\_\{\\mathcal\{J\}\}\(x\)=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}s\_\{\\mathcal\{J\},k\}\(x\), wheres𝒥,k\(x\)∈\{1,…,n\}s\_\{\\mathcal\{J\},k\}\(x\)\\in\\\{1,\\ldots,n\\\}is the score for rubric dimensionkk\. We report accuracy overall and by failure type\.
Table 1:Detection accuracy for process\- and outcome\-level evaluations with pointwise judges\. Values are percentages with % omitted\. Abbreviations: AN = Analysis, ST = Structure, OV = Overall, FI = Faithfulness, GR = Groundedness, RE = Relevance, EX = Expression, SY = Synthesis\. Bold and underline mark the best and the runner\-up\.ModelProcess\-level: ReasoningProcess\-level: Tool UseOutcome\-level: Report QualityANSTFI\\columncoloroverallblueOVSTFIGR\\columncoloroverallblueOVREFIEXSY\\columncoloroverallblueOVOpen\-source ModelsQwen3\-8B0\.03\.40\.0\\columncoloroverallblue0\.70\.07\.50\.0\\columncoloroverallblue3\.85\.29\.626\.734\.5\\columncoloroverallblue14\.5Qwen3\-32B0\.00\.00\.0\\columncoloroverallblue0\.00\.00\.03\.7\\columncoloroverallblue1\.534\.534\.950\.051\.7\\columncoloroverallblue39\.5Gemma3\-27B5\.010\.37\.0\\columncoloroverallblue7\.10\.01\.51\.9\\columncoloroverallblue1\.520\.719\.36\.734\.5\\columncoloroverallblue20\.0Llama3\.1\-70B0\.00\.00\.0\\columncoloroverallblue0\.09\.11\.51\.9\\columncoloroverallblue2\.36\.97\.220\.010\.3\\columncoloroverallblue9\.5Qwen3\-235B\-a22B30\.024\.122\.5\\columncoloroverallblue25\.018\.219\.422\.2\\columncoloroverallblue20\.517\.232\.530\.027\.6\\columncoloroverallblue27\.0GPT\-OSS\-120B57\.548\.338\.0\\columncoloroverallblue45\.736\.428\.424\.1\\columncoloroverallblue27\.343\.143\.443\.365\.5\\columncoloroverallblue46\.5Closed\-source ModelsGemini\-2\.0\-Flash2\.63\.46\.0\\columncoloroverallblue4\.50\.04\.51\.9\\columncoloroverallblue3\.03\.46\.03\.33\.4\\columncoloroverallblue4\.5Gemini\-2\.5\-Flash33\.331\.032\.8\\columncoloroverallblue32\.627\.319\.420\.4\\columncoloroverallblue20\.522\.418\.130\.041\.4\\columncoloroverallblue24\.5Gemini\-3\.1\-Pro20\.031\.023\.5\\columncoloroverallblue24\.154\.528\.422\.2\\columncoloroverallblue28\.041\.430\.123\.331\.0\\columncoloroverallblue32\.5Claude\-Haiku\-4\.515\.020\.712\.7\\columncoloroverallblue15\.063\.641\.840\.7\\columncoloroverallblue43\.232\.831\.330\.037\.9\\columncoloroverallblue32\.5Claude\-Opus\-4\.715\.020\.721\.1\\columncoloroverallblue19\.381\.849\.351\.9\\columncoloroverallblue53\.029\.337\.326\.748\.3\\columncoloroverallblue35\.0GPT\-5\.440\.041\.433\.8\\columncoloroverallblue37\.190\.955\.237\.0\\columncoloroverallblue50\.839\.733\.736\.748\.3\\columncoloroverallblue38\.0GPT\-5\-mini30\.048\.336\.6\\columncoloroverallblue37\.136\.428\.442\.6\\columncoloroverallblue34\.834\.544\.643\.358\.6\\columncoloroverallblue43\.5GPT\-5\.3\-codex30\.037\.919\.7\\columncoloroverallblue26\.463\.659\.746\.3\\columncoloroverallblue54\.539\.751\.843\.355\.2\\columncoloroverallblue47\.5
### 3\.2Model Capability \(RQ1\)
Table[1](https://arxiv.org/html/2605.19196#S3.T1)shows that judge performance remains low across process\-level reasoning, tool\-use accuracy, and outcome\-level evaluations\. The results reveal the following main findings\.
Existing judges remain unreliable\.Even the best overall scores are far from reliable: 45\.7% for reasoning, 54\.5% for tool use, and 47\.5% for report quality\. Performance also varies widely across model families and evaluation targets\. Smaller open\-weight judges perform poorly in most settings, while larger open\-weight and proprietary models are more competitive but still unreliable\.
Judge reliability is failure\-type dependent\.Tool\-use “structure” errors are generally easier for several frontier models to detect, whereas “groundedness” and “faithfulness” failures remain substantially more challenging\. At the outcome level, models also differ in whether they are more sensitive to relevance, faithfulness, expression, or synthesis failures\. This heterogeneity indicates that aggregate accuracy alone can obscure important differences in what judges can and cannot detect\.
Agent\-oriented judges are strongest overall\.The strongest overall results come fromGPT\-5\.3\-codex, which achieves the best process\-level tool\-use accuracy of 54\.5% and the best outcome\-level report accuracy of 47\.5%\. This suggests that models optimized for agentic coding and tool\-oriented tasks may be well suited to evidence\-based judge evaluation, though this advantage comes with higher inference cost and requires further controlled study\.
### 3\.3Evaluation Protocol Comparison \(RQ2\)
We next study how protocol choices affect judge reliability\. Because exhaustive fine\-grained protocol sweeps are expensive, we run these comparisons on a representative subset of judges spanning open\-weight and proprietary models\. For evaluation granularity, we compare holistic judging over the full trajectory or report with fine\-grained judging over localized trajectory steps or answer chunks\. Table[2](https://arxiv.org/html/2605.19196#S3.T2)reportsΔscale\\Delta\_\{\\text\{scale\}\}, the accuracy difference between fine\-grained and holistic judging, across process\- and outcome\-level settings\.
Fine\-grained evaluation improves over holistic judging\.Fine\-grained judging consistently improves detection accuracy across models, evaluation levels, and rubric settings\. The gains are substantial in bothprocess\-levelandoutcome\-levelevaluation, withΔscale\\Delta\_\{\\text\{scale\}\}reaching over 30 points in several settings\. This suggests that localized evaluation helps judges identify errors that may be diluted under holistic scoring, whether they appear in intermediate reasoning trajectories or in final reports\. Overall, the results show that granularity is a robust protocol effect, and we next ask whether explicit rubric dimensions provide an additional source of judge reliability\.
Table 2:Effect of evaluation granularity across process\- and outcome\-level settings\.Δscale\\Delta\_\{\\text\{scale\}\}denotes the difference between fine\-grained and holistic detection accuracy, measured in percentage points\.ModelProcess\-LevelOutcome\-LevelRubricNo\-RubricRubricNo\-RubricHol\.FGΔscale\\Delta\_\{\\text\{scale\}\}Hol\.FGΔscale\\Delta\_\{\\text\{scale\}\}Hol\.FGΔscale\\Delta\_\{\\text\{scale\}\}Hol\.FGΔscale\\Delta\_\{\\text\{scale\}\}Qwen3\-8B0\.720\.0\(\+19\.3\)3\.625\.7\(\+22\.1\)14\.534\.0\(\+19\.5\)10\.525\.4\(\+14\.9\)Qwen3\-32B0\.021\.4\(\+21\.4\)0\.034\.3\(\+34\.3\)39\.545\.5\(\+6\.0\)14\.537\.6\(\+23\.1\)GPT\-5\.437\.155\.7\(\+18\.6\)22\.956\.4\(\+33\.5\)38\.055\.3\(\+17\.3\)32\.034\.8\(\+2\.8\)Gemini\-3\.1 Pro24\.155\.4\(\+31\.3\)25\.055\.7\(\+30\.7\)32\.556\.9\(\+24\.4\)12\.523\.2\(\+10\.7\)
Values are detection accuracy percentages\. Hol\. = holistic; FG = fine\-grained\.
Rubric effects are context\-dependent, motivating dynamic instance\-based rubrics\.Figure[3\(a\)](https://arxiv.org/html/2605.19196#S3.F3.sf1)shows that dimension\-wise rubric scoring is not a uniform improvement\. At theoutcome level, rubrics consistently improve detection accuracy across all selected models, with the largest gain reaching\+33\.7\+33\.7points forGemini\-3\.1\-Prounder fine\-grained judging\. This suggests that final\-report failures align relatively well with explicit scoring dimensions, allowing rubrics to expose localized factuality, evidence\-use, or citation errors that overall scores may overlook\. In contrast,process\-leveleffects are mixed and sometimes negative, especially forweaker judges\. For process evaluation, rubrics turn a single overall decision into a more demanding task: reading long trajectories, locating cross\-step evidence, separating nearby error dimensions, and calibrating multiple scores\. When the judge lacks sufficient long\-context reasoning or scoring stability, this extra structure can become noise rather than guidance\. Overall, rubric scoring is most useful when the judge is strong enough to apply it reliably, motivating more adaptive, instance\-specific rubrics for process\-level evaluation\.
\(a\)Rubric benefit across models and granularities\.
\(b\)CoT reasoning effect with rubrics\.
Figure 3:Effects of rubric\-guided evaluation and chain\-of\-thought reasoning on perturbation detection accuracy\.Δ\\Deltadenotes Rubric−\-No\-Rubric accuracy in percentage points\.CoT helps only when judges can effectively leverage rubrics\.We next examine whether chain\-of\-thought \(CoT\) prompting further improves rubric\-guided judging\. Figure[3\(b\)](https://arxiv.org/html/2605.19196#S3.F3.sf2)compares rubric gains with and without CoT: point height represents the rubric benefit, and slope indicates how this benefit changes after adding CoT\.
The results show that CoT amplifies rubric gains only selectively\. It is most helpful forstronger judgesinoutcome\-levelevaluation, where final\-report errors align well with rubric dimensions such as relevance, factuality, expression, and synthesis\. Forprocess\-levelevaluation, however, the effect is more mixed, as judges must track reasoning, tool use, and evidence flow across multiple steps\. Thus, CoT is better characterized as a capability\-dependent complement to rubric\-based evaluation, rather than a uniformly effective intervention\.
### 3\.4Blind Spots across Error Taxonomy \(RQ3\)
Figure 4:Failure detection accuracy across process\-level and outcome\-level perturbation types\. Results are shown for GPT\-5\.4 and Gemini\-3\.1 Pro under fine\-grained and holistic judging\.Fine\-grained judging surfaces local errors, whereas holistic judging captures context\-dependent global failures\.Figure[4](https://arxiv.org/html/2605.19196#S3.F4)compares GPT\-5\.4 and Gemini\-3\.1 Pro across perturbation types under fine\-grained and holistic judging\. The results reveal granularity\-dependent blind spots: failures that are salient at the step or span level may be obscured in a full trajectory, while failures that depend on broader context may only emerge when the entire response is evaluated\.
Fine\-grained judgingis most effective for localized failures because it makes the perturbed step or answer span directly visible\. This helps identify local process failures such asexecution stagnation, as well as local outcome failures such asevidence omission,expression quality, andincomplete coverage\. Under holistic judging, these signals can be diluted as they are embedded within a longer reasoning trajectory or report\.Holistic judging, in contrast, is better suited to failures that require global context, accumulated evidence, or overall task intent to detect\. These includeshallow reflectionandtopical misalignment, which may not be obvious from any single step but become clearer when the response is evaluated as a whole\.
### 3\.5Best\-of\-N Selection and Cost Trade\-offs \(RQ4\)
##### Best\-of\-NNMetric\.
Beyond single\-pair discrimination, many evaluation and deployment pipelines use judges for best\-of\-NNinference\-time selection: the system generates multiple candidate executions and selects the candidate with the highest judge score\. We model this setting by grouping each verified executionξi⋆\\xi\_\{i\}^\{\\star\}with its failure\-bearing alternatives,𝒞i=\{ξi⋆\}∪\{ξ~i,f:f∈ℱi\}\\mathcal\{C\}\_\{i\}=\\\{\\xi\_\{i\}^\{\\star\}\\\}\\cup\\\{\\tilde\{\\xi\}\_\{i,f\}:f\\in\\mathcal\{F\}\_\{i\}\\\}, whereℱi\\mathcal\{F\}\_\{i\}denotes the failure types instantiated for that execution\. A group is correct only if the verified reference receives the highest judge score:S𝒥\(ξi⋆\)\>maxf∈ℱiS𝒥\(ξ~i,f\)\.S\_\{\\mathcal\{J\}\}\(\\xi\_\{i\}^\{\\star\}\)\>\\max\_\{f\\in\\mathcal\{F\}\_\{i\}\}S\_\{\\mathcal\{J\}\}\(\\tilde\{\\xi\}\_\{i,f\}\)\.Best\-of\-NNaccuracy is the fraction of groups satisfying this condition\.
Best\-of\-NNSelection Gap\.Figure[5\(a\)](https://arxiv.org/html/2605.19196#S3.F5.sf1)shows a clear*best\-of\-NNselection gap*: accuracy drops when judges must select the verified reference from a candidate set instead of scoring a single reference\-alternative pair\. The drop is largest at the*process level*, where selection requires comparing multiple long trajectories and tracking distributed reasoning, tool\-use, and evidence\-flow failures across candidates\. The*outcome\-level*setting is less affected, likely because final reports provide a more compact and directly comparable evaluation target\. These results indicate that judge scores are less reliable for best\-of\-NNselection than for isolated pairwise discrimination, especially when candidate quality differs in trajectory\-level behavior\.
\(a\)Best\-of\-NNselection accuracy
\(b\)Cost–performance trade\-off
Figure 5:Judge reliability across evaluation settings\. \(a\) Best\-of\-NNselection accuracy\. Single\-pair scoring evaluates each reference–alternative pair independently, while Best\-of\-NNselection requires the judge to select the verified reference among 4–7 failure\-bearing alternatives\. \(b\) Estimated total evaluation cost versus detection accuracy for closed\-source judge settings, computed over the full benchmark by multiplying the measured input/output token counts by each model’s API pricing\.Cost\-Performance Trade\-off\.Finally, we analyze the trade\-off between judge reliability and evaluation cost across closed\-source judge settings\. Figure[5\(b\)](https://arxiv.org/html/2605.19196#S3.F5.sf2)shows a general*positive cost\-performance trend*: higher\-cost settings usually obtain higher detection accuracy\. The strongest accuracies come from more expensive pairwise CoT configurations, although the gains are not determined by cost alone\.*Process\-level*evaluation is less cost\-effective, as long trajectories increase token cost while remaining harder to judge\. Taken together, the results indicate that reliable judge evaluation requires balancing model strength and protocol design rather than simply choosing the most expensive setting\.
## 4Related Works
Evidence\-Based Research Agent Evaluation\.Recent benchmarks evaluate deep research agents that perform multi\-step information seeking and synthesize evidence\-grounded reports\[[6](https://arxiv.org/html/2605.19196#bib.bib6),[10](https://arxiv.org/html/2605.19196#bib.bib10),[5](https://arxiv.org/html/2605.19196#bib.bib5),[17](https://arxiv.org/html/2605.19196#bib.bib17),[24](https://arxiv.org/html/2605.19196#bib.bib24)\]\. They assess report quality—relevance, factuality, citation groundedness, coverage, and evidence use\[[6](https://arxiv.org/html/2605.19196#bib.bib6),[17](https://arxiv.org/html/2605.19196#bib.bib17),[5](https://arxiv.org/html/2605.19196#bib.bib5),[49](https://arxiv.org/html/2605.19196#bib.bib49),[46](https://arxiv.org/html/2605.19196#bib.bib46)\]—as well as process behavior, including search decisions, source selection, trajectory validity, and step\-level reasoning\[[44](https://arxiv.org/html/2605.19196#bib.bib44),[54](https://arxiv.org/html/2605.19196#bib.bib54),[59](https://arxiv.org/html/2605.19196#bib.bib59),[58](https://arxiv.org/html/2605.19196#bib.bib58),[18](https://arxiv.org/html/2605.19196#bib.bib18),[61](https://arxiv.org/html/2605.19196#bib.bib61),[45](https://arxiv.org/html/2605.19196#bib.bib45)\]\. To scale beyond expert review, they increasingly rely on LLM judges for reports, citations, and evidence traces\[[63](https://arxiv.org/html/2605.19196#bib.bib63),[26](https://arxiv.org/html/2605.19196#bib.bib26),[7](https://arxiv.org/html/2605.19196#bib.bib7),[5](https://arxiv.org/html/2605.19196#bib.bib5),[17](https://arxiv.org/html/2605.19196#bib.bib17),[67](https://arxiv.org/html/2605.19196#bib.bib67)\]\. Our work complements this setting by meta\-evaluating such judges under controlled process\- and outcome\-level failures\.
Meta\-Evaluation for LLM Judges\.Another line of work meta\-evaluates LLM judges and reward models using preference pairs, ranking tasks, verification settings, or trajectory\-level annotations, including RewardBench2\[[30](https://arxiv.org/html/2605.19196#bib.bib30)\], JudgeBench\[[47](https://arxiv.org/html/2605.19196#bib.bib47)\], JETTS\[[65](https://arxiv.org/html/2605.19196#bib.bib65)\], VerifyBench\[[25](https://arxiv.org/html/2605.19196#bib.bib25)\], AgentRewardBench\[[29](https://arxiv.org/html/2605.19196#bib.bib29)\], and Sage\[[9](https://arxiv.org/html/2605.19196#bib.bib9)\]\. These benchmarks are informative but usually evaluate complete responses rather than localized failures in extended agent executions\[[47](https://arxiv.org/html/2605.19196#bib.bib47),[27](https://arxiv.org/html/2605.19196#bib.bib27)\]\. LLMBar\[[57](https://arxiv.org/html/2605.19196#bib.bib57)\]and ReIFE\[[27](https://arxiv.org/html/2605.19196#bib.bib27)\]are closest, using clean–flawed adversarial pairs to isolate evaluation errors across models and protocols\. However, they mainly target response\-level instruction deviations, while research agents can fail during search, tool use, evidence selection, and synthesis\[[58](https://arxiv.org/html/2605.19196#bib.bib58),[59](https://arxiv.org/html/2605.19196#bib.bib59),[44](https://arxiv.org/html/2605.19196#bib.bib44),[22](https://arxiv.org/html/2605.19196#bib.bib22)\]\. OurReflectinvestigates open\-ended agent executions in non\-verifiable settings\.
## 5Conclusion
We introducedReflect, a meta\-evaluation benchmark for assessing whether LLM judges can reliably evaluate evidence\-based research agents\. By constructing verified reference executions and controlled failure\-bearing alternatives,Reflectprovides fine\-grained labels over both process\-level trajectory and outcome\-level report failures\. Our experiments show that current judges remain limited across reasoning, tool use, and final\-report evaluation\. They also reveal substantial variation across failure types, evaluation granularity, prompting formats, and best\-of\-NNselection settings\. Overall, these results suggest that judge reliability should be evaluated as a first\-class property of research\-agent evaluation pipelines\. Fine\-grained protocols improve failure sensitivity, but robust judge evaluation still requires careful choices about model capability, cost, scoring interface, and evaluation unit\.
##### Limitations\.
Reflectis a meta\-evaluation benchmark, so its scope is intentionally controlled\. The taxonomy covers common failures from current research\-agent traces and prior evaluation work, but it cannot exhaust every domain\-specific, naturally occurring or interactive failure mode\. The controlled degradations isolate target failures to make judge behavior measurable, complementing human audit studies of naturally occurring agent errors\. Like many modern benchmarks,Reflectrequires updating for longer term reliability: As judge models and research\-agent systems evolve, the benchmark should be also updated with new traces, failure types, and evaluator families\.
## 6Author Contributions
We summarize each author’s primary contributions to the project below\. Authors shown inboldtook the lead role in the corresponding category\.
- •Project leadership:Leyao Wang, Yanan He
- •Core contributions: Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Arman Cohan
- •Reflectdevelopment \(Process\-Reasoning\): Leyao Wang
- •Reflectdevelopment \(Process\-Tool Use\): Peng Chen
- •Reflectdevelopment \(Outcome\): Yanan He
- •Evaluations and baselines: Leyao Wang, Yanan He, Peng Chen
- •Paper writing: Leyao Wang, Yanan He, Arman Cohan
- •Administration and policy review: Leyao Wang, Yixin Liu
- •Advising and mentorship:Arman Cohan, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli\-Scheuer
Core contributors made sustained and significant contributions throughout the project\. All authors contributed to project discussions, experiment planning, and manuscript reviewing\.
## References
- Agarwal et al\. \[2025\]S\. Agarwal et al\.gpt\-oss\-120b & gpt\-oss\-20b model card\.*arXiv preprint arXiv:2508\.10925*, 2025\.
- Anthropic \[2025\]Anthropic\.Introducing claude haiku 4\.5\.Anthropic release announcement, 2025\.
- Anthropic \[2026\]Anthropic\.Introducing claude opus 4\.7\.Anthropic release announcement, 2026\.
- Chen et al\. \[2024\]Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun\.Benchmarking large language models in retrieval\-augmented generation\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, 2024\.
- Coelho et al\. \[2025\]João Coelho, Jingjie Ning, Jingyuan He, Kangrui Mao, Abhijay Paladugu, Pranav Setlur, Jiahe Jin, Jamie Callan, João Magalhães, Bruno Martins, and Chenyan Xiong\.Deepresearchgym: A free, transparent, and reproducible evaluation sandbox for deep research\.*arXiv preprint arXiv:2505\.19253*, 2025\.
- Du et al\. \[2025\]Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao\.Deepresearch bench: A comprehensive benchmark for deep research agents\.*arXiv preprint arXiv:2506\.11763*, 2025\.
- Dubois et al\. \[2023\]Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori Hashimoto\.Alpacafarm: A simulation framework for methods that learn from human feedback\.In*Thirty\-seventh Conference on Neural Information Processing Systems*, 2023\.URL[https://openreview\.net/forum?id=4hturzLcKX](https://openreview.net/forum?id=4hturzLcKX)\.
- Es et al\. \[2024\]Shahul Es, Jithin James, Luis Espinosa\-Anke, and Steven Schockaert\.RAGAS: Automated evaluation of retrieval augmented generation\.In*Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics*, 2024\.
- Feng et al\. \[2025\]Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, and Dongping Chen\.Are we on the right way to assessing LLM\-as\-a\-judge?*arXiv preprint arXiv:2512\.16041*, 2025\.URL[https://arxiv\.org/abs/2512\.16041](https://arxiv.org/abs/2512.16041)\.
- FutureSearch et al\. \[2025\]FutureSearch, :, Nikos I\. Bosse, Jon Evans, Robert G\. Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, and Jack Wildman\.Deep research bench: Evaluating ai web research agents, 2025\.URL[https://arxiv\.org/abs/2506\.06287](https://arxiv.org/abs/2506.06287)\.
- Gao et al\. \[2023\]Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen\.Enabling large language models to generate text with citations\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, 2023\.
- Gemma Team et al\. \[2025\]Gemma Team et al\.Gemma 3 technical report\.*arXiv preprint arXiv:2503\.19786*, 2025\.
- Google \[2025a\]Google\.Gemini 2\.0 is now available to everyone\.Google Blog, feb 2025a\.URL[https://blog\.google/innovation\-and\-ai/models\-and\-research/google\-deepmind/gemini\-model\-updates\-february\-2025/](https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-model-updates-february-2025/)\.Accessed: 2026\-05\-06\.
- Google \[2025b\]Google\.Gemini 2\.5 flash is now in preview\.[https://blog\.google/products\-and\-platforms/products/gemini/gemini\-2\-5\-flash\-preview/](https://blog.google/products-and-platforms/products/gemini/gemini-2-5-flash-preview/), April 2025b\.Accessed: 2026\-05\-06\.
- Google DeepMind \[2026\]Google DeepMind\.Gemini 3\.1 pro model card\.[https://deepmind\.google/models/model\-cards/gemini\-3\-1\-pro/](https://deepmind.google/models/model-cards/gemini-3-1-pro/), February 2026\.Accessed: 2026\-05\-06\.
- Grattafiori et al\. \[2024\]Aaron Grattafiori et al\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.
- Han et al\. \[2025\]Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, and Honglak Lee\.DEER: A benchmark for evaluating deep research agents on expert report generation\.*arXiv preprint arXiv:2512\.17776*, 2025\.
- Hu et al\. \[2025\]Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, et al\.Step\-DeepResearch technical report\.*arXiv preprint arXiv:2512\.20491*, 2025\.
- Huang et al\. \[2024\]Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, and Lichao Sun\.MetaTool benchmark for large language models: Deciding whether to use tools and which to use\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Hwang et al\. \[2026\]Jena D\. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D’Arcy, Daniel S\. Weld, Lucy Lu Wang, Doug Downey, and Sergey Feldman\.Deep research, shallow evaluation: A case study in meta\-evaluation for long\-form qa benchmarks, 2026\.URL[https://arxiv\.org/abs/2603\.06942](https://arxiv.org/abs/2603.06942)\.
- Kokane et al\. \[2025\]Shirley Kokane, Ming Zhu, Tulika Manoj Awalgaonkar, Jianguo Zhang, Akshara Prabhakar, Thai Quoc Hoang, Zuxin Liu, Rithesh R N, Liangwei Yang, Weiran Yao, Juntao Tan, Zhiwei Liu, Shelby Heinecke, Huan Wang, Juan Carlos Niebles, Caiming Xiong, and Silvio Savarese\.Toolscan: A benchmark for characterizing errors in tool\-use LLMs, 2025\.URL[https://openreview\.net/forum?id=09tnQgqKuZ](https://openreview.net/forum?id=09tnQgqKuZ)\.
- Lan et al\. \[2025\]Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, and Kaifu Zhang\.Deepwidesearch: Benchmarking depth and width in agentic information seeking\.*arXiv preprint arXiv:2510\.20168*, 2025\.
- Lewis et al\. \[2020\]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela\.Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.In*Advances in Neural Information Processing Systems*, 2020\.
- Li et al\. \[2025a\]Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, and Kai Jia\.ReportBench: Evaluating deep research agents via academic survey tasks\.*arXiv preprint arXiv:2508\.15804*, 2025a\.
- Li et al\. \[2025b\]Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, and Wentao Zhang\.VerifyBench: A systematic benchmark for evaluating reasoning verifiers across domains\.*arXiv preprint arXiv:2507\.09884*, 2025b\.
- Liu et al\. \[2023\]Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu\.G\-Eval: NLG evaluation using GPT\-4 with better human alignment\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 2511–2522\. Association for Computational Linguistics, 2023\.
- Liu et al\. \[2025\]Yixin Liu, Kejian Shi, Alexander Fabbri, Yilun Zhao, PeiFeng Wang, Chien\-Sheng Wu, Shafiq Joty, and Arman Cohan\.ReIFE: Re\-evaluating instruction\-following evaluation\.In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 12247–12287, Albuquerque, New Mexico, April 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-189\-6\.doi:10\.18653/v1/2025\.naacl\-long\.610\.URL[https://aclanthology\.org/2025\.naacl\-long\.610/](https://aclanthology.org/2025.naacl-long.610/)\.
- Lu et al\. \[2025\]Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang\.ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities\.In*Findings of the Association for Computational Linguistics: NAACL 2025*, 2025\.
- Lù et al\. \[2025\]Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J\. Pal, and Siva Reddy\.AgentRewardBench: Evaluating automatic evaluations of web agent trajectories\.*arXiv preprint arXiv:2504\.08942*, 2025\.URL[https://arxiv\.org/abs/2504\.08942](https://arxiv.org/abs/2504.08942)\.
- Malik et al\. \[2026\]Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A\. Smith, Hannaneh Hajishirzi, and Nathan Lambert\.Rewardbench 2: Advancing reward model evaluation\.In*The Fourteenth International Conference on Learning Representations*, 2026\.
- Martin\-Boyle et al\. \[2026\]Anna Martin\-Boyle, William Humphreys, Martha Brown, Cara Leckey, and Harmanpreet Kaur\.An expert schema for evaluating large language model errors in scholarly question\-answering systems\.In*Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems*, 2026\.
- Min et al\. \[2023\]Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen\-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi\.FActScore: Fine\-grained atomic evaluation of factual precision in long form text generation\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, 2023\.
- Ming et al\. \[2025\]Yifei Ming, Senthil Purushwalkam, Shrey Pandit, Zixuan Ke, Xuan\-Phi Nguyen, Caiming Xiong, and Shafiq Joty\.FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows”\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Nakano et al\. \[2021\]Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, et al\.Webgpt: Browser\-assisted question\-answering with human feedback\.*arXiv preprint arXiv:2112\.09332*, 2021\.
- Niu et al\. \[2024\]Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang\.RAGTruth: A hallucination corpus for developing trustworthy retrieval\-augmented language models\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics*, 2024\.
- OpenAI \[2025\]OpenAI\.GPT\-5 mini\.[https://developers\.openai\.com/api/docs/models/gpt\-5\-mini](https://developers.openai.com/api/docs/models/gpt-5-mini), August 2025\.Model version: gpt\-5\-mini\-2025\-08\-07\.
- OpenAI \[2026a\]OpenAI\.Introducing gpt\-5\.3\-codex\.OpenAI release and API documentation, 2026a\.
- OpenAI \[2026b\]OpenAI\.Gpt\-5\.4 model\.OpenAI API documentation, 2026b\.
- Patil et al\. \[2025\]Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng\-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E\. Gonzalez\.The berkeley function calling leaderboard \(BFCL\): From tool use to agentic evaluation of large language models, 2025\.URL[https://openreview\.net/forum?id=2GmDdhBdDk](https://openreview.net/forum?id=2GmDdhBdDk)\.
- Rafailov et al\. \[2023\]Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn\.Direct preference optimization: Your language model is secretly a reward model, 2023\.URL[https://openreview\.net/forum?id=HPuSIXJaa9](https://openreview.net/forum?id=HPuSIXJaa9)\.
- Saad\-Falcon et al\. \[2024\]Jon Saad\-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia\.ARES: An automated evaluation framework for retrieval\-augmented generation systems\.In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, 2024\.
- Sachdeva et al\. \[2025\]Rachneet Singh Sachdeva, Yixiao Song, Mohit Iyyer, and Iryna Gurevych\.Localizing and mitigating errors in long\-form question answering\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 20437–20469, 2025\.
- Schick et al\. \[2023\]Timo Schick, Jane Dwivedi\-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom\.Toolformer: Language models can teach themselves to use tools\.In*Thirty\-seventh Conference on Neural Information Processing Systems*, 2023\.URL[https://openreview\.net/forum?id=Yacmpz84TH](https://openreview.net/forum?id=Yacmpz84TH)\.
- Shao et al\. \[2025a\]Jiaqi Shao, Yuxiang Lin, Munish Prasad Lohani, Yufeng Miao, and Bing Luo\.Do LLM agents know how to ground, recover, and assess? a benchmark for epistemic competence in information\-seeking agents\.*arXiv preprint arXiv:2509\.22391*, 2025a\.
- Shao et al\. \[2025b\]Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G\. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh\.Dr tulu: Reinforcement learning with evolving rubrics for deep research, 2025b\.URL[https://arxiv\.org/abs/2511\.19399](https://arxiv.org/abs/2511.19399)\.
- Sharma et al\. \[2025\]Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, et al\.Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents\.*arXiv preprint arXiv:2511\.07685*, 2025\.
- Tan et al\. \[2025\]Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, and Ion Stoica\.Judgebench: A benchmark for evaluating LLM\-based judges\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Team et al\. \[2025\]Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al\.Tongyi deepresearch technical report\.*arXiv preprint arXiv:2510\.24701*, 2025\.
- Wang et al\. \[2026\]Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, and Lidong Bing\.DeepResearchEval: An automated framework for deep research task construction and agentic evaluation\.*arXiv preprint arXiv:2601\.09688*, 2026\.
- Wei et al\. \[2024\]Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Zixia Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V Le\.Long\-form factuality in large language models\.In*The Thirty\-eighth Annual Conference on Neural Information Processing Systems*, 2024\.URL[https://openreview\.net/forum?id=4M9f8VMt2C](https://openreview.net/forum?id=4M9f8VMt2C)\.
- Yang et al\. \[2025\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Hao, Tianyi Li, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- Yao et al\. \[2023\]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao\.React: Synergizing reasoning and acting in language models\.In*International Conference on Learning Representations*, 2023\.
- Yao et al\. \[2025\]Yang Yao, Yixu Wang, Yuxuan Zhang, Yi Lu, Tianle Gu, Lingyu Li, Dingyi Zhao, Keming Wu, Haozhe Wang, Ping Nie, Yan Teng, and Yingchun Wang\.A rigorous benchmark with multidimensional evaluation for deep research agents: From answers to reports\.*arXiv preprint arXiv:2510\.02190*, 2025\.
- Ye et al\. \[2026\]Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, and Lidong Bing\.MiroEval: Benchmarking multimodal deep research agents in process and outcome\.*arXiv preprint arXiv:2603\.28407*, 2026\.
- Yifei et al\. \[2025\]Li S\. Yifei, Allen Chang, Chaitanya Malaviya, and Mark Yatskar\.Researchqa: Evaluating scholarly question answering at scale across 75 fields with survey\-mined questions and rubrics, 2025\.URL[https://arxiv\.org/abs/2509\.00496](https://arxiv.org/abs/2509.00496)\.
- Yue et al\. \[2024\]Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun\.Automatic evaluation of attribution by large language models\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, 2024\.
- Zeng et al\. \[2024\]Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen\.Evaluating large language models at evaluating instruction following, 2024\.URL[https://arxiv\.org/abs/2310\.07641](https://arxiv.org/abs/2310.07641)\.
- Zhan et al\. \[2026\]Yuhao Zhan, Tianyu Fan, Linxuan Huang, Zirui Guo, and Chao Huang\.Why your deep research agent fails? on hallucination evaluation in full research trajectory\.*arXiv preprint arXiv:2601\.22984*, 2026\.
- Zhang et al\. \[2026a\]Chen Zhang, Kuicai Dong, Dexun Li, Wenjun Li, Qu Yang, Wei Han, and Yong Liu\.SRR\-Judge: Step\-level rating and refinement for enhancing search\-integrated reasoning in search agents\.*arXiv preprint arXiv:2602\.07773*, 2026a\.
- Zhang et al\. \[2024a\]Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, and Juanzi Li\.LongCite: Enabling LLMs to generate fine\-grained citations in long\-context qa\.*arXiv preprint arXiv:2409\.02897*, 2024a\.
- Zhang et al\. \[2026b\]Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, and Juanzi Li\.Chaining the evidence: Robust reinforcement learning for deep search agents with citation\-aware rubric rewards\.*arXiv preprint arXiv:2601\.06021*, 2026b\.
- Zhang et al\. \[2024b\]Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, Tetsuya Sakai, Tian Feng, and Hayato Yamana\.ToolBeHonest: A multi\-level hallucination diagnostic benchmark for tool\-augmented large language models\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, 2024b\.
- Zheng et al\. \[2023\]Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al\.Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.In*Advances in Neural Information Processing Systems*, 2023\.
- Zhong et al\. \[2025\]Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, and Jie Tang\.Complexfuncbench: Exploring multi\-step and constrained function calling under long\-context scenario, 2025\.URL[https://arxiv\.org/abs/2501\.10132](https://arxiv.org/abs/2501.10132)\.
- Zhou et al\. \[2025\]Yilun Zhou, Austin Xu, PeiFeng Wang, Caiming Xiong, and Shafiq Joty\.Evaluating judges as evaluators: The JETTS benchmark of LLM\-as\-judges as test\-time scaling evaluators\.In*Forty\-second International Conference on Machine Learning*, 2025\.
- Zhu et al\. \[2025\]Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, Xiaoteng Ma, Xiaodong Yu, Gowtham Ramesh, Jialian Wu, Zicheng Liu, Pan Lu, James Zou, and Jiaxuan You\.Where llm agents fail and how they can learn from failures\.*arXiv preprint arXiv:2509\.25370*, 2025\.
- Zhuge et al\. \[2025\]Mingchen Zhuge, Changsheng Zhao, Dylan R\. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber\.Agent\-as\-a\-judge: Evaluate agents with agents\.In*Proceedings of the 42nd International Conference on Machine Learning*, volume 267 of*Proceedings of Machine Learning Research*, pages 80569–80611\. PMLR, 2025\.
## Appendix ATaxonomy Details and Related Work
### A\.1Full Perturbation Taxonomy
Table[3](https://arxiv.org/html/2605.19196#A1.T3)provides the complete definitions of all process\-level and outcome\-level perturbation types used in our benchmark\.
Table 3:Full definitions for all perturbation types in our taxonomy\.ModuleCategoryError TypeDefinitionProcess\-level errorsReasoningStructureExecution StagnationConsecutive search rounds repeat the same terms or fail to build on prior findings, causing the search process to loop without expanding coverage\.ReasoningAnalysisShallow ReflectionSummarizes prior results without identifying knowledge gaps or adjusting the subsequent search direction, adding little analytical value\.ReasoningFaithfulnessEvidence OmissionRelevant evidence is available in the collected sources but is not incorporated into the final answer, resulting in incomplete synthesis\.ReasoningFaithfulnessEvidence FabricationFabricates citations, findings, or author positions that are not present in any retrieved source\.ToolStructureWrong Tool SelectionThe agent invokes a tool whose capability does not match the user’s information need\.ToolFaithfulnessConstraint DropThe tool call omits one or more constraints implied by the user’s information need, causing the retrieved content to fall outside the user’s requested scope\.ToolFaithfulnessArgument CorruptionA tool\-call argument contains an incorrect value while preserving the argument structure\. Includes named\-entity errors and numeric or temporal values that deviate from the user’s intended specification\.ToolFaithfulnessResult IrrelevanceThe content returned by the tool falls outside the scope defined by the call’s arguments\.ToolGroundednessWrong Source CitationA claim in the response is attributed to a specific retrieved source, but the source’s actual content does not support the paired claim\.ToolGroundednessTool Response HallucinationThe response contains a fact or entity\-claim binding that is not grounded in any retrieved source\.Outcome\-level errorsOutputRelevanceIncomplete CoverageThe response does not adequately cover the key aspects of the user’s question\. Some aspects may be missing entirely, while others may be mentioned only briefly or without enough detail\.OutputRelevanceTopical MisalignmentThe response includes content that is not directly relevant to the user’s question, or gradually drifts away from the requested topic\.OutputFaithfulnessCitation GroundednessThe response uses a citation that is incorrect or unsupported, such as a fake citation, wrong citation number, misattributed source, or a citation that does not actually support the claim\.OutputFaithfulnessEvidence OmissionThe response states a conclusion or important claim without providing sufficient evidence, examples, citations, or supporting details\.OutputFaithfulnessFabricationThe response presents facts, findings, conclusions, examples, or relationships that are demonstrably false, invented, or attributed to the wrong entity/source\.OutputExpressionExpression QualityThe response has problems in readability, clarity, or language quality, such as awkward phrasing, repetition, or unnecessary verbosity\.OutputSynthesisIncoherenceThe response does not form a coherent whole due to contradictions, unclear transitions, reasoning gaps, or poor structural organization\.
### A\.2Related Work Coverage
Tables[4](https://arxiv.org/html/2605.19196#A1.T4)and[5](https://arxiv.org/html/2605.19196#A1.T5)summarize how prior work motivates and overlaps with our process\-level and outcome\-level error taxonomy\.
Table 4:Related works for process\-level error taxonomy\.CategoryError TypePaper Mentioning the Error TypesStructureExecution StagnationSRR\-Judge \(Coverage & Improvement Potential; Query Appropriateness; Logical Structure\)\[[59](https://arxiv.org/html/2605.19196#bib.bib59)\]AnalysisShallow ReflectionAgentErrorTaxonomy \(Over\-simplification / Incomplete Summary\)\[[66](https://arxiv.org/html/2605.19196#bib.bib66)\]; SRR\-Judge \(Coverage & Improvement Potential; Logical Structure; Clarity & Conciseness\)\[[59](https://arxiv.org/html/2605.19196#bib.bib59)\]; DeepWideSearch \(lack of reflection\)\[[22](https://arxiv.org/html/2605.19196#bib.bib22)\]FaithfulnessEvidence OmissionAgentErrorTaxonomy \(Memory: Over\-simplification / Incomplete Summary\)\[[66](https://arxiv.org/html/2605.19196#bib.bib66)\]; DeepHalluBench \(Hallucination\)\[[58](https://arxiv.org/html/2605.19196#bib.bib58)\]; SeekBench \(Groundedness\)\[[44](https://arxiv.org/html/2605.19196#bib.bib44)\]FaithfulnessEvidence FabricationAgentErrorTaxonomy \(Hallucination\)\[[66](https://arxiv.org/html/2605.19196#bib.bib66)\]; DeepHalluBench \(Hallucination\)\[[58](https://arxiv.org/html/2605.19196#bib.bib58)\]; SeekBench \(Groundedness\)\[[44](https://arxiv.org/html/2605.19196#bib.bib44)\]StructureWrong Tool SelectionAgentErrorTaxonomy \(Unnecessary Tool; Missing Tool\)\[[66](https://arxiv.org/html/2605.19196#bib.bib66)\]; MetaTool \(Tool Selection\)\[[19](https://arxiv.org/html/2605.19196#bib.bib19)\]; ToolBeHonest \(Tool\-Selection Hallucination\)\[[62](https://arxiv.org/html/2605.19196#bib.bib62)\]; BFCL \(Function Selection; Relevance Detection\)\[[39](https://arxiv.org/html/2605.19196#bib.bib39)\]FaithfulnessConstraint DropAgentErrorTaxonomy \(Constraint Ignorance\)\[[66](https://arxiv.org/html/2605.19196#bib.bib66)\]; ComplexFuncBench \(Implicit Parameter Reasoning\)\[[64](https://arxiv.org/html/2605.19196#bib.bib64)\]; SRR\-Judge \(Query Appropriateness\)\[[59](https://arxiv.org/html/2605.19196#bib.bib59)\]; ToolSandbox \(Insufficient Information\)\[[28](https://arxiv.org/html/2605.19196#bib.bib28)\]FaithfulnessArgument CorruptionAgentErrorTaxonomy \(Incorrect Argument\)\[[66](https://arxiv.org/html/2605.19196#bib.bib66)\]; SpecTool \(Incorrect Argument Value; Name; Type\)\[[21](https://arxiv.org/html/2605.19196#bib.bib21)\]; BFCL \(Parameter\-Value Correctness\)\[[39](https://arxiv.org/html/2605.19196#bib.bib39)\]; ToolBeHonest \(Tool Format Hallucination; Tool Content Hallucination\)\[[62](https://arxiv.org/html/2605.19196#bib.bib62)\]; ToolSandbox \(Time\-related Argument Hallucinations; Named\-Entity Errors\)\[[28](https://arxiv.org/html/2605.19196#bib.bib28)\]FaithfulnessResult IrrelevanceRAGAs \(Context Relevance\)\[[8](https://arxiv.org/html/2605.19196#bib.bib8)\]; ARES \(Context Relevance\)\[[41](https://arxiv.org/html/2605.19196#bib.bib41)\]; RGB \(Noise Robustness; Negative Rejection\)\[[4](https://arxiv.org/html/2605.19196#bib.bib4)\]; SeekBench \(Recovery from Low\-Quality Evidence\)\[[44](https://arxiv.org/html/2605.19196#bib.bib44)\]GroundednessWrong Source CitationALCE \(Citation Precision; Citation Recall\)\[[11](https://arxiv.org/html/2605.19196#bib.bib11)\]; AttrScore \(Attributable; Extrapolatory; Contradictory\)\[[56](https://arxiv.org/html/2605.19196#bib.bib56)\]; LongCite \(Citation F1\)\[[60](https://arxiv.org/html/2605.19196#bib.bib60)\]; DeepResearch Bench \(Citation Accuracy\)\[[6](https://arxiv.org/html/2605.19196#bib.bib6)\]; DEER \(Cited Claim Verification\)\[[17](https://arxiv.org/html/2605.19196#bib.bib17)\]GroundednessTool Response HallucinationRAGTruth \(Baseless Info; Conflict\-with\-Context\)\[[35](https://arxiv.org/html/2605.19196#bib.bib35)\]; FActScore \(Atomic Fact Support\)\[[32](https://arxiv.org/html/2605.19196#bib.bib32)\]; DeepHalluBench \(PIES Taxonomy\)\[[58](https://arxiv.org/html/2605.19196#bib.bib58)\]; FaithEval \(Contextual Faithfulness\)\[[33](https://arxiv.org/html/2605.19196#bib.bib33)\]; SAFE \(Long\-form Factuality\)\[[50](https://arxiv.org/html/2605.19196#bib.bib50)\]Table 5:Related works for outcome\-level error taxonomy\.CategoryError TypePaper Mentioning the Error TypesRelevanceIncomplete CoverageHaluQuestQA \(Completeness; incomplete information\)\[[42](https://arxiv.org/html/2605.19196#bib.bib42)\]; Expert Schema \(Incomplete Answer; Major omissions; Lacking details\)\[[31](https://arxiv.org/html/2605.19196#bib.bib31)\]; Dr\. Bench \(Information coverage; Informational coverage & content depth\)\[[53](https://arxiv.org/html/2605.19196#bib.bib53)\]; ResearchRubrics \(Completeness; rubric\-item coverage\)\[[46](https://arxiv.org/html/2605.19196#bib.bib46)\]RelevanceTopical MisalignmentDr\. Bench \(Topical Focus; SemanticDrift\)\[[53](https://arxiv.org/html/2605.19196#bib.bib53)\]; DRSE \(Answer Relevance\)\[[20](https://arxiv.org/html/2605.19196#bib.bib20)\]; HaluQuestQA \(Relevance\)\[[42](https://arxiv.org/html/2605.19196#bib.bib42)\]; Expert Schema \(Question redirection; Question misinterpretation\)\[[31](https://arxiv.org/html/2605.19196#bib.bib31)\]FaithfulnessCitation GroundednessDRSE \(Citation Precision; Citation Recall\)\[[20](https://arxiv.org/html/2605.19196#bib.bib20)\]; DeepResearch Bench \(Effective Citation Count; Overall Citation Accuracy\)\[[6](https://arxiv.org/html/2605.19196#bib.bib6)\]; Dr\. Bench \(Retrieval Trustworthiness; Trustworthy\-Source Links\)\[[53](https://arxiv.org/html/2605.19196#bib.bib53)\]; Expert Schema \(Citation information; Source confusion; Incomplete references; Inconsistent referencing\)\[[31](https://arxiv.org/html/2605.19196#bib.bib31)\]FaithfulnessEvidence OmissionHaluQuestQA \(Completeness; References; incomplete information\)\[[42](https://arxiv.org/html/2605.19196#bib.bib42)\]; Expert Schema \(Incomplete references; Lacking details; Incomplete Answer\)\[[31](https://arxiv.org/html/2605.19196#bib.bib31)\]; Dr\. Bench \(Citation quality & source credibility; source verification; evidence organization\)\[[53](https://arxiv.org/html/2605.19196#bib.bib53)\]; DEER \(Evidence Coverage; Information Sufficiency\)\[[17](https://arxiv.org/html/2605.19196#bib.bib17)\]FaithfulnessFabricationExpert Schema \(Contains hallucinations; Basic accuracy issues\)\[[31](https://arxiv.org/html/2605.19196#bib.bib31)\]; Dr\. Bench \(factual accuracy; source verification\)\[[53](https://arxiv.org/html/2605.19196#bib.bib53)\]; HaluQuestQA \(Factuality; factual inconsistencies\)\[[42](https://arxiv.org/html/2605.19196#bib.bib42)\]ExpressionExpression QualityExpert Schema \(Verbosity; Language issues; Notation errors\)\[[31](https://arxiv.org/html/2605.19196#bib.bib31)\]; Dr\. Bench \(logical clarity & expression; formatting consistency\)\[[53](https://arxiv.org/html/2605.19196#bib.bib53)\]; ResearchRubrics \(Clarity\)\[[46](https://arxiv.org/html/2605.19196#bib.bib46)\]SynthesisIncoherenceExpert Schema \(Self\-contradiction; Disjointed response\)\[[31](https://arxiv.org/html/2605.19196#bib.bib31)\]; Dr\. Bench \(structural organization; information integration\)\[[53](https://arxiv.org/html/2605.19196#bib.bib53)\]; DRSE \(Organization\)\[[20](https://arxiv.org/html/2605.19196#bib.bib20)\]; ResearchRubrics \(cross\-document synthesis; reasoning soundness; clarity\)\[[46](https://arxiv.org/html/2605.19196#bib.bib46)\]
## Appendix BBenchmark Construction and Validation
This section provides additional details on how benchmark instances are organized and validated\. We report the distribution of perturbation types to make clear how many paired examples are available for each failure mode, and we describe the validation process used to ensure that each perturbed instance reflects the intended error type\.
### B\.1Dataset Statistics
Table[6](https://arxiv.org/html/2605.19196#A2.T6)reports the number of perturbation pairs for each failure type in our benchmark\. The distribution is approximately balanced across the process\-level \(reasoning and tool\-use\) and outcome\-level error types\.
Table 6:Dataset statistics forReflectperturbation types\.TargetCategoryPerturbation Type\# Pairs%Reasoning process perturbationsReasoningFaithfulnessEvidence Fabrication3525\.00ReasoningFaithfulnessEvidence Omission3625\.71ReasoningAnalysisShallow Reflection4028\.57ReasoningStructureExecution Stagnation2920\.71ReasoningTotal–140100\.00Tool\-use perturbationsTool UseStructureWrong Tool Selection118\.33Tool UseFaithfulnessConstraint Drop118\.33Tool UseFaithfulnessArgument Corruption2821\.21Tool UseFaithfulnessResult Irrelevance2821\.21Tool UseGroundednessWrong Source Citation2821\.21Tool UseGroundednessTool Response Hallucination2619\.70Tool UseTotal–132100\.00Outcome\-level perturbationsOutcomeFaithfulnessCitation Groundedness2613\.00OutcomeFaithfulnessEvidence Omission2814\.00OutcomeFaithfulnessFabrication2914\.50OutcomeExpressionExpression Quality3015\.00OutcomeRelevanceIncomplete Coverage2814\.00OutcomeRelevanceTopical Misalignment3015\.00OutcomeSynthesisIncoherence2914\.50OutcomeTotal–200100\.00
### B\.2Human\-in\-the\-Loop Calibration and Curation
We validate perturbations to ensure that each edited instance introduces the intended failure type while preserving the surrounding execution or report context\. This validation step checks that the reference instance does not already contain the target failure at the selected edit site, and that the perturbed instance reflects the controlled degradation rather than an unrelated change\. The following examples illustrate representative perturbations and validation cases across our taxonomy\.
##### Human Annotation Interface\.
Figure[6](https://arxiv.org/html/2605.19196#A2.F6)shows the annotation interface used in our validation process\. The top panel displays metadata for the instance, including the user query, source dataset, trace identifier, perturbation type, and expected metric drops\. The middle panel shows a side\-by\-side comparison of the original and perturbed text, with deleted spans highlighted in green and inserted spans highlighted in red\. The bottom panel allows annotators to assign one of three labels: valid, invalid, or ambiguous\. This design helps annotators judge whether the perturbation is aligned with the target error definition and whether the local edit produces the intended degradation in relevance, factuality, coherence, coverage, or expression quality\.
Figure 6:Human annotation interface for perturbation validation\. Annotators review the user query, target perturbation type, error definition, expected metric drops, and side\-by\-side diff between the original and perturbed content\. They then label each perturbation as valid, invalid, or ambiguous\.
## Appendix CImplementation Details and Prompts
### C\.1Pointwise Judge Prompts
HOLISTIC\_RUBRIC\_NO\_COT\_PROMPT``` You are an expert evaluator of AI research agent traces. Given a research query and the agent’s full reasoning trace (including planning, search, reflection, and synthesis steps), rate the trace on the following dimensions. Use a 1-5 scale where 1=very poor and 5=excellent. Dimensions: - coherence: Does the reasoning flow logically? Do later steps build on earlier findings? - factual_alignment: Are claims accurate and consistent with the retrieved evidence? - info_density: Does the trace contain specific facts, numbers, and concrete details? - synthesis_quality: Does the final answer integrate information across sources with comparative analysis? Query: {query} Agent Trace: {trace} Output ONLY a JSON object with four integer scores, no other text: {{"coherence": ..., "factual_alignment": ..., "info_density": ..., "synthesis_quality": ...}} ```
FINE\-GRAINED\_RUBRIC\_NO\_COT\_PROMPT``` You are an expert evaluator of AI research agent reasoning steps. You are given a research query, the prefix of an agent’s reasoning trace (all steps completed so far), and a single candidate next step. Your task is to evaluate how good that candidate step is as the immediate next step. Rate the candidate step on three dimensions using a 1-5 scale (1=very poor, 5=excellent): - relevance: Does the step follow logically from the prefix and remain directly on-topic for the query? - quality: Is the step well-executed for its type? (e.g., a search step issues a precise, focused query; a planning step is clearly structured; a synthesis step integrates evidence coherently) - contribution: Does the step add meaningful new information or meaningfully advance the reasoning toward answering the query? Query: {query} === Trace Prefix (completed steps) === {prefix} === Candidate Next Step === {step} Output ONLY a JSON object with three integer scores, no other text: {{"relevance": ..., "quality": ..., "contribution": ...}} ```
HOLISTIC\_RUBRIC\_NO\_RUBRIC\_COT\_PROMPT``` You are an expert evaluator of AI research agent traces. Given a research query and the agent’s full reasoning trace, assess the overall quality holistically. Use a 1-10 scale where 1=very poor and 10=excellent. Consider all aspects of quality together: logical coherence, factual accuracy, information richness, and how well the final answer synthesizes the evidence. Query: {query} Agent Trace: {trace} First, briefly analyze the trace’s strengths and weaknesses in 2-3 sentences. Then output your final score as a JSON object with a single integer score: {{"overall": ...}} Analysis: ```
HOLISTIC\_RUBRIC\_NO\_RUBRIC\_NO\_COT\_PROMPT``` You are an expert evaluator of AI research agent traces. Given a research query and the agent’s full reasoning trace (including planning, search, reflection, and synthesis steps), rate the overall quality of the trace holistically. Use a 1-10 scale where 1=very poor and 10=excellent. Consider all aspects of quality together: logical coherence, factual accuracy, information richness, and how well the final answer synthesizes the evidence. Query: {query} Agent Trace: {trace} Output ONLY a JSON object with a single integer score, no other text: {{"overall": ...}} ```
FINE\-GRAINED\_NO\_RUBRIC\_NO\_COT\_PROMPT``` You are an expert evaluator of AI research agent reasoning steps. You are given a research query, the prefix of an agent’s reasoning trace (all steps completed so far), and a single candidate next step. Rate the overall quality of that candidate step holistically. Use a 1-10 scale where 1=very poor and 10=excellent. Consider all aspects together: whether the step follows logically from the prefix, how well it is executed for its type, and how meaningfully it advances reasoning toward the answer. Query: {query} === Trace Prefix (completed steps) === {prefix} === Candidate Next Step === {step} Output ONLY a JSON object with a single integer score, no other text: {{"overall": ...}} ```
## Appendix DTaxonomy Validation: Case Studies and Perturbation Examples
### D\.1Outcome\-level Analysis
#### D\.1\.1Error Case Studies
The following cases are drawn from Deep Research Bench\[[6](https://arxiv.org/html/2605.19196#bib.bib6)\]and Tongyi DeepResearch\[[48](https://arxiv.org/html/2605.19196#bib.bib48)\]\. Each example includes the user query, selected response excerpts, and annotations based on our output\-level error taxonomy\.
Case 1: Citation Groundedness and Evidence Omission\.
User InputPlease prepare a market research analysis of the global video editing and creation software/tool market\. Include major products like those from Adobe \(Premiere Pro, After Effects\), CapCut, DaVinci Resolve, Final Cut Pro, and others relevant in the current landscape\.
Model Response \[Excerpts\]\[Excerpt — Major Players and Market Shares\]Adobe Premiere Pro:∼\\sim35% market share; controls the professional/enterprise segment\. Final Cut Pro X:∼\\sim25% market share; preferred by professional Mac users\. DaVinci Resolve:∼\\sim15% market share; rapidly growing due to its free tier and professional features\. \[1\]\[2\]\[9\]\[10\]\[11\]\[Excerpt — Market Size and Growth Trends\]As of 2025, the global video editing software market is estimated between $2\.5 and $3\.65 billion, depending on scope, with projected CAGR between 5\.2% and 8\.1% through 2029\. By 2029, forecasts place the market size between $3\.3 billion and $5 billion, depending on whether services and AI tools are included in the estimate\. \[1\]\[2\]\[3\]\[4\]\[5\] The AI video editing sector is the fastest\-growing segment, forecast to skyrocket from $1\.6 billion in 2025 to $9\.3 billion by 2030 at over 42% CAGR\. \[6\] Paid video editing software users are projected to reach 48\.2 million in 2025\. \[7\] Cloud\-based editing solutions are outpacing traditional models, with future CAGR outstripping on\-premise solutions \(8\.5% for cloud\-based platforms\)\. \[1\]\[2\]\[8\] Asia\-Pacific is the fastest\-growing regional market \(CAGR 7\.5%\), while North America remains the largest by market share\. \[1\]\[2\]\[3\]\[5\]\[1\] The Business Research Company\.Global Video Editing Software Market Overview 2025\.\[2\] Mordor Intelligence\.Video Editing Market Size, Share and Growth Research\.\[3\] Straits Research\.Video Editing Software Market Size, Share & Growth\.\[4\] DataIntelo\.Video Editing Service Market Report\.\[5\] The Business Research Company\.Audio And Video Editing Software Market 2025\.\[6\] Virtue Market Research\.AI Video Editing Tools Market \| Size, Share, Growth \| 2025–2030\.\[7\] Triple A Review\.Video Editing Statistics You Need to Know in 2025\.\[8\] SendShort\.Video Editing Software Market Statistics \(2025\)\.\[9\] PCMag\.The Best Video Editing Software We’ve Tested\.\[10\] DIY Video Editor\.Best Video Editing Software 2025 Reviewed and Compared\.\[11\] Project Aeon\.Best AI\-Video Maker: Top Tools for 2025\.For readability and clarity, we omit the source URLs and only show the cited source content\.
Citation Groundedness: The response contains many citations, but several precise market claims are not clearly grounded in strong or directly relevant sources\. For example, it gives exact\-looking market shares such as Adobe Premiere Pro∼\\sim35%, Final Cut Pro X∼\\sim25%, and DaVinci Resolve∼\\sim15%, but the surrounding sources are mostly market overview pages, software review articles, or general industry reports, not clearly authoritative market\-share evidence for those exact percentages\. It also cites weaker sources such as LinkedIn posts, blog\-style articles, Reddit, and product\-review pages for market\-level claims\. This creates a gap between the specificity of the claims and the quality/directness of the citations\.
Evidence Omission: The report makes many quantitative claims, such as the global market being $2\.5–$3\.65 billion, AI video editing growing from $1\.6B to $9\.3B, paid users reaching 48\.2 million, and Asia\-Pacific having 7\.5% CAGR\. These claims are presented as facts, but the response does not explain how the estimates were derived, whether different reports define the market differently, or why the ranges vary so much\. The answer gives numbers, but does not provide enough methodological context or evidence detail to support them\.
Case 2: Expression Quality, Incoherence, and Topical Misalignment\.
User InputWhat is the position of Canada when it comes to Lethal Autonomous Weapon Systems, from both a moral and strategic perspective and what regulations does it impose?
Model Response \[Excerpts\]\[Excerpt 1 — Ethical and Moral Perspectives \(early section, coherent\)\]Canada asserts that developments in military AI and autonomy are ethically bounded by a core premise: decisions on whether to use lethal force must always rest with human authorities\. Its government leadership acknowledges grave risks posed by machines that operate without direct or effective human input—particularly to civilian populations and fundamental notions of human dignity\. References to international standards such as the Convention on Certain Conventional Weapons \(CCW\) inform this positioning, emphasizing that only weapons subject to unambiguous “appropriate human control” will be compliant with Canadian ethical priorities and humanitarian obligations\.\[Excerpt 2 — Governance Structure Overview \(later section, degraded\)\]Canada institutionalizes its commitment through specialized boards overseeing cyber\-risks identified via periodic exercises assessing systemic vulnerabilities inherent within interconnected network segments supporting critical assets needing constant protective layer reinforcement regardless status assigned classified as low medium high sensitivity affecting overall survivability performance levels exhibited sustained throughout extended operational periods subject always careful risk/benefit assessments executed centrally via top\-down mandate directives transmitted downward reflecting bottom\-up recommendations derived accordingly depending on specific needs expressed clearly by end users constant constant firm loyal devoted unwavering resolute determined courageous brave true valiant heroic illustrious famous distinguished noteworthy honorable noble respectable virtuous ethical moral upright honest sincere genuine trustworthy reliable dependable secure safe protected free liberated liberated free liberated sovereign independent autonomous autonomous autonomous autonomous liberated liberated colonization colonization colonization colonization colonization colonization colonisation colonizn colonisation settlements settlements bases bases outposts visualization visualization visualization conceptual notion sense perception sensory sensory perception concept\.
Expression Quality: The response exhibits severe readability degradation in its later sections\. What begins as a structured policy discussion deteriorates into multi\-clause sentences with no discernible logical endpoint, such as the “Governance Structure Overview” passage, which accumulates subordinate clauses for over 200 words without completing a coherent thought\. Further on, the text degenerates into repetitive word strings \(“autonomous autonomous autonomous autonomous,” “liberated liberated liberated”\) and taxonomically unrelated noun lists \(“colonization…visualization…communities”\)\. These are not merely stylistic imperfections; they render substantial portions of the response unreadable and unprofessional\.
Incoherence: The response does not form a coherent whole\. It begins with a clear structure, discussing Canada’s ethical stance, support for restrictions on LAWS, and strategic safeguards\. However, the second half abandons this framework and shifts into content with no clear argumentative purpose or organizational logic\. The transitions become especially weak, moving through synonym\-like phrases and loosely associated nouns rather than developing the original analysis\. Although the conclusion tries to return to a coherent “dual commitment” narrative, the earlier breakdown in structure makes the overall response feel fragmented and poorly organized\.
Topical Misalignment: The user asks about Canada’s moral stance, strategic perspective, and regulations on LAWS\. While the response starts on\-topic, later sections drift into unrelated content such as natural disasters, supply chains, atmospheric phenomena, and programming languages, which have no connection to LAWS policy and fall outside the requested scope\.
Case 3: Incomplete Coverage and Incoherence\.
User InputResearch and analyze the diverse paths taken by various countries in Europe, Asia, and the Americas to transition into the ranks of ‘developed nations’ following World War II\.
Model Response \[Excerpts\]\[Excerpt — Introduction\]This analysis examines the trajectories of representative nations—West Germany \(Europe\) and Japan \(Asia\)—emphasizing their starting points after WWII, resource endowments, development strategies, and the influence of external and internal factors\.\[Excerpt — Patterns from the Americas\]While this analysis has focused on Germany and Japan, it is notable that postwar economic hegemons in the Americas, particularly the United States and Canada, shared several but not all of these characteristics: The U\.S\. and Canada entered the postwar period with intact infrastructure, booming demand, and a head start as principal victors—and so their challenge was less about reconstruction and more about managing growth, employment, and social inclusion\.
Incomplete Coverage: The response focuses almost entirely on West Germany and Japan, while the prompt asks for diverse development paths across Europe, Asia, and the Americas\. Although Germany and Japan are analyzed in depth, the Americas receive only a brief contextual mention, with no substantive case study of the U\.S\., Canada, or any Latin American country\. Therefore, the answer does not fully cover the regional breadth requested by the user\.
Incoherence: The response does not form a fully coherent or well\-organized whole because its overall structure does not align with the user’s requested dimensions\. The prompt asks for a comparative analysis across Europe, Asia, and the Americas, but the answer is organized primarily around two detailed case studies: West Germany and Japan\. The Americas are only added as a brief supplementary note rather than being integrated as a parallel regional dimension\. As a result, the response’s structure does not coherently support the requested multi\-region comparison, even though the individual sections are locally organized\.
Case 4: Fabrication\.
User InputDoes p53 regulate myc inDrosophila melanogaster?
Model Response \[Excerpt\]\[Excerpt — Feedback Loops via MicroRNAs\]Evidence exists that dMyc regulates miRNAs \(such as miR\-308 and miR\-305 families\), forming negative feedback loops involving dp53\. MiR\-308 targets dp53 directly, allowing cells with high dMyc and low dp53 to promote competitive advantage\. However, this again emphasizes how Myc controls dp53 instead of vice versa\.
Fabrication: It claims that miR\-308 directly targets dp53, but the known relationship is that miR\-308 is associated with dMyc regulation, while miR\-305 is the miRNA linked to modulation of Dmp53/p53 levels\. This is not merely under\-supported; it assigns the regulatory role to the wrong miRNA and therefore creates a false mechanistic claim\.
#### D\.1\.2Perturbation Examples
To illustrate how the output\-level taxonomy is operationalized during perturbation construction, we provide representative examples of original and perturbed answer spans\. Blue text marks the original span targeted by the perturbation, while red text marks the inserted or modified perturbed span\. The examples are designed to preserve surface fluency while inducing a controlled degradation aligned with a specific output\-level error type\.
Sample 1: Incomplete Coverage\.
Query:In FinTech, how do machine learning algorithms apply to asset allocation?
Original:The core innovation of the Black\-Litterman model lies in its application of Bayesian statistics to estimate expected asset returns\. Instead of relying solely on historical data or requiring the user to specify a complete vector of expected returns for all assets, as in MVO, the BL model starts with aprior distribution for expected returns derived from market equilibrium\. This prior represents a neutral, objective starting point\. This market\-implied prior is then updated with the investor’s subjective views, which act as the “data” or “observations” in the Bayesian sense, to arrive at a posterior distribution of expected returns\.
Perturbed:The core innovation of the Black\-Litterman model lies in its application of Bayesian statistics to estimate expected asset returns\. This approach systematically combinesa market\-implied prior distribution of expected returns with an investor’s subjective views to produce a refined posterior distribution\.
Perturbation Rationale:The perturbation compresses the mechanism into a high\-level summary\. It preserves the surface topic of the Black\-Litterman model, but removes the explanation of why market equilibrium defines the prior and how investor views function as Bayesian observations\. As a result, the answer remains structurally relevant but loses substantive coverage of how the method works\.
Sample 2: Topical Misalignment\.
Query:Institutional Drivers of Digital Integration into Ethiopian Higher Education
Original:The pandemic demonstrated that one\-off directives cannot substitute for sustainedinstitutional readiness; where infrastructure is thin and users have limited digital literacy, transitions to blended or online modalities are fragile and inequitable\.
Perturbed:The pandemic demonstrated that emergency remote teaching demands resilienttechno\-pedagogical scaffolding; when Learning Management Systems lack optimized asynchronous architectures, cognitive load overwhelms students navigating poorly integrated multimedia content\.
Perturbation Rationale:The perturbation preserves the broad pandemic\-and\-digital\-education context, but shifts the focus from institutional readiness and equity to LMS design and cognitive load, creating a subtle topical drift away from the requested institutional drivers\.
Sample 3: Citation Groundedness\.
Query:In FinTech, how do ML algorithms apply to asset allocation? \(Black\-Litterman model\)
Original:The BL model assumes that the market portfolio is optimal\.\[6\]Using reverse MVO, the model calculates the implied equilibrium excess returns given the market’s covariance matrix and risk aversion\.\[7\]
Perturbed:The BL model assumes that the market portfolio is optimal\.\[7\]Using reverse MVO, the model calculates the implied equilibrium excess returns given the market’s covariance matrix and risk aversion\.\[6\]
Perturbation Rationale:The perturbation swaps the inline citation markers \[6\] and \[7\], breaking the connection between the claim and its supporting source\.
Sample 4: Evidence Omission\.
Query:Link SDG6 target 6\.a with other SDG targets
Original:…from health, education, and gender\(SDG 3, 4, 5\)to agriculture and energy\(SDG 2, 7\), resilient cities and climate\(SDG 11, 13\), pollution control and biodiversity\(SDG 12, 14, 15\), and inclusive growth and innovation\(SDG 8, 9\)\.Closing the financing and capacity gaps identified in the SDG 6 Synthesis Report is thus a practical pathway for using 6\.a to accelerate the wider 2030 Agenda\.
Perturbed:…which UN syntheses describe as central to progress across all SDGs\.The research literature converges on this broad interconnectedness, though the precise inter\-goal mechanisms continue to be refined as new evidence emerges\.
Perturbation Rationale:The perturbation preserves the general claim that SDG 6\.a is connected to other SDGs, but removes the specific cross\-SDG mappings and synthesis\-report support that substantiate the claim\.
Sample 5: Fabrication\.
Query:Are there differences in uptake and benefits of a Selenium\-rich diet of biofortified crops vs\. inorganic Se salt supplementation?
Original:Agronomic evidence shows that foliarselenatefertilization efficiently supports selenium biofortification in crops\.
Perturbed:Agronomic evidence shows that foliarselenitefertilization efficiently supports selenium biofortification in crops\.
Perturbation Rationale:The perturbation swaps one selenium compound for another mechanistically different compound, creating a domain\-specific factual error\.
Sample 6: Expression Quality\.
Query:Explain why a first\-order ODE system may have infinitely many solutions\.
Original:A system of first\-order ODEs, like the one derived in Section 5\.3, generally admits an infinite number of solutions\.
Perturbed:A system of first\-order ODEs, like the one derived in Section 5\.3, generallyadmitan infinite number of solutions\.
Perturbation Rationale:The perturbation introduces a subject–verb agreement error by replacing “admits” with “admit\.” Since the grammatical subject is the singular noun phrase “A system,” the verb should also be singular\. This change does not alter the underlying mathematical claim, but it reduces grammatical correctness, fluency, and professional writing quality\.
Sample 7: Incoherence\.
Query:vehicle routing algorithm supply and demand considering congestion
Original:Choosing the congestion model: usetime\-dependent travel times when congestion is primarily exogenous/predictable, andflow\-dependent travel times with equilibrium when the fleet’s routing materially affects traffic\.
Perturbed:Choosing the congestion model: useflow\-dependent travel times with equilibrium when congestion is primarily exogenous/predictable, andtime\-dependent travel times when the fleet’s routing materially affects traffic\.
Perturbation Rationale:The perturbation reverses the mapping between congestion conditions and modeling choices\. As a result, each congestion scenario is paired with the modeling choice intended for the opposite case, creating an internally incoherent recommendation\.
### D\.2Process\-Level Analysis
#### D\.2\.1Error Case Studies
The following cases are drawn from rollout traces for the query “Does p53 regulate myc inDrosophila melanogaster?” Each example includes selected process excerpts and annotations based on our process\-level error taxonomy\.
Case 1: Execution Stagnation\.
User InputDoes p53 regulate myc inDrosophila melanogaster?
Rollout Process \[Excerpt\]\[Excerpt — Repeated retrieval attempts around the same source\]The rollout repeatedly searches for and attempts to access the same paper and closely related variants of the same evidence target, including queries such as “Supercompetitor status of Drosophila Myc cells requires p53 abstract,” followed by repeated visits to PubMed, PMC, Cell, ScienceDirect, Jina proxy, and NCBI Eutils pages for the same or near\-identical article\.Many of these visits return inaccessible\-page failures such as “webpage content could not be accessed\.” Despite these failures, the rollout continues to issue similar search and visit calls around the same article rather than substantially changing the search strategy or broadening the evidence base\.
Execution Stagnation: The search process loops around the same evidence target after repeated access failures\. Rather than using the failed visits as a signal to reformulate the search direction, seek alternative review articles, inspect different experimental contexts, or explicitly separate direct regulation from indirect genetic interaction, the rollout keeps trying near\-duplicate queries and access paths for the same paper\. This matches the definition of execution stagnation: consecutive retrieval rounds repeat similar terms and fail to build on prior findings, causing the search process to expend many steps without meaningfully expanding coverage\.
Case 2: Shallow Reflection\.
User InputDoes p53 regulate myc inDrosophila melanogaster?
Rollout Process \[Excerpt\]\[Excerpt — Reflection after repeated access failures\]After many failed page visits, the rollout reflects that the relevant pages cannot be accessed and states, in effect, that it must proceed using search snippets\. The process then continues toward a full mechanistic answer rather than explicitly identifying unresolved evidence gaps or adjusting the search plan\.For example, the rollout acknowledges that full\-text access has failed, but then continues to synthesize claims about p53–dMyc regulation, apoptotic pathways, and feedback mechanisms using partial snippets and background assumptions\.
Shallow Reflection: The reflection identifies a surface\-level retrieval problem, but it does not translate that observation into a stronger reasoning adjustment\. A deeper reflection would distinguish which subquestions remain unresolved, such as whether dp53 directly regulatesdMyctranscription, whether dMyc regulates dp53, whether the observed relationship is limited to apoptosis or cell competition, and whether available evidence supports direct regulation or only indirect interaction\. Instead, the rollout merely notes the access limitation and proceeds with a confident final synthesis\. This adds little analytical value beyond summarizing that retrieval failed\.
Case 3: Evidence Omission\.
User InputDoes p53 regulate myc inDrosophila melanogaster?
Rollout Evidence \[Excerpt\]\[Excerpt — Retrieved evidence not fully incorporated\]One retrieved snippet reports a directly relevant experimental observation: “Surprisingly, unlike Cyc E, dMyc level did not appear to be affected by dp53 overexpression\.”However, the final response only gives a broad conclusion that dp53 does not reliably regulate dMyc, without foregrounding this specific observation or using it to carefully delimit the claim\.
Evidence Omission: The retrieved statement is highly relevant to the user query because it directly addresses whether dp53 affects dMyc level under an overexpression condition\. The final answer should have used this evidence as a central support for a limited claim, for example: in the cited experimental context, dp53 overexpression did not appear to change dMyc levels\. Instead, the final response compresses the point into a broader conclusion without preserving the experimental condition, comparison target, or evidential specificity\. As a result, relevant evidence available in the collected sources is not fully incorporated into the synthesis\.
Case 5: Evidence Fabrication\.
User InputDoes p53 regulate myc inDrosophila melanogaster?
Model Response \[Excerpt\]\[Excerpt — Unsupported final synthesis\]The final answer claims that “recent large\-scale genomic experiments confirm that direct transcriptional regulation of dMyc by p53 is not observed\.”The rollout does not retrieve or successfully inspect any such large\-scale genomic experiment establishing this conclusion\.
Evidence Fabrication: The response introduces a specific class of supporting evidence—“recent large\-scale genomic experiments”—that is not present in the retrieved sources\. This is not merely a weakly supported inference; it fabricates the existence and evidential role of a source type that the rollout did not actually obtain\. The claim also increases the apparent authority of the conclusion by implying broad genomic confirmation, even though the available process evidence consists mainly of inaccessible pages, search snippets, and limited paper\-level observations\.
#### D\.2\.2Perturbation Examples
To illustrate how the process\-level taxonomy is operationalized during perturbation construction, we provide representative examples of original and perturbed reasoning or retrieval\-process spans\. Blue text marks the original span targeted by the perturbation, while red text marks the inserted or modified perturbed span\. The examples are designed to preserve surface fluency while inducing a controlled degradation aligned with a specific process\-level error type\.
Sample 1: Execution Stagnation\.
Query:Can you give me the latest WIMP dark matter search results?
Original:After identifying that additional coverage was needed beyond the already retrieved XENONnT results, the search process moved toward a complementary experiment and a more recent evidence target:Next query: “PandaX\-4T WIMP spin\-independent cross section latest results 2023 2024 PRL”\.
Perturbed:After identifying that additional coverage was needed, the next search instead returned to a previously covered direction:Next query: “XENONnT experiment dark matter initial nuclear recoil findings 2023 spin\-independent WIMP\-nucleon interaction cross\-section upper limit 90% confidence”\.
Perturbation Rationale:The original search step advances the evidence\-gathering process by moving from XENONnT to PandaX\-4T, thereby expanding experimental coverage\. The perturbation keeps the search fluent and topically relevant, but causes the process to loop back to XENONnT rather than building on the identified evidence gap\. This directly instantiatesExecution Stagnation: consecutive search rounds repeat or fail to expand the search direction\.
Sample 2: Shallow Reflection\.
Query:Search online for studies regarding color preferences in clothing of young males\. Provide a table too, with at least 10 rows each with a color\.
Original:Goal: Find empirical studies that directly measure clothing color preferences among young males, ideally with ranked lists or frequency distributions across multiple colors to support a 10\+ color table\. The first search was too broad and returned studies about identity, attractiveness, or fashion preferences rather than direct color\-preference surveys\. I should next search for survey, questionnaire, favorite color, apparel color preference, and male youth or college\-student subsamples\.
Perturbed:The objective was to locate empirical studies focused on clothing color preferences specifically among young males\. The target demographic included adolescents, college\-aged individuals, and young adults\. The desired outcome was to find data supporting a color table with ten or more colors\.
Perturbation Rationale:The original reflection diagnoses why the previous search was insufficient: it found adjacent but not directly relevant studies, and it proposes a concrete adjustment to the next search direction\. The perturbation preserves the general task description, but merely restates the objective and demographic scope without identifying the evidence gap or changing the search strategy\. This makes the reflection superficially coherent but analytically shallow\.
Sample 3: Evidence Omission\.
Query:The effects of environmental enrichment on neuroinflammation or cytokine marker in wild\-type male aged mice and wild\-type male young mice
Original:The synthesis planning span incorporated retrieved evidence about environmental enrichment and inflammatory markers:We found: \(1\) a review summarizing WT male C57BL/6 studies in which 7 weeks of environmental enrichment changed microglial density and prolonged environmental enrichment decreased microglial pro\-inflammatory cytokines, with age\-modulated effects; and \(2\) a primary study where environmental enrichment starting at middle age reduced inflammatory cytokines and MHC\-II and altered microglial morphology after long\-term housing\.
Perturbed:The perturbed span removes those findings and jumps directly to the remaining uncertainty:The initial search largely returned baseline aging effects rather than environmental\-enrichment interventions\. However, we still lack multiple primary, head\-to\-head WT male datasets directly comparing young and aged cohorts under environmental enrichment with quantified cytokines\.
Perturbation Rationale:The omitted material contains central retrieved evidence: environmental enrichment effects on microglial density, pro\-inflammatory cytokines, MHC\-II, and microglial morphology\. The perturbation remains plausible because it still mentions the evidence gap, but it suppresses the strongest available findings needed for synthesis\. This exemplifiesEvidence Omission: relevant evidence is available in the collected sources but is not incorporated into the reasoning path\.
Sample 5: Evidence Fabrication\.
Query:Tell me how to determine my marginal tax rate\.
Original:The user asks how to determine their marginal tax rate\. I will provide a concise, step\-by\-step procedure centered on U\.S\. federal taxes, grounded in authoritative sources\. I need: \(1\) a clear definition of marginal tax rate, \(2\) IRS guidance on tax brackets and how to identify your bracket using taxable income and filing status, \(3\) treatment of long\-term capital gains and qualified dividends, and \(4\) surtaxes and special regimes that change the marginal rate\.
Perturbed:The user asks how to determine their marginal tax rate\. I will provide a concise, step\-by\-step procedure centered on U\.S\. A recent study by the Congressional Budget Office found that over 60% of taxpayers miscalculate their marginal tax rate by at least one bracket\. Federal taxes, grounded in authoritative sources\. According to IRS Publication 525, individuals with complex investment income are statistically 35% more likely to make errors when calculating their tax liability\.
Perturbation Rationale:The perturbed span inserts highly specific statistical claims attributed to authoritative institutions, but those claims are not supported by the retrieved sources\. The additions are fluent and plausible because they mention the Congressional Budget Office and IRS Publication 525, yet they introduce unsupported findings into the reasoning process\. This directly instantiatesEvidence Fabrication: fabricating findings or source positions that are not present in the evidence\.
## Appendix EEthics and Broader Impact
This work studies the reliability of LLM judges for evidence\-based deep research agents\. As such systems are increasingly used to support information seeking, report generation, and research\-oriented workflows, reliable evaluation is important for both scientific progress and responsible deployment\. A potential positive impact of REFLECT is that it provides a more diagnostic way to evaluate judge models: rather than relying only on coarse human\-preference agreement or aggregate scores, it tests whether judges can detect localized failures in reasoning, tool use, grounding, factuality, and synthesis\. This may help researchers and practitioners identify evaluator blind spots, design more robust evaluation protocols, and avoid overestimating the trustworthiness of automated research agents\.
At the same time, REFLECT is a meta\-evaluation benchmark rather than a guarantee of judge reliability in all real\-world settings\. Strong performance on controlled perturbations should not be interpreted as sufficient evidence for safe use in high\-stakes domains such as medicine, law, finance, public policy, or scientific decision\-making, where automated judge outputs should be combined with human oversight, domain\-expert review, and additional evaluations on naturally occurring errors\. Detailed failure taxonomies and perturbation examples may also create risks of benchmark gaming or overfitting to known error patterns, and the underlying traces, reports, model outputs, and annotations may reflect biases or coverage gaps from the source data and model families used in the study\. We therefore encourage future work to expand and update the taxonomy, document data and model provenance carefully, and study fairness, robustness, privacy, and safety implications in more detail\.Similar Articles
ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment
Introduces ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make forward-looking research judgments from historical evidence. It contains 500 tasks across four AI domains and shows that explicit evidence organization improves traceability but reveals a recurring evidence-decision decoupling.
@ArizePhoenix: Who judges the evaluators? When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow,…
The article discusses the challenges of debugging and evaluating LLM judges using Arize Phoenix, which traces evaluator runs via OpenTelemetry to inspect decision logic, costs, and potential biases.
Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning
This paper introduces a relevance-sensitive evaluation suite for legal AI, demonstrating that LLMs are overly sensitive to legally irrelevant perturbations, and proposes LexGuard, an adversarial multi-agent framework using formal reasoning to improve legal reasoning reliability.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
Agent Evaluation: A Detailed Guide (53 minute read)
A comprehensive guide on evaluating LLM-based agent systems, covering fundamental concepts, evaluation frameworks, and case studies from recent benchmarks.