Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

arXiv cs.AI Papers

Summary

This paper presents QuestBench, a benchmark built by students to evaluate deep research systems across humanities and social science domains. Results show that even advanced systems like GPT-5.5 pass only 57.58% of questions, highlighting failures in trustworthiness.

arXiv:2605.21413v2 Announce Type: new Abstract: As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:50 AM

# QuestBench as a Course-Based Practice for Accountable Knowledge Work
Source: [https://arxiv.org/html/2605.21413](https://arxiv.org/html/2605.21413)
## Teaching AI Through Benchmark Construction:QuestBenchas a Course\-Based Practice for Accountable Knowledge Work

Haiyang Shen1,∗,†Jiuzheng Wang1,∗,†Taian Guo1Mugeng Liu1 Wenchun Jing1Chongyang Pan1Siqi Zhong1Zhiyang Chen1 Weichen Bi1Yudong Han1Xiaoying Bai2Yun Ma1,† 1Peking University2Advanced Institute of Big Data hyshen@stu\.pku\.edu\.cnjiuzhengwang@stu\.pku\.edu\.cnmayun@pku\.edu\.cn ∗Equal contribution,†Corresponding Authors

###### Abstract

As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently\. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine\-produced knowledge\. To this end, we introduce a course\-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI\-era knowledge work\. Students turn disciplinary knowledge into verifiable expert\-level questions, review one another’s designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks\. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require\. The produced benchmark,QuestBench, consists of 256 questions across 14 humanities and social\-science domains\. Evaluation onQuestBenchshows that student\-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question\-level pass rate is only 16\.85%, and the best\-performing system, GPT\-5\.5, reaches a 57\.58% pass rate\. The failures are educationally useful because they show how fluent, source\-backed answers can still miss the right query, source, term, or evidence standard\. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs\. We presentQuestBenchas a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work\. The dataset is available at[https://huggingface\.co/datasets/PKUAIWeb/QuestBench/tree/main](https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main)\.

## 1Introduction

AI is becoming part of how students do knowledge work\. Current systems can search, read, write, code, call tools, and help complete tasks that once required many separate human steps\. This shift does not remove students from the work\. It changes what they must be able to do within it\. Much AI instruction begins with tool use: how to prompt a system, obtain useful output, and work more efficiently\. That first step is necessary, but it is incomplete\. If students only learn to receive AI outputs, they may not learn how to define the task, inspect the process, check the evidence, or decide whether the result can be trusted\. A central challenge for AI education is therefore to teach students how to remain accountable when AI participates in producing knowledge\.

We propose benchmark construction as a course\-based way to teach this responsibility\. Building a benchmark does not ask students to stand outside AI work as passive evaluators\. It asks them to build the conditions under which AI work can be judged\. Students must decide what is worth asking, what counts as a valid answer, which sources can support it, which shortcuts would make the task meaningless, and which field\-specific distinctions the grading criteria must preserve\. These decisions are part of responsible AI\-mediated work in professional settings\. Benchmark construction makes them explicit enough to practice in a classroom\.

This paper uses deep research systems as the teaching case\. They are not the endpoint of the argument; they are a concrete example of AI\-mediated knowledge work\. Such systems search the web, visit documents, synthesize evidence, and return answers that often appear well grounded\. They are useful enough for students to take seriously and unreliable enough to teach from\. Their fluent and source\-backed outputs can hide failures that only domain\-aware users can recognize, such as a wrong query, a shortcut, a confused source, an outdated term, or partial evidence presented as a complete answer\.

![Refer to caption](https://arxiv.org/html/2605.21413v2/figures/questbench_framework_image2.png)Figure 1:Conceptual framework ofQuestBenchas course\-based benchmark construction for teaching accountable AI\-mediated knowledge work\. Students first encounter deep research systems as a practical tool, then use benchmark construction to design expert\-level questions, test shortcuts, validate answers, evaluate models, and analyze failures\. The course links tool exposure with question design, disciplinary standards, and responsibility for judging AI\-produced work\.This approach gives disciplinary knowledge a specific role in AI education\. A narrow view treats such knowledge as content that AI may retrieve or summarize\. InQuestBench, students use it to define the standards under which AI\-produced work can be accepted\. Law students know why statutory wording and versioning matter\. History and international\-relations students know why provenance and document identity matter\. Language and literature students know why translation, editions, and phrasing matter\. These details are operational, but they also carry the standards through which information becomes trustworthy knowledge\.

The educational aim is to help students remain responsible knowledge actors when AI participates in the work\. AI systems can search and synthesize, but they do not decide which questions deserve attention, what evidence a field should accept, or when an answer is good enough to rely on\. As AI becomes more capable, education has to teach both use and accountability\. The goal is not to replace making with judging\. Rather, students need to learn how question design, evidence selection, model use, and answer acceptance form one chain of responsibility\. Benchmark construction makes that chain concrete by asking students to define, test, and defend the standards by which AI outputs will be judged\.

We instantiate this approach inQuestBench, a course\-based benchmark construction project for teaching accountable use of deep research systems\. Students from humanities and social\-science disciplines at Peking University designed, reviewed, and validated challenging information\-seeking questions from their own fields\. The final benchmark contains 256 curated questions spanning 14 normalized domains, including law, history, international relations, literature, social sciences, arts, and foreign languages\. Each question is publicly answerable but difficult for current systems because success requires domain\-aware query formulation, specialized source navigation, evidential judgment, and precise answer extraction\. The result is both a benchmark and a record of a course practice: it evaluates deep research systems while showing how students can learn to define and inspect AI\-mediated knowledge work\.

Evaluation onQuestBenchshows that student\-designed tasks reveal clear limitations in current deep research systems\. We evaluate Kimi K2\.5\[[16](https://arxiv.org/html/2605.21413#bib.bib9)\]and DeepSeek\-V3\.2\[[7](https://arxiv.org/html/2605.21413#bib.bib12)\], alongside Seed\-2 Pro\[[4](https://arxiv.org/html/2605.21413#bib.bib10)\]and Seed\-1\.8 Pro\[[3](https://arxiv.org/html/2605.21413#bib.bib11)\], together with nine frontier deep\-search systems: GPT\-5\.5\[[24](https://arxiv.org/html/2605.21413#bib.bib48)\], Claude Opus 4\.7\[[2](https://arxiv.org/html/2605.21413#bib.bib49)\], Gemini 3\.1 Pro\[[10](https://arxiv.org/html/2605.21413#bib.bib50)\], GLM 5\.1\[[35](https://arxiv.org/html/2605.21413#bib.bib51)\], DeepSeek\-V4 Pro\[[6](https://arxiv.org/html/2605.21413#bib.bib52)\], Kimi K2\.6\[[23](https://arxiv.org/html/2605.21413#bib.bib53)\], MiMo\-V2\.5 Pro\[[32](https://arxiv.org/html/2605.21413#bib.bib54)\], Qwen 3\.6 Plus\[[1](https://arxiv.org/html/2605.21413#bib.bib55)\], and MiniMax M2\.7\[[22](https://arxiv.org/html/2605.21413#bib.bib56)\]\. Mean scores range from 14\.58 to 67\.12 out of 100, and pass rates range from 7\.81% to 57\.58%\. Failure analysis shows that the main bottleneck is not information availability alone, but the interaction of query formulation, source navigation, and answer extraction under discipline\-specific standards\. For the course, these failures are more than model errors\. They are cases through which students can inspect where the AI\-mediated work broke down and what human responsibility remains\.

The main contributions are:

- •We introduce benchmark construction as a course\-based approach to teaching accountable AI\-mediated knowledge work\. The activity begins with direct exposure to an AI\-era productivity tool, then asks students to design verifiable tasks, defend evidence, and inspect model failures\.
- •We presentQuestBench, a benchmark and course artifact containing 256 expert\-level deep research questions across 14 humanities and social\-science domains\. The construction process combines student disciplinary expertise, adversarial peer review, anti\-shortcut validation, and multi\-round quality control\.
- •We evaluate thirteen state\-of\-the\-art deep search systems onQuestBenchand identify recurring failure patterns including retrieval failure, unsupported inference, entity confusion, and answer\-extraction errors\. These results show that student\-designed tasks can make hidden AI failures observable\.
- •We use reflections from five student contributors to discuss how students understand tools, define problems, judge evidence, and locate their responsibility in AI\-mediated knowledge work\. We treat this as part of a continuing inquiry into AI’s educational impact\.

## 2Background and Positioning

This paper treats benchmark construction as both an evaluation practice and a teaching practice for accountable AI\-mediated knowledge work\. We therefore organize related work around three questions: what kind of AI work students need to learn to inspect, why such work requires expert evaluation, and how benchmark construction can become part of AI education\.

### 2\.1AI Education Beyond Tool Use

AI education often begins with use\. Students learn how to prompt systems, obtain useful outputs, and apply AI to reading, writing, programming, or research tasks\. This first contact matters because students need to work with tools that are entering knowledge work\. Tool use alone, however, does not teach students how to define a task, inspect the process that produced an answer, check whether the evidence is sufficient, or decide whether the result can be used responsibly\.

QuestBenchaddresses this gap by treating AI as both a tool and an object of study\. The course begins with a concrete productivity tool, then asks students to define evaluation tasks, specify answer standards, test shortcuts, and analyze failures\. Benchmark construction is therefore the teaching activity rather than only the data collection procedure\. It turns accountability from an abstract norm into a set of operations students can practice\.

### 2\.2Deep Research as a Concrete AI Teaching Case

Deep research systems are a useful case for this educational question because they perform several parts of knowledge work in one workflow\. Earlier benchmarks such as Natural Questions\[[18](https://arxiv.org/html/2605.21413#bib.bib33)\], TriviaQA\[[14](https://arxiv.org/html/2605.21413#bib.bib34)\], and HotpotQA\[[33](https://arxiv.org/html/2605.21413#bib.bib44)\]evaluate retrieval\-grounded and multi\-hop answering\. Search\-augmented evaluations such as FreshLLMs\[[29](https://arxiv.org/html/2605.21413#bib.bib23)\]and GAIA\[[21](https://arxiv.org/html/2605.21413#bib.bib18)\]move toward dynamic tool\-using assistants\. More recent work studies agentic search: BrowseComp\[[31](https://arxiv.org/html/2605.21413#bib.bib5),[36](https://arxiv.org/html/2605.21413#bib.bib8)\]and DeepSearchQA\[[12](https://arxiv.org/html/2605.21413#bib.bib6)\]assess search\-intensive QA, while ResearchArena\[[15](https://arxiv.org/html/2605.21413#bib.bib16)\]and DeepResearch Bench\[[9](https://arxiv.org/html/2605.21413#bib.bib7)\]evaluate research agents\. RAG\-focused work such as FRAMES\[[17](https://arxiv.org/html/2605.21413#bib.bib24)\]and RAGChecker\[[27](https://arxiv.org/html/2605.21413#bib.bib25)\]provides finer\-grained retrieval diagnosis\.

These systems are no longer simple answer engines\. They search, read, synthesize, cite, and decide when a response appears complete\. That makes them powerful productivity tools, and it also makes their failures harder for students to notice\. A source\-backed answer may look like knowledge even when it rests on a wrong query, a confused document, a missing source, or an imprecise term\.QuestBenchuses deep research as a representative case for teaching students how to inspect the AI\-mediated work process behind a final answer\.

### 2\.3Expert Evaluation as a Source of Educational Standards

Expert\-level benchmarks show why general tool use is not enough\. GPQA\[[26](https://arxiv.org/html/2605.21413#bib.bib47)\]targets graduate\-level questions validated through an expert\-vs\-non\-expert gap\. Humanity’s Last Exam\[[25](https://arxiv.org/html/2605.21413#bib.bib27)\]curates questions intended to be unsolvable by current AI\. MMLU\-Pro\[[30](https://arxiv.org/html/2605.21413#bib.bib29)\]and SuperGPQA\[[20](https://arxiv.org/html/2605.21413#bib.bib30)\]scale closed\-ended academic difficulty\. PaperBench\[[28](https://arxiv.org/html/2605.21413#bib.bib19)\]tests research artifact replication, while AgentBench\[[19](https://arxiv.org/html/2605.21413#bib.bib20)\]andτ\\tau\-bench\[[34](https://arxiv.org/html/2605.21413#bib.bib21)\]broaden tool\-use evaluation\. Domain\-specific benchmarks such as LegalBench\[[11](https://arxiv.org/html/2605.21413#bib.bib28)\], PubMedQA\[[13](https://arxiv.org/html/2605.21413#bib.bib31)\], and FinQA\[[5](https://arxiv.org/html/2605.21413#bib.bib32)\]evaluate specialized reasoning within individual fields, while WebArena\[[37](https://arxiv.org/html/2605.21413#bib.bib37)\]and Mind2Web\[[8](https://arxiv.org/html/2605.21413#bib.bib38)\]study task completion in grounded web environments\.

QuestBenchdraws from this evaluation tradition, but uses expert\-level difficulty for a course purpose\. Students have to specify what makes a question meaningful, what makes an answer verifiable, what evidence counts, and what distinctions cannot be collapsed\. The same properties that make a question hard for AI systems, including specialized sources, precise terminology, anti\-shortcut structure, and explicit grading criteria, also make it useful for teaching students how to hold AI\-mediated knowledge work accountable\.

### 2\.4Benchmark Construction as AI Pedagogy

Most benchmark work treats construction as a research methodology: experts or annotators create tasks so that models can be evaluated\.QuestBenchkeeps this research function and adds a course function\. Students build the conditions under which AI outputs can be judged: they define tasks, defend answers, audit shortcuts, and score model outputs under explicit standards\.

This framing differs from tool\-oriented AI literacy\. Prompting and efficient use matter, but they do not by themselves teach students how knowledge becomes trustworthy\. Benchmark construction asks students to specify the conditions under which an AI answer should count as correct\. That is where the educational value sits: problem definition, evidence discipline, peer scrutiny, and responsibility for judgment become part of the assignment\.

Table[1](https://arxiv.org/html/2605.21413#S2.T1)summarizes high\-level benchmark properties\. We include it to show both the evaluation setting and the research substrate that the course adapts for teaching\.

Table 1:Comparison ofQuestBenchwith related benchmarks across key dimensions\. H&SS = Humanities & Social Sciences\.BenchmarkSizeDomainsWeb SearchAnti\-ShortcutAuthorsValidationBrowseComp1,266General✓✓ResearcherInternalDeepSearchQA90017 fields✓✗ResearcherInternalGAIA466General✓✗ResearcherInternalGPQA448STEM✗✓PhD studentExpert gapHLE3,000Broad✗✗ExpertPeer reviewQuestBench25614 \(H&SS\)✓✓Univ\. studentsMulti\-roundThe comparison clarifies the dual nature ofQuestBench\. As a benchmark, it evaluates open\-web deep research across expert\-level professional domains\. As a course practice, it asks students to learn AI by defining, testing, and judging the tasks that expose where AI\-mediated work can fail\.

## 3Course Design: Teaching AI Through Benchmark Construction

QuestBenchis designed as a course practice first and a benchmark artifact second\. The course asks a practical educational question: if students are already learning and working with AI tools, how can they remain responsible for the knowledge work those tools help produce? We make benchmark construction the main learning activity\. Students encounter deep research as a concrete AI productivity tool, transform disciplinary knowledge into verifiable evaluation tasks, review one another’s tasks for ambiguity and shortcuts, and use model failures to discuss where AI\-mediated work can and cannot be trusted\.

### 3\.1Pedagogical Design

The course is organized around a progression from exposure to accountable inspection\. Students first encounter AI as a productivity tool\. Deep research systems provide the concrete case because they search, read, synthesize, and answer in ways that resemble future knowledge work\. Students then study the tool by designing questions that it must answer under explicit standards\. This moves them from asking AI for answers to specifying a valuable, bounded, and verifiable task\. Finally, students analyze failures and ask which part of the work required disciplinary knowledge: the query, the source, the term, the answer boundary, or the decision to keep searching\. The progression keeps the first layer of AI education, direct contact with the tool, while making room for the larger question of how students remain responsible knowledge actors when AI participates in their work\.

Benchmark construction fits this design because it asks students to build the standards by which AI work can be inspected\. Instead of receiving answers from a system, students decide what would count as a rigorous test of the system\. They must externalize tacit disciplinary knowledge: what sources their field trusts, which distinctions matter, what would make an answer incomplete, and where a model might exploit a shortcut\. The course treats this work as a bridge between learning to use AI and learning to remain accountable for AI\-mediated knowledge work\.

### 3\.2Deep Research as the Teaching Case

The course could use many AI tools as teaching objects\. We use deep research because it is a consequential example of AI\-era knowledge work\. These systems do more than generate text; they search, visit documents, synthesize evidence, and present answers that often appear well grounded\. This makes them useful for students and, at the same time, educationally risky\. A fluent answer with citations may encourage students to treat output as knowledge before asking whether the question was well posed, whether the sources were authoritative, or whether the answer satisfies the standards of a field\.

Deep research therefore functions as a case for studying a larger issue: how AI tools should be situated within human standards of inquiry\. When students build questions that expose hidden failures, they also learn why those failures matter\. An incorrect legal version, a confused archival document, a missing source, or an imprecise translation is a model error with professional consequences\. It matters because a discipline has standards for what counts as knowledge, and those standards remain part of the student’s responsibility when AI helps produce an answer\.

### 3\.3Course\-Based Construction Activity

![Refer to caption](https://arxiv.org/html/2605.21413v2/figures/questbench_pipeline_image2.png)Figure 2:Course and technical pipeline forQuestBench\. Students transform disciplinary knowledge into expert\-level question packages, then filter them through preliminary screening, answer verification, grading\-criteria audit, anti\-shortcut validation, and domain normalization\. The same artifacts are then used for model evaluation, scoring, and failure analysis, turning task design into practice in accountable AI\-mediated knowledge work\.We collaborate with course instructors at Peking University to recruit undergraduate students across academic disciplines\.QuestBenchcontains questions from 85 student question authors, covering 37 self\-reported field labels that we normalize into 14 domain groups, including law, history, international relations, literature, foreign languages, social sciences, journalism, and archaeology\. All participants have completed advanced coursework in their respective fields\. This disciplinary spread matters for the course design because each field gives students a different way to examine how AI tools depend on human standards of inquiry\.

The course project asks students to construct questions from their own domains rather than generic trivia questions\. This choice improves benchmark quality because students know the sources, terminology, and reasoning patterns of their fields\. It also gives the course its educational force\. Students must articulate how their training determines standards for reliable answers in AI\-mediated work\. A law question may require exact statutory wording and version awareness; a history question may require archival provenance; a literature or language question may require attention to editions, translation, and phrasing\. These requirements show that professional knowledge is more than content for AI to retrieve\. It is a structure for deciding whether an AI\-produced result can be accepted\.

### 3\.4Question Design as an Educational Protocol

Without systematic guidelines, crowdsourced questions often have predictable problems: answers available through simple searches, ambiguous grading criteria, or artificial complexity that does not reflect genuine difficulty\. We therefore require question designers to satisfy five requirements: \(1\)domain expertise, so that questions require specialized knowledge inaccessible to educated non\-experts; \(2\)long\-tail information targeting, so that answers are embedded in specialized sources rather than obtainable through general knowledge; \(3\)answer uniqueness and verifiability, with unambiguous grading criteria and documented source materials; \(4\)complexity evolution documentation, which records how simple questions were refined into expert\-level queries; and \(5\)anti\-shortcut validation, which requires each question to pass three tests: no pre\-existing solutions searchable online, no trivial identifiers that directly determine answers, and no bypassable reasoning steps\. The complete protocol with examples appears in Appendix[A](https://arxiv.org/html/2605.21413#A1)\.

These requirements are annotation rules, but they also do educational work\. A verifiable question ties knowledge claims to evidence\. A grading criterion makes answer quality depend on explicit standards rather than fluency\. Anti\-shortcut design teaches students that apparent competence may come from accidental cues rather than genuine problem solving\. Construction documentation makes problem formulation visible as intellectual work\. Through this protocol, students learn that responsible AI use begins before a model answers: it begins with how the task and its standards are made\.

### 3\.5Peer Review and Quality Control

Peer review is central to both the benchmark and the course\. Students first construct candidate questions, then test one another’s designs by searching for ambiguities, hidden shortcuts, unverifiable answers, and alternative interpretations\. This review structure is modeled on scholarly evaluation: claims become stronger when they survive attempts to falsify or bypass them\. For students, the process makes clear that accountability is social as well as individual\. A trustworthy answer has to withstand examination by others\.

To ensure benchmark quality, we systematically filter questions through a three\-stage pipeline combining preliminary screening, iterative expert validation, and domain normalization\. Algorithm[1](https://arxiv.org/html/2605.21413#alg1)presents the complete workflow\.

Algorithm 1Iterative Question Filtering and Quality Control1:Input:Candidate question pool

Q=\{q1,…,qn\}Q=\\\{q\_\{1\},\\ldots,q\_\{n\}\\\}
2:Output:Curated benchmark

BB
3:

4://Stage 1: Preliminary Screening

5:

Qscreened←Q\_\{\\mathrm\{screened\}\}\\leftarrowRemoveInvalidOrTrivial

\(Q\)\(Q\)
6:

7://Stage 2: Expert Validation \(Iterative\)

8:for

q∈Qscreenedq\\in Q\_\{\\mathrm\{screened\}\}do

9:

reviewers1←\\mathrm\{reviewers\}\_\{1\}\\leftarrowAssignReviewers

\(q,2\)\(q,2\)// Round 1: Answer Correctness

10:if

¬\\negVerifyAnswerCorrectness

\(reviewers1,q\)\(\\mathrm\{reviewers\}\_\{1\},q\)then

11:Remove

qq;continue

12:endif

13:

reviewers2←\\mathrm\{reviewers\}\_\{2\}\\leftarrowAssignReviewers

\(q,2\)\(q,2\)// Round 2: Criteria Audit

14:if

¬\\negVerifyCriteriaClarity

\(reviewers2,q\)\(\\mathrm\{reviewers\}\_\{2\},q\)then

15:

q←q\\leftarrowReviseGradingCriteria

\(q\)\(q\);restartRound 1

16:endif

17:

reviewers3←\\mathrm\{reviewers\}\_\{3\}\\leftarrowAssignReviewers

\(q,2\)\(q,2\)// Round 3: Anti\-Shortcut

18:if

¬\\negVerifyAntiShortcut

\(reviewers3,q\)\(\\mathrm\{reviewers\}\_\{3\},q\)then

19:Remove

qq
20:endif

21:endfor

22:

23://Stage 3: Domain Normalization and Final Assembly

24:

B←B\\leftarrowAssembleBenchmark\(NormalizeDomains

\(Qscreened\)\(Q\_\{\\mathrm\{screened\}\}\)\)

25:return

BB

Stage 1 \(Preliminary screening\)removes invalid, ambiguous, or trivially answerable questions, and applies difficulty\-based filtering using a baseline search\-enabled model\.Stage 2 \(Expert validation\)subjects each remaining question to three independent human review rounds: answer correctness verification, grading criteria audit, and anti\-shortcut testing\. Questions failing any round are revised or permanently removed\.Stage 3 \(Domain normalization\)merges 37 original field labels into 14 domain groups while preserving meaningful disciplinary distinctions\. The final benchmark comprises 256 questions, with the largest domain \(Literature, 43 questions\) representing less than 17% of the total\. Full details of each stage appear in Appendix[B](https://arxiv.org/html/2605.21413#A2)\.

### 3\.6Technical Evaluation Task

The course activity yields a technical evaluation task for expert\-level deep research\. Given a questionqqrequiring specialized domain knowledge and sophisticated retrieval strategies, a modelℳ\\mathcal\{M\}equipped with retrieval tools \(web search, document visit\) must produce an answera^\\hat\{a\}\. The answer is evaluated against a reference answeraausing predefined grading criteria𝒢\\mathcal\{G\}, yielding a scores=𝒢​\(a^,a\)∈\[0,100\]s=\\mathcal\{G\}\(\\hat\{a\},a\)\\in\[0,100\]\. All questions, reference answers, and grading criteria are in Chinese, and models must search and retrieve information from the open web in any language\.

Design Principles\.QuestBenchis designed around two principles that serve both evaluation and AI education:

- •Cross\-Domain Coverage: Questions span 14 normalized professional domains and 37 original field labels, including law, history, international relations, literature, foreign languages, social sciences, arts, archaeology, and political science\. This diversity prevents models from relying on a single domain heuristic and lets students see that AI limitations are shaped by different knowledge structures\.
- •Expert\-Level Difficulty: Each question represents a long\-tail query requiring sophisticated retrieval strategies, deep domain knowledge, and careful interpretation of specialized sources\. For students, this principle makes expertise visible: a hard question is bounded, verifiable, and resistant to shortcuts, rather than merely obscure\.

These definitions and design principles connect the course activity to the model evaluation that follows\. Section[4](https://arxiv.org/html/2605.21413#S4)first summarizes the produced benchmark, then evaluates current deep research systems on it\.

## 4Evaluation: Making AI Failures Observable

The evaluation is part of the educational design\. It measures model performance, but it also tests whether student\-designed tasks can surface failures that ordinary tool use may hide\. A deep research system may search broadly, cite plausible sources, and produce a fluent answer while still asking the wrong query, trusting the wrong document, missing a disciplinary distinction, or extracting an imprecise answer\.QuestBenchuses evaluation to make those failures inspectable for students and instructors\. We first summarize the produced benchmark, then evaluate current systems and analyze where their work breaks down\.

### 4\.1Benchmark Overview

The course activity produces a benchmark with enough scale and diversity to support systematic evaluation\. We report scale, domain coverage, difficulty, and answer characteristics because each dimension serves two purposes: it supports model evaluation and shows what kinds of disciplinary standards students turned into testable tasks for inspecting AI\-mediated work\.

Scale and domain coverage\.QuestBenchcomprises 256 questions spanning 14 normalized domains across the humanities and social sciences after normalization\. Figure[3](https://arxiv.org/html/2605.21413#S4.F3)\(left\) visualizes the major domain groups, with the full normalized distribution reported in Appendix[F](https://arxiv.org/html/2605.21413#A6)\. Literature \(43 questions\), Law \(41\), History \(33\), Foreign Languages \(27\), and Social Sciences \(19\) form the largest categories, ensuring robust evaluation within these fields while maintaining broad cross\-domain coverage\.

Difficulty distribution\.Questions are difficult enough to discriminate among frontier systems\. Figure[3](https://arxiv.org/html/2605.21413#S4.F3)\(right\) shows the empirical cross\-model pass\-rate distribution after evaluation\. Among 256 questions, 128 questions have 0% cross\-model pass rate among the four primary baselines, 58 questions have pass rate above 0% and at most 25%, 58 questions have pass rate above 25% and at most 50%, and 12 questions exceed 50% cross\-model pass rate\. The average cross\-model pass rate is 16\.85% with median 3\.85%, and the average cross\-model score is 24\.15\. The distribution supports the educational premise that students can construct tasks whose difficulty comes from disciplinary standards rather than hidden data or arbitrary trickiness\.

Question and answer characteristics\.Questions average 141\.7 characters \(range: 42–389\)\. Answers average 29\.3 characters \(range: 1–455\), indicating concise verifiable responses\. Grading criteria average 73\.6 characters and specify exact\-match, partial\-credit, or structured\-answer requirements depending on the task\. Detailed statistics including answer format distributions appear in Appendix[F](https://arxiv.org/html/2605.21413#A6)\.

![Refer to caption](https://arxiv.org/html/2605.21413v2/x1.png)

![Refer to caption](https://arxiv.org/html/2605.21413v2/x2.png)

Figure 3:Left: Distribution across normalized domain groups\. Right: Empirical cross\-model question pass\-rate distribution after evaluation\.
### 4\.2Evaluation Setup

Evaluated models\.We evaluate thirteen frontier search\-enabled systems representing different model families and generations\. Four systems serve as primary baselines: Kimi K2\.5\[[16](https://arxiv.org/html/2605.21413#bib.bib9)\], DeepSeek\-V3\.2\[[7](https://arxiv.org/html/2605.21413#bib.bib12)\], Seed\-2 Pro\[[4](https://arxiv.org/html/2605.21413#bib.bib10)\], and Seed\-1\.8 Pro\[[3](https://arxiv.org/html/2605.21413#bib.bib11)\]\. The remaining nine systems are GPT\-5\.5\[[24](https://arxiv.org/html/2605.21413#bib.bib48)\], Claude Opus 4\.7\[[2](https://arxiv.org/html/2605.21413#bib.bib49)\], Gemini 3\.1 Pro\[[10](https://arxiv.org/html/2605.21413#bib.bib50)\], GLM 5\.1\[[35](https://arxiv.org/html/2605.21413#bib.bib51)\], DeepSeek\-V4 Pro\[[6](https://arxiv.org/html/2605.21413#bib.bib52)\], Kimi K2\.6\[[23](https://arxiv.org/html/2605.21413#bib.bib53)\], MiMo\-V2\.5 Pro\[[32](https://arxiv.org/html/2605.21413#bib.bib54)\], Qwen 3\.6 Plus\[[1](https://arxiv.org/html/2605.21413#bib.bib55)\], and MiniMax M2\.7\[[22](https://arxiv.org/html/2605.21413#bib.bib56)\]\. All models access identicalsearchandvisittools with a nominal 50\-call budget per question\. Kimi K2\.5 is evaluated with three runs per question \(scores averaged within each question\); all other models use one run\. Scores normalize to 0–100, with 60 as the passing threshold\. Full model configurations, prompt templates, and scoring details appear in Appendix[E](https://arxiv.org/html/2605.21413#A5)\.

### 4\.3Do Student\-Designed Tasks Make AI\-Mediated Work Inspectable?

The overall scores show that student\-designed tasks make expert\-level reliability visible as a problem, even for strong deep research systems\. Table[2](https://arxiv.org/html/2605.21413#S4.T2)presents overall results\. GPT\-5\.5 leads with the highest mean score \(67\.12\) and pass rate \(57\.58%\), followed by Claude Opus 4\.7 \(57\.79, 52\.94%\) and Gemini 3\.1 Pro \(43\.07, 31\.82%\)\. Among the four primary baselines, Kimi K2\.5\[[16](https://arxiv.org/html/2605.21413#bib.bib9)\]achieves the highest mean score \(30\.48\) and pass rate \(25\.26%\), followed by Seed\-2 Pro\[[4](https://arxiv.org/html/2605.21413#bib.bib10)\]\(22\.89, 15\.23%\), Seed\-1\.8 Pro\[[3](https://arxiv.org/html/2605.21413#bib.bib11)\]\(22\.07, 12\.50%\), and DeepSeek\-V3\.2\[[7](https://arxiv.org/html/2605.21413#bib.bib12)\]\(14\.58, 7\.81%\)\. The median score is 0 for all primary baseline models except Kimi K2\.5, whose median question\-level score is 16\.67\. Current systems can solve some questions, but they cannot be treated as reliable workers for expert\-level cross\-domain retrieval without further inspection\.

For the paper’s educational argument, difficulty matters because it identifies which parts of the work require accountable judgment\. Reliable answers require discipline\-specific ways of asking, locating, interpreting, and verifying\. The questions are publicly answerable, but they cannot be solved by fluent synthesis alone\. In the course, the score table becomes a starting point for asking where the work broke down: the task definition, the query, the source path, the evidence, or the final acceptance of an answer\.

Table 2:Overall model performance onQuestBench\. Pass rate uses the 60\-point threshold\. Kimi K2\.5 scores are averaged over three runs per question\.ModelMean Score \(0–100\)Pass Rate \(%\)Avg\. Tool CallsGPT\-5\.567\.1257\.5841\.12Claude Opus 4\.757\.7952\.9441\.94Gemini 3\.1 Pro43\.0731\.8218\.34GLM 5\.142\.3035\.1461\.70DeepSeek\-V4 Pro38\.2432\.4323\.92Kimi K2\.637\.7432\.2630\.81Qwen 3\.6 Plus36\.0530\.9545\.83MiMo\-V2\.5 Pro32\.0522\.7340\.68Kimi K2\.530\.4825\.2633\.41MiniMax M2\.729\.2916\.6732\.38Seed\-2 Pro22\.8915\.2314\.33Seed\-1\.8 Pro22\.0712\.5039\.71DeepSeek\-V3\.214\.587\.8124\.76![Refer to caption](https://arxiv.org/html/2605.21413v2/x3.png)

![Refer to caption](https://arxiv.org/html/2605.21413v2/x4.png)

Figure 4:Left: score distributions across models\. Right: average tool calls and fraction of runs exceeding 50 calls\. Among the thirteen evaluated models, higher tool use does not consistently correspond to higher accuracy: GLM 5\.1 averages 61\.70 calls but ranks fourth, while Gemini 3\.1 Pro reaches third place with only 18\.34 calls\.Domain\-wise performance shows that AI reliability depends on disciplinary context\. Performance remains low across domains, so the benchmark tests broad capabilities rather than one narrow knowledge area\. Averaged across thirteen models, Social Sciences \(36\.83 mean, 19 questions\) and Foreign Languages \(32\.70, 27 questions\) are the highest\-scoring sizeable domains, while Sports \(12\.28, 9 questions\) and International Relations \(16\.37, 15 questions\) are the most difficult\. Rankings for small domains should be interpreted cautiously\. The cross\-domain pattern shows that reliability is not a single global property of a model\. It changes with source ecosystems, disciplinary terminology, and answer standards\. Students therefore need more than a general habit of checking AI\. They need standards from the field in which the answer will be used\.

Question difficulty also supplies material for practicing inspection\. The majority of questions remain unsolved by all evaluated models, and only 12 exceed 50% cross\-model pass rate\. The mean question\-level pass rate is 16\.85% \(median 3\.85%\), rising to 23\.84% after excluding all\-zero\-score questions\. These numbers show thatQuestBenchdiscriminates among model capabilities while leaving substantial headroom\. In a course setting, they provide concrete cases for analyzing which task was defined, which source was trusted, which evidence was missing, and what a responsible user would have had to decide before using the output\.

### 4\.4Where Does the Work Break Down?

Failure analysis explains what the scores mean\. We conduct a systematic error analysis on all non\-passing responses across the four primary baseline models \(Kimi K2\.5, DeepSeek\-V3\.2, Seed\-2 Pro, Seed\-1\.8 Pro\)\. The error\-pattern statistics below cover these four baselines; failure analysis for the remaining nine systems is left to future work\. Each response that scores below 60 is classified into one of seven failure categories by an annotator who reads the model’s search trace and final answer\. The analysis covers 1,457 responses in total \(723 from Kimi K2\.5 across three runs, and approximately 248 each from the other three models\)\. Figure[5\(b\)](https://arxiv.org/html/2605.21413#S4.F5.sf2)shows the distribution\.

Naming failure types turns the vague impression that “AI is wrong” into an inspection of the work process\. Retrieval failure is the most frequent category \(478 responses, 32\.8%\): models cannot locate relevant information despite its public availability, especially when information is embedded in specialized archives, government documents, library metadata, or non\-English sources\. Unsupported inference is the second major pattern \(362, 24\.8%\): when search evidence is incomplete, models infer a plausible answer from partial matches rather than continuing verification\. Incomplete or off\-target answering accounts for 17\.2% \(250 responses\), where models retrieve useful context but fail to output the exact entity, number, or legal term required by the grading criteria\. Entity confusion \(135, 9\.3%\) and question misunderstanding \(77, 5\.3%\) are common when multiple similar documents, artworks, statutes, or historical figures satisfy nearby constraints\. Calculation or rule errors are rare \(16, 1\.1%\)\. The remaining 139 responses \(9\.5%\) involve timeout or empty output, where models exhausted their tool budget or failed to produce a final answer\.

These failure patterns map onto the work students have to learn to supervise\.Query formulationfailures \(retrieval failure and question misunderstanding, 38\.1% combined\) indicate that models cannot construct domain\-appropriate searches when general\-purpose query strategies fail\.Source navigationfailures \(entity confusion, 9\.3%\) show difficulty distinguishing closely related documents or entities within structured professional archives\.Answer extractionfailures \(unsupported inference, incomplete answering, and calculation errors, 43\.1% combined\) reveal that even when relevant sources are found, models struggle to produce the precise output that expert\-level grading criteria demand\. These are also the points where students must exercise judgment: define the right query, know which source identity matters, and decide whether an answer satisfies the field’s standard\.

![Refer to caption](https://arxiv.org/html/2605.21413v2/x5.png)\(a\)Domain\-wise mean scores\.
![Refer to caption](https://arxiv.org/html/2605.21413v2/x6.png)\(b\)Failure pattern counts\.

Figure 5:Domain and error analysis\. Left: mean scores for the largest normalized domains across the thirteen evaluated models\. Right: failure category distribution across 1,457 non\-passing responses from the four primary baseline models\.Representative examples\.We present three illustrative cases:

Case 1 \(Answer extraction – Law, terminology precision\):A question asks for a civil\-law institution that transforms a three\-party legal relationship into a relationship involving four or more parties, then asks for the crime in the Criminal Code with the same article number as the corresponding Civil Code provision\. The correct answer is the crime of low\-price share discounting and sale of state\-owned assets for personal gain\. Models often identify the civil\-law institution and the relevant article number, but fail because they output a nearby or updated crime name rather than the exact statutory formulation required by the grading criteria\. Legal search in this case requires precise tracking of specialized terminology, not just conceptual understanding\.

Case 2 \(Source navigation – International Relations, archival failure\):A question requires identifying the FRUS document number for ABC reporter John Scali’s October 26, 1962 meeting with Soviet KGB officer Aleksandr Fomin at the Occidental Restaurant\. Models often navigate to the correct FRUS volume and locate nearby documents involving Scali and Fomin, but select a related memorandum rather than the exact document number\. The error is a structured archival navigation failure: relevant evidence is available, but precise metadata and document\-level distinctions remain difficult\.

Case 3 \(Query formulation – Arts, last\-hop trap\):A question requires a three\-hop search: identify a monograph on Spanish visual culture from a quoted passage, find its dedicatee, then determine which film is analyzed in Chapter 4 of the dedicatee’s 2002 Spanish\-language book\. All models complete the first two hops \(identifying the book and dedicatee\) but fail at the final hop, where the target is a specialized academic monograph with minimal web indexing\. Multi\-hop search difficulty is not linear: it increases sharply when the chain enters low\-indexing specialized sources, and models lack the ability to judge when a search path has moved beyond standard web coverage\.

These examples show that failure rarely comes from information unavailability alone\. It comes from the interaction of specialized query formulation, precise source navigation, and strict answer extraction\. In the course setting, such failures are useful because they make AI\-mediated work inspectable\. Students can see that a model may reach the right source family but miss the exact document, identify the right legal concept but output the wrong statutory version, or complete early search hops while failing at the least\-indexed disciplinary source\. These cases show professional judgment at work in the evaluation process\. Additional detailed case studies with full model outputs are provided in Appendix[C](https://arxiv.org/html/2605.21413#A3)\.

Computational costs\.More search does not reliably improve performance\. GLM 5\.1 averages 61\.70 tool calls per question but ranks fourth, while Gemini 3\.1 Pro reaches third place with only 18\.34 calls\. The two top performers, GPT\-5\.5 and Claude Opus 4\.7, use moderate budgets \(41\.12 and 41\.94 calls\)\. Effective expert search requires query quality and source judgment, not just search depth \(see Appendix[E](https://arxiv.org/html/2605.21413#A5)for full tool\-use analysis\)\. This result supports the paper’s educational claim in a concrete way: using a stronger or more active tool is not enough unless the tool’s activity is guided and evaluated by meaningful human standards\.

## 5Discussion: What Benchmark Construction Teaches About AI Education

QuestBenchuses benchmark construction to connect tool exposure with responsibility for AI\-mediated knowledge work\. Students first encounter deep research as a real AI productivity tool\. They then turn the tool into something they can inspect by writing questions, grading criteria, source justifications, and anti\-shortcut checks\. Deep research remains the concrete case, but the educational issue is broader: students need to learn how to define, guide, and judge work in which AI systems search, read, write, and use tools\. The activity does not ask students to stop making and become detached judges\. It asks them to make the task, the standard, and the inspection procedure that determine whether an AI\-produced answer can count as knowledge\.

### 5\.1Tool Exposure Still Matters

Direct exposure matters\. Students cannot learn AI only through abstract discussion; they need to see what current systems can actually do\. Deep research systems are useful for this purpose because they feel like real productivity tools\. They search, synthesize, and produce answers with sources\. This first layer puts students inside the kind of AI\-mediated knowledge environment they are likely to inhabit\.

The course then changes the student’s relation to the tool\. A productive tool is not automatically a reliable authority\. A fluent answer with sources may appear to have transformed information into knowledge, but the evaluation results show why that appearance can be unstable\. Knowledge still depends on the value of the question, the authority of the source, the meaning of terms within a discipline, and the person who decides whether the answer is ready to use\.

### 5\.2Task Design as Accountable Work

Much AI education teaches students how to ask systems better questions\. Benchmark construction asks them to design questions that can test a system\. That difference matters\. A prompt can be improved by changing wording\. A benchmark question has to define a problem space: the intended reasoning path, the evidence needed to verify the answer, the shortcuts that must be blocked, and the grading standard that separates a correct answer from a plausible one\.

This is where the activity adds a second layer to ordinary tool use\. Students still use AI, but they also construct the conditions under which its work can be judged\. They cannot outsource the structure of the task to the model, because the task is what they are building\. They have to decide what is worth asking and how an answer should be checked\. In AI\-mediated knowledge work, that kind of problem formulation is not a preliminary step before real work begins\. It is part of the work\.

### 5\.3Disciplinary Knowledge as Evaluative Authority

QuestBenchalso gives disciplinary knowledge a concrete role\. In the presence of strong AI systems, students may wonder whether professional knowledge matters less because models can retrieve and summarize information on demand\. The construction process points the other way\. Professional knowledge tells students what to ask, which sources matter, what distinctions cannot be collapsed, and why a fluent answer may still be wrong\.

Attention to evidence, sources, terminology, and precision should therefore not be treated as a set of mechanical behaviors\. They are ways in which a field makes knowledge accountable\. Evidence links a claim to something that can be checked\. Source identity connects an answer to a knowledge community\. Terminology preserves conceptual boundaries\. Precision matters because professional work often fails at the level of the exact article number, document identifier, edition, translation, or statutory name\. Benchmark construction makes these standards visible because students must write them down and defend them before models are evaluated\.

### 5\.4Student Reflections and the Shift in Role

Reflections from five student contributors help ground the educational claim\. Their role here is not to settle the long\-term impact of the course, but to show what kinds of change become visible in this setting\. Students described deep research systems less as neutral answer providers and more as tools whose outputs had to be situated and tested\. Several reflections returned to the same practical discovery: designing a good question could be harder than obtaining an answer, because the question had to make the evidence path, answer boundary, and grading standard explicit\.

That discovery changes what AI use means for the student\. If the model can search, read, write, code, and call tools, the student’s role does not disappear\. It expands across the whole chain of work\. Students still make things: questions, evidence paths, standards, rubrics, and judgments about use\. The change is that making and judging become harder to separate\. Benchmark construction can therefore cultivate epistemic agency in a precise sense\. Students learn to remain responsible knowledge actors even when machines participate in producing answers\.

### 5\.5Reusable Practice and Long\-Term Educational Inquiry

The course activity behindQuestBenchcan be reused beyond this benchmark\. A course with students from different majors can ask them to contribute candidate questions from their own fields, review each other’s designs, validate answers, and evaluate AI systems on the resulting tasks\. The goal need not be a public benchmark in every setting\. Even when the primary goal is educational, the same structure can help students examine the relationship among AI tools, disciplinary knowledge, evidence standards, and human responsibility\.

The larger question is how education should respond as AI becomes ordinary in learning and professional work\. The value of practices such as benchmark construction is unlikely to be captured by short\-term performance measures alone\. What matters is whether students develop the habits needed to use AI without surrendering the work of judgment\. It also matters whether they see professional knowledge as a source of responsibility rather than a body of content to be automated away\. Answering that question requires longer\-term observation across courses, cohorts, disciplines, and forms of evidence, including reflective interviews, classroom observations, student artifacts, and follow\-up studies of how students use AI in later work\.

QuestBenchis one setting in which that inquiry can be pursued\. It makes AI\-mediated knowledge work observable in the classroom\. Students confront the gap between fluent output and trustworthy knowledge, between using a tool and accepting responsibility for an answer, and between professional knowledge as content and professional knowledge as a standard for evaluating AI\. How these experiences accumulate into durable intellectual habits remains open\. The contribution ofQuestBenchis to make that question concrete enough to study and refine\.

## 6Conclusion

This paper presentsQuestBenchas a course\-based practice for teaching accountable AI\-mediated knowledge work through benchmark construction\. The paper contributes a set of difficult questions for deep research systems and a course structure in which students learn AI by building the tests themselves\. Students first encounter deep research as an AI\-era productivity tool\. They then construct expert\-level tasks, review those tasks under peer scrutiny, evaluate models, and use failures as evidence for discussing what trustworthy knowledge requires\.

The benchmark artifact remains important\.QuestBenchcontains 256 expert\-level questions spanning 14 professional domains, constructed through course\-based collaboration with 85 student question authors\. Evaluation of thirteen frontier deep search systems shows clear limitations: the best\-performing system achieves only 57\.58% pass rate, and many questions remain unsolved by every evaluated system\. Failure analysis identifies three interacting bottlenecks: query formulation failures \(38\.1%\), source navigation failures \(9\.3%\), and answer extraction failures \(43\.1%\)\. These results show that student\-designed tasks can reveal where AI\-mediated work breaks down\. For the paper’s educational argument, those failures become material for a larger lesson: reliable knowledge depends on more than fluent answer generation\.

QuestBenchsuggests that AI education should develop more than efficient tool users\. As AI systems become more capable, students also need practice in the parts of work that cannot be treated as automatic: framing meaningful questions, setting evidence standards, inspecting process, and deciding when an answer is reliable enough to use\. Benchmark construction makes this lesson concrete\. It helps students encounter AI as a knowledge\-work tool, understand it as a system with reliability limits, use model failures as material for reflection, and define their own role in AI\-mediated work\. The benchmark is therefore more than an evaluation artifact\. It is a way to ask what kind of judgment education should cultivate when AI participates in knowledge production\.

Responsible use\.The benchmark should be used to examine AI\-mediated knowledge work, not as a fixed ranking of systems or a replacement for course design\. Model scores will change as tools, search APIs, and web coverage change\. The more durable value is the activity of making failures inspectable: asking which question was defined, which source was trusted, which evidence was used, and who remains responsible for the final judgment\. In educational settings, the activity should protect student privacy and avoid using benchmark performance as a proxy for student ability\. The goal is to help students develop judgment, not to turn their disciplinary work into a narrow contest of model scores\.

Limitations and future inquiry\.The limitations ofQuestBenchfollow from its dual role as a research artifact and a course\-based activity\. First, the benchmark currently covers humanities and social science domains and does not include natural sciences, engineering, or medical fields, where expert search may involve formulas, experimental data, instruments, or safety\-critical evidence\. Second, the construction process depends on students with advanced disciplinary training and on instructors who can support multi\-round review, so the activity may need adaptation in other institutions, course formats, or student populations\. Third, the educational impact of AI is itself a long\-term and still\-unfolding question\. Reflections from five student contributors help identify what is worth observing, but the development of judgment, responsibility, and epistemic agency should be studied across courses, cohorts, disciplines, and later uses of AI\. Fourth, the model evaluation reflects a specific snapshot of deep search systems, tool interfaces, and open\-web availability\. Most systems are evaluated with one run per question, so the reported scores should be read as descriptive evidence of current failure patterns rather than as statistically stable estimates of permanent model rankings\. These limits define the scope of the paper’s claim\.QuestBenchproposes and studies a course\-based benchmark construction activity through which students can encounter AI tools, construct verifiable problems, inspect model failures, and practice the forms of judgment that AI\-mediated knowledge work requires\.

## Acknowledgments and Disclosure of Funding

We thank the Peking University students who participated in the course\-based construction ofQuestBench\. Their work on question design, adversarial review, grading criteria, and answer validation made the benchmark possible and shaped the educational practice studied in this paper\. To reflect both the final benchmark and the collaborative construction process, we acknowledge the contributors below\.

Benchmark construction contributors \(101\)\.Dama Ba, Yang Bai, Jiyue Cao, Lijun Chen, Sifan Chen, Siman Chen, Siqi Chen, Yaoyao Chen, Yujie Chen, Yuqi Chen, Qi Cui, Tianyu Dai, Wenyi Dai, Yuanyuan Deng, Haotian Ding, Sihan Du, Chengyu Fan, Dexin Fei, Zichang Gao, Jiayun Gu, Xinqing Hao, Ruoyi Hu, Weiyu Hu, Yajie Hu, Dingyi Huang, Rensong Huang, Shucheng Huang, Yining Huang, Yutong Jiang, Xiaotong Jiao, Congyu Jie, Xiaozhen Jin, Yedi Jin, Menghan Li, Xinyuan Li, Ruikang Lin, Yihang Lin, Kewei Liu, Luoyirui Liu, Shijing Liu, Sijia Liu, Siyu Liu, Tianqi Liu, Xuanyu Liu, Yawen Liu, Linyong Long, Yuqi Lu, Xianchen Meng, Yufan Niu, Shuyan Pu, Yifan Shen, Yihe Shi, Wei Song, Yuting Sun, Yitong Tan, Yujiao Tao, Dijing Wang, Jiayi Wang, Yanzhi Wang, Yuhao Wang, Jiayi Wei, Peiru Wei, Qiulin Wei, Yuxin Wei, Xinyi Weng, Gulijiangbali Wu, Yawen Xiao, Bingxin Xie, Shurui Xie, Zhi Xing, Li Ya, Yichen Yan, Hanqing Yang, Qingyue Yang, Guanzhi Yu, Tongze Yuan, Siyu Zhang, Yihan Zhang, Yumeng Zhang, Yuxuan Zhang, Jingwen Zhao, Jiachen Zheng, Shengnan Zheng, Xiaoyu Zheng, Youzhen Zheng, Chaoran Guo, Jiayan Guo, Xinyue He, Binzhou Li, Rongcai Li, Shuoren Liu, Zhengda Lu, Xin Shu, Karen Uchibori, Zhengyi Wang, Zhichu Wu, Mu Xiong, Wenxiao Xiong, Zhizheng Yang, Zhizhi Zhang, Bozhong Zhao\.

We also thank the course instructors and teaching staff who facilitated the collaboration and supported the multi\-round review process\. We acknowledge the computational and construction support used to run model evaluations and organize the resulting analyses\.

## References

- \[1\]\(2026\)Qwen3\.6\-plus: towards real world agents\.Note:Accessed: 2026\-05\-06External Links:[Link](https://qwen.ai/blog?id=qwen3.6)Cited by:[12nd item](https://arxiv.org/html/2605.21413#A5.I1.i12.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1)\.
- \[2\]Anthropic\(2026\)What’s new in claude opus 4\.7\.Note:Accessed: 2026\-05\-06External Links:[Link](https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-7)Cited by:[6th item](https://arxiv.org/html/2605.21413#A5.I1.i6.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1)\.
- \[3\]ByteDance Seed\(2026\)Seed1\.8 Model Card: towards generalized real\-world agency\.arXiv preprint arXiv:2603\.20633\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2603.20633),[Link](https://arxiv.org/abs/2603.20633)Cited by:[3rd item](https://arxiv.org/html/2605.21413#A5.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1),[§4\.3](https://arxiv.org/html/2605.21413#S4.SS3.p1.1)\.
- \[4\]ByteDance Seed\(2026\)Seed2\.0 Model Card: towards intelligence frontier for real\-world complexity\.Note:Model cardExternal Links:[Link](https://seed.bytedance.com/en/seed2)Cited by:[2nd item](https://arxiv.org/html/2605.21413#A5.I1.i2.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1),[§4\.3](https://arxiv.org/html/2605.21413#S4.SS3.p1.1)\.
- \[5\]Z\. Chen, W\. Chen, C\. Smiley, S\. Shah, I\. Borova, D\. Langdon, R\. Moussa, M\. Beane, T\. Huang, B\. Routledge, and W\. Y\. Wang\(2021\)FinQA: a dataset of numerical reasoning over financial data\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 3697–3711\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.300),[Link](https://aclanthology.org/2021.emnlp-main.300/)Cited by:[§2\.3](https://arxiv.org/html/2605.21413#S2.SS3.p1.1)\.
- \[6\]DeepSeek AI\(2026\)DeepSeek V4 technical documentation\.Note:Accessed: 2026\-05\-06External Links:[Link](https://fe-static.deepseek.com/chat/transparency/deepseek-V4-model-card-EN.pdf)Cited by:[9th item](https://arxiv.org/html/2605.21413#A5.I1.i9.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1)\.
- \[7\]DeepSeek\-AI, A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin,et al\.\(2025\)DeepSeek\-V3\.2: pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2512.02556),[Link](https://arxiv.org/abs/2512.02556)Cited by:[4th item](https://arxiv.org/html/2605.21413#A5.I1.i4.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1),[§4\.3](https://arxiv.org/html/2605.21413#S4.SS3.p1.1)\.
- \[8\]X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su\(2023\)Mind2Web: towards a generalist agent for the web\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 28091–28114\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf)Cited by:[§2\.3](https://arxiv.org/html/2605.21413#S2.SS3.p1.1)\.
- \[9\]M\. Du, B\. Xu, C\. Zhu, X\. Wang, and Z\. Mao\(2025\)DeepResearch Bench: a comprehensive benchmark for deep research agents\.arXiv preprint arXiv:2506\.11763\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2506.11763),[Link](https://arxiv.org/abs/2506.11763)Cited by:[§2\.2](https://arxiv.org/html/2605.21413#S2.SS2.p1.1)\.
- \[10\]Google DeepMind\(2026\)Gemini 3\.1 pro model card\.Note:Accessed: 2026\-05\-06External Links:[Link](https://deepmind.google/models/model-cards/gemini-3-1-pro/)Cited by:[7th item](https://arxiv.org/html/2605.21413#A5.I1.i7.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1)\.
- \[11\]N\. Guha, J\. Nyarko, D\. Ho, C\. Ré, A\. Chilton, A\. K, A\. Chohlas\-Wood, A\. Peters, B\. Waldon, D\. Rockmore,et al\.\(2023\)LegalBench: a collaboratively built benchmark for measuring legal reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 44123–44279\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/89e44582fd28ddfea1ea4dcb0ebbf4b0-Paper-Datasets_and_Benchmarks.pdf)Cited by:[§2\.3](https://arxiv.org/html/2605.21413#S2.SS3.p1.1)\.
- \[12\]N\. Gupta, R\. Chatterjee, L\. Haas, C\. Tao, A\. Wang, C\. Liu, H\. Oiwa, E\. Gribovskaya, J\. Ackermann, J\. Blitzer, S\. Goldshtein, and D\. Das\(2026\)DeepSearchQA: bridging the comprehensiveness gap for deep research agents\.arXiv preprint arXiv:2601\.20975\.External Links:[Link](https://arxiv.org/abs/2601.20975)Cited by:[§2\.2](https://arxiv.org/html/2605.21413#S2.SS2.p1.1)\.
- \[13\]Q\. Jin, B\. Dhingra, Z\. Liu, W\. Cohen, and X\. Lu\(2019\)PubMedQA: a dataset for biomedical research question answering\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing,pp\. 2567–2577\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1259),[Link](https://aclanthology.org/D19-1259/)Cited by:[§2\.3](https://arxiv.org/html/2605.21413#S2.SS3.p1.1)\.
- \[14\]M\. Joshi, E\. Choi, D\. Weld, and L\. Zettlemoyer\(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics,pp\. 1601–1611\.External Links:[Document](https://dx.doi.org/10.18653/v1/P17-1147),[Link](https://aclanthology.org/P17-1147/)Cited by:[§2\.2](https://arxiv.org/html/2605.21413#S2.SS2.p1.1)\.
- \[15\]H\. Kang and C\. Xiong\(2025\)ResearchArena: benchmarking large language models’ ability to collect and organize information as research agents\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 5653–5671\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.303),[Link](https://aclanthology.org/2025.findings-emnlp.303/)Cited by:[§2\.2](https://arxiv.org/html/2605.21413#S2.SS2.p1.1)\.
- \[16\]Kimi Team, T\. Bai, Y\. Bai, Y\. Bao, S\. H\. Cai, Y\. Cao, Y\. Charles, H\. S\. Che, C\. Chen, G\. Chen,et al\.\(2026\)Kimi K2\.5: visual agentic intelligence\.arXiv preprint arXiv:2602\.02276\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2602.02276),[Link](https://arxiv.org/abs/2602.02276)Cited by:[1st item](https://arxiv.org/html/2605.21413#A5.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1),[§4\.3](https://arxiv.org/html/2605.21413#S4.SS3.p1.1)\.
- \[17\]S\. Krishna, K\. Krishna, A\. Mohananey, S\. Schwarcz, A\. Stambler, S\. Upadhyay, and M\. Faruqui\(2025\)Fact, fetch, and reason: a unified evaluation of retrieval\-augmented generation\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4745–4759\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.243),[Link](https://aclanthology.org/2025.naacl-long.243/)Cited by:[§2\.2](https://arxiv.org/html/2605.21413#S2.SS2.p1.1)\.
- \[18\]T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee,et al\.\(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7,pp\. 452–466\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00276),[Link](https://aclanthology.org/Q19-1026/)Cited by:[§2\.2](https://arxiv.org/html/2605.21413#S2.SS2.p1.1)\.
- \[19\]X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang, S\. Zhang, X\. Deng, A\. Zeng, Z\. Du, C\. Zhang, S\. Shen, T\. Zhang, Y\. Su, H\. Sun, M\. Huang,et al\.\(2024\)AgentBench: evaluating LLMs as agents\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=zAdUB0aCTQ)Cited by:[§2\.3](https://arxiv.org/html/2605.21413#S2.SS3.p1.1)\.
- \[20\]M\-A\-P Team, X\. Du, Y\. Yao, K\. Ma, B\. Wang, T\. Zheng, K\. Zhu, M\. Liu, Y\. Liang, X\. Jin,et al\.\(2025\)SuperGPQA: scaling LLM evaluation across 285 graduate disciplines\.arXiv preprint arXiv:2502\.14739\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2502.14739),[Link](https://arxiv.org/abs/2502.14739)Cited by:[§2\.3](https://arxiv.org/html/2605.21413#S2.SS3.p1.1)\.
- \[21\]G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom\(2024\)GAIA: a benchmark for general AI assistants\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=fibxvahvs3)Cited by:[§2\.2](https://arxiv.org/html/2605.21413#S2.SS2.p1.1)\.
- \[22\]MiniMax\(2026\)MiniMax M2\.7: early echoes of self\-evolution\.Note:Accessed: 2026\-05\-06External Links:[Link](https://www.minimax.io/news/minimax-m27-en)Cited by:[13rd item](https://arxiv.org/html/2605.21413#A5.I1.i13.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1)\.
- \[23\]Moonshot AI\(2026\)Kimi K2\.6 tech blog: advancing open\-source coding\.Note:Accessed: 2026\-05\-06External Links:[Link](https://www.kimi.com/blog/kimi-k2-6)Cited by:[10th item](https://arxiv.org/html/2605.21413#A5.I1.i10.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1)\.
- \[24\]OpenAI\(2026\)GPT\-5\.5 system card\.Note:Accessed: 2026\-05\-06External Links:[Link](https://openai.com/index/gpt-5-5-system-card/)Cited by:[5th item](https://arxiv.org/html/2605.21413#A5.I1.i5.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1)\.
- \[25\]L\. Phan, A\. Gatti, N\. Li, A\. Khoja, R\. Kim, R\. Ren, J\. Hausenloy, O\. Zhang, M\. Mazeika, D\. Hendrycks,et al\.\(2026\)A benchmark of expert\-level academic questions to assess AI capabilities\.Nature649\(8099\),pp\. 1139–1146\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09962-4),[Link](https://doi.org/10.1038/s41586-025-09962-4)Cited by:[§2\.3](https://arxiv.org/html/2605.21413#S2.SS3.p1.1)\.
- \[26\]D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman\(2024\)GPQA: a graduate\-level google\-proof q&a benchmark\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=TGnEMgh4PB)Cited by:[§2\.3](https://arxiv.org/html/2605.21413#S2.SS3.p1.1)\.
- \[27\]D\. Ru, L\. Qiu, X\. Hu, T\. Zhang, P\. Shi, S\. Chang, C\. Jiayang, C\. Wang, S\. Sun, H\. Li, Z\. Zhang, B\. Wang, J\. Jiang, T\. He, Z\. Wang, P\. Liu, Y\. Zhang, and Z\. Zhang\(2024\)RAGChecker: a fine\-grained framework for diagnosing retrieval\-augmented generation\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 21999–22027\.External Links:[Document](https://dx.doi.org/10.52202/079017-0692),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_and_Benchmarks_Track.pdf)Cited by:[§2\.2](https://arxiv.org/html/2605.21413#S2.SS2.p1.1)\.
- \[28\]G\. Starace, O\. Jaffe, D\. Sherburn, J\. Aung, J\. S\. Chan, L\. Maksin, R\. Dias, E\. Mays, B\. Kinsella, W\. Thompson, J\. Heidecke, A\. Glaese, and T\. Patwardhan\(2025\)PaperBench: evaluating AI’s ability to replicate AI research\.arXiv preprint arXiv:2504\.01848\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2504.01848),[Link](https://arxiv.org/abs/2504.01848)Cited by:[§2\.3](https://arxiv.org/html/2605.21413#S2.SS3.p1.1)\.
- \[29\]T\. Vu, M\. Iyyer, X\. Wang, N\. Constant, J\. Wei, J\. Wei, C\. Tar, Y\. Sung, D\. Zhou, Q\. Le, and T\. Luong\(2023\)FreshLLMs: refreshing large language models with search engine augmentation\.arXiv preprint arXiv:2310\.03214\.External Links:[Link](https://arxiv.org/abs/2310.03214)Cited by:[§2\.2](https://arxiv.org/html/2605.21413#S2.SS2.p1.1)\.
- \[30\]Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen\(2024\)MMLU\-Pro: a more robust and challenging multi\-task language understanding benchmark\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 95266–95290\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/ad236edc564f3e3156e1b2feafb99a24-Paper-Datasets_and_Benchmarks_Track.pdf)Cited by:[§2\.3](https://arxiv.org/html/2605.21413#S2.SS3.p1.1)\.
- \[31\]J\. Wei, Z\. Sun, S\. Papay, S\. McKinney, J\. Han, I\. Fulford, H\. W\. Chung, A\. T\. Passos, W\. Fedus, and A\. Glaese\(2025\)BrowseComp: a simple yet challenging benchmark for browsing agents\.arXiv preprint arXiv:2504\.12516\.External Links:[Link](https://arxiv.org/abs/2504.12516)Cited by:[§2\.2](https://arxiv.org/html/2605.21413#S2.SS2.p1.1)\.
- \[32\]Xiaomi\(2026\)MiMo\-V2\.5\-Pro\.Note:Accessed: 2026\-05\-06External Links:[Link](https://mimo.xiaomi.com/mimo-v2-5-pro)Cited by:[11st item](https://arxiv.org/html/2605.21413#A5.I1.i11.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1)\.
- \[33\]Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning\(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 2369–2380\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1259),[Link](https://aclanthology.org/D18-1259/)Cited by:[§2\.2](https://arxiv.org/html/2605.21413#S2.SS2.p1.1)\.
- \[34\]S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan\(2025\)τ\\tau\-bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2025/hash/1b126cc38b8638e07bef37e7b2bb72bf-Abstract-Conference.html)Cited by:[§2\.3](https://arxiv.org/html/2605.21413#S2.SS3.p1.1)\.
- \[35\]Zhipu AI\(2026\)GLM\-5\.1 model card\.Note:Accessed: 2026\-05\-06External Links:[Link](https://www.modelscope.cn/models/ZhipuAI/GLM-5.1)Cited by:[8th item](https://arxiv.org/html/2605.21413#A5.I1.i8.p1.1),[§1](https://arxiv.org/html/2605.21413#S1.p7.1),[§4\.2](https://arxiv.org/html/2605.21413#S4.SS2.p1.1)\.
- \[36\]P\. Zhou, B\. Leon, X\. Ying, C\. Zhang, Y\. Shao, Q\. Ye, D\. Chong, Z\. Jin, C\. Xie, M\. Cao, Y\. Gu, S\. Hong, J\. Ren, J\. Chen, C\. Liu, and Y\. Hua\(2025\)BrowseComp\-ZH: benchmarking web browsing ability of large language models in chinese\.arXiv preprint arXiv:2504\.19314\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2504.19314),[Link](https://arxiv.org/abs/2504.19314)Cited by:[§2\.2](https://arxiv.org/html/2605.21413#S2.SS2.p1.1)\.
- \[37\]S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig\(2024\)WebArena: a realistic web environment for building autonomous agents\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=oKn9c6ytLx)Cited by:[§2\.3](https://arxiv.org/html/2605.21413#S2.SS3.p1.1)\.

## Appendix ACourse Protocol for Benchmark Construction

This appendix gives the question design protocol used in the course\-based benchmark construction activity\. The protocol keeps questions expert\-level, verifiable, and resistant to common shortcuts\. It also shows how the course asks students to define problems, defend evidence, specify what would make an answer trustworthy, and inspect the AI\-mediated work that produces it\.

### A\.1Core Requirements

Each question must satisfy the following requirements:

Domain Expertise Requirement\.Questions must require specialized knowledge from the designer’s professional field that would not be accessible to educated non\-experts\. The question should resemble a research scenario or professional task that domain practitioners would recognize as challenging\.

Long\-tail Information Targeting\.Questions should target information that is publicly available but not widely known, embedded in sources such as academic publications, professional archives, or domain\-specific databases\. Effective searching requires domain knowledge to formulate appropriate queries and interpret specialized terminology\.

Answer Uniqueness and Verifiability\.Provide a unique, objectively correct answer \(or well\-defined set of correct answers\) with clear verification method through authoritative sources\. Acceptable formats include single entities, entity lists, structured objects, ordered sequences, tables, or numerical results\. All possible correct answers must be enumerable and specified\.

Detailed Grading Criteria\.Specify point allocation for each component of the answer\. Include rules for partial credit, alternative phrasings, and additional or missing information\. Criteria must support objective evaluation, appropriate partial credit, and edge\-case handling\.

Construction Documentation\.Document the evolution from a simple question to an expert\-level query, explaining what each iteration adds and why the final version requires domain expertise\. This record helps students see that problem formulation is part of accountable AI\-mediated knowledge work, not a preliminary step before it\.

### A\.2Anti\-Shortcut Checklist

Before submission, designers verify that questions avoid three common shortcut traps:

Trap: Pre\-existing Solutions\.Search combinations of question terms in multiple search engines to verify that no webpage directly provides the answer with the same reasoning path\. If such a page exists, reformulate the question to require different information integration or add constraints that require independent synthesis\.

Trap: Trivial Identifiers\.Ensure that no single constraint acts as a unique identifier \(ISBN, DOI, atomic number, or a specific date paired with a name\) that determines the answer without reasoning\. Replace unique identifiers with combinations of non\-unique descriptive properties\.

Trap: Bypassable Multi\-step Design\.For questions that appear to require multiple steps, verify that those steps are necessary and cannot be bypassed through direct search\. Test whether combinations of start and end constraints yield the answer without intermediate reasoning\. If they do, strengthen the dependencies between steps or add constraints that require information synthesis\.

## Appendix BPeer Review and Filtering Details

Algorithm[1](https://arxiv.org/html/2605.21413#alg1)in Section[3\.5](https://arxiv.org/html/2605.21413#S3.SS5)gives the filtering workflow\. We add details for each stage here\.

Stage 1 details\.Preliminary screening removes questions with the following issues: \(a\) questions requiring access to private or paywalled content that cannot be verified through open\-web search; \(b\) ambiguous wording where multiple reasonable interpretations lead to different answers; \(c\) questions directly answerable by the baseline model with high confidence, indicating insufficient difficulty; and \(d\) questions where no evaluated model produces a meaningful search attempt, suggesting potentially ill\-defined requirements\.

Stage 2 details\.In the iterative validation rounds, each reviewer works independently without seeing other reviewers’ assessments\. For Round 1 \(answer correctness\), reviewers must independently locate authoritative sources confirming the reference answer and verify that no alternative correct answers exist\. For Round 2 \(grading criteria audit\), reviewers simulate scoring sample responses to verify that criteria produce consistent scores\. For Round 3 \(anti\-shortcut verification\), reviewers test at least three different search query formulations to confirm that no straightforward search path yields the answer directly\. These validation steps also serve the educational goal: students learn that accountable AI\-mediated work depends on standards that can be inspected and defended by others\.

Disagreement resolution\.When reviewers disagree, the question is returned to the original author with specific feedback\. If disagreement persists after revision, the question is removed from the candidate pool\. This conservative rule prioritizes benchmark quality over quantity\. It also models an important norm for AI education: knowledge claims should survive scrutiny rather than rely on private confidence or fluent presentation\.

## Appendix CAI Failure Cases for Classroom Analysis

This section presents detailed cases fromQuestBenchevaluation, with question specifications, model responses, and analysis\. They document where AI\-mediated knowledge work breaks down and provide classroom material for discussing why such work requires domain standards and human judgment\. These cases correspond to the representative examples summarized in Section[4\.4](https://arxiv.org/html/2605.21413#S4.SS4)\.

### C\.1Case 1: Legal Domain – Answer Extraction Failure

Domain: LawChallenge dimension: Answer extraction \(terminology precision\)

Question\(translated from Chinese\):There is a civil\-law institution that transforms a legal relationship originally occurring among three parties into one involving four or more parties\. This type of legal act often involves contracts but does not appear in the Contract Section of the Civil Code\. State the crime in the Criminal Code whose article number matches that of the Civil Code provision governing this institution\.

Reference answer: The crime of personal\-gain\-motivated low\-price share discounting and sale of state\-owned assets \(translated from Chinese\), Article 169\.

Grading criteria: 100 points for an exact character\-by\-character match of the crime name; 0 otherwise\.

Reasoning path: The civil\-law institution issub\-agencyor entrusted sub\-agency \(translated from Chinese\), governed by Article 169 of the Civil Code \(General Provisions, Chapter 7: Agency\)\. Sub\-agency extends the original three\-party relationship \(principal, agent, counterparty\) to four or more parties by introducing a sub\-agent\. The corresponding crime in Article 169 of the Criminal Code is the target answer\.

Model responses:

- •Seed\-2 Pro\(0/100\): Incorrectly identified the civil\-law institution as “counter\-guarantee” \(Article 387\), leading to the wrong Criminal Code article and the answer “crime of unit bribery\.”
- •Seed\-1\.8 Pro\(0/100\): Correctly identified sub\-agency and Article 169, but answered “crime of personal\-gain\-motivated low\-price share discounting and sale of company or enterprise assets” \(translated from Chinese\), which is the post\-2024\-amendment name, not the original statutory formulation\.
- •Kimi K2\.5\(run 1: 0/100, run 2: 100/100, run 3: 100/100\): In the failing run, Kimi produced the post\-amendment name \(same error as Seed\-1\.8\)\. In two successful runs, it provided the exact pre\-amendment statutory formulation\.
- •DeepSeek\-V3\.2\(0/100\): Produced no answer \(empty response\)\.

Analysis: This case contains two answer extraction failures\. Seed\-2 Pro failed at the initial reasoning step by misidentifying the civil\-law institution\. Seed\-1\.8 Pro and one Kimi run completed the reasoning path but failed at final answer extraction because of a terminology\-evolution trap: the 2024 Criminal Code amendment expanded the crime’s scope from “state\-owned assets” to “company or enterprise assets,” changing the official name\. Models that retrieved post\-amendment sources produced the updated name, which does not match the grading criteria requiring the original statutory formulation\. Expert\-level legal tasks require precise tracking of legislative versioning, not just conceptual understanding\.

### C\.2Case 2: International Relations – Source Navigation Failure

Domain: International RelationsChallenge dimension: Source navigation \(archival metadata precision\)

Question\(translated from Chinese\):Consult the Foreign Relations of the United States \(FRUS\) volume on the Cuban Missile Crisis \(Volume XI, 1961–1963\)\. Locate the document number of the memorandum in which ABC reporter John Scali reported to the State Department about his October 26, 1962 meeting with Soviet KGB officer Aleksandr Fomin at the Occidental Restaurant in Washington, DC\.

Reference answer: Document 64\.

Grading criteria: 100 points for “Document 64” or “64”; 0 for any other document number\.

Model responses:

- •Seed\-2 Pro\(0/100\): Answered “Document 85,” described as an editorial note\.
- •Seed\-1\.8 Pro\(0/100\): Answered “Document 80,” identified a memorandum titled “Memorandum From ABC Correspondent John Scali to the Director of the Bureau of Intelligence and Research\.”
- •Kimi K2\.5\(run 1: 0/100, run 2: 0/100, run 3: 0/100\): Answered “Document 195,” “Document 80,” and “Document 195” across three runs\. None matched\.
- •DeepSeek\-V3\.2\(0/100\): Answered “Document 80\.”

Analysis: All models failed this question across all runs, making it one of the universally unsolved questions\. Three of four models, and two of three Kimi runs, converged on Document 80\. That is a real FRUS document involving Scali, but it records adifferentcommunication in the sequence\. FRUS Volume XI contains dozens of documents from overlapping dates with the same participants \(Scali, Fomin, Hilsman\)\. Document 64 records the initial restaurant meeting, while Documents 80 and 195 record later follow\-up communications\. Models consistently located the correct volume, date, and participants, but could not distinguish between closely related documents in a structured archive\. Professional archival navigation requires content\-level analysis of individual documents rather than metadata\-level matching\.

### C\.3Case 3: Cross\-Domain Scholarly Search – Query Formulation Failure

Domain: ArtsChallenge dimension: Query formulation \(multi\-hop search\)

Question\(translated from Chinese\):A scholarly monograph on contemporary Spanish visual culture analyzes Bleda y Rosa’s photographic work “Hacia Valeria\.” The author notes that “Hacia” \(Towards\) suggests we are placed “in medias nusquam,” never truly reaching history through images\. The author dedicated this book to a female scholar who deeply influenced her\. Find this dedicatee, then identify: in that scholar’s 2002 Spanish\-language monograph on “wounded culture” \(cultura herida\), which director’s which film is primarily analyzed in Chapter 4?

Reference answer: Director Iván Zulueta’s filmArrebato\. The reasoning chain: Book A = Patricia M\. Keller,Ghostly Landscapes→\\rightarrowdedicatee = Cristina Moreiras\-Menor→\\rightarrowBook B =Cultura herida\(2002\)→\\rightarrowChapter 4 analyzesArrebato\.

Grading criteria: Path verification \(40 points\): correctly identify Keller’s book and Moreiras\-Menor as dedicatee\. Final answer \(60 points\): Iván Zulueta \(30 points\) \+Arrebato\(30 points\)\.

Model responses:

- •Seed\-2 Pro\(40/100\): Correctly identified Keller’s book and Moreiras\-Menor, but gave the wrong film for Chapter 4\.
- •Seed\-1\.8 Pro\(20/100\): Identified Moreiras\-Menor but attributed the wrong chapter content\.
- •Kimi K2\.5\(run 1: 40, run 2: 40, run 3: 40\): All three runs correctly completed the first two hops \(Keller→\\rightarrowMoreiras\-Menor\) but failed to identify the correct Chapter 4 content inCultura herida\.
- •DeepSeek\-V3\.2\(40/100\): Same pattern: correct path verification, incorrect final answer\.

Analysis: This question requires a three\-hop search chain across two books and two scholars\. All models completed the first two hops: identifying Keller’s monograph from the Bleda y Rosa analysis, then finding Moreiras\-Menor as the dedicatee\. The failure occurs at the final hop, identifying the content of Chapter 4 in Moreiras\-Menor’s 2002 Spanish\-language monograph\. The book is a specialized academic work with limited online presence, and its table of contents is not readily available through standard web search\. The consistent partial success \(40/100 across most models\) followed by final\-hop failure shows that multi\-hop expert search becomes harder as the chain enters less\-indexed specialized sources\.

## Appendix DExample Student\-Constructed Questions

This section gives representative questions from different domains, including grading criteria and construction rationale\. These examples show what it means for students to turn disciplinary knowledge into a verifiable AI evaluation task\.

### D\.1Example from Foreign Languages and Cultural Studies

Question:Please help me find a fairy tale written between the year when World War II reached its largest scale and the year when the second front of World War II was opened\. The author’s country participated in both World War I and World War II, and the author had a profession other than writer\. The fairy tale was adapted into both a film and a musical\. Find the most common public Italian version of this fairy tale, identify the most frequent word in Chapter 6, and report its frequency\.\(translated from Chinese\)

Reference Answer:The Little Prince,il, 8 occurrences \(translated from Chinese\)\.

Grading Criteria:

- •45 points: correctly identifyThe Little Prince\(translated from Chinese\)\.
- •25 points: correctly identifyilas the most frequent word in Chapter 6 of the common Italian public version\.
- •30 points: correctly give the frequency as 8\.

This question combines historical filtering, author metadata, adaptation history, language\-version selection, chapter localization, and exact token counting\. Systems typically identifyThe Little Prince\(translated from Chinese\), but lose points on the Italian text selection or token count\.

### D\.2Example from Law

Question:In civil law, find a right A that has corresponding legal effect\. In general, A is subject to a limitation\-period system\. For this limitation\-period system, exactly five categories of claims are excluded from its application, and these five categories share a common superordinate concept B\. Please answer in the format “A→\\rightarrowB→\\rightarrow\[five categories of claims\]\.”\(translated from Chinese\)

Reference Answer: claim right in debt relations→\\rightarrowlimitation of action→\\rightarrow\[“claim for payment of deposit principal and interest”, “claim for redemption of principal and interest on government bonds, financial bonds, and corporate bonds issued to unspecified parties”, “claim for capital contribution arising from an investment relationship”, “claim for payment of child support, elder support, or spousal support”, “where personality rights have been infringed, the victim’s claim for cessation of infringement, removal of obstruction, elimination of danger, elimination of adverse effects, restoration of reputation, and apology”\] \(translated from Chinese\)\.

Grading Criteria: The answer is scored by identifying A, B, and each of the five statutory exception categories, with penalties for missing categories, additional incorrect categories, or failure to preserve the required structured format\.

This question requires locating the doctrinal relation between claim rights and limitation periods, then extracting a closed statutory list\. The main challenge is not naming the limitation system alone, but preserving the exact five\-category legal enumeration without adding adjacent exceptions\.

## Appendix ETechnical Evaluation Details

This section gives evaluation methodology details for the technical benchmark artifact produced by the course practice\.

### E\.1Model Configurations

All models evaluated onQuestBenchuse a standardized interface and tool access:

Evaluated Models:

- •Kimi K2\.5\[[16](https://arxiv.org/html/2605.21413#bib.bib9)\]: Long\-context model with enhanced search\. We evaluate three runs per question; mean score is averaged within each question, and pass rate is computed as the per\-question fraction of passing runs before cross\-question averaging\.
- •Seed\-2 Pro\[[4](https://arxiv.org/html/2605.21413#bib.bib10)\]: Search\-enabled Seed model evaluated with one run per question\.
- •Seed\-1\.8 Pro\[[3](https://arxiv.org/html/2605.21413#bib.bib11)\]: Earlier search\-enabled Seed model evaluated with one run per question\.
- •DeepSeek\-V3\.2\[[7](https://arxiv.org/html/2605.21413#bib.bib12)\]: Search\-enabled DeepSeek model evaluated with one run per question\.
- •GPT\-5\.5\[[24](https://arxiv.org/html/2605.21413#bib.bib48)\]: OpenAI frontier model evaluated with one run per question\.
- •Claude Opus 4\.7\[[2](https://arxiv.org/html/2605.21413#bib.bib49)\]: Anthropic reasoning model evaluated with one run per question\.
- •Gemini 3\.1 Pro\[[10](https://arxiv.org/html/2605.21413#bib.bib50)\]: Google DeepMind model evaluated with one run per question\.
- •GLM 5\.1\[[35](https://arxiv.org/html/2605.21413#bib.bib51)\]: Zhipu AI model evaluated with one run per question\.
- •DeepSeek\-V4 Pro\[[6](https://arxiv.org/html/2605.21413#bib.bib52)\]: DeepSeek next\-generation model evaluated with one run per question\.
- •Kimi K2\.6\[[23](https://arxiv.org/html/2605.21413#bib.bib53)\]: Moonshot AI updated model evaluated with one run per question\.
- •MiMo\-V2\.5 Pro\[[32](https://arxiv.org/html/2605.21413#bib.bib54)\]: Xiaomi reasoning model evaluated with one run per question\.
- •Qwen 3\.6 Plus\[[1](https://arxiv.org/html/2605.21413#bib.bib55)\]: Alibaba Cloud model evaluated with one run per question\.
- •MiniMax M2\.7\[[22](https://arxiv.org/html/2605.21413#bib.bib56)\]: MiniMax model evaluated with one run per question\.

Tool Interface: All models receive identical tool access through standardized API:

- •search\(query: str\) \-\> List\[SearchResult\]: Performs web search using Google Search API\. Returns top 10 results with title, URL, and snippet\.
- •visit\(url: str\) \-\> str: Retrieves webpage content with JavaScript rendering\. Returns cleaned text with 50,000 character limit\.

Resource Constraints:

- •Nominal 50 tool\-call budget per question \(search \+ visit combined\), with actual over\-budget runs tracked in analysis
- •30\-minute timeout per question
- •No external knowledge base access beyond web search and visit tools

### E\.2Prompt Template

We use the following standardized prompt for all models:

```
You are an expert research assistant with deep search
capabilities. Your task is to answer the following question
by conducting thorough web-based research.

Question: {question}

Tools available:
- search(query): Search the web and return top results
- visit(url): Visit a webpage and extract its content

Instructions:
1. Conduct systematic searches to locate relevant sources
2. Visit and analyze authoritative sources carefully
3. Synthesize information from multiple sources as needed
4. Provide your final answer in <answer></answer> tags

Final answer format: <answer>Your answer here</answer>
```

This prompt asks for systematic research without giving domain\-specific hints\. It matches expert search settings where users have questions but no pre\-identified sources\.

### E\.3Scoring Methodology

Each question specifies detailed grading criteria with point allocations\. Scoring follows a two\-stage process:

Stage 1 \- Automatic Extraction and Matching:

- •Extract model answer from<answer\>tags
- •For deterministic criteria \(exact string match, numerical values\), apply automated scoring
- •Flag questions requiring manual review \(complex structured answers, multiple acceptable phrasings\)

Stage 2 \- Expert Manual Review:

- •Domain experts review flagged questions following provided grading criteria
- •Apply partial credit rules specified in question metadata
- •Resolve edge cases \(alternative phrasings, additional information\)
- •Record scoring rationale for audit

Inter\-Annotator Agreement: For questions requiring manual scoring, two independent annotators score each response\. Cohen’s kappa for scoring agreement:κ=0\.92\\kappa=0\.92\(substantial agreement\)\. Disagreements resolved through discussion with original question designer\.

### E\.4Computational Costs

Tool use differs across models\. GLM 5\.1 uses the most tools on average \(61\.70 calls per question\) and exceeds 50 calls in 45\.95% of runs, but its pass rate is only 35\.14%\. The two top performers, GPT\-5\.5 and Claude Opus 4\.7, use comparable tool budgets \(41\.12 and 41\.94 calls on average\) yet achieve higher pass rates \(57\.58% and 52\.94%\)\. At the other extreme, Gemini 3\.1 Pro reaches 31\.82% pass rate using only 18\.34 tool calls per question on average, while Seed\-2 Pro uses 14\.33 calls with 15\.23% pass\. These results suggest that increasing search depth is insufficient; models must also formulate effective queries, select authoritative sources, and stop only after verifying the exact answer\.

## Appendix FDetailed Course Artifact Statistics

This section gives additional dataset statistics beyond the main text\. We report them as properties of the benchmark and as traces of the course activity that produced it\.

### F\.1Complete Domain Distribution

After domain normalization \(merging similar field names like “Law”, “Legal Studies”, and “Law and Sociology” into “Law”\),QuestBenchcomprises 256 questions across 14 normalized professional domains\. Table[3](https://arxiv.org/html/2605.21413#A6.T3)presents the complete distribution\.

Table 3:Complete domain distribution after normalization forQuestBench\.Domain \(Normalized\)CountDomain \(Normalized\)CountLiterature43Political Science15Law41Journalism14History33Sports9Foreign Languages27Cultural Geography7Social Sciences19Information Science6Arts15Archaeology6International Relations15Games and Animation6The distribution shows broad cross\-domain coverage\. Literature, Law, and History are the largest groups because these fields yielded many difficult questions, but no single domain exceeds 17% of the benchmark\.

### F\.2Question and Answer Characteristics

Table[4](https://arxiv.org/html/2605.21413#A6.T4)summarizes question, answer, and grading criteria properties\.

Table 4:Question and answer characteristics\.CharacteristicMeanMedianStd DevQuestion length \(chars\)141\.712669\.3Answer length \(chars\)29\.31155\.5Grading criteria length \(chars\)73\.65464\.6Question length: Questions average 141\.7 characters \(median: 126, std: 69\.3\)\. Longest question: 389 characters\. Shortest: 42 characters\.

Answer format distribution:

- •Single entity names: 45% \(e\.g\., person name, place name, technical term\)
- •Numerical values: 15% \(e\.g\., dates, counts, article numbers\)
- •Entity lists: 25% \(e\.g\., multiple documents, sets of related items\)
- •Structured information: 15% \(e\.g\., paired data, hierarchical answers\)

Answer length: Answers average 29\.3 characters \(median: 11, std: 55\.5\), indicating concise, verifiable responses\. Longer answers \(up to 455 characters\) occur for questions requiring structured lists or detailed specifications\.

Grading criteria complexity: Average grading criteria length is 73\.6 characters per question \(median: 54, std: 64\.6\), with specifications for exact match requirements, partial credit allocation, acceptable alternative phrasings, and edge\-case handling\. These criteria support consistent evaluation\.

Similar Articles

Introducing BenchBench (5 minute read)

TLDR AI

Introduces BenchBench, a benchmark that tests AI models' ability to create effective benchmarks for other models, with GPT 5.2 being the only successful winner so far while frontier models like GPT 5.5 and Opus 4.6 struggled.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

arXiv cs.AI

This paper introduces Benchmarking-Cultures-25, a dataset analyzing how AI model builders selectively highlight benchmarks in press releases. It finds a fragmented evaluation landscape with limited cross-model comparability, arguing that benchmarks are used as narrative devices for market positioning rather than standardized scientific measurement.

Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

arXiv cs.AI

Introduces EduAgentBench, a source-grounded benchmark for evaluating tutor agents across professional pedagogical judgment, multi-turn tutoring, and autonomous teaching workflow execution. Evaluations on frontier models show they still fall short of professional teaching standards in situated tutoring and workflow tasks.

Introducing HealthBench

OpenAI Blog

OpenAI introduces HealthBench, a new benchmark for evaluating AI systems in healthcare contexts, created with 262 physicians across 60 countries. The benchmark includes 5,000 realistic health conversations with physician-written rubrics to assess model performance on meaningful, trustworthy, and improvable metrics.

JobBench: Aligning Agent Work With Human Will

arXiv cs.AI

JobBench is a benchmark built from worker surveys to evaluate AI agents on tasks that workers most want automated, covering 130 tasks across 35 professions with detailed rubrics.