Design and Report Benchmarks for Knowledge Work

arXiv cs.AI Papers

Summary

This paper proposes a three-step framework for designing and reporting benchmarks for knowledge work AI, emphasizing alignment between benchmark tasks and real-world work activities. It derives 18 work activities from the O*NET database and analyzes three existing benchmarks (GDPval, OfficeQA Pro, APEX-SWE) to demonstrate gaps between benchmark scores and actual work capability.

arXiv:2605.23262v1 Announce Type: new Abstract: The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.
Original Article
View Cached Full Text

Cached at: 05/25/26, 08:57 AM

# Design and Report Benchmarks for Knowledge Work
Source: [https://arxiv.org/html/2605.23262](https://arxiv.org/html/2605.23262)
Yining Hua1Hongbin Na2Cyrus Ayubcha1Levi Lian3,4 1Harvard University2University of Technology Sydney3Stanford University4Raycaster AI yininghua@g\.harvard\.eduhongbin\.na@student\.uts\.edu\.au cyrusayubcha@hms\.harvard\.edulevilian@raycaster\.ai

###### Abstract

The development of LLM agents has led to a growing body of work on knowledge\-work AI, including coding, research, and healthcare\. However, current knowledge\-work evaluation and benchmark design still largely follow the evaluation logic of traditional NLP tasks\. As a result, higher benchmark performance does not reliably indicate that a system is better able to carry out knowledge work in real\-world deployment settings\. To help design better benchmarks, this paper contributes a three\-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product\. The paper first reviews work studies that motivate these reporting decisions by showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must be usable in downstream workflows\. We then translate these concerns into benchmark design and reporting guidance, covering how benchmark tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system\. To help name the work activity being evaluated and distinguish it from the benchmark tasks that are commonly tested, the paper derives an inventory of 18 work activities from the O\*NET occupational task database\. We demonstrate the approach through three benchmark case analyses:GDPval, a non\-code occupational deliverable benchmark;OfficeQA Pro, a grounded document\-analysis benchmark scored by final answers; andAPEX\-SWE, a software\-engineering benchmark with executable scored products\. These case analyses show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, the tested setting, the scored product, and the broader work claim\.

## 1Introduction

Knowledge work is a broad category of labor in which knowledge serves as both the source material and the output of work\(Drucker,[1999](https://arxiv.org/html/2605.23262#bib.bib24); Davenport,[2005](https://arxiv.org/html/2605.23262#bib.bib20)\)\. It covers a large share of the labor market\(Porat,[1977](https://arxiv.org/html/2605.23262#bib.bib110)\), including many occupations whose tasks center on information processing, judgment, communication, documentation, and coordination, as reflected in O\*NET task statements\(National Center for O\*NET Development,[2026](https://arxiv.org/html/2605.23262#bib.bib107)\)\. Because much computer\-based work falls within this category, LLM systems are increasingly studied in relation to work\-facing tasks\(Eloundouet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib25); Brynjolfssonet al\.,[2025](https://arxiv.org/html/2605.23262#bib.bib13)\)\. Recent studies evaluate agents that reason through intermediate steps\(Yaoet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib125)\), use external tools\(Schicket al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib111)\), coordinate across multiple agents\(Wuet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib123)\), operate in web environments\(Zhouet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib126)\), control desktop systems\(Xieet al\.,[2024](https://arxiv.org/html/2605.23262#bib.bib124)\), and complete enterprise\(Drouinet al\.,[2024](https://arxiv.org/html/2605.23262#bib.bib22)\)or office tasks\(Wanget al\.,[2024b](https://arxiv.org/html/2605.23262#bib.bib118)\)\.

Knowledge work differs from standard natural language processing \(NLP\) tasks such as information retrieval\(Thakuret al\.,[2021](https://arxiv.org/html/2605.23262#bib.bib131)\), summarization\(Nallapatiet al\.,[2016](https://arxiv.org/html/2605.23262#bib.bib68); Narayanet al\.,[2018](https://arxiv.org/html/2605.23262#bib.bib69)\), and function\-level code generation\(Chenet al\.,[2021](https://arxiv.org/html/2605.23262#bib.bib18)\), which usually evaluate a bounded input\-output behavior rather than the production of a work artifact within a situated workflow\. Evaluation design and reporting guidance for this broader class of work remains underdeveloped; much LLM and agent evaluation is still organized around scenarios, component tasks, and metrics rather than work\-product claims\(Wanget al\.,[2024a](https://arxiv.org/html/2605.23262#bib.bib119); Lianget al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib63)\)\. In the context of knowledge\-work agents, however, these scores are often inaccurately used to represent a system’s capability across a broader class of work, such as research synthesis, document revision, clinical triage, or administrative coordination\.

This inference is fragile because the output of knowledge work cannot be understood only through its visible content\. An answer, reply, or patch may have different meanings depending on the role, materials, setting, and receiving workflow in which it is produced\. A benchmark that reports only the final output produced by a system, therefore, cannot show whether that output can support downstream coordination and continuation\(Carlile,[2002](https://arxiv.org/html/2605.23262#bib.bib16),[2004](https://arxiv.org/html/2605.23262#bib.bib17); Malone and Crowston,[1994](https://arxiv.org/html/2605.23262#bib.bib101)\), such as being checked, revised, filed, executed, or used in a later workflow step\. For instance, a patch may pass a benchmark oracle while failing developer expectations\(Wanget al\.,[2025](https://arxiv.org/html/2605.23262#bib.bib132)\), and AI assistance can improve some knowledge tasks while worsening performance on others in field settings\(Dell’Acquaet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib21)\)\.

To prevent a mismatch between what a benchmark score measures and the knowledge\-work capability it is used to claim, this paper contributes a design and reporting approach for tying benchmark scores to broader work claims through three steps: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product\. For each step, benchmark reports should state what is represented by the task, which setting simplifications are made for evaluation, and which parts of the broader work remain outside the score\. Section[2](https://arxiv.org/html/2605.23262#S2)explains why work activity, tested setting, and downstream use matter for benchmark interpretation\. Section[3](https://arxiv.org/html/2605.23262#S3)develops this three\-step design and reporting approach as a reporting structure: \(1\) identify the work activity, \(2\) specify the tested setting, and \(3\) score the proper work product\. It also derives an O\*NET\-based inventory for identifying work activities\(National Center for O\*NET Development,[2026](https://arxiv.org/html/2605.23262#bib.bib107)\)\. Section[4](https://arxiv.org/html/2605.23262#S4)demonstrates the approach through three benchmark case analyses:GDPval,OfficeQA Pro, andAPEX\-SWE\. Section[5](https://arxiv.org/html/2605.23262#S5)discusses limitations, alternative interpretations, and future directions\. Table[1](https://arxiv.org/html/2605.23262#S1.T1)gives the definitions used throughout the paper\.

Table 1:Definitions used in this paper\.
## 2What Knowledge Work Requires

Knowledge work is commonly defined as labor in which knowledge is a primary input and output of work\(Drucker,[1999](https://arxiv.org/html/2605.23262#bib.bib24); Davenport,[2005](https://arxiv.org/html/2605.23262#bib.bib20)\)\. For benchmark design, the main issue is that scores are often attached to work\-capability claims broader than the task, setting, or scored object actually tested\. This paper therefore focuses on three aspects of representation that are often left implicit when NLP\-style task scores are used to support claims about knowledge\-work capability: what work the task represents, the conditions under which that work is tested, and the object evaluated as evidence of success\. This focus is consistent with validity theory’s broader concern with score interpretation and use\(Messick,[1995](https://arxiv.org/html/2605.23262#bib.bib2); Kane,[2013](https://arxiv.org/html/2605.23262#bib.bib94)\), but the contribution here is a benchmark\-design and reporting account rather than a full psychometric account\.

Research on professional jurisdiction shows that expert work is organized through roles, authority, problem areas, and boundaries of responsibility\(Abbott,[1988](https://arxiv.org/html/2605.23262#bib.bib5)\)\. Freidson similarly describes professionalism as a form of specialized work organized around occupational control and responsibility\(Freidson,[2001](https://arxiv.org/html/2605.23262#bib.bib29)\)\. These accounts show that a visible output does not identify the work being evaluated by itself\. Similar outputs can carry different responsibilities depending on whether they function as advice, analysis, documentation, review, or decision support\.

Research on situated action shows that performance depends on the conditions in which action occurs\. Situated\-action theory argues that action is organized through local materials, tools, instructions, and social circumstances, rather than through an abstract task description alone\(Suchman,[1987](https://arxiv.org/html/2605.23262#bib.bib112)\)\. Distributed\-cognition accounts similarly treat cognition as distributed across people, artifacts, and environments\(Hutchins,[1995](https://arxiv.org/html/2605.23262#bib.bib3)\)\. These accounts show that the same work activity can support different claims depending on the materials provided, tools available, role assigned, and workflow constraints imposed by the benchmark\.

Research on boundary objects, knowledge transfer, and coordination shows that knowledge\-work outputs often need to move across actors, systems, and dependent activities\. Boundary\-object work explains how artifacts support coordination across communities while remaining usable in different local contexts\(Star and Griesemer,[1989](https://arxiv.org/html/2605.23262#bib.bib4)\)\. Carlile’s work on knowledge boundaries shows that knowledge crossing organizational boundaries often requires representation, translation, and transformation\(Carlile,[2002](https://arxiv.org/html/2605.23262#bib.bib16),[2004](https://arxiv.org/html/2605.23262#bib.bib17)\)\. Coordination theory treats work as the management of dependencies among activities\(Malone and Crowston,[1994](https://arxiv.org/html/2605.23262#bib.bib101)\)\. These accounts direct attention to the object left for review, filing, execution, or continuation\.

Work activity, tested setting, and work product do not provide an exhaustive theory of knowledge work or a complete account of benchmark quality\. Other issues remain important, including task sampling, rubric design, grader reliability, metric aggregation, robustness, fairness, and consequences of use\. Our narrower claim is that work activity, tested setting, and work product provide a minimum reporting structure for keeping knowledge\-work benchmark scores tied to the work they actually represent and score\.

## 3A Three\-Step Approach for Benchmark Design and Reporting

### 3\.1Define the work activity the benchmark is meant to represent

The first design question is what work activity the benchmark is meant to represent\. This paper useswork activityas the reporting unit because common alternatives are either too narrow or too broad\. Acomponent tasksuch as retrieval, summarization, classification, tool use, or answer generation is often too small for a knowledge\-work claim\. Anoccupationordomainsuch as medicine, law, finance, or software engineering is usually too broad, because it contains many activities with different materials, roles, products, and downstream uses\. Work activity is the middle level needed here: it names the work being claimed while still allowing the benchmark report to explain how the task proxy, tested setting, and scored work product represent it\.

Current knowledge\-work benchmarks often define their scope through three types of labels\. One is a domain or occupation label, such as “healthcare”\(Aroraet al\.,[2025](https://arxiv.org/html/2605.23262#bib.bib8)\), “legal”\(Guhaet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib81)\), “enterprise documents”\(Opsahl\-Onget al\.,[2026](https://arxiv.org/html/2605.23262#bib.bib108)\), “office work”\(Wanget al\.,[2024b](https://arxiv.org/html/2605.23262#bib.bib118)\), or “software engineering”\(Kottamasuet al\.,[2026](https://arxiv.org/html/2605.23262#bib.bib96)\)\. These labels are useful for locating a benchmark in an application area, but they are usually too broad to define the construct being evaluated\. A domain or occupation contains many work activities with different roles, materials, decision boundaries, and standards of adequacy\. This is a familiar problem in occupational analysis: occupational categories group heterogeneous tasks\(Handel,[2016](https://arxiv.org/html/2605.23262#bib.bib84)\), while task statements provide a more direct description of what workers actually do\(National Center for O\*NET Development,[2026](https://arxiv.org/html/2605.23262#bib.bib107)\)\. For benchmark design, the same issue appears when a score for “healthcare,” “legal,” or “software engineering” is read as evidence for the whole domain\. Such a score may reflect strong performance on a small subset of activities while leaving other activities unsampled, weakly tested, or outside the benchmark entirely\.

A second type is a component\-task label, such as “question answering”\(Rajpurkaret al\.,[2016](https://arxiv.org/html/2605.23262#bib.bib70)\), “summarization”\(Nallapatiet al\.,[2016](https://arxiv.org/html/2605.23262#bib.bib68); Narayanet al\.,[2018](https://arxiv.org/html/2605.23262#bib.bib69)\), “retrieval”\(Thakuret al\.,[2021](https://arxiv.org/html/2605.23262#bib.bib131)\), or “tool use”\(Schicket al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib111)\)\. These labels create the opposite problem: under\-coverage\. They identify useful component capabilities, but they do not by themselves define the larger work activity that the component is meant to support\. Retrieval, embedding, summarization, or classification scores are useful evidence for knowledge\-work agents, but they do not, by themselves, show that a system can perform*investigation*,*analysis*,*inspection*, or*record\-keeping*\. A retrieval score becomes evidence for a work activity only when the benchmark also tests how retrieved materials are selected, compared, interpreted, and incorporated into the scored work product\.

A third type is an agent\-task label, such as browser task\(Zhouet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib126)\), desktop task\(Xieet al\.,[2024](https://arxiv.org/html/2605.23262#bib.bib124)\), office task\(Wanget al\.,[2024b](https://arxiv.org/html/2605.23262#bib.bib118)\), user\-interaction task\(Yaoet al\.,[2025](https://arxiv.org/html/2605.23262#bib.bib137)\), or software\-engineering task\(Kottamasuet al\.,[2026](https://arxiv.org/html/2605.23262#bib.bib96)\)\. These labels create a different problem: opaque coverage\. Because an agent task may include search, tool use, file editing, calculation, and submission in a single episode, a single task can map to multiple work activities\. This makes agent evaluation closer to real work, but it also makes the score harder to interpret\. An aggregate task score may hide which activities were required, which were only incidental to the task context, and which activity caused success or failure\. A benchmark should therefore treat task\-to\-activity mapping as many\-to\-many, while distinguishing activities required for the scored product from those that appear only in the background context\. At minimum, the report should define the target work activity, explain how the benchmark task approximates it, and state which parts of the activity are omitted or only weakly tested\.

Since there is currently no cross\-occupation work\-activity inventory for this purpose, this paper derives a preliminary work\-activity reference set from O\*NET task statements\(National Center for O\*NET Development,[2026](https://arxiv.org/html/2605.23262#bib.bib107)\)for future benchmark design, coverage audits, and the illustrative case analyses in Section[4](https://arxiv.org/html/2605.23262#S4)\. Figure[1](https://arxiv.org/html/2605.23262#S3.F1)shows the construction process of the inventory\. The derivation starts from O\*NET 30\.2 task statements in Job Zones 3–5, which cover technical, analytical, managerial, and other non\-routine work\(National Center for O\*NET Development,[2026](https://arxiv.org/html/2605.23262#bib.bib107)\)\. After restricting to knowledge\-work occupations and screening out direct manual, performative, and routine clerical task statements, the reporting corpus contains 12,464 task statements\. A stricter atlas\-inclusion screen retains 8,372 statements for profession\-neutral rewriting, embedding, clustering, and visualization\. The retained statements are rewritten into profession\-neutral work\-activity phrases, clustered, and consolidated through expert\-panel review\. The resulting 18 work\-activity labels are then applied back to the 12,464\-task reporting corpus\. Construction details are reported in Appendix[A](https://arxiv.org/html/2605.23262#A1)\.

O\*NET 30\.2 task statements \(18,796\)12,464 screened statements8,372 atlas\-included statements108 dense task groups18 cross\-occupation work activitiesJob Zones 3–5, knowledge\-work filtertask\-level screenprofession\-neutral rewrite, embed,UMAP \+ HDBSCAN clusteringLLM summary, expert\-panel review

Figure 1:Construction pipeline for the 18\-work\-activity inventory\.Figure[2](https://arxiv.org/html/2605.23262#S3.F2)summarizes the clustering result used to derive the work\-activity inventory\. The layout shows that the final labels are not arbitrary post hoc categories: many activities occupy coherent neighborhoods, while several also appear in separated regions that correspond to different occupational contexts\. Interactive maps of the mapping result with detailed task descriptions can be found at[knowledge\-work\-position/index\.html](https://ningkko.github.io/knowledge-work-position/index.html)\.

![Refer to caption](https://arxiv.org/html/2605.23262v1/operation_atlas.png)Figure 2:Work\-activity atlas of O\*NET knowledge work\.Table[2](https://arxiv.org/html/2605.23262#S3.T2)reports 18 recurrent work\-activity groups across knowledge\-work occupations\. Many current NLP and agent benchmarks are organized around component tasks, task environments, or end\-state success rather than cross\-occupation work activities\(Wanget al\.,[2024a](https://arxiv.org/html/2605.23262#bib.bib119)\)\. As a result, many work activities in Table[2](https://arxiv.org/html/2605.23262#S3.T2)are evaluated only through partial proxies\. For example, advice\-like behavior may appear as a helpful\-response task\(Baiet al\.,[2022](https://arxiv.org/html/2605.23262#bib.bib133)\)rather than as a case\-specific recommendation with role, evidence, and handoff constraints\.*Coordination*may appear as scheduling, task routing\(Wanget al\.,[2024b](https://arxiv.org/html/2605.23262#bib.bib118)\), or tool\-mediated state completion\(Drouinet al\.,[2024](https://arxiv.org/html/2605.23262#bib.bib22)\)in office and enterprise environments\.*Investigation*may be proxied by retrieval\(Thakuret al\.,[2021](https://arxiv.org/html/2605.23262#bib.bib131)\), knowledge\-intensive question answering\(Petroniet al\.,[2021](https://arxiv.org/html/2605.23262#bib.bib109)\), or summarization\(Nallapatiet al\.,[2016](https://arxiv.org/html/2605.23262#bib.bib68); Narayanet al\.,[2018](https://arxiv.org/html/2605.23262#bib.bib69)\), even though investigation also requires selecting evidence, reconstructing events or patterns, and preserving an interpretable basis for the finding\.*Record\-keeping*may be proxied by form understanding\(Jaumeet al\.,[2019](https://arxiv.org/html/2605.23262#bib.bib134)\), document question answering\(Mathewet al\.,[2021](https://arxiv.org/html/2605.23262#bib.bib136)\), extraction, or form filling\(Wanget al\.,[2024b](https://arxiv.org/html/2605.23262#bib.bib118)\)\.*Inspection*may be proxied by classification\(Muennighoffet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib129)\)or legal\-reasoning tasks\(Guhaet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib81)\), depending on the benchmark design\.

A broad claim, such as software\-engineering, healthcare, or office\-work capability, can be mapped to one or more work activities\. Each work activity can then be operationalized through smaller task modules at different levels of granularity\. The benchmark design should consider and explain how these lower\-level tasks represent the target work activity and the work claim\.

Table 2:The 18 cross\-occupation work activities derived from O\*NET task statements\. Counts are assigned task statements, not labor\-market weights or AI deployment estimates\. Occupations/settings and benchmark proxies are illustrative; the interactive atlas provides supplementary examples and adjacent\-label contrasts\.
### 3\.2Specify the tested setting

The second design question is to specify the setting in which a work activity is evaluated\. When a benchmark asks whether a system can complete a task, it also defines the materials, tools, role, and workflow state in which that task is completed\. These choices condition the interpretation of the resulting score\. For example, if the benchmark already provides selected sources, the score says less about the system’s ability to identify relevant sources\. If the benchmark already specifies the output form, the score says less about whether the system can decide what product the workflow needs\. These fixed elements may be reasonable simplifications, but they should be included in score interpretation\. The benchmark report should state whether the tested setting is the intended target of the claim or a simplified setting used to stand in for a broader class of work settings\.

SWE\-bench, a software\-engineering benchmark built from real GitHub issues\(Jimenezet al\.,[2024](https://arxiv.org/html/2605.23262#bib.bib92)\), illustrates how tested settings bound score interpretation\. Each score inSWE\-benchis produced in a setting defined by a GitHub issue, a repository state, and a test oracle, meaning the correctness checks used to determine whether the submitted patch solves the issue\. A higher score therefore supports a specific claim: the agent can generate a patch that passes the benchmark tests under the given issue description, repository state, and validation procedure\.

A broader software\-engineering claim requires additional justification that the benchmark represents the relevant activities and settings\. The score does not separately observe many other parts of software\-engineering work, such as requirement clarification, code review, deployment planning, rollback, long\-term maintenance, or coordination with maintainers\. Accordingly, the supported claim should be limited to the activities, settings, and products that the benchmark explicitly represents and checks\. To make tested setting choices explicit, a benchmark should at least report four dimensions derived from the knowledge\-work attributes described in Section[2](https://arxiv.org/html/2605.23262#S2):

1. i\)Materials: the artifacts available to the system, such as documents, tables, policies, web pages, and databases\(Suchman,[1987](https://arxiv.org/html/2605.23262#bib.bib112)\); whether these materials include irrelevant items, conflicting sources, outdated versions, or missing information; and whether the benchmark supplies selected materials or requires the system to find them\(Carlile,[2002](https://arxiv.org/html/2605.23262#bib.bib16)\)\.
2. ii\)Tools: whether the system can search, read, edit, calculate, write files, call APIs, operate applications, or only produce text; and whether the tool state is persistent across steps\.WebArenaevaluates web interaction\(Zhouet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib126)\)\.OSWorldevaluates desktop work activity\(Xieet al\.,[2024](https://arxiv.org/html/2605.23262#bib.bib124)\)\.WorkArenaevaluates enterprise work environments\(Drouinet al\.,[2024](https://arxiv.org/html/2605.23262#bib.bib22)\)\.
3. iii\)Role and scope: what role the system is asked to take, what it is allowed to decide, what it must defer, and when it should escalate or refuse\. Jurisdictional role boundaries are central to the sociology of professions\(Abbott,[1988](https://arxiv.org/html/2605.23262#bib.bib5)\)\. Professional authority and responsibility are also central to expert work\(Freidson,[2001](https://arxiv.org/html/2605.23262#bib.bib29)\)\.
4. iv\)Workflow state: whether the task occurs at initial drafting, review, revision, verification, routing, filing, handoff, or downstream execution\(Carlile,[2004](https://arxiv.org/html/2605.23262#bib.bib17); Malone and Crowston,[1994](https://arxiv.org/html/2605.23262#bib.bib101)\); and what state should remain after completion, such as an updated file, a submitted form, a reviewable diff, or a routed record\.

A benchmark can reasonably evaluate only one workflow stage, use cleaned source materials, restrict tools, or specify the required output form\. The benchmark design should still state that these choices were made and explain how they bound the claim supported by the score, including whether the tested setting should be read as a narrow setting claim or as a justified representation of a broader class of work settings\.

### 3\.3Score the proper work product

Once the work activity and tested setting have been specified, the benchmark should define the expected work product\. This is the artifact or state the system is supposed to leave after completing the activity in that setting, such as a final response, revised document, database update, evidence trace, handoff note, or other object for review or downstream use\. The scoring design should then state which components of this product are evaluated\. A benchmark score provides evidence for the object it checks\(Messick,[1989](https://arxiv.org/html/2605.23262#bib.bib105); Kane,[2013](https://arxiv.org/html/2605.23262#bib.bib94)\)\. If the benchmark checks only a final answer, chat response, rewritten paragraph, or simulator success state, then the supported claim should be limited to what these scored objects can show\(Jacobs and Wallach,[2021](https://arxiv.org/html/2605.23262#bib.bib90); Kane,[2013](https://arxiv.org/html/2605.23262#bib.bib94)\)\. A broader claim about knowledge\-work performance requires the report to justify how the scored object represents the expected work product, or to state which parts of that product remain untested\.

For example, a benchmark that aims to evaluate source\-grounded document revision should specify whether it scores only the rewritten text or the revised artifact as a work product\. If the scored object is only a rewritten paragraph, the score can show that the system produced revised text, but it gives limited evidence about whether the system completed document revision work\. Artifact\-level scoring would also check which locations were changed, whether source\-supported content was preserved, whether unsupported or outdated claims were corrected, whether citations, comments, or version structure remain traceable, and whether a reviewer can see the basis for the change\. A score designed around these components can support a stronger claim about source\-grounded document revision than a score based only on fluency or local textual improvement, because the scored object better represents the artifact that revision work is expected to leave\.

Rubric design should also be derived from the work product\. A rubric should move beyond a general quality judgment over the visible output, such as helpfulness, clarity, correctness, or completeness\. These criteria can be useful, but they may collapse different failures, such as unsupported evidence use, role violation, missing handoff information, and incorrect content, into a single quality judgment\. For a knowledge\-work benchmark, the rubric should turn the work activity and tested setting already defined in the first two steps into checkable questions:

1. i\)whether the result used the materials that should have been used in the setting;
2. ii\)whether the result stayed within the assigned role, scope, and workflow step;
3. iii\)whether the result left information that a downstream actor or system needs to perform its tasks, such as checking, modifying, approving, executing, or continuing the work\.

These questions correspond to the source, boundary, and destination of the work product\. Situated\-action work motivates attention to source materials and local constraints\(Suchman,[1987](https://arxiv.org/html/2605.23262#bib.bib112)\)\. Knowledge\-transfer work motivates attention to boundary crossing and downstream use\(Carlile,[2002](https://arxiv.org/html/2605.23262#bib.bib16),[2004](https://arxiv.org/html/2605.23262#bib.bib17)\)\. If these questions do not enter the rubric, then expert scoring may still reward a response that appears fluent, coherent, and complete without checking whether it satisfies the work\-product requirements\.

When a rubric is used, the benchmark report should explain how each criterion follows from the expected work product and how the grader applies it\.

## 4Benchmark Case Analyses Through the Three\-Step Approach

This section applies the three\-step approach in Section[3](https://arxiv.org/html/2605.23262#S3)to three released benchmark cases\.111Cases were selected to illustrate three common scoring designs: expert\-graded occupational deliverables, answer\-scored grounded document analysis, and executable software\-state changes\.Because each benchmark suite contains heterogeneous instances, the case analyses below are case\-level demonstrations of the approach\. They should not be read as complete characterizations of all tasks in any benchmark\. We use each case to identify the supported claim and its limits\.

GDPvalis introduced as a benchmark for evaluating model capabilities on real\-world, economically valuable knowledge\-work tasks, drawn from 44 occupations and 9 sectors, with an open gold subset of 5 cases per occupation\. The benchmark provides task context and reference files, asks for deliverables such as documents, slides, diagrams, spreadsheets, and multimedia artifacts, and scores outputs through head\-to\-head expert comparison between the model deliverable and a human expert deliverable\(Patwardhanet al\.,[2025](https://arxiv.org/html/2605.23262#bib.bib31)\)\.

In a released compliance case \(task id7bbfcfe9\-132d\-4194\-82bb\-d6f29d001b01\), the model is placed in a federal grants management role and tasked with creating a Federal Applicant Risk Assessment Tool to assess applicant risk in a grant review context\. Using the three\-step approach, this case is not evidence for grant administration as a whole\. It should be read conservatively as primarily*inspection*and*design*, with*record\-keeping*only insofar as the resulting assessment tool can support later review rather than because the benchmark observes an update to a canonical record\. The provided setting specifies the role, task context, reference materials, and deliverable form\. The scored product is the submitted compliance artifact, evaluated against the corresponding human expert artifact\. The score therefore supports a claim about producing a specified compliance deliverable under a fixed occupational prompt\.

GDPval makes the proxy relatively explicit: a self\-contained digital deliverable stands in for part of occupational work\. It also states important setting simplifications, including the focus on computer\-based tasks, the exclusion of manual and physical work, and the use of precisely specified one\-shot tasks rather than interactive task discovery\. The remaining gap is the relation between deliverable quality and downstream occupational use\. The scored artifact is not linked to a grant\-review workflow, approval record, filing action, revision history, or audit trail\. The public gold subset also has only five cases per occupation, so occupation\-level claims require additional task\-to\-activity reporting\.

OfficeQA Prois introduced as a benchmark for grounded, multi\-document reasoning over a large corpus of U\.S\. Treasury Bulletins\. Its tasks require document parsing, retrieval, and analytical reasoning across text and tabular data, and the Pro split uses hard questions with verifiable answers\(Opsahl\-Onget al\.,[2026](https://arxiv.org/html/2605.23262#bib.bib108)\)\.

The national\-defense expenditure case \(task idUID0005\) asks the model to use reported monthly values from 1953 and 1940, incorporate an external annual\-average CPI\-U value, apply inflation correction, and round the result\. The benchmark scores the final answer using exact match or numerical tolerance rather than scoring a separate reasoning trace or analyst artifact\. Using the three\-step approach, the case represents*record\-keeping*and*analysis*\. It involves*record\-keeping*because the model must locate and use the correct Treasury Bulletin records, and*analysis*because it must combine retrieved values, an external value, and a specified calculation into a result\. The provided setting is a fixed document corpus with a closed analytical question\. The scored product is the final numerical answer\. The score therefore supports a claim about grounded numerical analysis over Treasury Bulletin records\.

Closed questions make the evaluation reproducible and avoid the need for human\-expert grading\. The remaining gap is the absence of a reviewable analyst work product\. The benchmark paper does not report a scored analyst memo, evidence table, citation package, formula trace, or downstream filing object\.

APEX\-SWEis introduced as a benchmark for assessing AI agents on economically valuable software\-engineering tasks, with Integration tasks that require end\-to\-end systems across cloud primitives, business applications, and infrastructure services, and Observability tasks that require debugging production\-style failures using telemetry and unstructured context\. The released tasks provide containerized environments, issue context, service documentation, connection information, pytest tests, and rubric files\(Kottamasuet al\.,[2026](https://arxiv.org/html/2605.23262#bib.bib96)\)\.

In the1\-aws\-s3\-snapshotscase, the model must create a Python script that uploads ausers\.csvfile to a date\-based LocalStack S3 path\. The scored object is both an artifact and a state change: the script must exist and run, and the expected S3 object must be created with the required properties\. The visible work activities are primarily*procedure\-execution*and*record\-keeping*, with*troubleshooting*only insofar as the model must resolve execution failures inside the benchmark environment\. The tested setting includes service state, credentials, issue context, documentation, and executable tests\. It is a working script and service\-state change checked by tests and rubric criteria\. The score therefore supports a claim about completing a specified integration task in the benchmarked software environment\.

APEX\-SWE also shows that a product with a stronger score still has boundaries\. The benchmarked environment exposes more work structure than a reply or final answer, but the released task does not report production deployment, access\-control review, rollback procedure, long\-term monitoring, stakeholder approval, or maintainer review\. These omissions do not make the task invalid\. They locate the score: it is evidence of benchmarked integration or observability work, rather than full production software engineering responsibility\.

## 5Discussion

The case analyses show that benchmark scores support work\-capability claims only through the work activity, tested setting, and scored product they actually represent\. This framing does not make benchmarks a substitute for deployment evidence, nor does it make work\-activity labels final or sufficient by themselves\. It instead clarifies what kind of evidence a benchmark score provides before the system is used in an organization\.

Deployment evidence remains necessary for deployment claims\.An agent’s real capability is realized in deployment, where performance depends on adoption, workflow fit, infrastructure, governance, user behavior, and local constraints\(Dell’Acquaet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib21); Brynjolfssonet al\.,[2025](https://arxiv.org/html/2605.23262#bib.bib13); Jaffeet al\.,[2024](https://arxiv.org/html/2605.23262#bib.bib106)\)\. This matters especially when AI systems change knowledge\-work workflows themselves: a benchmark organized around predefined activities, settings, and products may be too tied to existing workflows\. Deployment evidence is therefore needed for claims about productivity, local safety, and organizational value\. At the same time, deployment studies are local, costly, often private, and difficult to compare across systems\. They also reflect capabilities beyond the agent itself, including workflow redesign, user training, governance, and human\-AI coordination\. Benchmarks remain useful as public, repeatable, and comparable pre\-deployment evidence\(Lianget al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib63)\)\. The relevant question is how such evidence should be interpreted: if benchmarks are used for capability claims, their reports should state which work activity, tested setting, and scored product the score actually supports\(Rajiet al\.,[2021](https://arxiv.org/html/2605.23262#bib.bib89)\)\.

Component tasks and rubrics remain useful, but they need claim boundaries\.Component\-task evaluations\(Thakuret al\.,[2021](https://arxiv.org/html/2605.23262#bib.bib131); Muennighoffet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib129); Lianget al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib63)\)isolate narrower behaviors and make failures easier to diagnose\. Rubrics can also grade response quality when exact\-match scoring is too narrow\. These instruments are useful for understanding model behavior, but they become less informative when they are used to imply that a system can perform analysis, advising, coordination, record\-keeping, or other work activities without showing how the component behavior functions in the work setting\(Jacobs and Wallach,[2021](https://arxiv.org/html/2605.23262#bib.bib90); Rajiet al\.,[2021](https://arxiv.org/html/2605.23262#bib.bib89)\)\. A retrieval score, helpfulness score, or rubric score can contribute evidence for a work\-capability claim, but only when the benchmark report explains how the component task represents the target work activity and what part of the expected work product was actually scored\.

The work\-activity inventory should be treated as a starting point\.Work practices are local and institution\-specific\(Suchman,[1987](https://arxiv.org/html/2605.23262#bib.bib112)\), and they change over time\(Kellogget al\.,[2020](https://arxiv.org/html/2605.23262#bib.bib95)\)\. O\*NET task statements\(National Center for O\*NET Development,[2026](https://arxiv.org/html/2605.23262#bib.bib107)\)also describe occupational tasks at a general level rather than local procedures, tools, documents, and responsibilities\. The O\*NET\-derived inventory in this paper should therefore not be read as a final classification of knowledge work\. Its role is to make the work claim explicit enough to inspect, compare, and revise\. A benchmark task may be narrower than the claim attached to its score, while an occupation or domain may contain many different activities\. The inventory gives benchmark reports a starting vocabulary for this middle level\. It can be revised, expanded, or replaced when better occupational or workflow evidence is available\.

These qualifications suggest two directions for future benchmark research\. First, knowledge\-work benchmarks should move toward scoring work products more directly\. Coding\-agent evaluation already shows this direction: recent benchmarks move from isolated code generation toward repository state, issue context, patches, tests\(Jimenezet al\.,[2024](https://arxiv.org/html/2605.23262#bib.bib92)\), pull\-request review\(Guoet al\.,[2025](https://arxiv.org/html/2605.23262#bib.bib82); Kumar,[2026](https://arxiv.org/html/2605.23262#bib.bib97)\), integration tasks, and executable state changes\(Kottamasuet al\.,[2026](https://arxiv.org/html/2605.23262#bib.bib96)\)\. Knowledge\-work benchmarks in other domains should follow the same design logic without copying the software\-engineering oracle directly\. Web\(Zhouet al\.,[2023](https://arxiv.org/html/2605.23262#bib.bib126)\), desktop\(Xieet al\.,[2024](https://arxiv.org/html/2605.23262#bib.bib124)\), office\(Wanget al\.,[2024b](https://arxiv.org/html/2605.23262#bib.bib118)\), and enterprise\-agent benchmarks\(Drouinet al\.,[2024](https://arxiv.org/html/2605.23262#bib.bib22)\)already expose different materials, tools, application states, and workflow constraints\. They also differ in what they score, ranging from final answers and simulator success states to edited artifacts or persistent state changes\. For document, clinical, research, administrative, and policy work, benchmark designers should identify the work activity being evaluated, specify the setting in which the activity is performed, and score the product that another actor or system would actually receive\. Benchmark reports should also state which task proxies are used, which parts of the work they cover, which setting simplifications were made, and what remains untested\.

Second, future work should improve and validate the work\-activity inventory\. The O\*NET\-derived inventory should be compared with other occupational ontologies, observed workflows, professional standards, organizational records, real job postings, and workplace\-use traces\. O\*NET and ESCO\(National Center for O\*NET Development,[2026](https://arxiv.org/html/2605.23262#bib.bib107); European Commission, Directorate\-General for Employment, Social Affairs and Inclusion,[2022](https://arxiv.org/html/2605.23262#bib.bib27)\)provide useful ontology\-level comparisons, while workplace\-use traces\(Handaet al\.,[2025](https://arxiv.org/html/2605.23262#bib.bib1)\)can help identify emerging task combinations and AI\-mediated work practices\. Job postings may provide a complementary source for current employer expectations, but they should be analyzed separately from observed use\. Future studies should also connect work activities back to occupations, so that benchmark coverage can be reported at both levels: which activities are tested, and which occupational claims the activities can plausibly support\.

## 6Conclusion

AI\-agent benchmarks are increasingly used to support claims about workplace\-facing capability, but the meaning of a benchmark score depends on the work evidence the benchmark actually collects\. This paper develops a reporting structure for making knowledge\-work benchmark evidence explicit by identifying the work activity being represented, specifying the tested setting, and scoring the work product left by the system\. The benchmark case analyses show that different scoring designs expose different parts of work: occupational deliverables, grounded answers, and executable state changes each support different claims and leave different gaps\. The central implication is that benchmark scores should be read as evidence for the work activity, setting, and product that were actually tested\. Keeping this connection explicit would make knowledge\-work evaluation more interpretable, reduce overbroad capability claims, and give future benchmarks a clearer basis for testing agents in work\-shaped settings\.

## References

- A\. Abbott \(1988\)The system of professions: an essay on the division of expert labor\.University of Chicago Press\.Cited by:[§2](https://arxiv.org/html/2605.23262#S2.p2.1),[item iii\)](https://arxiv.org/html/2605.23262#S3.I1.i3.p1.1)\.
- R\. K\. Arora, J\. Wei, R\. Soskin Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas, M\. Sharman, M\. Shah, A\. Vallone, A\. Beutel, J\. Heidecke, and K\. Singhal \(2025\)HealthBench: evaluating large language models towards improved human health\.External Links:2505\.08775,[Document](https://dx.doi.org/10.48550/arXiv.2505.08775)Cited by:[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p2.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan, N\. Joseph, S\. Kadavath, J\. Kernion, T\. Conerly, S\. El\-Showk, N\. Elhage, Z\. Hatfield\-Dodds, D\. Hernandez, T\. Hume, S\. Johnston, S\. Kravec, L\. Lovitt, N\. Nanda, C\. Olsson, D\. Amodei, T\. Brown, J\. Clark, S\. McCandlish, C\. Olah, B\. Mann, and J\. Kaplan \(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.Vol\.abs/2204\.05862\.External Links:2204\.05862,[Document](https://dx.doi.org/10.48550/arXiv.2204.05862),[Link](https://arxiv.org/abs/2204.05862)Cited by:[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p7.1)\.
- E\. Brynjolfsson, D\. Li, and L\. R\. Raymond \(2025\)Generative AI at work\.The Quarterly Journal of Economics140\(2\),pp\. 889–942\.Note:Earlier version circulated as NBER Working Paper 31161External Links:[Document](https://dx.doi.org/10.1093/qje/qjae044)Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p1.1),[§5](https://arxiv.org/html/2605.23262#S5.p2.1)\.
- R\. J\. G\. B\. Campello, D\. Moulavi, and J\. Sander \(2013\)Density\-based clustering based on hierarchical density estimates\.InAdvances in Knowledge Discovery and Data Mining,pp\. 160–172\.External Links:[Document](https://dx.doi.org/10.1007/978-3-642-37456-2%5F14)Cited by:[Appendix A](https://arxiv.org/html/2605.23262#A1.SS0.SSS0.Px3.p1.2),[Appendix A](https://arxiv.org/html/2605.23262#A1.p1.1)\.
- P\. R\. Carlile \(2002\)A pragmatic view of knowledge and boundaries: boundary objects in new product development\.Organization Science13\(4\),pp\. 442–455\.Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p3.1),[§2](https://arxiv.org/html/2605.23262#S2.p4.1),[item i\)](https://arxiv.org/html/2605.23262#S3.I1.i1.p1.1),[§3\.3](https://arxiv.org/html/2605.23262#S3.SS3.p3.2)\.
- P\. R\. Carlile \(2004\)Transferring, translating, and transforming: an integrative framework for managing knowledge across boundaries\.Organization Science15\(5\),pp\. 555–568\.External Links:[Document](https://dx.doi.org/10.1287/orsc.1040.0094)Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p3.1),[§2](https://arxiv.org/html/2605.23262#S2.p4.1),[item iv\)](https://arxiv.org/html/2605.23262#S3.I1.i4.p1.1),[§3\.3](https://arxiv.org/html/2605.23262#S3.SS3.p3.2)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.External Links:2107\.03374,[Document](https://dx.doi.org/10.48550/arXiv.2107.03374),[Link](https://arxiv.org/abs/2107.03374)Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p2.1)\.
- T\. H\. Davenport \(2005\)Thinking for a living: how to get better performance and results from knowledge workers\.Harvard Business School Press,Boston, MA\.Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p1.1),[§2](https://arxiv.org/html/2605.23262#S2.p1.1)\.
- F\. Dell’Acqua, E\. McFowland, E\. R\. Mollick, H\. Lifshitz\-Assaf, K\. C\. Kellogg, S\. Rajendran, L\. Krayer, F\. Candelon, and K\. R\. Lakhani \(2023\)Navigating the jagged technological frontier: field experimental evidence of the effects of ai on knowledge worker productivity and quality\.Technical reportHarvard Business School\.Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p3.1),[§5](https://arxiv.org/html/2605.23262#S5.p2.1)\.
- A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. Del Verme, T\. Marty, D\. Vazquez, N\. Chapados, and A\. Lacoste \(2024\)WorkArena: how capable are web agents at solving common knowledge work tasks?\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 11642–11662\.External Links:[Link](https://proceedings.mlr.press/v235/drouin24a.html)Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p1.1),[item ii\)](https://arxiv.org/html/2605.23262#S3.I1.i2.p1.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p7.1),[§5](https://arxiv.org/html/2605.23262#S5.p5.1)\.
- P\. F\. Drucker \(1999\)Knowledge\-worker productivity: the biggest challenge\.California Management Review41\(2\),pp\. 79–94\.Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p1.1),[§2](https://arxiv.org/html/2605.23262#S2.p1.1)\.
- T\. Eloundou, S\. Manning, P\. Mishkin, and D\. Rock \(2023\)GPTs are GPTs: an early look at the labor market impact potential of large language models\.External Links:2303\.10130,[Document](https://dx.doi.org/10.48550/arXiv.2303.10130),[Link](https://arxiv.org/abs/2303.10130)Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p1.1)\.
- European Commission, Directorate\-General for Employment, Social Affairs and Inclusion \(2022\)ESCO Dataset v1\.1\.1: European Skills, Competences, Qualifications and Occupations\.Note:[https://esco\.ec\.europa\.eu/en/use\-esco/download](https://esco.ec.europa.eu/en/use-esco/download)English CSV classification releaseCited by:[Appendix B](https://arxiv.org/html/2605.23262#A2.SS0.SSS0.Px1.p1.1),[Appendix B](https://arxiv.org/html/2605.23262#A2.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2605.23262#S5.p6.1)\.
- E\. Freidson \(2001\)Professionalism: the third logic\.University of Chicago Press\.Cited by:[§2](https://arxiv.org/html/2605.23262#S2.p2.1),[item iii\)](https://arxiv.org/html/2605.23262#S3.I1.i3.p1.1)\.
- N\. Guha, J\. Nyarko, D\. E\. Ho, C\. Ré, A\. Chilton, A\. Narayana, A\. Chohlas\-Wood, A\. Peters, B\. Waldon, D\. N\. Rockmore, D\. Zambrano, D\. Talisman, E\. Hoque, F\. Surani, F\. Fagan, G\. Sarfaty, G\. M\. Dickinson, H\. Porat, J\. Hegland, J\. Wu, J\. Nudell, J\. Niklaus, J\. Nay, J\. H\. Choi, K\. Tobia, M\. Hagan, M\. Ma, M\. Livermore, N\. Rasumov\-Rahe, N\. Holzenberger, N\. Kolt, P\. Henderson, S\. Rehaag, S\. Goel, S\. Gao, S\. Williams, S\. Gandhi, T\. Zur, V\. Iyer, and Z\. Li \(2023\)LegalBench: a collaboratively built benchmark for measuring legal reasoning in large language models\.External Links:2308\.11462,[Link](https://arxiv.org/abs/2308.11462)Cited by:[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p7.1)\.
- H\. Guo, X\. Zheng, Z\. Liao, H\. Yu, P\. Di, Z\. Zhang, and H\. Dai \(2025\)CodeFuse\-CR\-Bench: a comprehensiveness\-aware benchmark for end\-to\-end code review evaluation in python projects\.External Links:2509\.14856,[Document](https://dx.doi.org/10.48550/arXiv.2509.14856),[Link](https://arxiv.org/abs/2509.14856)Cited by:[§5](https://arxiv.org/html/2605.23262#S5.p5.1)\.
- K\. Handa, A\. Tamkin, M\. McCain, S\. Huang, E\. Durmus, S\. Heck, J\. Mueller, J\. Hong, S\. Ritchie, T\. Belonax, K\. K\. Troy, D\. Amodei, J\. Kaplan, J\. Clark, and D\. Ganguli \(2025\)Which economic tasks are performed with AI? evidence from millions of Claude conversations\.External Links:2503\.04761,[Link](https://arxiv.org/abs/2503.04761)Cited by:[§5](https://arxiv.org/html/2605.23262#S5.p6.1)\.
- M\. J\. Handel \(2016\)The O\*NET content model: strengths and limitations\.Journal for Labour Market Research49\(2\),pp\. 157–176\.External Links:[Document](https://dx.doi.org/10.1007/s12651-016-0199-8)Cited by:[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p2.1)\.
- E\. Hutchins \(1995\)Cognition in the wild\.MIT Press,Cambridge, MA\.External Links:ISBN 9780262082310Cited by:[§2](https://arxiv.org/html/2605.23262#S2.p3.1)\.
- International Labour Office \(2012\)International standard classification of occupations 2008 \(ISCO\-08\): structure, group definitions and correspondence tables\.Technical reportInternational Labour Office,Geneva\.Cited by:[Appendix B](https://arxiv.org/html/2605.23262#A2.SS0.SSS0.Px2.p1.1)\.
- A\. Z\. Jacobs and H\. Wallach \(2021\)Measurement and fairness\.InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,FAccT ’21,pp\. 375–385\.External Links:[Document](https://dx.doi.org/10.1145/3442188.3445901)Cited by:[§3\.3](https://arxiv.org/html/2605.23262#S3.SS3.p1.1),[§5](https://arxiv.org/html/2605.23262#S5.p3.1)\.
- S\. Jaffe, N\. P\. Shah, J\. Butler, A\. Farach, A\. Cambon, B\. Hecht, M\. Schwarz, and J\. Teevan \(2024\)Generative AI in real\-world workplaces\.Technical reportTechnical ReportMSR\-TR\-2024\-29,Microsoft Research\.External Links:[Link](https://www.microsoft.com/en-us/research/publication/generative-ai-in-real-world-workplaces/)Cited by:[§5](https://arxiv.org/html/2605.23262#S5.p2.1)\.
- G\. Jaume, H\. K\. Ekenel, and J\. Thiran \(2019\)FUNSD: a dataset for form understanding in noisy scanned documents\.In2019 International Conference on Document Analysis and Recognition Workshops,pp\. 1–6\.External Links:[Document](https://dx.doi.org/10.1109/ICDARW.2019.10029)Cited by:[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p7.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. R\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world github issues?\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2310.06770)Cited by:[§3\.2](https://arxiv.org/html/2605.23262#S3.SS2.p2.1),[§5](https://arxiv.org/html/2605.23262#S5.p5.1)\.
- M\. T\. Kane \(2013\)Validating the interpretations and uses of test scores\.Journal of Educational Measurement50\(1\),pp\. 1–73\.External Links:[Document](https://dx.doi.org/10.1111/jedm.12000)Cited by:[§2](https://arxiv.org/html/2605.23262#S2.p1.1),[§3\.3](https://arxiv.org/html/2605.23262#S3.SS3.p1.1)\.
- K\. C\. Kellogg, M\. A\. Valentine, and A\. Christin \(2020\)Algorithms at work: the new contested terrain of control\.Academy of Management Annals\.Cited by:[§5](https://arxiv.org/html/2605.23262#S5.p4.1)\.
- A\. Kottamasu, C\. Mahapatra, S\. Lee, B\. Pan, A\. Barthwal, A\. Datta, A\. Gupta, P\. Mehta, A\. Arun, S\. Alberti, A\. Hiremath, B\. Foody, and B\. Vidgen \(2026\)APEX\-SWE\.External Links:2601\.08806,[Document](https://dx.doi.org/10.48550/arXiv.2601.08806),[Link](https://arxiv.org/abs/2601.08806)Cited by:[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p4.1),[§4](https://arxiv.org/html/2605.23262#S4.p8.1),[§5](https://arxiv.org/html/2605.23262#S5.p5.1)\.
- D\. Kumar \(2026\)SWE\-PRBench: benchmarking ai code review quality against pull request feedback\.External Links:2603\.26130,[Document](https://dx.doi.org/10.48550/arXiv.2603.26130),[Link](https://arxiv.org/abs/2603.26130)Cited by:[§5](https://arxiv.org/html/2605.23262#S5.p5.1)\.
- P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar, B\. Newman, B\. Yuan, B\. Yan, C\. Zhang, C\. Cosgrove, C\. D\. Manning, C\. Ré, D\. Acosta\-Navas, D\. A\. Hudson, E\. Zelikman, E\. Durmus, F\. Ladhak, F\. Rong, H\. Ren, H\. Yao, J\. Wang, K\. Santhanam, L\. Orr, L\. Zheng, M\. Yuksekgonul, M\. Suzgun, N\. Kim, N\. Guha, N\. Chatterji, O\. Khattab, P\. Henderson, Q\. Huang, S\. M\. X\. Chi, S\. Santurkar, S\. Ganguli, T\. Hashimoto, T\. Icard, T\. Zhang, V\. Chaudhary, W\. Wang, X\. Li, Y\. Mai, Y\. Zhang, and Y\. Koreeda \(2023\)Holistic evaluation of language models\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p2.1),[§5](https://arxiv.org/html/2605.23262#S5.p2.1),[§5](https://arxiv.org/html/2605.23262#S5.p3.1)\.
- T\. W\. Malone and K\. Crowston \(1994\)The interdisciplinary study of coordination\.ACM Computing Surveys26\(1\),pp\. 87–119\.Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p3.1),[§2](https://arxiv.org/html/2605.23262#S2.p4.1),[item iv\)](https://arxiv.org/html/2605.23262#S3.I1.i4.p1.1)\.
- M\. Mathew, D\. Karatzas, and C\. V\. Jawahar \(2021\)DocVQA: a dataset for VQA on document images\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,pp\. 2200–2209\.Cited by:[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p7.1)\.
- L\. McInnes, J\. Healy, N\. Saul, and L\. Grossberger \(2018\)UMAP: uniform manifold approximation and projection\.Journal of Open Source Software3\(29\),pp\. 861\.External Links:[Document](https://dx.doi.org/10.21105/joss.00861)Cited by:[Appendix A](https://arxiv.org/html/2605.23262#A1.SS0.SSS0.Px3.p1.2),[Appendix A](https://arxiv.org/html/2605.23262#A1.p1.1)\.
- S\. Messick \(1989\)Validity\.InEducational Measurement,R\. L\. Linn \(Ed\.\),pp\. 13–103\.Cited by:[§3\.3](https://arxiv.org/html/2605.23262#S3.SS3.p1.1)\.
- S\. Messick \(1995\)Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning\.American Psychologist50\(9\),pp\. 741–749\.External Links:[Document](https://dx.doi.org/10.1037/0003-066X.50.9.741)Cited by:[§2](https://arxiv.org/html/2605.23262#S2.p1.1)\.
- N\. Muennighoff, N\. Tazi, L\. Magne, and N\. Reimers \(2023\)MTEB: massive text embedding benchmark\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,Cited by:[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p7.1),[§5](https://arxiv.org/html/2605.23262#S5.p3.1)\.
- R\. Nallapati, B\. Zhou, C\. dos Santos, Ç\. Gulçehre, and B\. Xiang \(2016\)Abstractive text summarization using sequence\-to\-sequence RNNs and beyond\.Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning,pp\. 280–290\.Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p3.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p7.1)\.
- S\. Narayan, S\. B\. Cohen, and M\. Lapata \(2018\)Don’t give me the details, just the summary\! topic\-aware convolutional neural networks for extreme summarization\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 1797–1807\.Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p3.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p7.1)\.
- National Center for O\*NET Development \(2026\)O\*NET 30\.2 Database: Task Statements\.Note:[https://www\.onetcenter\.org/dictionary/30\.2/mysql/task\_statements\.html](https://www.onetcenter.org/dictionary/30.2/mysql/task_statements.html)Accessed 2026\-04\-29Cited by:[Appendix A](https://arxiv.org/html/2605.23262#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2605.23262#A1.p1.1),[§1](https://arxiv.org/html/2605.23262#S1.p1.1),[§1](https://arxiv.org/html/2605.23262#S1.p4.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p5.1),[§5](https://arxiv.org/html/2605.23262#S5.p4.1),[§5](https://arxiv.org/html/2605.23262#S5.p6.1)\.
- K\. Opsahl\-Ong, A\. Singhvi, J\. Collins, I\. Zhou, C\. Wang, A\. Baheti, O\. Oertell, J\. Portes, S\. Havens, E\. Elsen, M\. Bendersky, M\. Zaharia, and X\. Chen \(2026\)OfficeQA Pro: an enterprise benchmark for end\-to\-end grounded reasoning\.External Links:2603\.08655,[Document](https://dx.doi.org/10.48550/arXiv.2603.08655),[Link](https://arxiv.org/abs/2603.08655)Cited by:[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p2.1),[§4](https://arxiv.org/html/2605.23262#S4.p5.1)\.
- T\. Patwardhan, R\. Dias, E\. Proehl, G\. Kim, M\. Wang, O\. Watkins, S\. P\. Fishman, M\. Aljubeh, P\. Thacker, L\. Fauconnet, N\. S\. Kim, P\. Chao, S\. Miserendino, G\. Chabot, D\. Li, M\. Sharman, A\. Barr, A\. Glaese, and J\. Tworek \(2025\)GDPval: evaluating AI model performance on real\-world economically valuable tasks\.External Links:2510\.04374,[Document](https://dx.doi.org/10.48550/arXiv.2510.04374),[Link](https://arxiv.org/abs/2510.04374)Cited by:[§4](https://arxiv.org/html/2605.23262#S4.p2.1)\.
- F\. Petroni, A\. Piktus, A\. Fan, P\. Lewis, M\. Yazdani, N\. De Cao, J\. Thorne, Y\. Jernite, V\. Karpukhin, J\. Maillard, V\. Plachouras, T\. Rocktäschel, and S\. Riedel \(2021\)KILT: a benchmark for knowledge intensive language tasks\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 2523–2544\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.200)Cited by:[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p7.1)\.
- M\. U\. Porat \(1977\)The information economy: definition and measurement\.OT Special PublicationTechnical Report77\-12,U\.S\. Department of Commerce, Office of Telecommunications,Washington, DC\.Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p1.1)\.
- I\. D\. Raji, E\. M\. Bender, A\. Paullada, E\. Denton, and A\. Hanna \(2021\)AI and the everything in the whole wide world benchmark\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,External Links:[Link](https://openreview.net/forum?id=j6NxpQbREA1)Cited by:[§5](https://arxiv.org/html/2605.23262#S5.p2.1),[§5](https://arxiv.org/html/2605.23262#S5.p3.1)\.
- P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang \(2016\)SQuAD: 100,000\+ questions for machine comprehension of text\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,pp\. 2383–2392\.Cited by:[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p3.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessi, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=Yacmpz84TH)Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p3.1)\.
- S\. L\. Star and J\. R\. Griesemer \(1989\)Institutional ecology, ‘translations’ and boundary objects: amateurs and professionals in berkeley’s museum of vertebrate zoology, 1907–39\.Social Studies of Science19\(3\),pp\. 387–420\.External Links:[Document](https://dx.doi.org/10.1177/030631289019003001)Cited by:[§2](https://arxiv.org/html/2605.23262#S2.p4.1)\.
- L\. A\. Suchman \(1987\)Plans and situated actions: the problem of human\-machine communication\.Cambridge University Press\.Cited by:[§2](https://arxiv.org/html/2605.23262#S2.p3.1),[item i\)](https://arxiv.org/html/2605.23262#S3.I1.i1.p1.1),[§3\.3](https://arxiv.org/html/2605.23262#S3.SS3.p3.2),[§5](https://arxiv.org/html/2605.23262#S5.p4.1)\.
- N\. Thakur, N\. Reimers, A\. Rücklé, A\. Srivastava, and I\. Gurevych \(2021\)BEIR: a heterogeneous benchmark for zero\-shot evaluation of information retrieval models\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p3.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p7.1),[§5](https://arxiv.org/html/2605.23262#S5.p3.1)\.
- L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin, W\. X\. Zhao, Z\. Wei, and J\. Wen \(2024a\)A survey on large language model based autonomous agents\.Frontiers of Computer Science18\(6\),pp\. 186345\.External Links:[Document](https://dx.doi.org/10.1007/s11704-024-40231-1)Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p7.1)\.
- Y\. Wang, M\. Pradel, and Z\. Liu \(2025\)Are "solved issues" in swe\-bench really solved correctly? an empirical study\.ArXivabs/2503\.15223\.External Links:2503\.15223,[Document](https://dx.doi.org/10.48550/arXiv.2503.15223),[Link](https://arxiv.org/abs/2503.15223)Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p3.1)\.
- Z\. Wang, Y\. Cui, L\. Zhong, Z\. Zhang, D\. Yin, B\. Y\. Lin, and J\. Shang \(2024b\)OfficeBench: benchmarking language agents across multiple applications for office automation\.External Links:2407\.19056,[Document](https://dx.doi.org/10.48550/arXiv.2407.19056)Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p4.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p7.1),[§5](https://arxiv.org/html/2605.23262#S5.p5.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang \(2023\)AutoGen: enabling next\-gen llm applications via multi\-agent conversation\.arXiv preprint arXiv:2308\.08155\.External Links:2308\.08155,[Document](https://dx.doi.org/10.48550/arXiv.2308.08155),[Link](https://arxiv.org/abs/2308.08155)Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p1.1)\.
- T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei, Y\. Liu, Y\. Xu, S\. Zhou, S\. Savarese, C\. Xiong, V\. Zhong, and T\. Yu \(2024\)OSWorld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.External Links:2404\.07972,[Document](https://dx.doi.org/10.48550/arXiv.2404.07972)Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p1.1),[item ii\)](https://arxiv.org/html/2605.23262#S3.I1.i2.p1.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p4.1),[§5](https://arxiv.org/html/2605.23262#S5.p5.1)\.
- S\. Yao, N\. Shinn, P\. Razavi, and K\. R\. Narasimhan \(2025\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.External Links:[Link](https://openreview.net/forum?id=roNSXZpUDN)Cited by:[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p4.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig \(2023\)WebArena: a realistic web environment for building autonomous agents\.External Links:2307\.13854,[Link](https://arxiv.org/abs/2307.13854)Cited by:[§1](https://arxiv.org/html/2605.23262#S1.p1.1),[item ii\)](https://arxiv.org/html/2605.23262#S3.I1.i2.p1.1),[§3\.1](https://arxiv.org/html/2605.23262#S3.SS1.p4.1),[§5](https://arxiv.org/html/2605.23262#S5.p5.1)\.

## Appendix AInventory construction details

This appendix documents how the 18\-work\-activity inventory used in Section[3\.1](https://arxiv.org/html/2605.23262#S3.SS1)is built from O\*NET task text\[National Center for O\*NET Development,[2026](https://arxiv.org/html/2605.23262#bib.bib107)\]\. Figure[1](https://arxiv.org/html/2605.23262#S3.F1)diagrams the full pipeline, including the embedding, UMAP, and HDBSCAN clustering stage that yields 108 dense task groups before expert\-panel consolidation\[McInneset al\.,[2018](https://arxiv.org/html/2605.23262#bib.bib103), Campelloet al\.,[2013](https://arxiv.org/html/2605.23262#bib.bib15)\]\.

#### Corpus\.

O\*NET 30\.2 task statements are first filtered to Job Zones 3–5 and knowledge\-work occupations\[National Center for O\*NET Development,[2026](https://arxiv.org/html/2605.23262#bib.bib107)\]\. After several GPT\-5\.5\-assisted screening and adjudication rounds removed task statements whose primary content was direct manual, performative, or routine clerical work, the reporting corpus contains 12,464 task statements\. Each row carries the O\*NET\-SOC code, occupation title, task identifier, cleaned task text, importance and relevance scores, and Job Zone\.

#### Screens\.

The screening procedure uses iterative GPT\-5\.5 classification with author adjudication\. The exclusion categories are developed through three independent 100\-task samples drawn without replacement from the candidate corpus; each sample is adjudicated against a growing list of candidate exclusion categories until the category list stabilizes at twelve categories, including direct\-manual, performative, routine\-clerical, and their sub\-variants\. A stricter atlas\-inclusion screen is then applied to the 12,464\-task reporting corpus, removing 4,092 statements and retaining 8,372 statements, including a 34\-task analysis re\-inclusion from an availability\-scheduling cluster, for profession\-neutral rewriting, embedding, clustering, and consolidation\. The final 18 work\-activity labels are applied back to the 12,464\-task reporting corpus for Table[2](https://arxiv.org/html/2605.23262#S3.T2), so the table counts conserve the reporting corpus while the atlas visualizes the stricter 8,372\-task clustering subset\.

#### Embedding and clustering\.

Tasks are rewritten into profession\-neutral work\-activity phrases under a fixed prompt to remove domain vocabulary, then embedded into a1,5361\{,\}536\-dimensional space andL2L\_\{2\}\-normalized\. UMAP with 15 dimensions, 30 neighbors, and cosine metric, followed by HDBSCAN with minimum cluster size 30, minimum samples 10, and excess\-of\-mass extraction, yields 108 clusters on the task\-only geometry; noise is soft\-assigned via membership vectors\[McInneset al\.,[2018](https://arxiv.org/html/2605.23262#bib.bib103), Campelloet al\.,[2013](https://arxiv.org/html/2605.23262#bib.bib15)\]\.

#### Consolidation\.

Four rounds of LLM summarization interleaved with expert\-panel review produce a domain\-stripped sub\-work\-activity label per cluster and a top\-level work\-activity set\. Rounds 1–3 produced 38, 39, and 16 top\-level work activities respectively and were rejected by the panel; round 4 produced the 18 plain\-English, single\-concept work activities used in Figures[1](https://arxiv.org/html/2605.23262#S3.F1)and[2](https://arxiv.org/html/2605.23262#S3.F2)and Table[2](https://arxiv.org/html/2605.23262#S3.T2), including the split of an earlier compliance\-analysis work activity into*inspection*and*investigation*and the folding of a small clinical / surgical / nursing\-support cluster with 93 tasks into*procedure\-execution*\.

#### Invariants\.

Every atlas cluster has exactly one top\-level assignment, and every top\-level work activity has≥1\\geq\\\!1member cluster and several distinct occupation families in its cross\-occupation example set\. For reporting, the final labels are propagated back to all 12,464 screened task statements, so Table[2](https://arxiv.org/html/2605.23262#S3.T2)conserves the reporting corpus\. Every work activity carries an explicit set of contrasts against adjacent work activities\.

## Appendix BExternal ontology check: ESCO

This appendix contains the ESCO legibility experiment: scope, design, pipeline, per\-work\-activity mapping results and shares, and a reading of the cases where the O\*NET and ESCO per\-work\-activity distributions diverge\.

#### Source and scope\.

The ESCO inputs are the v1\.1\.1 English CSV release\[European Commission, Directorate\-General for Employment, Social Affairs and Inclusion,[2022](https://arxiv.org/html/2605.23262#bib.bib27)\], ingesting the Skills pillar, the Occupations pillar with the ISCO groups table, and the occupation–skill relations\. The scope filter keeps rows whose skill type is skill or competence and retains only skills that are essential\-linked to at least one occupation in ISCO\-08 groups 1, 2, or 3, giving 5,826 items\[European Commission, Directorate\-General for Employment, Social Affairs and Inclusion,[2022](https://arxiv.org/html/2605.23262#bib.bib27)\]\. A supplementary set of 1,977 knowledge\-type items is excluded from the primary analysis because such ESCO rows describe subject\-matter domains rather than work activities\.

#### Design\.

O\*NET remains the derivation dataset and the inventory is held frozen\. Pooling is avoided because the units differ: ESCO’s primary unit is the skill or competence, whose text describes an ability a worker must hold, while O\*NET tasks describe work performed\. The two units sit at different levels, and mixing them in a single clustering run would blur the primary claim\. The scope filter, meaning skill or competence type essential\-linked to at least one ISCO\-08 group 1–3 occupation\[International Labour Office,[2012](https://arxiv.org/html/2605.23262#bib.bib87)\], is the ESCO analog of the O\*NET Job Zones 3–5 scope\. The resulting 5,826 items are assigned to one of the 18 work activities by an LLM prompted with the frozen per\-work\-activity one\-line definition, input, output, and distinct\-from fields, plus an explicit “none of the above” option for items the inventory does not cover\. The two signals we report are the mapping result under this rubric assignment and the per\-work\-activity distribution of ESCO items against the O\*NET distribution in Table[2](https://arxiv.org/html/2605.23262#S3.T2)\. The divergences localize the places where the inventory is most O\*NET\-shaped, and their direction is interpretable from how the two ontologies are built\.

#### LLM assignment\.

An LLM classifier is called once per item at temperature 0 with a system prompt that contains the 18 work\-activity ids and, for each, the one\-line definition, input, output, and up to three distinct\-from clauses\. The prompt requires the model to return a work\-activity id, a confidence rating of high, medium, or low, and a one\-sentence rationale, and explicitly permits a “none of the above” label for items that do not describe a work activity\. Classification calls are cached so the pass is resumable\. Of the 5,826 scoped items, 3,893 return high confidence \(66\.8%\), 1,642 medium \(28\.2%\), and 291 low \(5\.0%\)\. A supplementary centroid assignment, based on cosine similarity of each ESCO item to theL2L\_\{2\}\-normalized mean of O\*NET task embeddings per work activity, is computed for cross\-reference but is not used for any number reported in this appendix\.

#### Mapping result\.

5,730 of the 5,826 ESCO items \(98\.3%\) are placed into one of the 18 work activities; 96 items \(1\.7%\) receive a “none of the above” label\[European Commission, Directorate\-General for Employment, Social Affairs and Inclusion,[2022](https://arxiv.org/html/2605.23262#bib.bib27)\]\. All 18 work activities receive ESCO items; 17 of 18 receive≥40\\geq\\\!40items, with*investigation*the single exception at 39\. The “none of the above” set is dominated by ESCO entries that encode personal traits, subject\-matter knowledge, perceptual abilities, or language\-proficiency labels\. These rejected entries describe attributes rather than work activities, consistent with the inventory’s work\-activity\-not\-worker design rule\. The atlas therefore absorbs the ESCO material at the categorical level: no work activity empties, no large residue accumulates outside the inventory\. Where the two corpora diverge is in how they re\-weight the 18 work activities, which we read next\.

#### Where O\*NET and ESCO diverge\.

Figure[B\.1](https://arxiv.org/html/2605.23262#A2.F1)and Table[B\.1](https://arxiv.org/html/2605.23262#A2.T1)give the per\-work\-activity shares in each corpus\. Every work activity is populated in both corpora, but the two ontologies re\-weight the activities because their source units differ\. O\*NET is a task inventory, so it tends to multiply discrete stewardship acts such as troubleshooting, administration, record\-keeping, investigation, and teaching across occupations\. ESCO is a skill and competence catalog, so it tends to multiply transferable competences such as advising, fabrication, coordination, design, and rule\-enforcement\. These differences do not change the main paper’s conclusion\. They show that the 18\-work\-activity inventory is legible across a separate occupational ontology while remaining shaped by its O\*NET derivation source\. The ESCO check is therefore used only as a legibility and corpus\-shift check, not as validation that the inventory is final\.

![Refer to caption](https://arxiv.org/html/2605.23262v1/esco_onet_op_share.png)Figure B\.1:Per\-work\-activity share in the O\*NET task corpus \(12,464 knowledge\-work tasks\) versus the scoped ESCO skill/competence corpus \(5,730 items mapped to a work activity\)\.Both corpora populate all 18 work activities\. The two catalogs re\-weight the work activities substantially: ESCO emphasizes*advising*,*fabrication*,*coordination*,*design*, and*rule\-enforcement*more than O\*NET; O\*NET emphasizes*troubleshooting*,*administration*,*record\-keeping*,*investigation*, and*teaching*more than ESCO\. The direction of the divergence tracks the two ontologies’ different source units \(domain\-bound tasks vs profession\-abstract competences\), as read in Appendix[B](https://arxiv.org/html/2605.23262#A2)\. Underlying counts are in Table[B\.1](https://arxiv.org/html/2605.23262#A2.T1)\.Table B\.1:Per\-work\-activity shares in the O\*NET task corpus versus the scoped ESCO skill/competence corpus\.O\*NETnnis the task count from Table[2](https://arxiv.org/html/2605.23262#S3.T2)\.ESCOnnis the number of scoped ESCO items assigned to the work activity by the LLM classifier\.Δ\\Deltais ESCO percent minus O\*NET percent in work\-activity\-share percentage points\. Sorted byΔ\\Delta\.

## Appendix CDesign notes for each work activity

Table[C\.1](https://arxiv.org/html/2605.23262#A3.T1)translates the 18 work activities into benchmark\-design notes\. The table does not introduce separate evaluation logic\. For each activity, it states a plausible benchmark proxy and the scored work product that would be needed if the benchmark intends to support a work\-activity claim\.

Table C\.1:Design notes for applying the main three\-step approach to each work activity\.

Similar Articles

JobBench: Aligning Agent Work With Human Will

arXiv cs.AI

JobBench is a benchmark built from worker surveys to evaluate AI agents on tasks that workers most want automated, covering 130 tasks across 35 professions with detailed rubrics.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

arXiv cs.AI

This paper introduces Benchmarking-Cultures-25, a dataset analyzing how AI model builders selectively highlight benchmarks in press releases. It finds a fragmented evaluation landscape with limited cross-model comparability, arguing that benchmarks are used as narrative devices for market positioning rather than standardized scientific measurement.

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Hugging Face Daily Papers

KWBench introduces a benchmark of 223 professional tasks to evaluate whether LLMs can recognize the underlying game-theoretic structure of a situation without prompting, finding that even the best model succeeds on only 27.9% of tasks. The benchmark targets unprompted problem recognition—a step prior to task execution—across domains like acquisitions, clinical pharmacy, and fraud analysis.