Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

arXiv cs.AI Papers

Summary

This paper introduces OfficeEval, a benchmark based on China's National Computer Rank Examination (NCRE) to evaluate LLM agents on complex Office automation tasks. Frontier models achieve at best 36.6% in single-turn and 68.8% with agentic systems, far below human-level performance.

arXiv:2606.10956v1 Announce Type: new Abstract: The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:18 AM

# Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?
Source: [https://arxiv.org/html/2606.10956](https://arxiv.org/html/2606.10956)
###### Abstract

The deployment of Large Language Model \(LLM\) agents for computer automation is accelerating, yet their ability to navigate complex, professional\-grade productivity software is largely untested\. We argue that Office automation is an ideal environment for benchmarking document\-automation capability, as it requires long\-horizon planning and reasoning, precise parameter configuration, and multi\-application integration\. To quantify this capability, we introduce an evaluation based on China’s National Computer Rank Examination \(NCRE\), featuring 200 comprehensive practical\-operation tasks across Word, Excel, and PowerPoint\. Each task is scored on a 100\-point rubric scale using 7,118 machine\-gradable criteria, and Score Rate \(SR\) denotes the mean percentage of rubric points earned across these tasks\. We benchmark 7 frontier LLMs and observe stark limitations: single\-turn models score a maximum of 36\.6%\. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68\.8%, but remains below the 95\.5% community\-reference score used as a scoring sanity check\. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine\-grained Office document automation remains a significant challenge for current code\-generating LLM and agent systems\.

## 1Introduction

The transition toward autonomous Large Language Model \(LLM\) agents has driven rapid progress in domains ranging from software engineering to web navigation\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib3); Zhouet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib1); Xieet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib2); Mialonet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib11)\)\. However, if these agents are to serve as viable “digital workers”, they must master the environments where human work actually takes place\. Office suites remain among the most widely used environments for knowledge work, and proficiency across applications like Word, Excel, and PowerPoint remains a foundational workplace skill\. Yet, despite its real\-world importance, agent evaluation in Office automation remains surprisingly underdeveloped\. Current studies rely heavily on synthetic environments, narrow single\-application slices, or subjective LLM\-as\-a\-judge scoring, none of which adequately capture the complexity of genuine Office workflows\.

To establish a rigorous standard for Office\-based agents, we propose evaluating them against the exact same practical examinations used to certify human professionals\. In this paper, we introduce a novel benchmarking framework based on China’s National Computer Rank Examination \(NCRE\), a massive\-scale standardized testing system that has evaluated over 110 million candidates\(National Education Examinations Authority,[2024a](https://arxiv.org/html/2606.10956#bib.bib17)\)\. Leveraging an established human certification resolves several major benchmarking bottlenecks simultaneously\. The NCRE provides a suite of foundational and advanced tasks that are designed by domain experts, carefully calibrated for difficulty, and crucially evaluated using objective, machine\-readable scoring rubrics\. Furthermore, unlike traditional QA or isolated code\-completion tasks, a single NCRE problem requires an agent to execute dozens of distinct, interdependent operations on real\-world documents\. This transforms a human professional exam into an ideal, highly challenging testbed for evaluating long\-horizon sequential decision\-making\.

![Refer to caption](https://arxiv.org/html/2606.10956v1/figures/hero_figure.png)Figure 1:End\-to\-end illustration of a Word task inOfficeEval\. The original document \(left\) is transformed according to the task instructions \(center\) into a styled brochure with header image, heading styles, and mail\-merge labels \(right\)\. Only page 1 of the 2\-page document is shown; several steps \(e\.g\., 3\-column layout, watermark\) apply to page 2\. The task is scored by 30 deterministic criteria across 6 skill categories\. Instructions are translated from the original Chinese; additional examples across Word, Excel, and PowerPoint appear in the Appendix\.Specifically, we introduceOfficeEval, a comprehensive benchmark constructed from 200 NCRE tasks and evaluated against 7,118 machine\-gradable criteria across Word, Excel, and PowerPoint\. Using this benchmark, we systematically evaluate 7 frontier LLMs under two paradigms: single\-turn code generation and autonomous coding\-agent systems\. Our primary contributions and key findings are summarized as follows:

- •A benchmark grounded in real\-world professional certification\.By transforming the NCRE intoOfficeEval, we provide an evaluation framework anchored by a 60\-point reference threshold for the extracted practical subset and a 95\.5% community\-reference score used as a scoring sanity check\. This establishes a highly realistic and quantifiable baseline for long\-horizon Office automation under a standardized practical\-task rubric\.
- •Systematic evaluation revealing a critical performance gap\.Our evaluation demonstrates that no single\-turn model reaches the 60\-point reference threshold on the extracted practical subset, with the best model \(Claude Opus 4\.7\) achieving only a 36\.6% Score Rate \(SR\)\. While autonomous coding agents show substantial improvements where Claude Code reaches 53\.0% and Codex reaches 68\.8%, all models remain below the community\-reference score\. Furthermore, performance is highly polarized \(dropping to a 2\.8% SR for Grok\-4\.1\-fast\), indicating that frontier API status alone does not guarantee strong Office automation performance\.
- •Fine\-grained diagnosis and error taxonomy\.To understand*why*models fail, we introduce a criterion\-level taxonomy that isolates execution crashes from logic errors\. This reveals a key insight: in the stronger coding\-agent setting, code execution success rises from under 50% to 98–99%, but operation accuracy remains low\. Current agents can successfully write code that runs, but still struggle to implement the correct Office\-specific semantic operations\.

## 2Related Work

LLM agent benchmarks now span web navigation\(Zhouet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib1); Denget al\.,[2023](https://arxiv.org/html/2606.10956#bib.bib12)\), software engineering\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib3)\), multi\-environment reasoning\(Liuet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib5); Mialonet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib11)\), and desktop automation\(Xieet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib2); Bonattiet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib27); Xuet al\.,[2026](https://arxiv.org/html/2606.10956#bib.bib28)\)\. To our knowledge, no prior benchmark combines real standardized Office\-exam tasks with deterministic criterion\-level grading across Word, Excel, and PowerPoint\.

Within Office automation, existing work covers individual applications or narrow scopes\. Word tasks appear as minor components in broader suites\(Xieet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib2); Wanget al\.,[2024b](https://arxiv.org/html/2606.10956#bib.bib23); Muet al\.,[2025](https://arxiv.org/html/2606.10956#bib.bib31)\)without dedicated formatting evaluation\. Spreadsheet benchmarks\(Liet al\.,[2023](https://arxiv.org/html/2606.10956#bib.bib4); Maet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib25); Chenet al\.,[2025](https://arxiv.org/html/2606.10956#bib.bib29)\)focus on formula and data manipulation, under\-representing chart customization, pivot tables, and conditional formatting\. Presentation benchmarks\(Guoet al\.,[2024](https://arxiv.org/html/2606.10956#bib.bib24); Huanget al\.,[2025](https://arxiv.org/html/2606.10956#bib.bib26)\)cover layout and editing but omit animations, transitions, and cross\-application skills\. The most comparable multi\-application efforts, OfficeBench\(Wanget al\.,[2024b](https://arxiv.org/html/2606.10956#bib.bib23)\)and OdysseyBench\(Wanget al\.,[2025](https://arxiv.org/html/2606.10956#bib.bib30)\), study Office workflows across applications, but their evaluation targets workflow\-level task completion rather than NCRE\-style deterministic, criterion\-level grading of fine\-grained document properties\.

OfficeEvaldiffers from prior work in two respects:\(1\)tasks come from NCRE, a nationally administered certification exam designed by domain\-expert committees, providing externally validated difficulty and broad skill coverage that synthetic or crowdsourced tasks cannot match; and\(2\)all 7,118 scoring criteria are machine\-gradable, enabling deterministic, fine\-grained evaluation across Word, Excel, and PowerPoint without LLM or human judging variance\. A further consequence of\(1\)is that scores onOfficeEvalcarry an externally defined meaning\. The per\-criterion point allocations are taken directly from the NCRE task rubrics, so a model’s SR is reported on the same per\-task 100\-point rubric scale used for the extracted practical\-operation tasks\. It is the share of allocated points the model earned on this subset, giving the score an external rubric anchor rather than only a benchmark\-internal ranking among systems\.

## 3OfficeEval

### 3\.1Data source

OfficeEvalis derived from the practical operation component of China’s National Computer Rank Examination \(NCRE\), specifically the Level 1 and Level 2 MS Office modules\(National Education Examinations Authority,[2024c](https://arxiv.org/html/2606.10956#bib.bib19),[d](https://arxiv.org/html/2606.10956#bib.bib20)\)\. NCRE is a nationally standardized proficiency test administered by China’s National Education Examinations Authority and designed to assess practical computing skills\(National Education Examinations Authority,[2024a](https://arxiv.org/html/2606.10956#bib.bib17)\)\. Level 1 evaluates foundational Office skills such as basic formatting, simple formulas, and standard presentation creation, while Level 2 covers more advanced operations including mail merge, pivot tables, chart customization, and complex animations\(National Education Examinations Authority,[2024c](https://arxiv.org/html/2606.10956#bib.bib19),[d](https://arxiv.org/html/2606.10956#bib.bib20)\)\. The full NCRE exam also includes multiple\-choice questions on computer fundamentals and, at Level 1, basic OS and internet tasks; we extract only the*practical Office operation*sections \(Word, Excel, and PowerPoint\), which account for the majority of the exam score \(60% at Level 1, 80% at Level 2\) and constitute the primary assessment of hands\-on Office proficiency\(National Education Examinations Authority,[2024c](https://arxiv.org/html/2606.10956#bib.bib19),[d](https://arxiv.org/html/2606.10956#bib.bib20)\)\. This two\-level structure provides a natural difficulty gradient for analyzing how model performance varies with task complexity\.

The NCRE certificate is a nationally recognized credential\. The full NCRE exam combines multiple\-choice questions on computer fundamentals with the practical\-operation tasks studied in this work, and a candidate passes \(and receives a certificate from the Ministry of Education\) by scoring at least 60 out of 100*on the full exam*\(National Education Examinations Authority,[2024b](https://arxiv.org/html/2606.10956#bib.bib18)\)\. OurOfficeEvalbenchmark extracts the practical\-operation subset\. Every scoring criterion and its point allocation is taken from the task rubric, so SR is the share of allocated points the model earned and is reported on the same per\-task 100\-point rubric scale for these practical tasks, a property that benchmarks built from synthetic or crowdsourced tasks cannot offer\. Throughout this paper, the 60\-point value serves as a reference threshold for subset\-level score interpretation\.

Task instructions are in Chinese, which is well\-supported by all multilingual LLMs evaluated in this work\. To investigate whether instruction language affects model performance, we also constructed English\-translated versions of all 200 tasks—including document content, scoring criteria, and font/style mappings—for a cross\-language analysis \(Section[7](https://arxiv.org/html/2606.10956#S7)\)\. We provide English translations of example tasks in the appendix for reference\.111The appendix examples are drawn from official NCRE sample materials publicly posted by the National Education Examinations Authority:[https://ncre\.neea\.edu\.cn/xhtml1/category/1507/848\-1\.htm](https://ncre.neea.edu.cn/xhtml1/category/1507/848-1.htm)\.

#### Scope and data availability\.

The NCRE practical examination tasks, input documents, and scoring configurations are copyrighted examination materials authored by the National Education Examinations Authority and exam\-preparation publishers\. We therefore do not redistribute the raw task statements, original input documents, or scoring scripts\. To make the study reproducible, we provide in the appendix the full set of prompts we used, the experimental environment and settings, the evaluation procedure, and the criterion\-level statistics, so that the same pipeline can be reproduced on independently obtained NCRE materials or applied to similar Office\-proficiency data sources\.

### 3\.2Task format

Each task inOfficeEvalconsists of three components:

1. 1\.Input document:The initial document file \(Word\.docx, Excel\.xlsx, or PowerPoint\.pptx\), along with any supporting materials such as images, data files, or theme templates that may be referenced in the instructions\.
2. 2\.Task instructions:Natural language descriptions of the required operations, typically 5–15 distinct sub\-tasks of varying complexity\. Instructions often include*reference images*that illustrate the target formatting, layout, or style \(see Figure[1](https://arxiv.org/html/2606.10956#S1.F1)for representative examples\)\. This is a natural consequence of Office work: complex formatting goals \(table borders, chart styles, SmartArt layouts\) are inherently visual and often easier to convey through an image than through text alone\. This also mirrors real workplace practice; a coworker or designer may hand over only a style mock\-up and expect the document to be produced to match\. The ability to translate a visual style into concrete Office operations is itself a practical skill being assessed\. The benchmark is therefore inherentlymultimodal: solving a task often requires interpreting visual references alongside textual descriptions\.
3. 3\.Scoring configuration:A machine\-readable XML file defining all evaluation criteria, including the properties to check, expected values, comparison operators, and point allocations\.

The evaluation pipeline is independent of how the output document is produced: given the task instructions and input document, any method produces an output document, which is then scored automatically against the scoring configuration\.

Figure[2](https://arxiv.org/html/2606.10956#S3.F2)illustrates the overall task format and evaluation pipeline\.

![Refer to caption](https://arxiv.org/html/2606.10956v1/figures/overview.png)Figure 2:Overview of theOfficeEvalbenchmark and evaluation pipeline\. 200 exam tasks from NCRE span three applications and two difficulty levels\. Tasks are processed by either a single\-turn code generator or an autonomous coding agent, and output documents are scored by a deterministic evaluation engine against 7,118 fine\-grained criteria\.
### 3\.3Evaluation framework

Our evaluation engine parses Office Open XML documents standardized by ISO/IEC 29500\(ISO/IEC,[2016](https://arxiv.org/html/2606.10956#bib.bib14)\)using the Open XML SDK\(Microsoft,[2023](https://arxiv.org/html/2606.10956#bib.bib15)\)and runs as a set of Windows executables \(\.NET, x86\)\. Most scoring criteria are evaluated by parsing the Open XML document structure directly; however, a subset of checks \(e\.g\., certain chart renderings and complex formatting validations\) invoke Microsoft Office COM automation internally, so a Windows environment with Microsoft Office installed is required for full evaluation\. The scoring logic is specified by configuration files and deterministic evaluator code\. For a fixed compatible Windows, Office, font, and locale environment, the same input document yields the same score\.

Each task’s scoring configuration defines a set of*scoring criteria*, each specifying a criterion type \(e\.g\.,PageSetup,Chart,PivotTable\), one or more evaluation rules with expected values and comparison operators, and a point value \(on a 100\-point scale\)\. Figure[3](https://arxiv.org/html/2606.10956#S3.F3)shows a representative criterion from one task’s configuration: it checks that the font of the text “指标体系构建” is set to “华文新魏” and awards 0\.8 points if so\. A complete task typically contains 20–70 such criteria covering every operation in the rubric\.

<ScorePointDefine WepName="Paragraph find/replace"WepType="ParaFind"\><MarkRule FullScore="0\.8"\><FindParams PropKey="Font" PropKeyDesc="Font"Find="指标体系构建"/\><StdAnswer\><Value\>华文新魏</Value\><ValueDesc\>Title font incorrect;should be 华文新魏\.</ValueDesc\></StdAnswer\><OperatorParam ValueType="0"\>0</OperatorParam\></MarkRule\></ScorePointDefine\>

Figure 3:Representative scoring criterion from a Word task\. The criterion locates the text “指标体系构建”, checks that its font equals “华文新魏”, and awards 0\.8 points on a match\. Each task contains 20–70 such criteria\.Each task is scored on a 100\-point scale, with the total distributed across its scoring criteria in fine\-grained increments\. This rubric\-based approach enables*partial credit scoring*: an agent that correctly completes 70% of the required operations receives roughly 70% of the points, providing a much more informative signal than binary success/failure metrics\.

The evaluation engine is implemented as a command\-line tool: users submit an output document and task ID, and receive a structured JSON result containing the total score and per\-criterion pass/fail details\. As a scoring sanity check, we collected community reference solutions from online NCRE study communities for all 200 tasks\. These solutions, created informally by exam candidates with no guarantee of correctness, achieve an average score of 95\.5% \(89 out of 200 perfect scores\)\. This indicates that high\-scoring solutions exist for the tasks and that the scoring engine produces expected results on well\-formed documents\. Manual inspection of low\-scoring community reference solutions suggests that much of the remaining gap comes from imperfections in the community answers themselves \(see Appendix[D\.5](https://arxiv.org/html/2606.10956#A4.SS5)\)\.

### 3\.4Taxonomy of Office skills

Based on the criterion type annotations in the scoring configurations, we organizeOfficeEval’s 7,118 scoring criteria into 8 skill categories that span all three applications\. Table[1](https://arxiv.org/html/2606.10956#S3.T1)presents this taxonomy with the distribution of criteria across the full dataset\.

Table 1:Taxonomy of Office skills inOfficeEval\. All 7,118 scoring criteria across 200 tasks are organized into 8 categories based on their criterion types\. The “Apps” column indicates which applications contribute criteria to each category\.This taxonomy enables per\-skill capability analysis beyond aggregate scores \(Section[5\.2](https://arxiv.org/html/2606.10956#S5.SS2)\)\.

### 3\.5Dataset statistics

Table[2](https://arxiv.org/html/2606.10956#S3.T2)presents the overall statistics ofOfficeEval\. The benchmark comprises 200 tasks across two difficulty levels with a total of 7,118 scoring criteria, providing dense evaluation coverage\.

Table 2:Dataset statistics ofOfficeEval\. Each task is scored on a 100\-point scale with fine\-grained criteria\. Level 1 tasks assess foundational skills; Level 2 tasks cover advanced operations\.Level 1 tasks have more criteria per task \(41\.7 avg\.\) with simpler checks, while Level 2 tasks have fewer \(30\.8 avg\.\) but more complex compound rules\. With 7,118 independent evaluation signals across 200 tasks, the evaluation density is substantially higher than benchmarks using binary pass/fail\.

## 4Experimental setup

### 4\.1Models

We evaluate 7 frontier LLMs: four proprietary/API systems \(Claude Opus 4\.7\(Anthropic,[2026](https://arxiv.org/html/2606.10956#bib.bib8)\), GPT\-5\.5\(OpenAI,[2026b](https://arxiv.org/html/2606.10956#bib.bib6)\), Gemini 3\.1 Pro\(Google DeepMind,[2026](https://arxiv.org/html/2606.10956#bib.bib9)\), Grok\-4\.1\-fast\(xAI,[2025](https://arxiv.org/html/2606.10956#bib.bib33)\)\) and three open\-weight models \(Kimi\-K2\.6\(Moonshot AI,[2026](https://arxiv.org/html/2606.10956#bib.bib22)\), Qwen3\.5\-397B\-A17B\(Qwen Team, Alibaba,[2026](https://arxiv.org/html/2606.10956#bib.bib21)\), MiMo\-V2\.5\(Xiaomi AI Lab,[2026](https://arxiv.org/html/2606.10956#bib.bib32)\)\)\. Because task instructions often contain reference images \(Section[3\.2](https://arxiv.org/html/2606.10956#S3.SS2)\), all models must support multimodal input, which excludes competitive text\-only LLMs such as DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2606.10956#bib.bib10)\)and GPT\-5\.3\-Codex\-Spark\(OpenAI,[2026a](https://arxiv.org/html/2606.10956#bib.bib7)\)\. For the open\-weight group, we selected three multimodal models available through our evaluation infrastructure at the time of evaluation to maximize the chance of observing high Office performance\.

### 4\.2Agent architecture

We evaluate code\-based methods in two settings: a*single\-turn LLM baseline*that probes the underlying model directly with a single API call, and*autonomous coding agents*that exercise the model within an iterative scaffold\(Wanget al\.,[2024a](https://arxiv.org/html/2606.10956#bib.bib13)\)\. In both settings the model produces executable code that manipulates documents, rather than interacting with a GUI\.

#### Single\-turn LLM baseline\.

This is a single API call with no agent loop\. The model receives pre\-processed task instructions, document screenshots, and the input document path, and produces Python code in one pass using standard Office libraries \(python\-docx,openpyxl,python\-pptx\)\. There is no execution feedback, no retry, and no tool use beyond emitting code\. This setting probes single\-turn code generation under a fixed library\-constrained automation interface\.

#### Coding agent\.

We evaluate two autonomous agents: Claude Code \(CC\), powered by Claude Opus 4\.7\(Anthropic,[2026](https://arxiv.org/html/2606.10956#bib.bib8)\), and Codex, powered by GPT\-5\.5\(OpenAI,[2026b](https://arxiv.org/html/2606.10956#bib.bib6)\)\. Each iteratively writes, executes, and debugs code with no turn limit and unrestricted tool access\. They receive only the raw exam paper and input documents, the same materials given to a human candidate sitting the NCRE exam, without pre\-processed descriptions, screenshots, or scoring rubrics\. This setting is a stronger system comparison, not an ablation of a single factor\. It differs from the single\-turn baseline in execution feedback, repair budget, agent scaffolding, and access to Office automation tools such as COM\.

### 4\.3Evaluation metrics

Our primary metric isScore Rate \(SR\): the unweighted macro\-average score across the 200 extracted practical\-operation tasks, expressed as a percentage of the 100\-point maximum:SR=1N​∑i=1Nsi%\\text\{SR\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}s\_\{i\}\\;\\%, wheresis\_\{i\}is the score on taskii\. Since each task is scored on a 100\-point scale and contributes equally to the average, SR is numerically equal to the mean task score\. Application\-, level\-, and skill\-specific scores are reported as analytical breakdowns; they are not reweighted to match the full NCRE exam composition, and SR remains the single metric used for overall model comparison\.

## 5Results

### 5\.1Main results

Table[3](https://arxiv.org/html/2606.10956#S5.T3)presents our main results\. The best single\-turn model, Claude Opus 4\.7, reaches only 36\.6%, followed by GPT\-5\.5 at 36\.2%\. Both are far below the 60\-point reference threshold used to contextualize this extracted practical subset and below the 95\.5% community\-reference score \(Section[3\.3](https://arxiv.org/html/2606.10956#S3.SS3)\)\. All numbers are means across 3 independent runs \(SD 0\.4–2\.1pp on overall SR\)\. A pairedtt\-test across 200 tasks confirms that the top two models, Claude Opus 4\.7 and GPT\-5\.5, are not statistically distinguishable \(p=0\.82p\{=\}0\.82; bootstrap 95% CI for the difference:\[−3\.5,4\.4\]\[\-3\.5,4\.4\]pp\)\.

Table 3:Main results onOfficeEval\. SR%: Score Rate per application and level \(mean of 3 runs\)\. Exec%: fraction of tasks where the generated code runs without error\. L1 = Level 1; L2 = Level 2\. Bold = best per column\.Key observations:

- •All single\-turn models remain weak on the extracted practical subset\.Even on the strongest category \(Excel L1, GPT\-5\.5: 61\.7% SR\), nearly 40% of task score is lost\. Overall SR ranges from 36\.6% to 2\.8%, and the proprietary/API group contains both the top systems and the weakest system \(Grok\-4\.1\-fast\)\. Most models degrade substantially on Level 2, with drops more pronounced for stronger models\.
- •Under the single\-turn library\-constrained setting, Excel is the most tractable and PowerPoint the hardest\.Excel tasks are dominated by formulas and tabular data manipulation, where the operation maps cleanly to a small, well\-known set of programmatic primitives\. PowerPoint tasks more often hinge on animation effects, transitions, and theme/master constants whose exact names and parameters are not stated in the instruction and must be recalled from the model’s internal knowledge of the Office object model\.
- •Code reliability is a major bottleneck\.The best model’s Exec% is only 61\.5 \(Table[3](https://arxiv.org/html/2606.10956#S5.T3)\), meaning the remaining 38\.5% of programs crash before producing output\. Among the tasks that do run, the implied conditional SR is roughly 60% \(overall SR divided by Exec%\), showing that accuracy remains a major problem even when code executes\.

### 5\.2Skill\-level analysis

Figure[4](https://arxiv.org/html/2606.10956#S5.F4)shows the per\-skill criterion pass rate for all seven evaluated models\. Models exhibit distinct capability profiles across the 8 skill categories\.

![Refer to caption](https://arxiv.org/html/2606.10956v1/figures/radar_chart.png)Figure 4:Per\-skill criterion pass rate \(%\) for all seven evaluated models\. Tables, Charts, and Data & Formulas contain the strongest peaks among top models; Animation and Graphics & Media are the weakest\. See Table[10](https://arxiv.org/html/2606.10956#A3.T10)for full results\.Full per\-skill results appear in Table[10](https://arxiv.org/html/2606.10956#A3.T10)\(Appendix\)\. Key patterns:

- •Structured, explicitly specified operations are strongest\.Tables \(59\.1% for Claude Opus 4\.7\), Charts \(51\.4% for GPT\-5\.5\), and Data & Formulas \(48\.5% for GPT\-5\.5\) reach the highest category\-level pass rates\. These categories rely more on explicit task parameters and well\-scoped document structures, reducing the implementation knowledge gap\.
- •Animation and Graphics & Media are weakest\(26\.7% and 29\.1% at best\)\. These require exact enumeration names and visual property values absent from the task text; the implementation knowledge gap is widest here\.
- •Weaker models show flat profiles: open\-weight models’ pass rates vary little across categories \(e\.g\., Qwen3\.5\-397B\-A17B: 2\.3–11\.1%\), suggesting failure at a more basic level before category\-specific challenges become relevant\.

The skill profiles also help explain the per\-application gains from coding agents \(Table[4](https://arxiv.org/html/2606.10956#S5.T4)\)\. PowerPoint sees the largest agent improvement \(Codex: \+43\.8pp over single\-turn GPT\-5\.5\), which aligns with Animation and Graphics & Media being the weakest single\-turn categories: these skills rely heavily on exact enumeration constants thatpython\-pptxoften lacks, but that COM automation exposes directly\. Excel gains are also large \(\+26\.3pp\), consistent with Data & Formulas and Charts already being the strongest single\-turn categories; the stronger agentic setting recovers from many crashes that suppress single\-turn scores\. Word gains are the smallest for Claude Code \(\+0\.7pp\) despite Word’s Text & Format category being reasonably strong \(46\.6%\); this suggests that Word’s bottleneck is implementation knowledge \(exact font names, style IDs, page\-setup constants\) rather than crashes, a bottleneck that iterative repair alone cannot address\.

### 5\.3Coding agent results

To measure a stronger autonomous setting, we evaluate two coding agents: Claude Code \(CC\), powered by Claude Opus 4\.7, and Codex, powered by GPT\-5\.5\. This setting combines execution feedback, iterative repair, agent scaffolding, and unrestricted tool access, with a one\-hour time limit per task\. The agents receive only the raw exam paper and input documents, without pre\-processed descriptions, screenshots, or scoring rubrics\. The reported turn counts \(Table[4](https://arxiv.org/html/2606.10956#S5.T4)\) reflect natural convergence within the time limit, not an imposed turn cap\.

Table 4:System\-level comparison between the single\-turn LLM baseline and autonomous coding agents\. SR% is reported per application \(averaged across both levels\) and overall; Exec% is the code execution success rate; Turns is the average number of agent interaction rounds\.The best coding agent \(Codex, GPT\-5\.5\) achieves 68\.8% overall SR, a 32\.6pp improvement over its single\-turn counterpart\. Claude Code \(Claude Opus 4\.7\) reaches 53\.0%, a 16\.4pp gain over its single\-turn baseline\. Both agents close a substantial portion of the gap to the community\-reference score \(95\.5%\), but neither closes it fully\. In this stronger agentic setting, execution success rises to 98–99% for both agents\. Both agents also exercise the broader tool access of this setting\. For example, they drive Office throughwin32comCOM automation in addition to the standard libraries, which is particularly impactful for PowerPoint operations\. These results should be read as a comparison between two system settings, not as proof that iterative debugging alone causes the improvement\. The coding\-agent setting changes several things at once: execution feedback, repair budget, agent scaffolding, and the range of tools the model is allowed to invoke\. The single\-turn baseline, by design, exercises only the libraries provided in its prompt; separating the effects of feedback, broader tool access, and agent scaffolding remains future work\. Despite these gains, the best coding agent still remains below the community\-reference score \(95\.5%\)\. Once many code\-execution failures are fixed, models still lose substantial credit on fine\-grained Office details, such as exact style names, color values, chart layouts, animation constants, and XML properties\.

## 6Error analysis

We analyze failure modes of the best single\-turn model \(Claude Opus 4\.7\) and categorize common issues:\(1\) Code execution failure: hallucinated API methods or type errors crash the program\.\(2\) Missing or misunderstood operation: required operations are skipped, misunderstood, or only partially completed \(e\.g\., formatting only the first paragraph\)\.\(3\) Implementation\-knowledge error: the generated code targets the right high\-level operation but uses the wrong API parameter, built\-in style name, color value, chart layout, animation constant, or XML property\.\(4\) Cascading failure: an upstream missing object or incorrect transformation causes downstream criteria to fail\.\(5\) Library limitation: the intended operation is correct but unsupported by the selected library \(e\.g\., certain PowerPoint animations, advanced mail merge\)\.

Code execution failure accounts for most zero\-score outcomes \(23\.7% of tasks averaged across three runs\), disproportionately affecting PPT \(29\.5%\) and Excel \(26\.7%\) over Word \(14\.6%\)\. To understand the errors beyond crashes, we annotate failed scoring criteria for Claude Opus 4\.7 single\-turn \(run 1\) and the Codex coding agent\. For Claude Opus 4\.7, we use execution status and generated\-code keyword matching to separate operations that were attempted but wrong from operations that were missing\. For Codex, the classification relies mainly on scorer errors and final outputs\. When the target object exists but the exact property is wrong, we label the error as implementation knowledge\. This is a diagnostic labeling scheme, not a definitive manual audit; it may over\-count implementation\-knowledge errors for Codex and under\-count library limitations when unsupported operations are simply skipped\.

Table[5](https://arxiv.org/html/2606.10956#S6.T5)suggests a shift in the error profile under our diagnostic labeling scheme\. The table uses weighted failure units from the error\-taxonomy script, not the 200\-task×\\times100\-point SR denominator, so the percentages should be read only as a breakdown of annotated errors within each setting\. In the single\-turn setting, execution failures account for 51\.8% of all weighted loss; among non\-crash losses, implementation\-knowledge errors are already the largest category \(61\.7%\)\. In the coding\-agent setting, execution failures drop to 7\.9% of all weighted loss, while 97\.4% of non\-crash weighted loss is labeled as implementation knowledge\. The larger weighted implementation\-knowledge loss for the coding agent \(1,336\.8 vs\. 859\.5\) should not be read as degradation: after crashes are fixed, many fine\-grained property errors become visible instead of being hidden by execution failure\.

Table 5:Failure taxonomy over failed scoring criteria\. W\. loss denotes weighted loss computed from criterion\-level annotation weights and is used only for relative error decomposition within this table; it is not the task\-normalized SR denominator\. “All loss” uses all failed criteria in the setting; “Non\-crash loss” excludes execution\-failure rows\.Table[6](https://arxiv.org/html/2606.10956#S6.T6)breaks down implementation\-knowledge errors\. The dominant subtypes are not semantic misunderstandings of the instructions, but low\-level representation mismatches: Open XML property paths, enumeration constants, color/theme encodings, numeric units, SmartArt/shape properties, and chart layouts\. Built\-in template mismatches are a recurring example: tasks requiring named templates \(e\.g\., “Blank \(Three Columns\)” footer\) are reconstructed from scratch instead of invoked by name\. Formula cascading failures and chart property imprecision further reduce scores on otherwise strong Excel tasks \(see Figure[5](https://arxiv.org/html/2606.10956#S6.F5)and Appendix for detailed examples\)\.

Table 6:Dominant implementation\-knowledge subtypes\. W\. loss denotes weighted loss\. Shares are within implementation\-knowledge weighted loss for each setting; rows show the six largest subtypes by Codex\-agent weighted loss\.![Refer to caption](https://arxiv.org/html/2606.10956v1/figures/error_cases.png)Figure 5:Representative failure cases fromOfficeEval, using Claude Opus 4\.7 outputs\. \(a\) A Word task where the model reconstructs a footer manually instead of using Word’s built\-in template\. \(b\) A zero\-score PPT task where the model hallucinates a non\-existent API, crashing the entire script\. \(c\) An Excel task where all formulas are correct but chart properties are wrong, costing 13 points\.#### Coding agent failure modes\.

The coding\-agent setting greatly reduces crashes \(98–99% execution success vs\. 57–62% single\-turn\), but introduces iterative regression: the best agent \(Codex\) scores worse than its single\-turn counterpart on 23 tasks \(11\.5%\), sometimes corrupting working documents through successive modifications\. This trade\-off is inherent to repair\-based workflows: execution\-time correction can recover crashed tasks but risks undoing correct work\.

## 7Discussion and conclusion

#### Single\-turn vs\. agentic systems\.

The gap between single\-turn \(36\.6%\) and coding agents \(68\.8%\) maps onto a design choice in real AI\-assisted workflows\. Single\-turn code generation is a one\-shot assistance paradigm: the AI produces output in one pass\. Stronger agentic systems attempt autonomous task completion through execution feedback, repair budget, scaffolding, and broader tool access\. The one\-shot setting fails on long\-horizon Office tasks; most generated programs crash without producing output\. The stronger agentic setting recovers from many of these crashes, but at the cost of regression on already\-correct work\. For deployment, reliable Office automation likely needs closed\-loop execution with safeguards against regression and careful control over tool access\.

#### Code generation as a proxy\.

Programmatic automation is a powerful and scalable proxy for many Office operations, especially when COM exposes the relevant object\-model controls\. However, GUI\-based agents with visual feedback may access UI\-level affordances, galleries, theme assets, and layout heuristics that are difficult to reproduce through library calls or COM constants alone\. A direct comparison between programmatic and GUI\-based Office agents remains an important direction for future work\.

#### Why Office tasks are hard for AI\.

Our analysis shows qualitative differences between human and AI approaches:\(1\) Style galleries vs\. API constants\.Humans select built\-in templates and effects from visual galleries in 2–3 clicks; AI agents must produce exact internal constants \(color hex values, style IDs, enumeration names\)*nowhere in the task description*\. Our criterion\-level taxonomy shows that this implementation\-knowledge gap remains a persistent bottleneck after many crashes are repaired\.\(2\) Limited interactive visual feedback\.Humans receive continuous visual feedback while editing\. The evaluated systems receive either static screenshots in the single\-turn setting or raw documents in the coding\-agent setting, but they do not receive the same continuous, closed\-loop visual confirmation after each operation\. This helps explain why visually specified categories \(Animation: 26\.7%, Graphics & Media: 29\.1%\) lag far behind more structured categories with explicit parameters \(Tables: 59\.1%, Charts: 51\.4%, Data & Formulas: 48\.5%\)\.\(3\) Cascading failures\.In multi\-step tasks, if an early operation fails, all downstream operations score zero\. Humans avoid this through sequential visual confirmation\.\(4\) Iterative regression\.Coding agents can*score worse*than single\-turn on some tasks: iterative debugging sometimes corrupts previously correct output, a failure mode absent in human workflows\.

#### Skill\-augmented agents\.

Recent work on skill\-based agent architectures\(Liet al\.,[2026](https://arxiv.org/html/2606.10956#bib.bib34); Liuet al\.,[2026](https://arxiv.org/html/2606.10956#bib.bib36)\)shows that equipping agents with curated, reusable skill libraries can substantially improve task completion\. Our error analysis provides a concrete case for this direction in the Office domain\. Table[5](https://arxiv.org/html/2606.10956#S6.T5)shows that, under this diagnostic labeling scheme, 97\.4% of non\-crash weighted loss for the best coding agent is labeled as implementation knowledge, and Table[6](https://arxiv.org/html/2606.10956#S6.T6)further breaks this down into actionable subtypes: OOXML property paths \(25\.8%\), enumeration constants \(22\.4%\), color/theme encodings \(14\.8%\), and numeric units \(11\.8%\)\. These are the kinds of errors that pre\-defined skills may help address\. For instance, a skill for “set slide transition” could encapsulate the exactppTransitionTypeenumeration value, the effect\-option mapping, and the duration unit conversion—knowledge that current models must guess from memory\. Per\-skill pass rates \(Table[10](https://arxiv.org/html/2606.10956#A3.T10)\) suggest where such skills would yield the highest return: Animation \(best model: 26\.7%\) and Graphics & Media \(29\.1%\) are the weakest categories, while Tables \(59\.1%\) and Data & Formulas \(48\.5%\) are already comparatively strong\. Designing and evaluating a skill library for Office automation, and measuring how much of the implementation\-knowledge bottleneck it can close, is a promising direction for future work\.

#### Cross\-language analysis\.

A natural question is whether the performance gap stems primarily from the difficulty of the Office tasks themselves or from models’ weaker handling of Chinese\-language task materials\. To investigate, we construct an English\-translated variant of all 200 tasks, including document content, scoring criteria, and font/style mappings, and re\-evaluate 5 models under the same single\-turn setting\. Table[7](https://arxiv.org/html/2606.10956#S7.T7)reports the results\.

Table 7:Cross\-language comparison \(single\-turn SR%\)\. EN = English\-translated tasks; ZH = original Chinese tasks\. Claude Opus 4\.7 EN is the mean of 2 runs\. Grok\-4\.1\-fast and Kimi\-K2\.6 are omitted due to API unavailability at the time of the English experiment\.Three patterns emerge\.\(1\) The translated variant does not remove the difficulty for strong models\.Claude Opus 4\.7 performs comparably in both languages \(34\.7% EN vs\. 36\.6% ZH,Δ=−1\.9\\Delta\{=\}\{\-\}1\.9pp\), and GPT\-5\.5 slightly improves \(\+3\.8pp\)\. This suggests that the best models’ failures are driven substantially by Office\-specific implementation knowledge, not only by Chinese instruction comprehension\.\(2\) Mid\-tier and open\-weight models benefit substantially from English\.Gemini 3\.1 Pro improves by \+6\.1pp, Qwen3\.5 by \+8\.8pp, and MiMo by \+8\.3pp when switching to English, indicating that Chinese instruction understanding is a meaningful secondary bottleneck for these models\.\(3\) PowerPoint shows the largest translated\-variant gains\.Nearly all models gain 10–17pp on PPT in English\. PPT operations reference named constants \(e\.g\., “Emphasis/Teeter,” “Medium Style 2”\) that originate from English\-language Office APIs; English translations and font/style mappings align more directly with these identifiers\. Crucially, even on English tasks, the best single\-turn model reaches only 40\.0% \(GPT\-5\.5\), still well below the 60\-point reference threshold for the extracted practical subset\. This suggests that English translation can affect some models and applications, especially PowerPoint, but the task difficulty remains substantial in the translated variant\.

#### Limitations\.

\(1\)The English experiment \(Table[7](https://arxiv.org/html/2606.10956#S7.T7)\) uses translated NCRE tasks rather than a natively English certification \(e\.g\., MOS\)\. Translation artifacts—such as font\-name substitutions and style\-constant mappings—may introduce confounds that a native English exam would avoid\.\(2\)We evaluate code\-generation agents only; GUI\-based agents with visual feedback might perform differently, particularly on layout\-heavy tasks\. A direct comparison remains future work\.\(3\)The coding\-agent comparison changes several factors at once: execution feedback, repair budget, agent scaffolding, and COM access\. Future ablations should isolate these factors\.\(4\)Community reference solutions \(95\.5%\) provide a useful scoring sanity check but are not a formal human baseline\. NCRE materials may also appear in public training data; the direction and magnitude of possible contamination effects are difficult to determine\.\(5\)NCRE materials remain copyrighted and are not redistributed by the authors\. Exact replication therefore requires researchers to obtain the same materials independently\. At the same time, this is what gives the study its exam anchor: the tasks were written and calibrated outside this work\. We document the selection protocol, task statistics, scoring taxonomy, evaluation pipeline, prompts, and criterion\-level aggregate results to make the study as inspectable as possible \(Section[3\.1](https://arxiv.org/html/2606.10956#S3.SS1)\)\.

#### Conclusion\.

On this extracted practical\-operation subset of a standardized Office proficiency exam, no single\-turn LLM earns more than 36\.6% of the officially allocated points\. A stronger autonomous coding\-agent setting reaches 53\.0–68\.8%, but remains below the 95\.5% community\-reference score\.OfficeEvaluses a real certification exam to track how close code\-generating LLM and agent systems are to reliable, fine\-grained Office document automation\.

Authors\.Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Furu Wei

## References

- Anthropic \(2026\)Introducing Claude Opus 4\.7\.External Links:[Link](https://www.anthropic.com/news/claude-opus-4-7)Cited by:[§4\.1](https://arxiv.org/html/2606.10956#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2606.10956#S4.SS2.SSS0.Px2.p1.1)\.
- R\. Bonatti, D\. Zhao, F\. Bonacci, D\. Dupont, S\. Abdali, Y\. Li, Y\. Lu, J\. Wagle, K\. Koishida, A\. Bucker,et al\.\(2024\)Windows agent arena: evaluating multi\-modal os agents at scale\.arXiv preprint arXiv:2409\.08264\.Cited by:[§2](https://arxiv.org/html/2606.10956#S2.p1.1)\.
- Y\. Chen, Y\. Yuan, Z\. Zhang, Y\. Zheng, J\. Liu, F\. Ni, J\. Hao, H\. Mao, and F\. Zhang \(2025\)SheetAgent: towards a generalist agent for spreadsheet reasoning and manipulation via large language models\.InProceedings of the ACM on Web Conference 2025,pp\. 158–177\.Cited by:[§2](https://arxiv.org/html/2606.10956#S2.p2.1)\.
- X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su \(2023\)Mind2web: towards a generalist agent for the web\.Advances in Neural Information Processing Systems36,pp\. 28091–28114\.Cited by:[§2](https://arxiv.org/html/2606.10956#S2.p1.1)\.
- Google DeepMind \(2026\)Gemini 3\.1 Pro model card\.External Links:[Link](https://deepmind.google/models/model-cards/gemini-3-1-pro/)Cited by:[§4\.1](https://arxiv.org/html/2606.10956#S4.SS1.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§4\.1](https://arxiv.org/html/2606.10956#S4.SS1.p1.1)\.
- Y\. Guo, Z\. Zhang, Y\. Liang, D\. Zhao, and N\. Duan \(2024\)Pptc benchmark: evaluating large language models for powerpoint task completion\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 8682–8701\.Cited by:[§2](https://arxiv.org/html/2606.10956#S2.p2.1)\.
- Z\. Huang, X\. Liu, T\. Hu, K\. Zhang, and Y\. Liu \(2025\)PPTBench: towards holistic evaluation of large language models for powerpoint layout and design understanding\.arXiv preprint arXiv:2512\.02624\.Cited by:[§2](https://arxiv.org/html/2606.10956#S2.p2.1)\.
- ISO/IEC \(2016\)ISO/IEC 29500\-1:2016: information technology – document description and processing languages – office open XML file formats – part 1: fundamentals and markup language reference\.ISO/IEC 29500\-1:2016\.External Links:[Link](https://www.iso.org/standard/71691.html)Cited by:[§D\.3](https://arxiv.org/html/2606.10956#A4.SS3.p1.1),[§3\.3](https://arxiv.org/html/2606.10956#S3.SS3.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)Swe\-bench: can language models resolve real\-world github issues?\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 54107–54157\.Cited by:[§1](https://arxiv.org/html/2606.10956#S1.p1.1),[§2](https://arxiv.org/html/2606.10956#S2.p1.1)\.
- H\. Li, J\. Su, Y\. Chen, Q\. Li, and Z\. Zhang \(2023\)Sheetcopilot: bringing software productivity to the next level through large language models\.Advances in Neural Information Processing Systems36,pp\. 4952–4984\.Cited by:[§2](https://arxiv.org/html/2606.10956#S2.p2.1)\.
- X\. Li, W\. Chen, Y\. Liu, S\. Zheng, X\. Chen, Y\. He, Y\. Li, B\. You, H\. Shen, J\. Sun,et al\.\(2026\)SkillsBench: benchmarking how well agent skills work across diverse tasks\.arXiv preprint arXiv:2602\.12670\.Cited by:[§7](https://arxiv.org/html/2606.10956#S7.SS0.SSS0.Px4.p1.1)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang,et al\.\(2024\)Agentbench: evaluating llms as agents\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 52989–53046\.Cited by:[§2](https://arxiv.org/html/2606.10956#S2.p1.1)\.
- Y\. Liu, J\. Ji, L\. An, T\. Jaakkola, Y\. Zhang, and S\. Chang \(2026\)How well do agentic skills work in the wild: benchmarking llm skill usage in realistic settings\.arXiv preprint arXiv:2604\.04323\.Cited by:[§7](https://arxiv.org/html/2606.10956#S7.SS0.SSS0.Px4.p1.1)\.
- Z\. Ma, B\. Zhang, J\. Zhang, J\. Yu, X\. Zhang, X\. Zhang, S\. Luo, X\. Wang, and J\. Tang \(2024\)Spreadsheetbench: towards challenging real world spreadsheet manipulation\.Advances in Neural Information Processing Systems37,pp\. 94871–94908\.Cited by:[§2](https://arxiv.org/html/2606.10956#S2.p2.1)\.
- G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom \(2024\)Gaia: a benchmark for general ai assistants\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 9025–9049\.Cited by:[§1](https://arxiv.org/html/2606.10956#S1.p1.1),[§2](https://arxiv.org/html/2606.10956#S2.p1.1)\.
- Microsoft \(2023\)Welcome to the open XML SDK for office\.External Links:[Link](https://learn.microsoft.com/en-us/office/open-xml/open-xml-sdk)Cited by:[§D\.3](https://arxiv.org/html/2606.10956#A4.SS3.p1.1),[§3\.3](https://arxiv.org/html/2606.10956#S3.SS3.p1.1)\.
- Moonshot AI \(2026\)Kimi K2\.6: advancing open\-source coding to agentic intelligence\.External Links:[Link](https://www.kimi.com/blog/kimi-k2-6)Cited by:[§4\.1](https://arxiv.org/html/2606.10956#S4.SS1.p1.1)\.
- J\. Mu, C\. Zhang, C\. Ni, L\. Wang, B\. Qiao, K\. Mathur, Q\. Wu, Y\. Xie, X\. Ma, M\. Zhou,et al\.\(2025\)Gui\-360: a comprehensive dataset and benchmark for computer\-using agents\.Cited by:[§2](https://arxiv.org/html/2606.10956#S2.p2.1)\.
- National Education Examinations Authority \(2024a\)National computer rank examination: frequently asked questions\.External Links:[Link](https://ncre.neea.edu.cn/html1/category/1507/840-1.htm)Cited by:[§1](https://arxiv.org/html/2606.10956#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.10956#S3.SS1.p1.1)\.
- National Education Examinations Authority \(2024b\)National computer rank examination: results and certificates\.External Links:[Link](https://ncre.neea.edu.cn/html1/category/1507/840-2.htm)Cited by:[§3\.1](https://arxiv.org/html/2606.10956#S3.SS1.p2.1)\.
- National Education Examinations Authority \(2024c\)NCRE level 1 computer basics and MS Office application syllabus \(2025 edition\)\.External Links:[Link](https://ncre.neea.edu.cn/res/Home/2412/9ee6dd725a956e088c694df720528cbd.pdf)Cited by:[§3\.1](https://arxiv.org/html/2606.10956#S3.SS1.p1.1)\.
- National Education Examinations Authority \(2024d\)NCRE level 2 MS Office advanced application and design syllabus \(2025 edition\)\.External Links:[Link](https://ncre.neea.edu.cn/res/Home/2412/007b7cbab11984a769c33fdc605b7f85.pdf)Cited by:[§3\.1](https://arxiv.org/html/2606.10956#S3.SS1.p1.1)\.
- OpenAI \(2026a\)Introducing GPT\-5\.3\-Codex\-Spark\.External Links:[Link](https://openai.com/index/introducing-gpt-5-3-codex-spark/)Cited by:[§4\.1](https://arxiv.org/html/2606.10956#S4.SS1.p1.1)\.
- OpenAI \(2026b\)Introducing GPT\-5\.5\.Note:[https://openai\.com/index/introducing\-gpt\-5\-5/](https://openai.com/index/introducing-gpt-5-5/)Accessed: 2026\-05\-18Cited by:[§4\.1](https://arxiv.org/html/2606.10956#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2606.10956#S4.SS2.SSS0.Px2.p1.1)\.
- Qwen Team, Alibaba \(2026\)Qwen3\.5\-397B\-A17B\.External Links:[Link](https://huggingface.co/Qwen/Qwen3.5-397B-A17B)Cited by:[§4\.1](https://arxiv.org/html/2606.10956#S4.SS1.p1.1)\.
- W\. Wang, D\. Han, D\. M\. Diaz, J\. Xu, V\. Rühle, and S\. Rajmohan \(2025\)Odysseybench: evaluating llm agents on long\-horizon complex office application workflows\.arXiv preprint arXiv:2508\.09124\.Cited by:[§2](https://arxiv.org/html/2606.10956#S2.p2.1)\.
- X\. Wang, Y\. Chen, L\. Yuan, Y\. Zhang, Y\. Li, H\. Peng, and H\. Ji \(2024a\)Executable code actions elicit better llm agents\.InForty\-first International Conference on Machine Learning,Cited by:[§4\.2](https://arxiv.org/html/2606.10956#S4.SS2.p1.1)\.
- Z\. Wang, Y\. Cui, L\. Zhong, Z\. Zhang, D\. Yin, B\. Y\. Lin, and J\. Shang \(2024b\)Officebench: benchmarking language agents across multiple applications for office automation\.arXiv preprint arXiv:2407\.19056\.Cited by:[§2](https://arxiv.org/html/2606.10956#S2.p2.1)\.
- xAI \(2025\)Grok 4\.1 Fast and agent tools API\.External Links:[Link](https://x.ai/news/grok-4-1-fast)Cited by:[§4\.1](https://arxiv.org/html/2606.10956#S4.SS1.p1.1)\.
- Xiaomi AI Lab \(2026\)MiMo\-V2\.5\.External Links:[Link](https://huggingface.co/XiaomiMiMo/MiMo-V2.5)Cited by:[§4\.1](https://arxiv.org/html/2606.10956#S4.SS1.p1.1)\.
- T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei,et al\.\(2024\)Osworld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.Advances in Neural Information Processing Systems37,pp\. 52040–52094\.Cited by:[§1](https://arxiv.org/html/2606.10956#S1.p1.1),[§2](https://arxiv.org/html/2606.10956#S2.p1.1),[§2](https://arxiv.org/html/2606.10956#S2.p2.1)\.
- F\. F\. Xu, Y\. Song, B\. Li, Y\. Tang, K\. Jain, M\. Bao, Z\. Wang, X\. Zhou, Z\. Guo, M\. Cao,et al\.\(2026\)Theagentcompany: benchmarking llm agents on consequential real world tasks\.Advances in Neural Information Processing Systems38\.Cited by:[§2](https://arxiv.org/html/2606.10956#S2.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2024\)Webarena: a realistic web environment for building autonomous agents\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 15585–15606\.Cited by:[§1](https://arxiv.org/html/2606.10956#S1.p1.1),[§2](https://arxiv.org/html/2606.10956#S2.p1.1)\.

## Appendix ATask examples

We provide three complete task examples \(one per application\) illustrating the format, difficulty, and scoring granularity ofOfficeEvaltasks\. All three are Level 1 \(foundational\) tasks\. The original instructions are in Chinese; we provide English translations below each\.

### A\.1Word task: First China Online Media Forum \(71 criteria\)

> 在考生文件夹下,打开文档WORD\.docx,按照要求完成下列操作并以该文件名(WORD\.docx)保存文档。 ⑴将文中所有错词"网罗"替换为"网络";将标题段文字("首届中国网络媒体论坛在青岛开幕")的字体格式设置为三号黑体、加粗、居中、字符宽度调整为17字符,文本渐变填充效果设为"预设渐变/径向渐变 \- 个性色2,类型/路径,颜色/红色(标准色)",文本映像效果设为"预设/映像变体/全映像,4pt偏移量:透明度50%、大小65%、距离2\.5磅"。 ⑵设置页面纸张大小为"16开(18\.4厘米 x 26厘米)";在页面底端插入"椭圆形"样式页码,页脚底端距离2厘米,设置页码编号格式为"壹,贰,叁,……"、起始页码为"贰";编辑文档属性信息:标题/首届中国网络媒体论坛在青岛开幕,作者/A考生,单位/NCRE;在页面顶端插入"空白"型页眉,页眉内容为该文档作者("文档部件/文档属性");将页面颜色的填充效果设置为"纹理/羊皮纸";为页面添加内容为"首届中国网络媒体论坛"文字型水印,水印内容的文本格式为:黑体、蓝色(标准色)。 ⑶将正文各段文字("6月22日,……评选办法等。")设置为12磅方正姚体;第一段首字下沉2行,距正文0\.2厘米;除第一段外的其余各段落(不包括表格)左、右各缩进1\.5字符,首行缩进2字符,段前间距1行;将正文第三段("论坛的主题是……管理和自律。")分为等宽两栏,设置栏宽为15字符,栏间加分隔线;将正文第四段("与会嘉宾……评选办法等。")的段落分页设置为"段中不分页"。 ⑷在表格顶端添加表标题"公众号关注量统计表",并将其设置为小二号华文彩云、加粗、居中、深红(标准色);在表格右侧插入一空列,在该列第一行的单元格中输入列标题"合计",其余各单元格中填入该行各单元格数据的总和(利用表格工具中的公式),按"合计"列依据"数字"类型对表格降序排序;设置表格居中,表格第一行和第一列内容居中、垂直对齐为水平居中,其余单元格内容右对齐;设置表格行高为0\.6厘米、第一列列宽为2\.5厘米、其余列列宽1\.8厘米。 ⑸设置表格第一行底纹颜色为主题颜色"水绿色,个性色5,淡色80%";设置表格外框线和第一行的下框线为红色(标准色)0\.75磅双窄线、其余内框线为红色(标准色)0\.5磅单实线。

English translation:

> Open WORD\.docx and complete the following operations: 1\. Replace all instances of the typo “网罗” with “网络”; format the title \(“First China Online Media Forum Opens in Qingdao”\): SimHei 16pt, bold, centered, character width 17; set text gradient fill to “Preset Gradient/Radial – Accent 2, Type/Path, Color/Red \(standard\)”; text reflection to “Full Reflection, 4pt offset: transparency 50%, size 65%, distance 2\.5pt\.” 2\. Set page size to 16K \(18\.4cm×\\times26cm\); insert “Oval” page number at bottom, footer distance 2cm, numbering format “壹, 贰, 叁, …” starting at “贰”; set document properties: title/First China Online Media Forum Opens in Qingdao, author/A考生, organization/NCRE; insert “Blank” header with document author \(via Document Properties\); set page color fill to “Texture/Parchment”; add text watermark “首届中国网络媒体论坛” in SimHei, blue \(standard\)\. 3\. Format body paragraphs \(“6月22日, …评选办法等\.”\): 12pt FangZheng Yao; first paragraph drop cap 2 lines \(0\.2cm from text\); remaining paragraphs \(excluding table\): left/right indent 1\.5 characters, first\-line indent 2 characters, 1\-line spacing before; split third paragraph into 2 equal columns \(width 15 characters, separator line\); set fourth paragraph to “Keep lines together\.” 4\. Add table caption “公众号关注量统计表” in size 18pt HuaWen CaiYun, bold, centered, dark red \(standard\); insert blank column on right labeled “合计,” fill with row\-sum formulas; sort by “合计” column descending \(numeric\); center table, center first row and column content vertically and horizontally, right\-align remaining cells; row height 0\.6cm, first column width 2\.5cm, other columns 1\.8cm\. 5\. Shade first row “Turquoise, Accent 5, Lighter 80%”; set outer borders and first\-row bottom border to red 0\.75pt double line; other inner borders to red 0\.5pt single line\.

Scoring:71 criteria spanning text formatting and find/replace \(35\), table structure and formatting \(22\), page setup \(10\), and document properties \(4\)\. Each criterion checks a specific document property \(e\.g\., “title font is SimHei,” “page size is 18\.4cm×\\times26cm,” “table row height is 0\.6cm”\)\.

### A\.2Excel task: Three\-Year Temperature Statistics \(20 criteria\)

> 打开考生文件夹下的电子表格Excel\.xlsx工作簿文件,按照下列要求完成对此表格的操作并保存。 1\.选择Sheet1工作表,将A1:G1单元格合并为一个单元格,文字居中对齐;利用AVERAGE函数计算近三年每月平均高温的平均值置于"平均高温平均值"列(H4:H15,保留小数点后1位),计算近三年每月平均低温的平均值置于"平均低温平均值"列(I4:I15,保留小数点后1位);利用MAX函数计算近三年每月平均高温的最高值置于"平均高温最高值"列(J4:J15),利用MIN函数计算近三年每月平均低温的最低值置于"平均低温最低值"列(K4:K15)。利用条件格式修饰B4:G15单元格区域,值大于28的单元格设置为"浅红填充色深红色文本",值小于12的单元格设置为"绿填充色深绿色文本";利用条件格式修饰H4:K15单元格区域,基于各自值设置所有单元格的格式为实心填充数据条,自定义颜色(RGB:216,194,246),条形图方向为从左到右。 2\.选取Sheet1工作表中的"月份"列(A3:A15)、"平均高温平均值"列(H3:H15)、"平均低温平均值"列(I3:I15)、"平均高温最高值"列(J3:J15)和"平均低温最低值"列(K3:K15)数据区域的内容建立"折线图",图表样式为"样式3",图表布局为"布局5",垂直坐标轴标题设置为"单位(度)",图表标题为"近三年气温统计图",将图表插入到当前工作表的"A17:H33"单元格区域内,将Sheet1工作表命名为"近三年气温统计表"。 3\.选择"产品销售情况表"工作表,对工作表内数据清单的内容按主要关键字"分公司"的升序和次要关键字"季度"的升序进行排序;对排序后的数据进行筛选,条件:产品名称为电视机、电冰箱、数码相机和空调,且销售额排名小于或等于30,工作表名不变,保存Excel\.xlsx工作簿。

English translation:

> Open Excel\.xlsx and complete the following: 1\. On Sheet1, merge A1:G1 and center; use AVERAGE to compute three\-year mean high temperatures \(H4:H15, 1 decimal\) and mean low temperatures \(I4:I15, 1 decimal\); use MAX for highest high temperatures \(J4:J15\) and MIN for lowest low temperatures \(K4:K15\)\. Apply conditional formatting to B4:G15: values\>\>28→\\to“Light Red Fill with Dark Red Text,” values<<12→\\to“Green Fill with Dark Green Text\.” Apply data\-bar conditional formatting to H4:K15: solid fill, custom color \(RGB: 216, 194, 246\), left\-to\-right direction\. 2\. Create a line chart from columns A3:A15, H3:H15, I3:I15, J3:J15, K3:K15; chart style “Style 3,” layout “Layout 5”; vertical axis title “单位(度)” \(“Unit \(°C\)”\); chart title “近三年气温统计图” \(“Three\-Year Temperature Statistics”\); place chart at A17:H33; rename sheet to “近三年气温统计表\.” 3\. On the “产品销售情况表” \(Product Sales\) sheet, sort by “分公司” \(Branch\) ascending then “季度” \(Quarter\) ascending; filter: product name = TV, refrigerator, digital camera, or air conditioner, and sales rank≤\\leq30\.

Scoring:20 criteria covering formulas and cell content \(5\), conditional formatting \(4\), chart properties \(6\), worksheet tab name \(1\), and data sorting/filtering \(4\)\. Representative criteria include: “AVERAGE formula is correct in H4,” “chart type is line chart,” “sheet tab name is 近三年气温统计表\.”

### A\.3PowerPoint task: High Temperature Warning \(21 criteria\)

> 打开考生文件夹下的演示文稿yswg\.pptx,按照下列要求完成对此文稿的修饰并保存。 1\.为整个演示文稿应用"离子"主题,设置幻灯片的大小为"全屏显示(16:9)",并在幻灯片大小变更时,按比例调整内容以适应新的幻灯片尺寸;放映方式为"观众自行浏览"。 2\.在第一张幻灯片前插入一张版式为"标题幻灯片"的新幻灯片,主标题为"北京河北山东陕西等地7月6日高气温将达40℃",副标题为"高温预警"。 3\.第二张幻灯片版式改为"两栏内容";标题为"高温黄色预警";将考生文件夹下图片文件PPT1\.PNG移到右侧内容区;左侧文本设置为"黑体"、23磅字;图片动画设置为"强调/陀螺旋",效果选项为"数量/半旋转"。 4\.在幻灯片的最后插入一张版式为"空白"的幻灯片,插入一个SmartArt图形,版式为"水平层次结构",SmartArt样式为"砖块场景",SmartArt图形中的所有文字从考生文件夹下的文本文件PPT2\.txt中获取。 5\.在幻灯片的最后插入一张版式为"标题和内容"的新幻灯片,标题为"高温防御指南";内容区插入5行2列的表格,表格样式为"中度样式2"。第1行的1、2列内容依次为"有关单位和人员"和"高温防御措施",其他单元格的内容根据考生文件夹下文本文件PPT3\.txt内容按顺序依次从上到下填写,例如第2行的1、2列内容依次为"媒体"和"应加强防暑降温保健知识的宣传;"。表格内文字均设置为22磅字,并在备注区插入文本"全社会动员起来防御高温"。 6\.全体幻灯片切换方式为"百叶窗",效果选项为"水平"。

English translation:

> Open yswg\.pptx and complete the following: 1\. Apply the “Ion” theme; set slide size to widescreen \(16:9\), scaling content to fit; set show type to “Browsed by an Individual\.” 2\. Insert a new “Title Slide” before slide 1; main title “北京河北山东陕西等地7月6日高气温将达40℃” \(“Beijing, Hebei, Shandong, Shaanxi forecast 40°C on July 6”\); subtitle “高温预警” \(“High Temperature Warning”\)\. 3\. Change slide 2 to “Two Content” layout; title “高温黄色预警” \(“Yellow Heat Warning”\); move PPT1\.PNG to right content area; set left text to SimHei 23pt; set image animation to “Emphasis/Spin,” effect option “Amount/Half Spin\.” 4\. Insert a blank slide at the end; add SmartArt “Horizontal Hierarchy” with “Brick Scene” style; populate text from PPT2\.txt\. 5\. Insert a “Title and Content” slide at the end; title “高温防御指南” \(“Heat Defense Guide”\); insert a 5×\\times2 table with “Medium Style 2”; fill header row with “有关单位和人员” and “高温防御措施,” remaining rows from PPT3\.txt; set all table text to 22pt; add note “全社会动员起来防御高温\.” 6\. Apply “Blinds” transition to all slides with “Horizontal” effect option\.

Scoring:21 criteria covering transitions and animations \(8\), slide layout and content \(7\), tables \(4\), graphics/media \(1\), and document properties \(1\)\. These criteria verify properties such as “theme name is Ion,” “show type is Browsed by an Individual,” “SmartArt layout is Horizontal Hierarchy,” and “transition type is Blinds/Horizontal\.”

## Appendix BDetailed scoring configuration

Table[8](https://arxiv.org/html/2606.10956#A2.T8)provides the complete mapping from criterion types \(WepTypes\) to our 8 skill categories\.

Table 8:Complete mapping from evaluation criterion types \(WepTypes\) to skill categories\. Count indicates the number of ScorePointDefine elements across all 200 tasks\.Skill CategoryWepType\(s\)CountText & FormatParaFind2,160ParaFont109TextEffect136FormatParaGraph51Body288Layout & DesignPageSetup483SlideMasterEffect81SlideFind648SlideCustom41TablesTable185TableTestRowCol586ChartsChart362Data & FormulasPivotTable91FormatCondition63Subtotal14Border3Graphics & MediaShapePicExist237SmartArt293FormatPicture58Shape31TextFrameFind30AnimationAnimationSetting178Transition377Doc\. PropertiesDocument174Common220WorksheetTab / WorkSheetTab73Sheet18Workbook8Custom / CustShow / ParseAllProps120
## Appendix CAdditional results

### C\.1Per\-task score distribution

Table[9](https://arxiv.org/html/2606.10956#A3.T9)summarizes the per\-task score distribution for each model, computed from three\-run averages\. Scores are highly skewed: most models’ first quartile is 0%, reflecting the large fraction of tasks that receive zero credit \(typically due to code execution failure\)\. Claude Opus 4\.7 achieves a median of 32\.1%, meaning half of all tasks score below this mark, and is one of only two models to achieve a near\-perfect score on any task\. Weaker models are dominated by zero\-score outcomes: Qwen3\.5\-397B\-A17B receives exactly 0 on more than half of all tasks \(104 out of 200\), resulting in a median of 0%\. Standard deviations are large \(6\.8–25\.7 percentage points\), indicating high variance across tasks: some tasks are solvable by every model while others defeat all of them\.

Table 9:Per\-task score distribution \(%\) across models, computed from three\-run averages\. Q25/Q50/Q75 denote quartiles\. “Zeros” counts tasks receiving exactly 0/100 averaged across runs\.
### C\.2Complete per\-skill results

Table[10](https://arxiv.org/html/2606.10956#A3.T10)presents per\-skill criterion pass rates for all seven evaluated models\. Tables, Charts, and Data & Formulas show the strongest peak rates for top models \(up to 59\.1%\), indicating that table structure, formula\-based operations, and chart operations are comparatively amenable to code generation\. Animation and Graphics & Media are consistently the weakest categories, reflecting the limited API coverage ofpython\-pptxfor transitions, effects, and SmartArt\. The four weakest models \(Kimi\-K2\.6 through Grok\-4\.1\-fast\) show relatively uniform profiles with rates below 20% across all categories, while stronger models exhibit more pronounced peaks and valleys\. GPT\-5\.5 leads in criterion pass rate on Charts, Data & Formulas, Graphics & Media, and Animation, while Claude Opus 4\.7 leads on Text & Format, Layout & Design, Tables, and Doc\. Properties\. Despite comparable task\-level SR \(36\.2% vs\. 36\.6%\) and overall criterion pass rates \(40\.3% vs\. 40\.5%\), the two models exhibit complementary strengths: GPT\-5\.5 excels at data\-centric and chart operations, while Claude Opus 4\.7 is stronger at document structure and table formatting\.

Table 10:Per\-skill criterion pass rate \(%\) for all seven evaluated models\. Bold indicates the highest rate per skill\. Criteria are classified into 8 skill categories based on their evaluation type \(WepType\); see Table[8](https://arxiv.org/html/2606.10956#A2.T8)for the full mapping\.
### C\.3Task difficulty analysis

To assess the discriminative power of individual tasks, we analyze the score distribution across all 7 models\. Of 200 tasks, 1 \(0\.5%\) defeats all models \(every model’s three\-run average is 0%\), while 67 \(33\.5%\) are partially solved by all models \(every model scores above 0%\)\. The remaining 132 tasks discriminate between model tiers: 34 tasks are solved only by the top models \(GPT\-5\.5 or Claude Opus 4\.7 scores\>\>50%\) while the three weakest models all score 0%\. The most discriminating tasks exhibit spreads of up to 100 percentage points between the best and worst models, confirming thatOfficeEvalprovides meaningful separation across the full capability spectrum\. These statistics indicate a well\-calibrated difficulty distribution: the benchmark is neither trivially easy nor impossibly hard, with the majority of tasks providing useful signal for distinguishing model capabilities\.

## Appendix DReproducibility details

### D\.1Agent prompts

The single\-turn baseline uses the following system prompt for all models:

> You are an expert in Microsoft Office automation\. Your task is to write Python code to manipulate Office documents according to the user’s requirements\. Rules:1\. Write complete, directly executable Python code based on the task requirements\.2\. Use python\-docx for Word documents \(\.docx\), openpyxl for Excel \(\.xlsx\), and python\-pptx for PowerPoint \(\.pptx\)\.3\. The document content and task instructions are in Chinese\.4\. Output ONLY the Python code, wrapped in \`\`\`python and \`\`\`\. No explanations\.

The user message includes: \(1\) task instructions from the examination \(Chinese text with any embedded images\), \(2\) screenshots of input document pages/sheets/slides rendered as JPEG images \(max 1000px\), and \(3\) a file listing with code generation instructions\. No few\-shot examples are provided\.

For the coding agents \(Claude Code and Codex\), the following prompt is provided directly without a custom system prompt:

> Read the \.rtf file in this directory for the task requirements\. The input files to modify are in \./input/\. Complete all the operations described and save all changes back to the original files\.

The agent operates in an isolated temporary directory containing only the raw exam paper \(RTF file\) and the input document\(s\)\. No pre\-processed task text, document screenshots, library recommendations, or scoring rubrics are provided\. Claude Code runs with theclaudeCLI in non\-interactive mode; Codex runs with thecodexCLI\. Both use no turn limit and a one\-hour timeout per task\.

### D\.2Decoding settings

All models are called withtemperature=1\.0\. GPT\-5\.5 uses an internaltemperature=1\.0setting that is not user\-configurable, so we set all other configurable models to the same value for consistency\. To account for the resulting stochasticity, all main\-paper results are means of 3 independent runs per model\.

### D\.3Evaluation environment

The evaluation engine primarily parses ISO/IEC 29500 Office Open XML structures using the Open XML SDK\[ISO/IEC,[2016](https://arxiv.org/html/2606.10956#bib.bib14), Microsoft,[2023](https://arxiv.org/html/2606.10956#bib.bib15)\], with a subset of checks invoking Office COM automation, and evaluates each scoring criterion defined in the task’s XML configuration\. For each criterion, the engine locates the specified document property, compares it against the expected value using the declared operator, and emits a binary pass/fail result\. The per\-task score \(0–100\) is the sum of points from passing criteria\. The engine is implemented as a command\-line tool; a 60\-second timeout is applied per task\.

### D\.4Library versions

Table 11:Library versions used in experiments\.
### D\.5Community solution error analysis

The community reference solutions achieve 95\.5% on average, leaving a small gap from perfect scores\. We manually inspected the lowest\-scoring community reference solutions and found that the deductions correspond to genuinely missing or incorrect operations \(e\.g\., omitted page setup steps, incorrect font sizes, or missing chart elements\)\. In all inspected cases, the scoring engine correctly identified the discrepancy between the submitted document and the rubric specification\. This supports the use of the community reference solutions as a sanity check for task solvability and scoring behavior, while not making them a formal human baseline\.

Similar Articles

DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation

arXiv cs.CL

DLawBench is a new benchmark for evaluating large language models in multi-turn legal consultation, covering Chinese and US law with four client types. Experiments show significant room for improvement, with the best model achieving only 0.562 on legal reasoning.

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Hugging Face Daily Papers

The paper introduces IndustryBench, a benchmark evaluating LLMs on industrial procurement QA in Chinese against national standards, highlighting safety compliance gaps. It reveals that extended reasoning often lowers safety-adjusted scores and reshuffles model rankings when safety violations are considered.