AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

arXiv cs.CL Papers

Summary

AGORA is a new benchmark for evaluating large language models on archive-grounded reasoning tasks across workplace documents, comprising 362 questions over 9,664 real documents. The strongest model achieves only 59.4% accuracy, highlighting substantial room for improvement.

arXiv:2606.24526v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and none jointly stresses archive-groundedness, agentic exploration, and cross-domain coverage. We introduce Agora, a benchmark pairing 362 questions with eight domain collections of 9,664 authentic documents and 372M tokens, far exceeding any model's context window, so agents must explore deliberately rather than scan exhaustively. Agora is built by an agentic pipeline combining cross-document task synthesis, leakage-preventing obfuscation, and difficulty filtering. Evaluating eight models, we find the task far from solved: even the strongest reaches only 59.4% accuracy, with notable variation across domains.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:47 AM

# An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning
Source: [https://arxiv.org/html/2606.24526](https://arxiv.org/html/2606.24526)
Honglin Guo1,,Qi Zhang1,\*,Yu Zhang2,Weijie Li1,Rui Zheng3, Zhikai Lei3,Qiyuan Peng1,Zhiheng Xi1,Tao Gui1,Qi Zhang1 1Fudan University,2Zhejiang University,3Shanghai Qiji Zhifeng Co\., Ltd\. hlguo24@m\.fudan\.edu\.cn, \{tgui,qz\}@fudan\.edu\.cn

###### Abstract

Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge\. We study*archive\-grounded reasoning*: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer\. Existing benchmarks address only parts of this setting and none jointly stresses archive\-groundedness, agentic exploration, and cross\-domain coverage\. We introduceAgora111Agorais short forArchive\-GroundedOfficeReasoningAssessment, a benchmark pairing 362 questions with eight domain collections of 9,664 authentic documents and 372M tokens, far exceeding any model’s context window, so agents must explore deliberately rather than scan exhaustively\.Agorais built by an agentic pipeline combining cross\-document task synthesis, leakage\-preventing obfuscation, and difficulty filtering\. Evaluating eight models, we find the task far from solved: even the strongest reaches only 59\.4% accuracy, with notable variation across domains\.

Agora: An Archive\-Grounded Benchmark for Agentic Workplace Document Reasoning

Honglin Guo1,††thanks:Equal contribution\., Qi Zhang1,\*, Yu Zhang2, Weijie Li1, Rui Zheng3,Zhikai Lei3,Qiyuan Peng1,Zhiheng Xi1,Tao Gui1,Qi Zhang11Fudan University,2Zhejiang University,3Shanghai Qiji Zhifeng Co\., Ltd\.hlguo24@m\.fudan\.edu\.cn, \{tgui,qz\}@fudan\.edu\.cn

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.24526v1/x1.png)Figure 1:Overview ofAgora\.The benchmark comprises 372M tokens across 9,664 documents and 362 questions, spanning eight professional domains\. Even the strongest model reaches only 59\.4% overall accuracy, leaving substantial headroom\.Table 1:Comparison with representative multi\-hop QA, document QA, RAG, and agent benchmarks\.Multi\-file: whether answering a question may require evidence from multiple files\.Archive\-grounded: whether reasoning is restricted to a fixed corpus rather than live web search or an open environment\.Agentic: whether answering requires active file navigation and code execution\. Marks: ✓ = yes,△\\triangle= partial, ✗ = no, – = not applicable\.Large language models are increasingly being used as agents rather than standalone chatbots\(Yanget al\.,[2024](https://arxiv.org/html/2606.24526#bib.bib20); Luoet al\.,[2025](https://arxiv.org/html/2606.24526#bib.bib7); DeepSeek\-AI,[2026](https://arxiv.org/html/2606.24526#bib.bib36)\)\. In enterprise settings, their value is not limited to producing fluent responses, but lies in helping users complete real knowledge\-work tasks over internal archives\(Hanet al\.,[2025](https://arxiv.org/html/2606.24526#bib.bib5); Zhanget al\.,[2026](https://arxiv.org/html/2606.24526#bib.bib4)\)\. For example, financial analysts need to verify figures across reports and spreadsheets, legal researchers need to trace relevant clauses and precedents, and policy teams need to synthesize evidence from scattered documents\. These tasks are difficult because internal archives are often large, inconsistently organized, and full of conflicting terminology, units, dates, and assumptions\(Islamet al\.,[2023](https://arxiv.org/html/2606.24526#bib.bib12); Zhaoet al\.,[2022](https://arxiv.org/html/2606.24526#bib.bib3)\)\. A capable agent must locate sparse evidence, reconcile inconsistencies, perform necessary calculations, and produce an answer that is accurate, verifiable, and directly useful for decision\-making\(Jinet al\.,[2025b](https://arxiv.org/html/2606.24526#bib.bib2); Zhenget al\.,[2025](https://arxiv.org/html/2606.24526#bib.bib1)\)\. Evaluating language agents therefore requires assessing not their fluency in chat but their capacity for archive\-grounded document reasoning, the capability central to whether they can become reliable workplace assistants\.

As Table[1](https://arxiv.org/html/2606.24526#S1.T1)shows, existing benchmarks cover only part of the requirements outlined above\. Multi\-hop QA benchmarks\(Yanget al\.,[2018](https://arxiv.org/html/2606.24526#bib.bib8); Trivediet al\.,[2022](https://arxiv.org/html/2606.24526#bib.bib9); Krishnaet al\.,[2025](https://arxiv.org/html/2606.24526#bib.bib10)\)typically rely on homogeneous corpora such as Wikipedia, and therefore do not capture the file\-format diversity and irregular organization of workplace archives\. Document QA and table QA benchmarks\(Zhuet al\.,[2021](https://arxiv.org/html/2606.24526#bib.bib11); Islamet al\.,[2023](https://arxiv.org/html/2606.24526#bib.bib12)\)are usually framed as reading\-comprehension tasks, so they do not require agents to navigate file systems or perform multi\-step computation over heterogeneous documents\. Agent and web\-browsing benchmarks\(Mialonet al\.,[2024](https://arxiv.org/html/2606.24526#bib.bib13); Zhouet al\.,[2024](https://arxiv.org/html/2606.24526#bib.bib14); Weiet al\.,[2025](https://arxiv.org/html/2606.24526#bib.bib15)\)mostly operate on the open web or in simulated environments, rather than on a bounded internal archive\. The closest prior work is OfficeQA Pro\(Opsahl\-Onget al\.,[2026](https://arxiv.org/html/2606.24526#bib.bib16)\), which combines retrieval with computation over a large enterprise\-style corpus; however, it is built from a single external source, limiting its domain and file\-format coverage\. This gap motivates the need for a benchmark that evaluates whether LLM agents can reason over realistic internal archives rather than over flat, homogeneous, or web\-scale corpora\.

![Refer to caption](https://arxiv.org/html/2606.24526v1/x2.png)Figure 2:Overview of theAgoratask setting\. Given a natural\-language query paired with one of eight professional domain collections, an agent explores files and runs computation through bash tool calls\. After turns of exploration and analysis, it submits a single numeric answer that is deterministically verified against the gold answer\.To bridge this gap, we introduceAgora, a benchmark designed to test whether LLM agents can perform archive\-grounded reasoning in realistic workplace settings\. Real archive\-grounded reasoning is not a simple reading\-comprehension task over a flat corpus; it is a search\-and\-verify process under scarcity\.Agoracontains 362 natural\-language queries distributed across eight domain\-specific collections, covering 9,664 real\-world documents and 372M tokens\. For each query, an agent is given access only to its paired collection and must navigate the file structure, locate sparse evidence, and execute the computations needed to derive the answer\. Because each collection is far larger than the context window of current models, the agent cannot rely on exhaustive scanning or shallow keyword matching; it must plan its exploration and reason over evidence selected from the archive\. Each query has a unique, verifiable numeric answer, enabling deterministic and reproducible evaluation without human or model\-based judgment\.Agorais constructed through an agentic pipeline that collects documents across multiple domains using deep search, synthesizes cross\-document multi\-hop questions, and enforces quality through leakage\-preventing obfuscation, difficulty filtering, and human verification\. As Figure[1](https://arxiv.org/html/2606.24526#S1.F1)shows, the resulting benchmark remains highly challenging: across eight evaluated models, even the strongest reaches only 59\.4% accuracy\.

We evaluate a mix of eight proprietary and open\-weight models onAgora, running all of them inmini\-swe\-agent, a minimal harness that exposes only a bash tool\. This keeps the agent interface fixed and simple, helping focus the comparison on model\-level reasoning, evidence selection, and tool use rather than on complex agent scaffolding\. Archive\-grounded document reasoning remains far from solved: even the strongest model obtains only 59\.4% accuracy, just below the 60% threshold\. More importantly, performance varies substantially across domains\. The results show that difficulty is often not a property of the benchmark alone, but emerges from the interaction between a specific model and a specific domain\. Indeed, the per\-domain leaderboards often differ from the aggregate ranking, revealing weaknesses that would be hidden by a single\-source or single\-domain benchmark\.

Our contributions are threefold\. First, we introduceAgora, a cross\-domain benchmark for archive\-grounded agentic document reasoning, consisting of 362 questions with verifiable numeric answers over eight workplace document collections, including 9,664 real\-world documents and 372M tokens\. Second, we develop an agentic construction pipeline that combines cross\-document task synthesis, leakage\-preventing obfuscation, difficulty filtering, and human verification\. Third, we evaluate a mix of eight frontier proprietary and open\-weight models onAgora, showing that archive\-grounded document reasoning remains far from solved: even the strongest model reaches only 59\.4% accuracy and exhibits systematic per\-domain weaknesses\.

## 2Related Work

The shift from standalone chatbots to autonomous agents has been driven by methodological advances\(Yaoet al\.,[2023](https://arxiv.org/html/2606.24526#bib.bib17); Schicket al\.,[2023](https://arxiv.org/html/2606.24526#bib.bib18)\), which established the paradigm of interleaving reasoning with tool invocation\(Yanget al\.,[2024](https://arxiv.org/html/2606.24526#bib.bib20); Wanget al\.,[2025](https://arxiv.org/html/2606.24526#bib.bib21)\)\. Building on these foundations, agents are now widely deployed in real productivity settings, spanning code repair, end\-to\-end office workflows, and deep research over document collections\(Liet al\.,[2025a](https://arxiv.org/html/2606.24526#bib.bib24); Jinet al\.,[2025a](https://arxiv.org/html/2606.24526#bib.bib23)\)\. As such agents proliferate, rigorous and scenario\-faithful benchmarks become essential for measuring their real\-world capability\. A first thread of benchmarks evaluates agents in increasingly realistic environments\. GAIA\(Mialonet al\.,[2024](https://arxiv.org/html/2606.24526#bib.bib13)\)and BrowseComp\(Weiet al\.,[2025](https://arxiv.org/html/2606.24526#bib.bib15)\)probe open\-web reasoning with browsing and search tools, while WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2606.24526#bib.bib14)\)and OSWorld\(Xieet al\.,[2024](https://arxiv.org/html/2606.24526#bib.bib26)\)target simulated browser and desktop operating\-system environments\. Closer to office workflows, SpreadsheetBench\(Maet al\.,[2024](https://arxiv.org/html/2606.24526#bib.bib27)\)evaluates cell\-level manipulation and formula reasoning, MEBench\(Linet al\.,[2025](https://arxiv.org/html/2606.24526#bib.bib28)\)measures multi\-step tool use over office artifacts, and OfficeBench\(Wanget al\.,[2024](https://arxiv.org/html/2606.24526#bib.bib29)\)extends the setting to task completion across diverse office software\. Across this thread, evaluation is grounded in interactive environments or concrete office artifacts, with documents typically serving as targets of manipulation rather than as a body of evidence to be reasoned over\.

Closer to our setting, multi\-document question answering over workspace document archive has attracted substantial attention\. HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.24526#bib.bib8)\), 2WikiMultihopQA\(Hoet al\.,[2020](https://arxiv.org/html/2606.24526#bib.bib30)\), MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2606.24526#bib.bib9)\), and FRAMES\(Krishnaet al\.,[2025](https://arxiv.org/html/2606.24526#bib.bib10)\)require composing facts across multiple Wikipedia passages, often with explicit supporting\-fact supervision; MultiHop\-RAG\(Tang and Yang,[2024](https://arxiv.org/html/2606.24526#bib.bib31)\)extends the setting to news corpora, and M3SCIQA\(Liet al\.,[2024](https://arxiv.org/html/2606.24526#bib.bib32)\)pushes multi\-hop reasoning into the scientific literature\. Closer to professional settings, TAT\-QA\(Zhuet al\.,[2021](https://arxiv.org/html/2606.24526#bib.bib11)\)and FinanceBench\(Islamet al\.,[2023](https://arxiv.org/html/2606.24526#bib.bib12)\)evaluate hybrid text\-and\-table reasoning over financial materials, GDPVal\(Patwardhanet al\.,[2025](https://arxiv.org/html/2606.24526#bib.bib33)\)measures economically valuable deliverables across occupations, and OfficeQA Pro\(Opsahl\-Onget al\.,[2026](https://arxiv.org/html/2606.24526#bib.bib16)\)couples retrieval with computation over a specific source of corpus\. None of these benchmarks, however, jointly stresses active retrieval over a large internal collection and cross\-file reconciliation of units, time conventions, and terminology before computation—the conditions that define real productivity workflows and that motivateAgora\.

## 3TheAgoraBenchmark

Building a benchmark that jointly demands archive\-groundedness, agentic exploration, and cross\-domain coverage poses three challenges: specifying tasks that genuinely require reasoning over a fixed collection rather than parametric knowledge or single\-file lookup; assembling authentic, messy documents at a scale that forces deliberate exploration; and synthesizing verifiable multi\-hop questions while suppressing evidence leakage\. We address these in turn: formalizing the task and benchmark and the pipeline that constructs it\.

### 3\.1Dataset Desiderata

We distill the design ofAgorainto four desiderata, derived from how document\-grounded reasoning arises in real workplace settings and from our aim of a rigorously measurable benchmark\.

#### Archive\-groundedness\.

Each task must be answerable using only a fixed source collection𝒞\\mathcal\{C\}, without open\-web access, so that a score depends on the collection rather than a model’s prior knowledge\. A frozen collection also makes the benchmark reproducible: the available evidence does not drift over time\.

#### Cross\-domain coverage\.

The benchmark must span a broad range of professional domains\. Real corpora differ sharply across domains in file formats, table structures, terminology, and reporting conventions, and a single\-source benchmark risks rewarding agents that overfit to one source’s idiosyncrasies\. Drawing collections from many domains instead measures whether document\-reasoning ability generalizes\.

#### Agentic exploration and evidence integration\.

A task must test an agent end to end: planning its exploration, gathering a long\-range evidence chain across files, and reconciling that evidence into an answer\. Evidence is sparsely distributed among a large volume of unrelated material, so an agent must navigate deliberately rather than scan exhaustively, which is a particular challenge under a limited context window\. Retrieved evidence moreover often fails to line up, differing in wording, definitions, or unit and time conventions, and these inconsistencies must be resolved before the final computation\.

#### Verifiable evaluation\.

Every task must admit automatic and deterministic verification\. To this end, each query has a single numeric answer and a specified output format, so responses can be checked against ground truth by normalizing superficial formatting differences without human or model\-based judging, and free of its noise and cost\.

Together, these desiderata scopeAgoradeliberately: by pairing realistic, messy document environments with single numeric answers and automatic verification, it targets agentic exploration and reasoning that can be measured precisely, rather than the open\-ended deliverable quality assessed by other benchmarks\.

![Refer to caption](https://arxiv.org/html/2606.24526v1/x3.png)Figure 3:Overview of theAgoraconstruction pipeline\. Phase 1 \(Data Collection and Preprocessing\) gathers and parses multi\-domain documents, segmenting them into chunks indexed in a vector database\. Phase 2 \(Task Synthesis\) drafts tasks from these chunks and progressively enhances them through refinement and obfuscation\. Phase 3 \(Quality Control\) applies difficulty filtering and multi\-check verification to yield the final QA set\.

### 3\.2Task and Benchmark Composition

Table 2:Per\-domain composition ofAgora\. Domains are abbreviated as Agri \(Agriculture, Resources & Energy\), Arch \(Architecture, Construction, Real Estate & Facilities\), Biz \(Business, Management, Marketing & Sales\), Edu \(Education, Science & Academia\), Fin \(Finance & Economics\), Health \(Healthcare & Medicine\), Law, and Tech \(Technology, Software & Manufacturing\)\.#### Task formulation\.

EachAgoratask pairs a natural\-language query with one domain collection that the agent may use exclusively to answer it\. A query is cross\-document and multi\-hop: its evidence is sparsely scattered across several files amid much unrelated material, so the agent must locate bridging facts, reconcile inconsistent terminology, units, and time conventions, and compute an answer\. As illustrated in Figure[2](https://arxiv.org/html/2606.24526#S1.F2), answering such a query requires multiple rounds of tool calls to explore and analyze the corresponding document collection before submitting an answer in the specified format\. Because each collection far exceeds any model’s context window, the agent cannot scan it exhaustively but must explore deliberately\. Every query admits a single verifiable numeric answer under a specified output format; the agent works through a bash tool and submits its answer in an<answer\>…</answer\>tag\. Appendix[E](https://arxiv.org/html/2606.24526#A5)gives more task examples\.

#### Benchmark composition\.

Agoraconsists of eight source collections, one per domain, each paired with a set of natural language queries answered over that collection\. A collection is a flat directory of plain\-text Markdown files converted from authentic workplace documents such as official reports, statistical yearbooks, and tabular records\. Files are named after their originating data source and document title, and a single collection aggregates documents from several distinct data sources\. The documents are authentic and may appear in languages other than English, while all queries are posed in English\. In total, the benchmark comprises 362 questions\. Table[2](https://arxiv.org/html/2606.24526#S3.T2)reports the per\-domain composition ofAgora\.

### 3\.3Benchmark Construction

Figure[3](https://arxiv.org/html/2606.24526#S3.F3)illustrates the three\-stage construction ofAgora: document collection and preprocessing, task synthesis, and quality control\. ACodex\-based agent is employed throughout\.

Table 3:Per\-domain and overall accuracy \(%\) onAgora\. Models are ordered by overall accuracy\. The best result in each column is inboldand the second best isunderlined\. Domain abbreviations follow Table[2](https://arxiv.org/html/2606.24526#S3.T2)\.#### Document collection and preprocessing\.

To achieve broad domain coverage and a large document pool, we survey official occupational classification systems and distill eight major domains as*domain seeds*\. A deep\-search agent then retrieves semantically relevant web documents and produces a candidate list from which we verify and download the final set\. The collected documents span four formats and are chunked for fine\-grained retrieval by format\-specific rules\. EachPDFis converted to Markdown viadots\.ocr\(Liet al\.,[2025b](https://arxiv.org/html/2606.24526#bib.bib6)\)and grouped into five\-page windows that each form one chunk\. EachMarkdownfile is tokenized into8,0008\{,\}000\-token sliding windows \(800800\-token overlap;7,2007\{,\}200\-token stride\) that each form one chunk\. ForExcelevery non\-empty sheet becomes one chunk represented as a compact table profile of column names, data types, summary statistics, and sample rows\. EachCSVfile is parsed as a single sheet and mapped to one chunk under the same profile\. All four formats are thus normalized into plain text that a language agent can read and reason over directly, and the resulting per\-domain chunks are consolidated with their metadata into a single JSON file\.

We score each chunk by an information\-density heuristic that prioritizes numerics, tables, and time\-series content while filtering out directory listings and standalone titles \(Appendix[B](https://arxiv.org/html/2606.24526#A2)\), retaining the top\-100100chunks of each domain as*seed chunks*that serve as entry points for task synthesis\. Finally, we use GPT to analyze the chunks, generating summaries and tags at three hierarchical levels:*chunk*,*document*, and*domain*\. Each summary is concatenated with its tags, encoded into a dense vector, and indexed in a vector database, supporting multi\-granularity evidence retrieval during task synthesis\.

#### Task synthesis\.

To synthesize high\-quality multi\-hop questions at scale, an agent retrieves cross\-document evidence and drafts questions grounded in it, which two further stages then refine and harden\.Drafting\.Given a seed chunk, a semantic\-search tool over the vector database, and a set of constraints \(e\.g\., prohibited reasoning shortcuts, minimum hop counts, and answer\-leakage criteria\), the agent explores the domain corpus, identifies cross\-document bridging facts, and produces a candidate task with its reference reasoning path and verification code, followed by a self\-check\.Refinement\.This stage enforces consistency, naturalness, and unambiguity while preserving intrinsic difficulty\. The agent attempts the question to gauge answerability and leakage, checks the alignment among the question stem, reasoning chain, and reference answer, and audits properties such as cross\-file coverage\. Any question failing a check is rewritten\.Obfuscation\.This stage removes residual leakage that survives refinement, of two kinds:lexical leakage, where stem terms retrieve the evidence within one or two searches, andstructural leakage, where entities the solver should infer are stated outright\. A suite of attack tests flags both, after which exposed terms and entity names are rewritten as business\-role descriptions or equivalent expressions that preserve the original cross\-file dependency structure; each rewrite is re\-tested to confirm removal\.

#### Quality control\.

We further subject the synthesized tasks to a quality\-control procedure targeting difficulty, correctness, and unambiguity\. Each task is first presented toDeepSeek\-V4\-Proin a closed\-book setting, and any task solvable from parametric knowledge alone is discarded\. The surviving tasks are then assessed by a three\-model panel \(GPT\-5\.5,DeepSeek\-V4\-Flash, andDeepSeek\-V4\-Pro\), and those solved correctly by all three are eliminated to guarantee sufficient difficulty\. Next,Codexreviews each task under two conditions: conditioned solely on the query, and conditioned on the query together with the reference reasoning path from synthesis\. From the two resulting trajectories, it adjudicates whether the underlying reasoning chain is valid, establishing the task’s solvability and unambiguity, and only tasks passing this verification are retained\. Finally, we conduct a round of human annotation grounded in both the query and its reference reasoning path, yielding a curated set of 362 tasks\.

## 4Experiments

### 4\.1Setup

#### Models\.

We evaluate GPT\-5\.5\(OpenAI,[2026](https://arxiv.org/html/2606.24526#bib.bib34)\), Gemini\-3\.1\-Pro\(Google,[2026b](https://arxiv.org/html/2606.24526#bib.bib35)\), Gemini\-3\.1\-Flash\-Lite\(Google,[2026a](https://arxiv.org/html/2606.24526#bib.bib40)\), DeepSeek\-V4\-Flash and DeepSeek\-V4\-Pro\(DeepSeek\-AI,[2026](https://arxiv.org/html/2606.24526#bib.bib36)\), GLM\-5\.1\(Zhipu,[2026](https://arxiv.org/html/2606.24526#bib.bib41)\), and Qwen3\.5\-35B\-A3B and Qwen3\.5\-9B\(Qwen Team,[2026](https://arxiv.org/html/2606.24526#bib.bib38)\)\. All models run at temperature 1\.0 with their reasoning effort set to the maximum supported\. We serve the Qwen models locally using SGLang\(Zhenget al\.,[2024](https://arxiv.org/html/2606.24526#bib.bib22)\), whereas the remaining models are accessed through their respective official API providers\.

#### Agent harness\.

We evaluate every model inside a minimal harnessmini\-swe\-agent\(Yanget al\.,[2024](https://arxiv.org/html/2606.24526#bib.bib20)\)\. Agents can explore the collection and run computation through bash commands, and reports the final answer by printing an<answer\>…</answer\>tag, upon which the harness terminates the episode\. We show the system prompt in Appendix[C](https://arxiv.org/html/2606.24526#A3)\.

#### Runtime environment\.

Each task runs in an isolated E2B sandbox with the document collection mounted as a local directory and no internet access\. The full environment specification is given in Appendix[A](https://arxiv.org/html/2606.24526#A1)\.

#### Budget and termination\.

Each episode is capped at 200 interaction turns and a 3,600\-second timeout, and terminates when the agent emits an<answer\>tag or exhausts its turn, time, or context budget\. Any episode ending without a valid<answer\>tag is scored as incorrect\. We impose no uniform context\-window limit; each model operates under its own native maximum context length\.

### 4\.2Main Results

#### Agorais far from saturated, and performance splits into two tiers\.

Table[3](https://arxiv.org/html/2606.24526#S3.T3)reports per\-domain and overall accuracy for all eight models, ordered by overall accuracy\. No model exceeds60%60\\%: even the strongest, Gemini\-3\.1\-Pro, answers only59\.39%59\.39\\%of queries correctly\. Since every task admits a single verifiable numeric answer solvable from the mounted collection alone, this gap does not reflect formatting artifacts but a genuine capability deficit: archive\-grounded agentic document reasoning remains unsolved for current models\. The eight models further split into two sharply separated tiers\. A frontier tier of five occupies a4040–60%60\\%band, while the remaining three fall well below it\. The28\.7328\.73\-point gap between the tiers exceeds any gap within either\. The lower tier moreover does not trail uniformly but approaches the floor domain by domain\. Qwen3\.5\-9B scores at or below3%3\\%in five of eight domains, and Gemini\-3\.1\-Flash\-Lite at or below7%7\\%in six, leaving these smaller, lower\-cost models effectively non\-functional onAgora\. We examine their failure modes in Section[5\.2](https://arxiv.org/html/2606.24526#S5.SS2)\.

![Refer to caption](https://arxiv.org/html/2606.24526v1/x4.png)Figure 4:Per\-domain accuracy relative to each model’s overall accuracy onAgora\. Each cell shows the signed residual \(large\) and raw accuracy \(small\); red is above the model’s average, blue below\. The vertical rule separates frontier\-tier from near\-floor models\.
#### Per\-domain accuracy varies and reorders the leaderboard\.

Aggregate accuracy masks substantial per\-domain variation in two ways\. First, no model is uniformly strong: Gemini\-3\.1\-Pro, the overall leader, still drops to41\.03%41\.03\\%on Finance, and GPT\-5\.5 falls to38\.00%38\.00\\%on Business\. Strong aggregate performance can still mask unreliable behavior on a model’s weak domains\. Second, per\-domain rankings diverge from the aggregate one: Gemini\-3\.1\-Pro tops five of eight domains yet ranks only fourth on Finance, behind GLM\-5\.1, GPT\-5\.5, and DeepSeek\-V4\-Pro, while GPT\-5\.5 leads on Law and Tech\. A leaderboard built on any single domain would therefore misrank these systems\. These effects compound across domains, and we further analyze them in Section[5\.1](https://arxiv.org/html/2606.24526#S5.SS1)\.

## 5Analysis

### 5\.1Cross\-Domain Performance Variance

Agora’s multi\-domain design lets us ask whether agentic multi\-document reasoning transfers across domains, and it does not\. This is why cross\-domain coverage is necessary rather than incidental toAgora’s design: a single\-source benchmark would leave the blind spots and rank inversions below hidden\. To see them, we center each model’s per\-domain accuracy on its overall accuracy, which isolates a domain\-specific residual\. As Figure[4](https://arxiv.org/html/2606.24526#S4.F4)shows, these residuals make difficulty largely a property of the model–domain pair rather than of the domain alone\. Business is Gemini\-3\.1\-Pro’s strongest domain but the weakest for GPT\-5\.5, and the interaction is large enough to reorder the leaderboard: DeepSeek\-V4\-Pro trails GPT\-5\.5 by8\.88\.8points overall yet overtakes it on Business\. This imbalance does not diminish with model scale\. Among frontier models the strongest is also the least even\. Gemini\-3\.1\-Pro leads overall yet has the widest*spread*, the gap between its best and worst domain, at30\.9730\.97points\. Aggregate strength and domain balance are therefore distinct axes, and a single headline number conceals the latter\. The converse does not hold, however: a small spread does not imply balance\. The smaller models have small spreads simply because their accuracy is near the floor across all domains\.

### 5\.2Failure Modes

![Refer to caption](https://arxiv.org/html/2606.24526v1/x5.png)Figure 5:Failure\-mode breakdown over five categories, as the share ofAgora’s 362 questions each category is implicated in\. Frontier \(left, ymax =55%55\\%\) and lower\-tier \(right, ymax =70%70\\%\) models use different radial scales\. Failure modes are abbreviated as II \(Incomplete Inspection\), EM \(Evidence Misidentification\), RE \(Resource Exhaustion\), INF \(Instruction Non\-Following\), and HAL \(Hallucination\)\.We annotate every wrong trace and consolidate the labels into five categories:Incomplete Inspection, where the agent skips a required document;Evidence Misidentification, where it inspects the right files but extracts the wrong value;Resource Exhaustion, an exhausted context window, turn, time, or sandbox budget;Instruction Non\-Following, where the agent ignores a stated requirement of the query; andHallucination, fabricated answers or forgotten earlier findings\. Figure[5](https://arxiv.org/html/2606.24526#S5.F5)reveals three patterns: \(i\) errors across nearly all models are dominated by the three evidence\-grounded categories \(II, EM, INF\), indicating the bottleneck lies in locating and applying evidence rather than in computation or invention; \(ii\) Resource Exhaustion is model\-specific—GPT\-5\.5’s top failure at24\.59%24\.59\\%, near\-zero for the DeepSeek\-V4 family \(≤1\.10%\\leq\\\!1\.10\\%\), and catastrophic for Gemini\-3\.1\-Flash\-Lite \(69\.61%69\.61\\%\); and \(iii\) Hallucination stays below12%12\\%across the frontier tier but climbs to∼40%\\sim\\\!40\\%for the small models, suggesting the tier gap reflects evidence discipline rather than reasoning depth\.

### 5\.3Interaction\-Turn Distribution

![Refer to caption](https://arxiv.org/html/2606.24526v1/x6.png)Figure 6:Distribution of interaction turns per model onAgora\. Episodes are stratified by the turn at which the agent emits its final<answer\>tag and partitioned by outcome; the parenthesized count denotes submitted episodes\. The axis is truncated at 100, as longer episodes are rare\.To characterize how agents allocate their exploration budget, we examine the distribution of interaction turns at which a final answer is submitted\. Figure[6](https://arxiv.org/html/2606.24526#S5.F6)stratifies each model’s episodes by submission turn and partitions them by outcome\. Across models, correct outcomes are concentrated at low\-to\-moderate turn counts, whereas episodes extending deep into the budget are predominantly incorrect\. This skew suggests that prolonged trajectories more often reflect a loss of direction—repeated, unproductive exploration—than gradual convergence toward a solution and that an agent failing to resolve a task early rarely recovers by searching longer\.

## 6Conclusion

We presentedAgora, an archive\-grounded, cross\-domain benchmark for agentic workplace document reasoning, pairing 362 verifiable numeric queries with eight domain collections of 9,664 documents and 372M tokens\. Built by an agentic pipeline combining cross\-document task synthesis, leakage\-preventing obfuscation, and difficulty filtering with human verification,Agorajointly stresses archive\-groundedness, agentic exploration, and cross\-domain coverage\. Evaluating eight models, we find the task far from solved: even the strongest reaches only 59\.4% accuracy, and per\-domain analysis reveals systematic blind spots and rank inversions a single\-source benchmark cannot surface\. We hopeAgoraserves as a rigorous, reproducible testbed for the next generation of document\-reasoning agents\.

## Limitations

Agoraspans eight professional domains distilled from official occupational classification systems—broader than prior single\-source benchmarks, though not exhaustive of all workplace settings\. We keep the query count \(362\) deliberately modest in favor of per\-task quality, with every query passing multi\-stage difficulty filtering, automated verification, and human annotation\.

We discard tasks solvable from parametric knowledge alone via a closed\-book filtering stage, ensuring current models must genuinely reason over the mounted collection\. As models scale, however, pretraining corpora may absorb more documents of this kind, eroding the guarantee; we therefore view our pipeline as a reusable instrument and hope it can refresh the benchmark with new collections as the frontier advances\. A related caveat concerns the difficulty\-filtering panel itself: three of its models are also evaluation subjects, so discarding tasks they jointly solve may slightly bias the benchmark against them\. We mitigate this by removing only tasks solved by*all three*panel models, a deliberately narrow criterion, but a fully unbiased construction would require a filtering panel disjoint from the evaluation set\.

Finally, all models are evaluated inside a single minimal harness exposing only a bash tool, a deliberate choice that isolates model capability from scaffolding engineering, which is not our focus\. Absolute accuracies may shift under heavier frameworks; since harness design matters substantially for real\-world agent deployment, we leave a systematic study of its effect to future work\.

## References

- DeepSeek\-AI \(2026\)DeepSeek\-V4 Technical Report\.External Links:[Link](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf)Cited by:[§1](https://arxiv.org/html/2606.24526#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.24526#S4.SS1.SSS0.Px1.p1.1)\.
- Google \(2026a\)Gemini\-3\.1 Flash\-Lite System Card\.External Links:[Link](https://deepmind.google/models/model-cards/gemini-3-1-flash-lite/)Cited by:[§4\.1](https://arxiv.org/html/2606.24526#S4.SS1.SSS0.Px1.p1.1)\.
- Google \(2026b\)Gemini\-3\.1 Pro System Card\.External Links:[Link](https://deepmind.google/models/model-cards/gemini-3-1-pro/)Cited by:[§4\.1](https://arxiv.org/html/2606.24526#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Han, P\. Xia, R\. Zhang, T\. Sun, Y\. Li, H\. Zhu, and H\. Yao \(2025\)MDocAgent: A multi\-modal multi\-agent framework for document understanding\.CoRRabs/2503\.13964\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.13964),[Document](https://dx.doi.org/10.48550/ARXIV.2503.13964),2503\.13964Cited by:[§1](https://arxiv.org/html/2606.24526#S1.p1.1)\.
- X\. Ho, A\. D\. Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing A multi\-hop QA dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain \(Online\), December 8\-13, 2020,D\. Scott, N\. Bel, and C\. Zong \(Eds\.\),pp\. 6609–6625\.External Links:[Link](https://doi.org/10.18653/v1/2020.coling-main.580),[Document](https://dx.doi.org/10.18653/V1/2020.COLING-MAIN.580)Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p2.1)\.
- P\. Islam, A\. Kannappan, D\. Kiela, R\. Qian, N\. Scherrer, and B\. Vidgen \(2023\)FinanceBench: A new benchmark for financial question answering\.CoRRabs/2311\.11944\.External Links:[Link](https://doi.org/10.48550/arXiv.2311.11944),[Document](https://dx.doi.org/10.48550/ARXIV.2311.11944),2311\.11944Cited by:[Table 1](https://arxiv.org/html/2606.24526#S1.T1.1.1.2),[§1](https://arxiv.org/html/2606.24526#S1.p1.1),[§1](https://arxiv.org/html/2606.24526#S1.p2.1),[§2](https://arxiv.org/html/2606.24526#S2.p2.1)\.
- B\. Jin, J\. Yoon, P\. Kargupta, S\. Ö\. Arik, and J\. Han \(2025a\)An empirical study on reinforcement learning for reasoning\-search interleaved LLM agents\.CoRRabs/2505\.15117\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.15117),[Document](https://dx.doi.org/10.48550/ARXIV.2505.15117),2505\.15117Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p1.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, D\. Wang, H\. Zamani, and J\. Han \(2025b\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.CoRRabs/2503\.09516\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.09516),[Document](https://dx.doi.org/10.48550/ARXIV.2503.09516),2503\.09516Cited by:[§1](https://arxiv.org/html/2606.24526#S1.p1.1)\.
- S\. Krishna, K\. Krishna, A\. Mohananey, S\. Schwarcz, A\. Stambler, S\. Upadhyay, and M\. Faruqui \(2025\)Fact, fetch, and reason: A unified evaluation of retrieval\-augmented generation\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 \- Volume 1: Long Papers, Albuquerque, New Mexico, USA, April 29 \- May 4, 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),pp\. 4745–4759\.External Links:[Link](https://doi.org/10.18653/v1/2025.naacl-long.243),[Document](https://dx.doi.org/10.18653/V1/2025.NAACL-LONG.243)Cited by:[Table 1](https://arxiv.org/html/2606.24526#S1.T1.1.6.4.1),[§1](https://arxiv.org/html/2606.24526#S1.p2.1),[§2](https://arxiv.org/html/2606.24526#S2.p2.1)\.
- C\. Li, Z\. Shangguan, Y\. Zhao, D\. Li, Y\. Liu, and A\. Cohan \(2024\)M3SciQA: A multi\-modal multi\-document scientific QA benchmark for evaluating foundation models\.InFindings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12\-16, 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Findings of ACL,pp\. 15419–15446\.External Links:[Link](https://doi.org/10.18653/v1/2024.findings-emnlp.904),[Document](https://dx.doi.org/10.18653/V1/2024.FINDINGS-EMNLP.904)Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p2.1)\.
- X\. Li, Y\. Xiao, D\. Ng, H\. Ye, Y\. Deng, X\. Lin, B\. Wang, Z\. Mo, C\. Zhang, Y\. Zhang, Z\. Yang, R\. Li, L\. Lei, S\. Xu, H\. Zhao, W\. Chen, F\. Ji, and L\. Bing \(2025a\)MiroMind\-m1: an open\-source advancement in mathematical reasoning via context\-aware multi\-stage policy optimization\.CoRRabs/2507\.14683\.External Links:[Link](https://doi.org/10.48550/arXiv.2507.14683),[Document](https://dx.doi.org/10.48550/ARXIV.2507.14683),2507\.14683Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p1.1)\.
- Y\. Li, G\. Yang, H\. Liu, B\. Wang, and C\. Zhang \(2025b\)Dots\.ocr: multilingual document layout parsing in a single vision\-language model\.External Links:2512\.02498,[Link](https://arxiv.org/abs/2512.02498)Cited by:[§3\.3](https://arxiv.org/html/2606.24526#S3.SS3.SSS0.Px1.p1.3)\.
- T\. Lin, Y\. Luo, H\. Zhang, J\. Zhang, C\. Liu, K\. Wu, and N\. Tang \(2025\)MEBench: benchmarking large language models for cross\-document multi\-entity question answering\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 1481–1494\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.77),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-MAIN.77)Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p1.1)\.
- J\. Luo, W\. Zhang, Y\. Yuan, Y\. Zhao, J\. Yang, Y\. Gu, B\. Wu, B\. Chen, Z\. Qiao, Q\. Long, R\. Tu, X\. Luo, W\. Ju, Z\. Xiao, Y\. Wang, M\. Xiao, C\. Liu, J\. Yuan, S\. Zhang, Y\. Jin, F\. Zhang, X\. Wu, H\. Zhao, D\. Tao, P\. S\. Yu, and M\. Zhang \(2025\)Large language model agent: A survey on methodology, applications and challenges\.CoRRabs/2503\.21460\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.21460),[Document](https://dx.doi.org/10.48550/ARXIV.2503.21460),2503\.21460Cited by:[§1](https://arxiv.org/html/2606.24526#S1.p1.1)\.
- Z\. Ma, B\. Zhang, J\. Zhang, J\. Yu, X\. Zhang, X\. Zhang, S\. Luo, X\. Wang, and J\. Tang \(2024\)SpreadsheetBench: towards challenging real world spreadsheet manipulation\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/ac840df270ac537dd74530a15c332684-Abstract-Datasets%5C_and%5C_Benchmarks%5C_Track.html)Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p1.1)\.
- G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom \(2024\)GAIA: a benchmark for general AI assistants\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=fibxvahvs3)Cited by:[Table 1](https://arxiv.org/html/2606.24526#S1.T1.1.8.6.1),[§1](https://arxiv.org/html/2606.24526#S1.p2.1),[§2](https://arxiv.org/html/2606.24526#S2.p1.1)\.
- OpenAI \(2026\)GPT\-5\.5 System Card\.External Links:[Link](https://deploymentsafety.openai.com/gpt-5-5)Cited by:[§4\.1](https://arxiv.org/html/2606.24526#S4.SS1.SSS0.Px1.p1.1)\.
- K\. Opsahl\-Ong, A\. Singhvi, J\. Collins, I\. Zhou, C\. Wang, A\. Baheti, O\. Oertell, J\. Portes, S\. Havens, E\. Elsen, M\. Bendersky, M\. Zaharia, and X\. Chen \(2026\)OfficeQA pro: an enterprise benchmark for end\-to\-end grounded reasoning\.CoRRabs/2603\.08655\.External Links:[Link](https://doi.org/10.48550/arXiv.2603.08655),[Document](https://dx.doi.org/10.48550/ARXIV.2603.08655),2603\.08655Cited by:[Table 1](https://arxiv.org/html/2606.24526#S1.T1.1.7.5.1),[§1](https://arxiv.org/html/2606.24526#S1.p2.1),[§2](https://arxiv.org/html/2606.24526#S2.p2.1)\.
- T\. Patwardhan, R\. Dias, E\. Proehl, G\. Kim, M\. Wang, O\. Watkins, S\. P\. Fishman, M\. Aljubeh, P\. Thacker, L\. Fauconnet, N\. S\. Kim, P\. Chao, S\. Miserendino, G\. Chabot, D\. Li, M\. Sharman, A\. Barr, A\. Glaese, and J\. Tworek \(2025\)GDPval: evaluating AI model performance on real\-world economically valuable tasks\.CoRRabs/2510\.04374\.External Links:[Link](https://doi.org/10.48550/arXiv.2510.04374),[Document](https://dx.doi.org/10.48550/ARXIV.2510.04374),2510\.04374Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p2.1)\.
- Qwen Team \(2026\)Qwen3\.5: accelerating productivity with native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§4\.1](https://arxiv.org/html/2606.24526#S4.SS1.SSS0.Px1.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p1.1)\.
- Y\. Tang and Y\. Yang \(2024\)MultiHop\-rag: benchmarking retrieval\-augmented generation for multi\-hop queries\.CoRRabs/2401\.15391\.External Links:[Link](https://doi.org/10.48550/arXiv.2401.15391),[Document](https://dx.doi.org/10.48550/ARXIV.2401.15391),2401\.15391Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p2.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Trans\. Assoc\. Comput\. Linguistics10,pp\. 539–554\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00475),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00475)Cited by:[Table 1](https://arxiv.org/html/2606.24526#S1.T1.1.4.2.1),[§1](https://arxiv.org/html/2606.24526#S1.p2.1),[§2](https://arxiv.org/html/2606.24526#S2.p2.1)\.
- X\. Wang, B\. Li, Y\. Song, F\. F\. Xu, X\. Tang, M\. Zhuge, J\. Pan, Y\. Song, B\. Li, J\. Singh, H\. H\. Tran, F\. Li, R\. Ma, M\. Zheng, B\. Qian, Y\. Shao, N\. Muennighoff, Y\. Zhang, B\. Hui, J\. Lin, and et al\. \(2025\)OpenHands: an open platform for AI software developers as generalist agents\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=OJd3ayDDoF)Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p1.1)\.
- Z\. Wang, Y\. Cui, L\. Zhong, Z\. Zhang, D\. Yin, B\. Y\. Lin, and J\. Shang \(2024\)OfficeBench: benchmarking language agents across multiple applications for office automation\.CoRRabs/2407\.19056\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.19056),[Document](https://dx.doi.org/10.48550/ARXIV.2407.19056),2407\.19056Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p1.1)\.
- J\. Wei, Z\. Sun, S\. Papay, S\. McKinney, J\. Han, I\. Fulford, H\. W\. Chung, A\. T\. Passos, W\. Fedus, and A\. Glaese \(2025\)BrowseComp: A simple yet challenging benchmark for browsing agents\.CoRRabs/2504\.12516\.External Links:[Link](https://doi.org/10.48550/arXiv.2504.12516),[Document](https://dx.doi.org/10.48550/ARXIV.2504.12516),2504\.12516Cited by:[Table 1](https://arxiv.org/html/2606.24526#S1.T1.1.9.7.1),[§1](https://arxiv.org/html/2606.24526#S1.p2.1),[§2](https://arxiv.org/html/2606.24526#S2.p1.1)\.
- T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei, Y\. Liu, Y\. Xu, S\. Zhou, S\. Savarese, C\. Xiong, V\. Zhong, and T\. Yu \(2024\)OSWorld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/5d413e48f84dc61244b6be550f1cd8f5-Abstract-Datasets%5C_and%5C_Benchmarks%5C_Track.html)Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p1.1)\.
- J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press \(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/5a7c947568c1b1328ccc5230172e1e7c-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.24526#S1.p1.1),[§2](https://arxiv.org/html/2606.24526#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.24526#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: A dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),pp\. 2369–2380\.External Links:[Link](https://doi.org/10.18653/v1/d18-1259),[Document](https://dx.doi.org/10.18653/V1/D18-1259)Cited by:[Table 1](https://arxiv.org/html/2606.24526#S1.T1.1.3.1.1),[§1](https://arxiv.org/html/2606.24526#S1.p2.1),[§2](https://arxiv.org/html/2606.24526#S2.p2.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=WE%5C_vluYUL-X)Cited by:[§2](https://arxiv.org/html/2606.24526#S2.p1.1)\.
- Q\. Zhang, X\. Lv, J\. Wu, B\. Li, Z\. Tao, G\. Yan, H\. Zhang, B\. Wang, J\. Xu, H\. Mi, and W\. Zhang \(2026\)DocDancer: towards agentic document\-grounded information seeking\.CoRRabs/2601\.05163\.External Links:[Link](https://doi.org/10.48550/arXiv.2601.05163),[Document](https://dx.doi.org/10.48550/ARXIV.2601.05163),2601\.05163Cited by:[§1](https://arxiv.org/html/2606.24526#S1.p1.1)\.
- Y\. Zhao, Y\. Li, C\. Li, and R\. Zhang \(2022\)MultiHiertt: numerical reasoning over multi hierarchical tabular and textual data\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2022, Dublin, Ireland, May 22\-27, 2022,S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),pp\. 6588–6600\.External Links:[Link](https://doi.org/10.18653/v1/2022.acl-long.454),[Document](https://dx.doi.org/10.18653/V1/2022.ACL-LONG.454)Cited by:[§1](https://arxiv.org/html/2606.24526#S1.p1.1)\.
- L\. Zheng, L\. Yin, Z\. Xie, C\. Sun, J\. Huang, C\. H\. Yu, S\. Cao, C\. Kozyrakis, I\. Stoica, J\. E\. Gonzalez, C\. W\. Barrett, and Y\. Sheng \(2024\)SGLang: efficient execution of structured language model programs\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/724be4472168f31ba1c9ac630f15dec8-Abstract-Conference.html)Cited by:[§4\.1](https://arxiv.org/html/2606.24526#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Zheng, D\. Fu, X\. Hu, X\. Cai, L\. Ye, P\. Lu, and P\. Liu \(2025\)DeepResearcher: scaling deep research via reinforcement learning in real\-world environments\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 414–431\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.22),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-MAIN.22)Cited by:[§1](https://arxiv.org/html/2606.24526#S1.p1.1)\.
- Zhipu \(2026\)GLM\-5\.1 System Card\.External Links:[Link](https://docs.bigmodel.cn/cn/guide/models/text/glm-5.1)Cited by:[§4\.1](https://arxiv.org/html/2606.24526#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig \(2024\)WebArena: A realistic web environment for building autonomous agents\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=oKn9c6ytLx)Cited by:[§1](https://arxiv.org/html/2606.24526#S1.p2.1),[§2](https://arxiv.org/html/2606.24526#S2.p1.1)\.
- F\. Zhu, W\. Lei, Y\. Huang, C\. Wang, S\. Zhang, J\. Lv, F\. Feng, and T\. Chua \(2021\)TAT\-QA: A question answering benchmark on a hybrid of tabular and textual content in finance\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, \(Volume 1: Long Papers\), Virtual Event, August 1\-6, 2021,C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),pp\. 3277–3287\.External Links:[Link](https://doi.org/10.18653/v1/2021.acl-long.254),[Document](https://dx.doi.org/10.18653/V1/2021.ACL-LONG.254)Cited by:[Table 1](https://arxiv.org/html/2606.24526#S1.T1.1.5.3.1),[§1](https://arxiv.org/html/2606.24526#S1.p2.1),[§2](https://arxiv.org/html/2606.24526#S2.p2.1)\.

## Appendix ASandbox Environment

AllAgoratasks run inside an isolated E2B222[https://e2b\.dev](https://e2b.dev/)sandbox with 2 vCPUs, 4 GB of memory, no network access, and a 3,600\-second timeout\. The sandbox is built from a fixed Docker template that extends the officiale2bdev/baseimage with a Python 3 interpreter, a few command\-line utilities \(jq,ripgrep,fd,tree,less\), and a pinned set of scientific\-computing and document\-parsing libraries:pandas,numpy,scipy,openpyxl,lxml,beautifulsoup4,pyyaml,pymupdf,pdfplumber, andtabulate\. These cover the tabular computation and file parsing thatAgoratasks require, without any network access for installing further dependencies\. The working directory is/workspace, with the document collection at/workspace/documentsand/workspace/runas scratch space\. All package versions are pinned in the template so the environment is stable across evaluation runs\.

## Appendix BHeuristic for Information\-Density Scoring

We rank candidate chunks for downstream task synthesis with a lightweight additive heuristic, applied during preprocessing to direct the more expensive agentic synthesis toward chunks whose content can plausibly support multi\-hop computation\. Because synthesis is invoked once per seed chunk and incurs a full agent trajectory, even a coarse prefilter materially reduces wasted budget; we therefore favor an inexpensive rule\-based score over an LLM\-judged one\. The signals we accumulate track three properties typical of evidence in workplace reasoning tasks: numeric density, tabular structure, and temporal extent\.

#### Scoring formula\.

For a chunk with raw textccand metadatamm, we sum the contributions below, each clamped to its listed cap so that no single signal can dominate:

- •Numeric density:5⋅1000⋅\|N​\(c\)\|/\|c\|5\\cdot 1000\\cdot\|N\(c\)\|/\|c\|, capped at3030, whereN​\(c\)N\(c\)is the set of numeric tokens incc\. This rewards chunks dense in figures such as financial values, counts, or measurements\.
- •Length:tokens​\(c\)/50\\mathrm\{tokens\}\(c\)/50, capped at2020\. Slightly larger chunks have more room to host the cross\-row computation that multi\-hop questions require\.
- •ForExcelandCSVchunks we further add:rows/100\\mathrm\{rows\}/100\(cap1515\),cols\\mathrm\{cols\}\(cap1010\),2⋅\|numeric columns\|2\\cdot\|\\textrm\{numeric columns\}\|,88if any time column is present, and2⋅\(yearmax−yearmin\)2\\cdot\(\\mathrm\{year\}\_\{\\max\}\-\\mathrm\{year\}\_\{\\min\}\)\(cap1010\)\. Together they reward wide, numerically typed, time\-spanning sheets\.
- •ForMarkdownchunks we add1010if a table is present and33if a list is present; forPDFchunks we add1212when the OCR’d output contains a table\. Tabular structure is a strong cue that the chunk admits cell\-level lookup\.

#### Hard filters\.

Two conditions force a sentinel score that excludes the chunk regardless of its other signals\.*\(i\) Token floor\.*Chunks with fewer than2020tokens—typically standalone titles, section headers, or stray footer fragments—are too sparse to bridge across documents and are dropped\.*\(ii\) Directory listings\.*Chunks whose leading200200characters match a table\-of\-contents pattern \(table of contents,contents,index\) are structural artifacts of their parent document and contribute no extractable evidence\.

#### Selection\.

For each domain we score every chunk, discard those falling below a minimum\-score floor of1010, sort the remainder by score, and retain the top\-100100as seed chunks for task synthesis \(Section[3\.3](https://arxiv.org/html/2606.24526#S3.SS3)\)\. The floor prunes a long tail of low\-information chunks; the per\-domain cap bounds the synthesis budget and prevents richer corpora from monopolizing the seed pool while ensuring sparser ones still contribute\. We deliberately keep the heuristic mechanical: it is a router for downstream agentic synthesis, not itself a quality judgment, and the subsequent refinement, obfuscation, and quality\-control stages remain responsible for filtering tasks that survive selection but fail later checks\.

## Appendix CPrompts

`System Prompt for Evaluation Prompt for Chunk\-Level Summary Prompt for Task Reviewer`

`Appendix D Responsible NLP Statements All models evaluated in this work are used strictly in accordance with their respective original licenses and intended\-use terms\. We confirm that our use of existing artifacts is consistent with their intended use\. The dataset we introduce is released strictly for academic research purposes; it is a derivative of data accessed for research purposes and must not be used in any commercial or non\-research context, in line with the original access conditions\. The data used in this work is derived from publicly available sources, which do not contain private personally identifying information\. We discuss the potential risks of this work\. The document collections in Agora are compiled from publicly available workplace files, and we release them strictly for research purposes; users should respect the terms of the original sources and avoid uses beyond academic evaluation\. More broadly, Agora is intended to drive progress in agents that can autonomously explore and reason over large document archives, and advances in such AI capabilities may carry wider societal implications that warrant ongoing attention\. Finally, as pretraining corpora continue to grow, benchmark documents may gradually be absorbed into training data, weakening the closed\-book filtering safeguard against knowledge leakage and causing evaluation results to drift over time\. We acknowledge the use of large language models, specifically Claude Opus 4\.7 and GPT\-5\.5, as writing aids during the preparation of this manuscript\. Their use was limited to language polishing, grammar correction, literature search, code completion, and improving the clarity of presentation\. Appendix E Task Examples Agriculture, Resources & Energy ⬇ A portfolio analyst is tying together Germany’s renewable\-support statistics with the ministry material on sectoral economic effects\. In the newest regulator observation in the collection, work within the electricity\-support block at the technology\-row level: keep rows whose later\-column growth in eligible delivered output is below the total row while the same row’s later\-column growth in the companion support\-cash table is above the total row\. From that reduced set, use the newest short ministry brief to select the row that is presented as the technology\-specific capital\-buildout exception, because its latest plant\-building spend is described as having exceeded its own earlier high point\.\\n\\nFor that selected technology, compare two arithmetic means\. On the regulator side, use every successive observation in the collection that reports this eligible\-output support table; treat each table’s later\-year column as that observation and its immediately preceding column in the same table as the base\. On the ministry side, form the comparable plant\-building investment series from exactly three ministry documents, in order: the comprehensive annual numbers compendium for the first investment year, the subsequent English\-language bridge note for the middle investment year, and the newer German short brief for the latest investment year\.\\n\\nCompute all rates from table cell values, not rounded percentage displays or prose, and put physical\-output and currency scales on consistent units before forming rates\. Report the regulator\-side mean minus the ministry\-side mean as signed percentage points rounded to two decimals, formatted as ‘\+x\.xx pp‘ or ‘\-x\.xx pp‘\. Answer: \-18\.03 pp Architecture, Construction, Real Estate & Facilities ⬇ An international housing dashboard needs one multiplier\. For the Japan\-side numerator, first use the survey report’s own definitions to determine the full set of dwelling\-related categories\. For each category, go to its repeated cross\-year table and choose the respondent\-choice item that records where the respondent found the relevant dwelling or contractor; do not use the nearby item about general digital activity when it lacks the full early\-to\-late series\. Within the chosen item, use the option for the web route, take values at the broadest geography that the survey design makes available for that category, align all resulting series to their shared local\-era fiscal\-year window, and keep the largest compound annual change\. For the England\-side denominator, use the valuation component report for a national housing survey: take the opening headline percent change for the arithmetic\-average dwelling value between its two valuation reference dates, annualized by exact elapsed days using a 365\.2425\-day year\. What is the numerator divided by the denominator? Return the quotient rounded to three decimal places, followed by ‘x‘\. Format: ‘<value with three decimal places\>x‘\. Answer: 5\.517x Business, Management, Marketing & Sales ⬇ An Italian connectivity retailer wants a single launch\-pricing premium that compares a present\-day access\-and\-demand basket with the older wholesale access signal that a retrospective infrastructure\-investment study treats as requiring entrants to commit assets inside the incumbent’s local exchange footprint, then scales the gap by the size of the top\-level requirement catalogue in the card\-acceptance security rulebook used by sales environments\. Use only the source documents’ own endpoints and displayed percentage values\. The historical side is the compound annual growth of that study’s retained proxy across its full observation window; any present\-day component that spans a multi\-year sequence uses the same compound annual\-growth convention\. Build the present\-day side as an equal\-weight average of four roles: the household reach measure for the end\-to\-premises fixed\-access form that the small\-firm white paper distinguishes from the cabinet\-plus\-copper alternative, carried from that paper’s first displayed national household point to the later annual\-report value for the same public benchmark; the matching reach measure for the address\-resolved small\-firm sample, carried from the white paper’s research\-sample result to the annual report’s later update for the same firm segment and computed as simple relative growth; the fixed\-service usage\-volume sequence in the statistics annex, using every year displayed there; and the relative uplift between the consumer\-study demand signal for people already in the top current\-speed class and the business\-study demand signal for the current\-speed class that the business report identifies as the one most inclined to spend more\. Report the present\-day\-minus\-historical premium in percentage points per top\-level requirement entry, rounded to three decimals\. Answer format: ‘x\.xxx pp/requirement‘\. Answer: 2\.463 pp/requirement Education, Science & Academia ⬇ An analyst is studying how Chinese universities turned research results into cash through external transfer deals\. First, use the recurring English executive\-summary series from the national policy\-research institute outside China in the corpus: take the consecutive releases in which the country that is the subject of each summary sits just behind the two larger national systems on both the money\-scale and workforce\-scale front\-door capacity measures, and stop with the release where that country’s authorship\-apportioned bibliometric\-output rank no longer matches the repeated position seen in the immediately preceding releases\. For the Chinese higher\-education S&T statistical books whose release years fall in that release span, retain only those editions whose prefatory note says the book is built from the normal province\-administered annual reporting process for the regular higher\-education S&T population, rather than from a special nationwide inventory of research\-development resources\. Respect each retained book’s own institutional and territorial scope\. From the national aggregate row in the broad activity\-overview table, take the cash received during the year for university result\-transfer deals and divide it by the count of such deals made in that year\. Let beta be the OLS slope of the natural log of that cash\-per\-deal series on the underlying statistical year recorded in the prefatory note\. Report exp\(beta\)\-1 as one signed percentage rounded to three decimals, using the template \[sign\]x\.xxx%\. Answer: \-9\.710% Finance & Economics ⬇ Within the source, first identify the holdings review that ranks a primary\-production stock basket by the money committed by stock\-heavy fund products and says that, inside its highest\-ranked basket, the largest representation remains an upstream animal\-protein group\. For the companies in that role\-defined group whose standalone company note for the same reporting season both shows the same visible publication date as the holdings review and provides a full forward sequence for profit attributable to the parent, compute each firm’s CAGR from the sequence’s first forecast year to its last\. Treat the different RMB magnitude labels used across narrative and tables as the same scale before calculating\. What is the cross\-sectional median of those CAGRs? Return one percentage rounded to three decimals, in the format ‘x\.xxx%‘\. Answer: 23\.687% Healthcare & Medicine ⬇ A public health analyst is comparing one admissions\-side category across several releases from the same substance\-treatment admissions data family\. First identify the category from the older national report’s special\-interest section: it is the primary\-substance group used to discuss planned medication support for opioid treatment, in a context that distinguishes person\-entry admissions data from a facility\-day client measure used elsewhere\. Then gather the annual counts for that same group from the older report’s multi\-year primary\-substance count table, from the frequency tables in the two separate single\-year admission\-file metadata documents, and from the later report’s admissions\-side comparison of the leading primary\-substance groups\. Using only those observations, take the largest count as the base and the chronologically latest count as the endpoint\. What compound annual rate of change do they imply? Round the signed percentage to three decimals\.\\n\\nAnswer format: ‘x\.xxx%‘ Answer: \-17\.051% Law ⬇ Using only the Law corpus, compute one normalized dispersion score by joining two groups of materials\. For the England\-and\-Wales reform project, identify the project through its role in recommending changes to the rules governing the two principal ways a body is dealt with after death, including land\-capacity pressure affecting one of them; use its original call\-for\-views document to obtain the opening and closing dates of the response window, and its later summary document to obtain the total submissions received\. From the French statutory\-code extract pages, retain each distinct code whose territorial\-extension/adaptation provisions include a court\-label swap from the mainland general court to the locally named first\-tier forum\. For each retained code, compute the signed calendar\-day difference between the footer date tied to the page\-copy stamp and the footer date tied to the code\-revision stamp\. Take the population standard deviation of those signed differences, multiply by the submission total, and divide by both the retained\-code count and the inclusive length of the response window\. Return only the result rounded to exactly two decimal places, in the form ‘xx\.xx‘\. Answer: 16\.87 Technology, Software & Manufacturing ⬇ A productivity benchmarking team wants one dispersion check for a source\-code\-line productivity measure in the Japanese development\-benchmark family present in this source set\. Use the main volume and every companion volume whose preface says it reuses the main volume’s analysis layout for business\-domain slices\. In each selected volume, go to the appendix productivity section where code size is divided by the five\-phase development effort\. From the row aggregating all size bands and the hourly unit, take the methodology\-defined middle statistic rather than the mean, once from the table for first\-build projects in the release’s bounded analysis window and once from the matching cumulative\-period table\. Keep the panel only if the number of values equals the length of the contiguous clause range that the toy\-safety comparison’s counterpart\-standard column places outside ordinary\-use testing as foreseeable\-misuse testing\. Apply the benchmark’s stated outlier\-handling rule, then compute the sample coefficient of variation for the completed panel\. Report the result as a percentage rounded to three decimal places\. Answer format: ‘x\.xxx%‘\. Answer: 26\.747%`

Similar Articles

GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning

arXiv cs.AI

GraphARC is a new benchmark for abstract reasoning on graph-structured data, extending the ARC paradigm to graphs. Evaluations of state-of-the-art language models reveal a comprehension-execution gap and performance degradation on larger instances, highlighting scaling challenges.