QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

arXiv cs.CL 06/04/26, 04:00 AM Papers
Summary
QO-Bench is a diagnostic benchmark for query-operator question answering over typed event tuples, covering 22,984 news articles and 614 corporate events across 18 query templates. It evaluates RAG, ReAct RAG, GraphRAG, and extraction-to-SQL systems, finding that operator execution—not just retrieval—is a core bottleneck that stronger models alone cannot resolve.
arXiv:2606.04646v1 Announce Type: new Abstract: Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution. We introduce QO-Bench, a diagnostic benchmark for query-operator question answering over typed event tuples. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge. This design enables operator-level diagnosis such as joins and intersection. We evaluate RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL under matched conditions, with a long-context oracle ceiling to isolate retrieval failure. A two-axis framework -- index-time preservation versus query-time execution -- predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction-to-SQL on intersection and counting. Even given the gold evidence, a long-context oracle stays far from saturated, so operator execution -- not retrieval alone -- is a core bottleneck that a stronger answer model does not remove. QO-Bench reframes the goal from passage relevance to query-operator-preserving retrieval.
Original Article
View Cached Full Text
Cached at: 06/05/26, 02:16 AM
# QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples
Source: [https://arxiv.org/html/2606.04646](https://arxiv.org/html/2606.04646)
Mengao Zhang Xiang Yang Chang Liu Tianhui Tan Ke\-Wei Huang Asian Institute of Digital Finance, National University of Singapore \{mengaoz, chang\_liu, tant, dishkw\}@nus\.edu\.sge0556732@u\.nus\.eduCorresponding author\. Work done during an internship at the Asian Institute of Digital Finance, National University of Singapore\.

###### Abstract

Many real\-world questions over business, legal, and scientific corpora are natural\-language versions of database\-style queries over records latent in text\. Existing retrieval\-augmented generation \(RAG\) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution\. We introduceQO\-Bench, a diagnostic benchmark for*query\-operator question answering*over typed event tuples\. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions\. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge\. This design enables operator\-level diagnosis such as joins and intersection\. We evaluate RAG, ReAct RAG, GraphRAG, and information\-extraction\-to\-SQL under matched conditions, with a long\-context oracle ceiling to isolate retrieval failure\. A two\-axis framework—index\-time preservation versus query\-time execution—predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction\-to\-SQL on intersection and counting\. Even given the gold evidence, a long\-context oracle stays far from saturated, so operator execution—not retrieval alone—is a core bottleneck that a stronger answer model does not remove\.QO\-Benchreframes the goal from passage relevance to query\-operator\-preserving retrieval\.

QO\-Bench: Diagnosing Query\-Operator\-Preserving Retrieval over Typed Event Tuples

Mengao Zhang††thanks:Corresponding author\.Xiang Yang††thanks:Work done during an internship at the Asian Institute of Digital Finance, National University of Singapore\.Chang Liu Tianhui Tan Ke\-Wei HuangAsian Institute of Digital Finance, National University of Singapore\{mengaoz, chang\_liu, tant, dishkw\}@nus\.edu\.sge0556732@u\.nus\.edu

## 1Introduction

Many high\-value information needs in business, law, and policy are expressed in natural language but behave like database queries\. A user may ask,*Which firms were buyers in M&A announcements in both 2018 and 2022?*or*Which firms had a CEO change within 30 days of an M&A announcement?*The records these questions range over are not given in a table; they lie*latent across a text corpus*, reported piecemeal across many articles\. Because no single article pre\-aggregates them for an arbitrary query, such questions cannot rely on one summarizing passage\. They require selecting events that satisfy constraints, assigning roles, anchoring dates, joining evidence across conditions and other aggregations\. We call this setting*query\-operator question answering*\(QO\-QA\): natural\-language questions specify database\-style operators over records\. Relational databases support such questions through typed attributes, query plans, and set semanticsCodd \([1970](https://arxiv.org/html/2606.04646#bib.bib31),[1972](https://arxiv.org/html/2606.04646#bib.bib32)\); Grayet al\.\([1997](https://arxiv.org/html/2606.04646#bib.bib33)\)\. However, many real corpora are not clean databases\. They are news articles, filings, contracts, reports, emails, or analyst notes, where the relevant records must first be recovered from text\. QO\-QA therefore lies at the boundary between semantic parsing and retrieval\-augmented generation: the question demands database\-style execution, but the evidence resides in unstructured documents\.

Retrieval\-augmented generation \(RAG\)Lewis and others \([2020](https://arxiv.org/html/2606.04646#bib.bib1)\); Karpukhin and others \([2020](https://arxiv.org/html/2606.04646#bib.bib2)\)is typically built around a different contract: retrieve passages that are semantically relevant to a question, then ask a language model to synthesize an answer\. This contract is powerful for passage lookup, but QO\-QA exposes a structural mismatch\. For*Which firms were buyers in M&A announcements in both 2018 and 2022?*, a system must identify M&A events, normalize firm names, distinguish buyers from targets, attach announcement dates, construct buyer sets for each year, and intersect them\. Top\-kkretrieval may return plausible passages, but semantic relevance alone does not guarantee role correctness, temporal correctness, set completeness, or count correctness\.

The capability needed here is*operator\-preserving retrieval*: retrieval that preserves the typed values required to execute the operators expressed in the question\. This framing separates two sources of failure\. On the corpus side, an index may fail to preserve operator\-relevant attributes\. On the query side, a system may fail to convert the natural\-language question into an executable plan involving the operators it induces\. Standard RAG and ReAct\-style RAGYao and others \([2023](https://arxiv.org/html/2606.04646#bib.bib8)\)leave most operator execution to the generator\. GraphRAGEdge and others \([2024](https://arxiv.org/html/2606.04646#bib.bib11)\)introduces entity\-relation structure and summaries, but summaries may compress away necessary attributes\. Information\-extraction\-to\-SQL systemsLi and others \([2021](https://arxiv.org/html/2606.04646#bib.bib13)\); Yu and others \([2018](https://arxiv.org/html/2606.04646#bib.bib14)\); Scholak and others \([2021](https://arxiv.org/html/2606.04646#bib.bib15)\)execute operators explicitly, but only after committing to a schema and extraction pipeline whose coverage and normalization may be incomplete\. Thus, QO\-QA is not simply “text\-to\-SQL over news” or “RAG with more documents”: it asks whether retrieval architectures can bridge natural\-language operator intent and latent event structure in text\.

We introduceQO\-Bench, a diagnostic benchmark for QO\-QA over typed event tuples\.111Code, benchmark, baselines, and evaluation scripts:[https://github\.com/ZHANG\-MENGAO/qo\-bench](https://github.com/ZHANG-MENGAO/qo-bench)The benchmark spans 18 query templates evaluated on a stratified sample of 785 questions\. The questions are instantiated over 22,984 news articles covering 614 corporate events\.222Corporate events provide a useful substrate because they naturally exhibit the operator surface we want to test\. The same QO\-QA structure appears beyond business news, including legal case aggregation, biomedical evidence synthesis, and policy monitoringEach gold answer is deterministically computed from event tuples with event type, entity, role, anchor date, counterparty, and provenance fields\. This givesQO\-Benchan explicit denotational gold standard\. Systems return structured final answers—entity lists, event lists, ordered lists, counts, grouped results—which are canonicalized and compared with the gold denotation by exact match—not an LLM judge—and scored with template\-specific recall\. This design lets us attribute each failure to a specific operator rather than to answer phrasing\. Our experiments compare RAG, ReAct RAG, GraphRAG local and global search, and information\-extraction\-to\-SQL\. Our goal is not to crown a winner, but to produce a failure profile: which operators each paradigm supports, which it approximates poorly, and which it cannot reliably execute\.

Our experiments yield a consistent diagnosis\. No paradigm dominates: the deployable paradigms’ ranking*inverts*across operators, with similarity retrieval \(RAG, ReAct RAG\) strongest on filter/project and IE→\\rightarrowSQL strongest on intersection and counting\. The failures also separate along the framework’s two axes—similarity retrieval loses*coverage*, GraphRAG preserves structure but not exact*values*, and IE→\\rightarrowSQL covers events yet cannot*execute*cross\-event joins beyond its schema\. And the ceiling is itself operator\-bound: even fed the gold evidence, a long\-context oracle stays far from saturated \(∼\\sim4% on set intersection\), and a stronger or more heavily reasoning answer model does not lift it—operator execution, not retrieval alone, is a core bottleneck\.

#### Contributions\.

First, we formulate QO\-QA, where questions specify database\-style operators over records latent in text, and identify operator preservation as a central retrieval property\. Second, we introduceQO\-Bench, a benchmark with deterministic typed event\-tuple gold answers and role\-, date\-, and counterparty\-aware denotations over provenance\-attested events\. Third, we use the QO\-QA framework to decompose representative paradigms along its two axes— index\-time preservation and query\-time execution—and, against a long\-context oracle ceiling that separates retrieval from answer\-synthesis failures, locate where each paradigm fails\.

## 2Related Work

QO\-Benchbuilds on work in retrieval\-augmented, multi\-document, temporal, list, and aggregation question answering, but targets a different diagnostic unit\. Dense, late\-interaction, and retrieval\-augmented models make large\-corpus retrieval central to QAKarpukhin and others \([2020](https://arxiv.org/html/2606.04646#bib.bib2)\); Khattab and Zaharia \([2020](https://arxiv.org/html/2606.04646#bib.bib3)\); Lewis and others \([2020](https://arxiv.org/html/2606.04646#bib.bib1)\); Izacard and Grave \([2021](https://arxiv.org/html/2606.04646#bib.bib4)\), while multi\-document and long\-context benchmarks test whether models combine evidence across passagesShaham and others \([2023](https://arxiv.org/html/2606.04646#bib.bib16)\); Bai and others \([2024](https://arxiv.org/html/2606.04646#bib.bib17)\)\. Multi\-hop and agentic retrieval methods further decompose questions into iterative retrieval stepsYang and others \([2018](https://arxiv.org/html/2606.04646#bib.bib5)\); Trivedi and others \([2022](https://arxiv.org/html/2606.04646#bib.bib6)\); Ho and others \([2020](https://arxiv.org/html/2606.04646#bib.bib7)\); Yao and others \([2023](https://arxiv.org/html/2606.04646#bib.bib8)\); Press and others \([2023](https://arxiv.org/html/2606.04646#bib.bib9)\); Trivedi and others \([2023](https://arxiv.org/html/2606.04646#bib.bib10)\)\. These settings are related, but they usually evaluate whether systems find and synthesize relevant evidence, rather than whether retrieval preserves the typed values needed to execute database\-style operators\.

Table 1:Comparison with related benchmarks\.QO\-Benchis, to our knowledge, the first benchmark to combine deterministic typed event\-tuple gold \(no LLM judge\), per\-operator failure attribution, and a matched multi\-architecture comparison against an oracle ceiling\.Several benchmarks cover parts of this problem\. FanOutQA evaluates fan\-out multi\-document questionsZhuet al\.\([2024](https://arxiv.org/html/2606.04646#bib.bib24)\); MEBench studies cross\-document reasoningLin and others \([2025](https://arxiv.org/html/2606.04646#bib.bib28)\); TLQA evaluates time\-referenced list constructionDumitruet al\.\([2025](https://arxiv.org/html/2606.04646#bib.bib25)\); ChronoQA studies temporal\-sensitive RAG with absolute, aggregate, and relative temporal questionsChen and others \([2025](https://arxiv.org/html/2606.04646#bib.bib26)\); and RAGBench evaluates general RAG behavior and attributionFrielet al\.\([2024](https://arxiv.org/html/2606.04646#bib.bib29)\)\. List\-QA datasets such as LIQUID evaluate questions with multiple non\-contiguous answersLeeet al\.\([2023](https://arxiv.org/html/2606.04646#bib.bib30)\), while financial QA benchmarks such as FinQA, TAT\-QA, FinanceBench, and FAITH emphasize numerical reasoning, tabular reasoning, evidence\-grounded answers, or tabular faithfulness over financial documentsChen and others \([2021](https://arxiv.org/html/2606.04646#bib.bib19)\); Zhu and others \([2021](https://arxiv.org/html/2606.04646#bib.bib18)\); Islam and others \([2023](https://arxiv.org/html/2606.04646#bib.bib20)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.04646#bib.bib35)\)\. AGGBench is closest because it studies aggregation over unstructured text as a completeness\-oriented find\-all taskZhu and others \([2026](https://arxiv.org/html/2606.04646#bib.bib27)\)\. Unlike AGGBench’s entity\-level aggregation setting,QO\-Benchevaluates deterministic query execution over typed event tuples\.

QO\-Benchis also related to structure\-based retrieval and information extraction\. GraphRAG organizes extracted entities and relations into graphsEdge and others \([2024](https://arxiv.org/html/2606.04646#bib.bib11)\), which can improve enumeration and global sense\-making but may lose exact operator\-relevant values such as dates, roles, and counts\. Event extraction, semantic parsing, and text\-to\-SQL make operators executable by converting text into structured records and questions into programsWang and others \([2020](https://arxiv.org/html/2606.04646#bib.bib12)\); Li and others \([2021](https://arxiv.org/html/2606.04646#bib.bib13)\); Yu and others \([2018](https://arxiv.org/html/2606.04646#bib.bib14)\); Scholak and others \([2021](https://arxiv.org/html/2606.04646#bib.bib15)\), but depend on predefined schemas and complete normalized tuples\. We therefore compare these paradigms under matched conditions \(§[5](https://arxiv.org/html/2606.04646#S5)\)\. The goal is not to claim that aggregation, temporal reasoning, list answers, or financial QA are new, but to test whether retrieval architectures can preserve and execute the typed event values required by natural\-language query operators\. Table[1](https://arxiv.org/html/2606.04646#S2.T1)summarizes this distinction\.

## 3A Framework for Query\-Operator QA

### 3\.1QO\-QA as Denotational Retrieval

We formalize QO\-QA as a derivation from corpus to answer\. A corpus entails events with attributes; a question selects events, projects fields, and applies an operator; and a system is judged by whether it recovers the denotation required by the question\.

LetC=\{d1,…,dn\}C=\\\{d\_\{1\},\\ldots,d\_\{n\}\\\}be a document corpus and letE=\{e1,…,em\}E=\\\{e\_\{1\},\\ldots,e\_\{m\}\\\}be the latent event set entailed byCC\. Each evente∈Ee\\in Ehas an attribute setA\(e\)A\(e\)whose elementsa=\(n,v\)a=\(n,v\)pair an attribute name with a value derivable from the corpus\. Attribute names are open\-ended: for an M&A event, for example, relevant names may include acquirer, target, seller, announcement date, completion date, deal value, jurisdiction, and regulatory status\.

A questionqqis a typed query specification

q=\(k,𝒟q,ϕ,π,α\),q=\(k,\\mathcal\{D\}\_\{q\},\\phi,\\pi,\\alpha\),wherek≥1k\\geq 1is the arity and𝒟q⊆Eτ1×⋯×Eτk\\mathcal\{D\}\_\{q\}\\subseteq E\_\{\\tau\_\{1\}\}\\times\\cdots\\times E\_\{\\tau\_\{k\}\}is the typed tuple domain induced by the event types in the question\. Fork=1k=1, we identify the tuple\(e\)\(e\)with the eventeewhen this causes no ambiguity\. The predicateϕ\\phiselects tuples from𝒟q\\mathcal\{D\}\_\{q\}using attribute constraints and, fork≥2k\\geq 2, tuple\-level conditions such as shared entities, temporal windows, or announce–complete deal identity\. The projectionπ\\pireturns structured records with component labels, andα\\alphais an operator such aslist,count,group\.

The selected tuple relation and projected relation are

Rq\\displaystyle R\_\{q\}=\{e→∈𝒟q:ϕ\(A\(e→\)\)=1\},\\displaystyle=\\\{\\vec\{e\}\\in\\mathcal\{D\}\_\{q\}:\\phi\(A\(\\vec\{e\}\)\)=1\\\},\(1\)Pq\\displaystyle P\_\{q\}=⟨π\(A\(e→\)\):e→∈Rq⟩,\\displaystyle=\\langle\\pi\(A\(\\vec\{e\}\)\):\\vec\{e\}\\in R\_\{q\}\\rangle,\(2\)whereA\(e→\)=\(A\(e1\),…,A\(ek\)\)A\(\\vec\{e\}\)=\(A\(e\_\{1\}\),\\ldots,A\(e\_\{k\}\)\)and⟨⋅⟩\\langle\\cdot\\rangledenotes a multiset that preserves tuple identity and multiplicity, as required by counting, grouping, and ordering\.

The gold answer is the denotation

y∗\(q\)=𝖠𝗇𝗌E\(q\)=α\(Pq\)\.y^\{\*\}\(q\)=\\mathsf\{Ans\}\_\{E\}\(q\)=\\alpha\(P\_\{q\}\)\.\(3\)Depending onα\\alpha,y∗\(q\)y^\{\*\}\(q\)may be a list or set of entities or events, an ordered list, a matched event\-pair set, a count, a grouped table; whenRq=∅R\_\{q\}=\\emptyset, the empty list, set, or table, or a count of0\.

A document setD⊆CD\\subseteq Cis sufficient forqqif it attests the attributes needed to evaluateϕ\\phiand the attributes returned byπ\\pifor the selected tuples\. For tuple queries, this includes fields needed to evaluate joins, such as normalized entity IDs, roles, anchor dates, counterparties, and deal\-linking identifiers\. Thus, QO\-QA stresses set\-complete retrieval: a system must surface the selected tuples inRqR\_\{q\}and the fields needed to project and aggregate them\. In our benchmark, many multi\-document questions are multi\-event and multi\-article retrieval problems; in broader settings, a single event may also require evidence composed across multiple documents\.

### 3\.2Preservation and Execution

QO\-QA imposes two architectural requirements\. First, the index must preserve operator\-relevant values\. Second, the runtime must execute the operator structure expressed by the question\.

#### Index\-time preservation\.

The relevant issue is not whether the original text is stored, but whether operator\-relevant values are recoverable through the system’s query\-time interface in typed, executable form\. Letι\(C\)\\iota\(C\)denote the index a system builds overCC—its stored, queryable representation of the corpus\. We call an index𝒮\\mathcal\{S\}\-preserving for an attribute set𝒮\\mathcal\{S\}if, for every eventeeand every attributea∈A\(e\)a\\in A\(e\)whose name lies in𝒮\\mathcal\{S\}, the value ofaacan be recovered fromι\(C\)\\iota\(C\)by a well\-defined query\-time procedure under the system’s*retrieval budget*—the bounded evidence \(e\.g\. top\-kkchunks or a fixed number of retrieval hops\) it may surface per query \(Appendix[F](https://arxiv.org/html/2606.04646#A6)\)\. Full open\-schema preservation, where𝒮\\mathcal\{S\}contains all possible attribute names inA\(e\)A\(e\), is an idealized target\.

#### Query\-time execution\.

At query time, the system must parse the question into an operator specification, retrieve tuples and fields conditioned on that specification, and applyα\\alphaover the projected relation\. Parsing errors produce the wrong plan; retrieval errors omit tuples or fields; aggregation errors occur when evidence is present but the system still miscounts, misgroups, misorders, or fails to intersect sets\. For large selected relations, set\-complete retrieval is often the bottleneck: once relevant tuples are outside the retrieved context, downstream generation cannot reconstruct them from available evidence\.

We apply these two requirements as diagnostic axes for five representative paradigms: RAG, ReAct RAG, GraphRAG in local and global query modes, and IE→\\toSQL, which builds on event extraction and text\-to\-SQL\. The selection is not exhaustive, but it spans distinct points along both axes\. Table[2](https://arxiv.org/html/2606.04646#S3.T2)summarizes the decomposition\.

Table 2:Paradigm decomposition along the two requirements of §[3\.2](https://arxiv.org/html/2606.04646#S3.SS2)\.*Indexing*\(Method,Preservation scope\) shows how the index is built and which part ofA\(e\)A\(e\)it preserves in queryable form;*Querying*\(Parse,Retrieve,Aggregate\) shows how the runtime dispatches the three query\-time subtasks\.RAG and ReAct RAG preserve article text, embeddings, and chunk\-level metadata, but leave role, stage, counterparty, and aggregation operators to the generator\. GraphRAG preserves entity\- and relation\-level structure in a summary\-driven graph, but exact operator\-relevant values such as dates, stages, roles, and counts may be compressed into prose\. IE→\\toSQL materializes typed attributes and executes operators explicitly, but only for fields covered by its frozen schema𝒮\\mathcal\{S\}\. The unoccupied design point is corpus\-side coverage with query\-time typed execution beyond a fixed schema;QO\-Benchdiagnoses which parts of this pipeline current paradigms support and which remain missing\.

### 3\.3A Tractable Subclass for Evaluation

The latent formulation above allows open\-ended queries over open\-ended attribute sets, which is difficult to evaluate deterministically\. We therefore evaluate a controlled subclass\.

*\(1\) Schema\-bounded queries\.*Letτ\(e\)\\tau\(e\)denote the event type of eventee\. For each event typett, we fix a finite schemaS\(t\)S\(t\)containing the fields used by our templates\. Every template restrictsϕ\\phiandπ\\pito fields in the schemas of the component event types\.

*\(2\) Single\-document\-attestable events\.*An evente∈Ee\\in Eis admitted to the operational setE^⊆E\\widehat\{E\}\\subseteq Eiff some article attests every schema field inS\(τ\(e\)\)S\(\\tau\(e\)\)foree, after canonicalization and conflict resolution\.

These restrictions make the benchmark easier than the fully open\-ended setting: the relevant schema is known in advance, and each event is individually recoverable from a single article\. Failures on this restricted substrate therefore provide conservative evidence of architectural difficulty\. We do not claim a formal dominance relation between recall on this subclass and recall in the fully open\-ended setting, because the two settings may differ in query distribution, event distribution, and retrieval budget\.

## 4TheQO\-BenchBenchmark

### 4\.1Corpus, Events, and Ground Truth

Instantiating the tractable subclass of §[3\.3](https://arxiv.org/html/2606.04646#S3.SS3),QO\-Benchpairs a financial\-news corpus with structured corporate\-event ground truth\. The corpus is the NASDAQ subset of FNSPIDDonget al\.\([2024](https://arxiv.org/html/2606.04646#bib.bib21)\)\. Event candidates come from S&P Capital IQ Key Developments, a structured corporate\-event stream recording event type, firms and roles, and dates, taken over 2010–2023\. We align each S&P event to FNSPID articles by two filters—the event’s firm ticker and an event\-type–specific date window around its anchor date \(Appendix[B](https://arxiv.org/html/2606.04646#A2)\)—producing a set of candidate articles per event\. A 3\-of\-3 LLM judge then attests, per event, which candidates genuinely describe it \(§[4\.2](https://arxiv.org/html/2606.04646#S4.SS2)\); an event enters the operational setE^\\widehat\{E\}only when an article attests its schema fields\. The resulting benchmark contains22,98422\{,\}984FNSPID articles333The reported22,98422\{,\}984counts distinct FNSPID articles; because one article can match several events, event–article links are more numerous \(on average1\.131\.13events per article\)\.and\|E^\|=614\|\\widehat\{E\}\|=614single\-article\-attestable events across eight types: M&A announcement, completion, cancellation, and rumor; CEO change; CFO change; IPO; and stock split\. Figure[1](https://arxiv.org/html/2606.04646#S4.F1)summarizes the construction pipeline\.

Event\-type definitions are anchored in public records, including SEC Form 8\-K Items, Securities Act §5, and NYSE/NASDAQ listing rules, and are supplied verbatim to every paradigm \(Appendix[D](https://arxiv.org/html/2606.04646#A4)\)\. Ground truth is represented as typed event tuples keyed by public identifiers rather than as text spans\. Table[3](https://arxiv.org/html/2606.04646#S4.T3)gives the common tuple fields\. The queryable schemaS\(t\)S\(t\)for event typettcontains the fields used by templates\. Provenance is retained as evidence metadata but is not itself a query predicate\. This representation enables deterministic denotational evaluation while avoiding span\-match ambiguity\. Licensing details are in Appendix[A](https://arxiv.org/html/2606.04646#A1)\.

FNSPIDnews corpustime\-window filter22,984 articlesS&PKey Developments8 types\+\+public\-IDtyped event tuples3\-of\-3 attestationE^=614\\widehat\{E\}=61418 question templates785 questionsdatesFigure 1:QO\-Benchconstruction pipeline\. S&P Capital IQ events time\-window\-filter the FNSPIDDonget al\.\([2024](https://arxiv.org/html/2606.04646#bib.bib21)\)corpus;33\-of\-33judge attestation aligns the two into the operational event setE^\\widehat\{E\}\(614614single\-article\-attestable events\), over which1818templates instantiate the785785\-question benchmark with deterministic gold denotations\.Table 3:Common event\-tuple fields\.#### Deterministic evaluation\.

For each question, the gold answer is the denotationy∗\(q\)=α\(Pq\)y^\{\*\}\(q\)=\\alpha\(P\_\{q\}\)over the operational event setE^\\widehat\{E\}, using the notation of §[3\.1](https://arxiv.org/html/2606.04646#S3.SS1)\. Evaluation therefore compares the system’s returned answerf\(q,C\)f\(q,C\)\(its answer forqqon corpusCC\) directly withy∗\(q\)y^\{\*\}\(q\), without an LLM\-as\-a\-judge at scoring time: matching to the gold denotation is by exact match on canonicalized values, and answers are scored by recall\. The template\-specific recall metric is defined in §[5](https://arxiv.org/html/2606.04646#S5)\.

### 4\.2Validating the Judge Consensus

The operational setE^\\widehat\{E\}is constructed using unanimous 3\-LLM attestation: an event–article pair is accepted only when all three judges agree that the article attests the event’s schema fields\. To validate this high\-precision proxy, three expert annotators re\-examined a stratified sample of221221accepted event–article pairs \(up to3030per event type\), each pair receiving two independent labels\. Counting an accepted pair as correct only when both annotators confirm it, the 3\-of\-3 consensus reaches94\.1%94\.1\\%precision, confirming that admitted pairs are reliable\. Because the pool contains only accepted pairs, this is a precision\-only check \(recall is not estimable\)\. Per\-type precision, inter\-annotator agreement, and the disagreement analysis are reported in Appendix[C](https://arxiv.org/html/2606.04646#A3)\.

### 4\.3Template Taxonomy

QO\-Benchcontains 18 primary templates in two capability classes\. Primary evaluation uses a stratified sample capped at 50 questions per template, totaling 785 questions: 200 from four Capability A templates and 585 from fourteen Capability B templates\.Capability A \(filtered retrieval\)questions require selection and projection under typed predicates\. Examples include listing a firm’s M&A announcements within a time window or returning firms with IPOs in a year\. These templates test type\-matched event recall, entity disambiguation, and role\-aware retrieval\.

Capability B \(compositional operations\)questions add an operator over filtered events\. Examples include finding firms with both a CEO change and an M&A announcement within 30 days, identifying M&A buyers in two different years, selecting the earliest IPO in a quarter, counting firms above an event\-frequency threshold, grouping events by quarter, and returning type\-labeled unions\. These templates test temporal joins, intersections, ordering, counting, grouping, and multi\-type aggregation\. Appendix[E](https://arxiv.org/html/2606.04646#A5)lists every template with its signature, sample size, example, and diagnostic target\.

## 5Experiments and Analysis

We compare five deployable paradigms against a retrieval\-freelong\-context oracle\(LC\-oracle\) ceiling that feeds each question the33\-of\-33attested documents supporting its gold answer\. Holding the answer LLM \(Qwen3\.6\-27B\) fixed, the gap from each deployable paradigm to this ceiling isolates its retrieval contribution\.

### 5\.1Setup

Templates are scored by recall with a±\\pm7\-day date tolerance \(absorbing news\-vs\-anchor date drift\)\. Recall is computed on the*covered subset*: each gold denotation is restricted to events attested in the corpus \(E^\\widehat\{E\}\), so a system is not penalized for gold events whose evidence the corpus lacks\. This recall is our main metric\. For leakage control, no paradigm sees question templates at index time and all use fixed decoding; in particular IE→\\rightarrowSQL’s schema is generated from event definitions alone, frozen, and used verbatim with no manual corrections \(full configurations in Appendix[F](https://arxiv.org/html/2606.04646#A6)\)\.

### 5\.2Main Results

Overall, the LC\-oracle ceiling reaches52\.2%52\.2\\%; among deployable paradigms IE→\\rightarrowSQL is strongest at37\.9%37\.9\\%, ahead of RAG \(25\.2%25\.2\\%\) and ReAct RAG \(23\.9%23\.9\\%\), with GraphRAG global search \(3\.8%3\.8\\%\) and local search \(0\.9%0\.9\\%\) near the no\-context floor \(0\.6%0\.6\\%; Table[11](https://arxiv.org/html/2606.04646#A7.T11)\)\. Table[4](https://arxiv.org/html/2606.04646#S5.T4)reports recall by operator family and capability; per\-template recall is deferred to Appendix[G](https://arxiv.org/html/2606.04646#A7)\(Table[9](https://arxiv.org/html/2606.04646#A7.T9)\)\. The headline is not the ranking but its*operator\-dependence*: no paradigm dominates across families, and the deployable ranking inverts from one operator to the next\.

Table 4:Per\-operator\-family and capability recall \(%\),±\\pm7\-day tolerant, question\-weighted \(micro\)\.Three patterns organize the results\.\(1\) Even the ceiling is operator\-bound\.With gold evidence in context, LC\-oracle is far from saturated \(52\.2%52\.2\\%\):77\.6%77\.6\\%on filtering but3\.9%3\.9\\%on intersection, so operator*execution*, not only retrieval, is a bottleneck\.\(2\) Filtering is broadly solved\.On filter/project, RAG, ReAct RAG, and IE→\\rightarrowSQL each recover about half the ceiling \(53\.653\.6,55\.455\.4,50\.650\.6; GR\-local is the exception, near zero\)\.\(3\) The ranking inverts on composition\.IE→\\rightarrowSQL dominates intersection \(50\.950\.9vs\. RAG/ReAct6\.0/9\.56\.0/9\.5\) and counting/grouping \(51\.251\.2vs\. ReAct3\.43\.4\): SQL executes these natively, while a generator must reconstruct them from prose—on intersection IE→\\rightarrowSQL even*exceeds*the LC\-oracle, whose LLM cannot reliably intersect from gold chunks—the oracle bounds retrieval and generation, not operator execution, so an explicit SQL executor can legitimately surpass it\.

Re\-scoring the LC\-oracle ceiling with two further answer models—from open\-weight to frontier scale and light to heavy reasoning—leaves the hardest operators just as hard \(Table[11](https://arxiv.org/html/2606.04646#A7.T11)\): intersection stays near4%4\\%for all three\. A stronger or more heavily reasoning answer model does not lift the ceiling: operator execution is ill\-suited to free\-form generation—an architectural mismatch, not a capacity limitation\.

### 5\.3Localizing Failure: Preservation and Execution

The framework’s two axes—index\-time*preservation*and query\-time*execution*—imply that paradigms fail in different places\.

Preservation fails in two ways\. The first is*coverage*: gold\-article retrieval recall \(Table[10](https://arxiv.org/html/2606.04646#A7.T10)\) measures whether the supporting evidence is surfaced at all, and RAG and ReAct RAG reach just32\.4%32\.4\\%and38\.4%38\.4\\%of gold articles, with Cap B coverage \(∼\\sim27–30%\) far below Cap A: for similarity retrieval the corpus\-side axis is lost*before any operator runs*\. The cause is not only relevance ranking but the fixed top\-kkcutoff itself: it hard\-caps the candidate set, so once a query’s gold answer spans more distinct events than top\-kksurfaces, set\-completeness is unattainable by construction—which is why coverage collapses on the many\-event Cap B aggregations\.

The second is*value\-fidelity*: surfacing the evidence is not enough if the index does not also keep the exact values an operator consumes\. GraphRAG fails here\. Its index preserves entity–relation*structure*and community summaries, but the LLM\-generated reports compress away the precise dates, roles, and counts: GR\-local scores≤3\.5%\{\\leq\}3\.5\\%on every value\-dependent family despite organizing the whole corpus—organization is not preservation\. GR\-global makes the point sharply: its map\-reduce forwards essentially the entire corpus \(article\-level coverage≈100%\{\\approx\}100\\%by construction; Table[10](https://arxiv.org/html/2606.04646#A7.T10)\), yet still reaches only3\.8%3\.8\\%overall, losing the same values in summarization rather than in retrieval\. With the corpus\-side index held fixed, the local\-to\-global gap \(0\.9→3\.8%0\.9\\\!\\to\\\!3\.8\\%\) isolates the query side: global’s fixed map\-reduce surfaces more community reports than local’s entity\-match neighborhood, lifting most families a few points \(e\.g\. filtering3\.5→7\.33\.5\\\!\\to\\\!7\.3\); only cross\-event temporal joins stay near0%0\\%in both modes, since the values that would link events were never preserved\. GraphRAG thus fails the*value\-fidelity*facet of preservation, distinct from the coverage facet that bounds RAG\.

Execution is the query\-side axis: even the LC\-oracle, fed the gold documents directly, scores only3\.9%3\.9\\%on intersection—here execution, not retrieval, is the limiting factor\. IE→\\rightarrowSQL covers64\.8%64\.8\\%of gold documents \(near\-uniform across Cap A/B\), so its residual losses lie not on coverage but on this same axis, on cross\-event temporal joins: it scores0%0\\%on the announce\-to\-complete cross\-stage join \(B\.1\.4\), where linking an announcement to its completion needs a cross\-stage linkage identity the extractor populates on93%93\\%of M&A announcement records but only∼4%\{\\sim\}4\\%of completion records, capping announce\-to\-complete pairing at∼4%\{\\sim\}4\\%, and equally0%0\\%on the cross\-type CEO\-change/M&A join \(B\.1\.2\), whose join key it likewise never materializes\. Decomposing IE→\\rightarrowSQL bounds the ceiling: extraction recall tops out at75\.8%75\.8\\%\(event level, above end\-to\-end SQL recall\), and50%50\\%of residual misses are the extractor declining to emit a record from a thin in\-article mention—paradigm\-honest conservatism, not SQL error\. This is the structural ceiling of schema\-bounded extraction, and it sits on the query axis, not on coverage\.

The three deployable families thus fail at three distinct points—similarity retrieval on*coverage*, GraphRAG on*value\-fidelity*, and schemaful execution on the*query*axis—the first two both facets of index\-time preservation, exactly as the framework predicts\.

### 5\.4Implications

The results bear out the operator\-specific failure the framework implies: no paradigm both preserves operator\-relevant values across the corpus and executes typed operators over them\. Similarity retrieval optimizes relevance over completeness, summary\-based graphs discard the exact roles, stages, dates, and counts they index, and schema\-bound extraction caps what can be queried\. The design target is therefore not larger context or better generation, but broad corpus\-side coverage combined with query\-time typed execution beyond a fixed schema\.

This diagnosis suggests several architectural directions\. One is to treat plan\-and\-execute retrieval as an IR primitive: a query planner would infer the operator structure of the question, retrieve candidate events for coverage, materialize the fields needed for a query\-specific schema, and execute aggregation over the resulting temporary relation\. A second direction is schema\-with\-residual coverage, where a fixed event schema handles common operators while residual text retrieval captures attributes outside the schema\. A third is late\-binding entity and event resolution, triggered only when the query requires cross\-event joins or intersections\. Finally, retrievers should be trained not only for relevance but also for set completeness\.

More broadly, QO\-QA isolates capabilities that single\-document reading comprehension and open\-domain QA do not stress: exact filtering, attribute preservation, cross\-document composition, and aggregation\.

## 6Conclusion

This paper argues that query\-operator QA is not naive QA with more documents: a natural\-language question specifies a computation over latent event records\. To study this setting, we introducedQO\-Bench, a diagnostic benchmark with 18 typed query templates over 22,984 financial\-news articles and 785 questions\.

We also proposed a two\-axis framework that separates index\-time preservation from query\-time execution\. In matched experiments, RAG, ReAct RAG, GraphRAG, and IE→\\rightarrowSQL exhibit various operator\-specific failures\. The LC\-oracle gap shows that many failures arise before generation, when retrieval fails to expose the records over which the answer must be computed\.

QO\-Benchreframes cross\-document QA as*schemaful information retrieval with natural\-language input and output*\. We hope it supports future work on operator\-preserving retrieval systems that combine broad corpus coverage with query\-time typed execution beyond fixed schemas\.

## Limitations

QO\-Benchuses corporate events as operational ground truth\. The tested operators—filtering, joining, intersection, counting, grouping, and ordering—are domain\-general, but absolute results may not transfer to domains with weaker public records, less standardized event definitions, or noisier entity resolution\. The benchmark also uses template\-generated questions, which provide controlled operator probes but underrepresent the linguistic diversity of natural user queries\. Although answers are computed deterministically from normalized event tuples, the tuples may still contain errors from source ambiguity, date conventions, event\-stage boundaries, or entity normalization; we mitigate these risks through public identifiers, provenance, unanimous attestation, and human validation\. Finally, each paradigm admits many implementations and scores depend on the answer LLM, so our results should be interpreted as matched\-condition architecture diagnostics rather than exhaustive or model\-invariant performance estimates\.

## Ethics and Responsible Benchmarking

QO\-Benchis built from public corporate news and public corporate\-event records\. It does not use non\-public personal data; any executive names or firm affiliations come from public disclosures or news reports\. Corporate news coverage is uneven and tends to overrepresent large publicly listed firms, English\-language markets, and media\-salient events\. This skew should be considered when interpreting coverage or event\-frequency statistics\.

The benchmark is intended for retrieval\-architecture evaluation, not for investment, legal, compliance, or employment decisions\. System outputs should not be treated as verified financial facts without human review and source checking\. To respect article\-text licensing, we release article identifiers, metadata, public\-identifier event tuples, questions, prompts, and evaluation scripts; redistribution of full article text remains governed by the originating publishers’ licenses\. The benchmark, baselines, and evaluation code are available at[https://github\.com/ZHANG\-MENGAO/qo\-bench](https://github.com/ZHANG-MENGAO/qo-bench)\(code under MIT; benchmark questions and derived data under CC\-BY\-4\.0\)\.

## References

- LongBench: a bilingual, multitask benchmark for long context understanding\.InACL,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p1.1)\.
- Z\. Chenet al\.\(2021\)FinQA: a dataset of numerical reasoning over financial data\.InEMNLP,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p2.1)\.
- Z\. Chenet al\.\(2025\)A question answering dataset for temporal\-sensitive retrieval\-augmented generation\.Scientific Data\.Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p2.1)\.
- E\. F\. Codd \(1970\)A relational model of data for large shared data banks\.Communications of the ACM13\(6\),pp\. 377–387\.Cited by:[§1](https://arxiv.org/html/2606.04646#S1.p1.1)\.
- E\. F\. Codd \(1972\)Relational completeness of data base sublanguages\.InData Base Systems: Courant Computer Science Symposia Series 6,R\. Rustin \(Ed\.\),pp\. 65–98\.Cited by:[§1](https://arxiv.org/html/2606.04646#S1.p1.1)\.
- Z\. Dong, X\. Fan, and Z\. Peng \(2024\)FNSPID: a comprehensive financial news dataset in time series\.InKDD,Cited by:[Appendix A](https://arxiv.org/html/2606.04646#A1.p1.1),[Appendix B](https://arxiv.org/html/2606.04646#A2.SS0.SSS0.Px1.p1.2),[Figure 1](https://arxiv.org/html/2606.04646#S4.F1),[§4\.1](https://arxiv.org/html/2606.04646#S4.SS1.p1.3)\.
- A\. Dumitru, V\. V, A\. Jatowt, and A\. Anand \(2025\)Evaluating list construction and temporal understanding capabilities of large language models\.InICTIR,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p2.1)\.
- D\. Edgeet al\.\(2024\)From local to global: a graph RAG approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.Cited by:[Appendix F](https://arxiv.org/html/2606.04646#A6.p2.2),[§1](https://arxiv.org/html/2606.04646#S1.p3.1),[§2](https://arxiv.org/html/2606.04646#S2.p3.1)\.
- R\. Friel, M\. Belyi, and A\. Sanyal \(2024\)RAGBench: explainable benchmark for retrieval\-augmented generation systems\.arXiv preprint arXiv:2407\.11005\.Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p2.1)\.
- J\. Gray, S\. Chaudhuri, A\. Bosworth, A\. Layman, D\. Reichart, M\. Venkatrao, F\. Pellow, and H\. Pirahesh \(1997\)Data cube: a relational aggregation operator generalizing group\-by, cross\-tab, and sub\-totals\.Data Mining and Knowledge Discovery1\(1\),pp\. 29–53\.Cited by:[§1](https://arxiv.org/html/2606.04646#S1.p1.1)\.
- X\. Hoet al\.\(2020\)Constructing a multi\-hop QA dataset for comprehensive evaluation of reasoning steps\.InCOLING,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p1.1)\.
- P\. Islamet al\.\(2023\)FinanceBench: a new benchmark for financial question answering\.arXiv preprint arXiv:2311\.11944\.Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p2.1)\.
- G\. Izacard and E\. Grave \(2021\)Leveraging passage retrieval with generative models for open domain question answering\.InEACL,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p1.1)\.
- V\. Karpukhinet al\.\(2020\)Dense passage retrieval for open\-domain question answering\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2606.04646#S1.p2.1),[§2](https://arxiv.org/html/2606.04646#S2.p1.1)\.
- O\. Khattab and M\. Zaharia \(2020\)ColBERT: efficient and effective passage search via contextualized late interaction over BERT\.InSIGIR,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p1.1)\.
- S\. Lee, H\. Kim, and J\. Kang \(2023\)LIQUID: a framework for list question answering dataset generation\.InAAAI,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p2.1)\.
- P\. Lewiset al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.04646#S1.p2.1),[§2](https://arxiv.org/html/2606.04646#S2.p1.1)\.
- Q\. Liet al\.\(2021\)A survey on deep learning event extraction: approaches and applications\.arXiv preprint arXiv:2107\.02126\.Cited by:[§1](https://arxiv.org/html/2606.04646#S1.p3.1),[§2](https://arxiv.org/html/2606.04646#S2.p3.1)\.
- T\. Linet al\.\(2025\)MEBench: benchmarking large language models for cross\-document multi\-entity question answering\.InEMNLP,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p2.1)\.
- O\. Presset al\.\(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of EMNLP,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p1.1)\.
- T\. Scholaket al\.\(2021\)PICARD: parsing incrementally for constrained auto\-regressive decoding from language models\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2606.04646#S1.p3.1),[§2](https://arxiv.org/html/2606.04646#S2.p3.1)\.
- U\. Shahamet al\.\(2023\)ZeroSCROLLS: a zero\-shot benchmark for long text understanding\.InFindings of EMNLP,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p1.1)\.
- H\. Trivediet al\.\(2022\)MuSiQue: multi\-hop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics \(TACL\)\.Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p1.1)\.
- H\. Trivediet al\.\(2023\)Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.InACL,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p1.1)\.
- X\. Wanget al\.\(2020\)MAVEN: a massive general domain event detection dataset\.InEMNLP,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p3.1)\.
- Z\. Yanget al\.\(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InEMNLP,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p1.1)\.
- S\. Yaoet al\.\(2023\)ReAct: synergizing reasoning and acting in language models\.InICLR,Cited by:[§1](https://arxiv.org/html/2606.04646#S1.p3.1),[§2](https://arxiv.org/html/2606.04646#S2.p1.1)\.
- T\. Yuet al\.\(2018\)Spider: a large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-SQL task\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2606.04646#S1.p3.1),[§2](https://arxiv.org/html/2606.04646#S2.p3.1)\.
- M\. Zhang, J\. Fu, T\. Warrier, Y\. Wang, T\. Tan, and K\. Huang \(2025\)FAITH: a framework for assessing intrinsic tabular hallucinations in finance\.InProceedings of the 6th ACM International Conference on AI in Finance \(ICAIF\),Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p2.1)\.
- A\. Zhu, A\. Hwang, L\. Dugan, and C\. Callison\-Burch \(2024\)FanOutQA: a multi\-hop, multi\-document question answering benchmark for large language models\.InACL,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p2.1)\.
- F\. Zhuet al\.\(2021\)TAT\-QA: a question answering benchmark on a hybrid of tabular and textual content in finance\.InACL,Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p2.1)\.
- H\. Zhuet al\.\(2026\)Aggregation queries over unstructured text: benchmark and agentic method\.arXiv preprint arXiv:2602\.01355\.Cited by:[§2](https://arxiv.org/html/2606.04646#S2.p2.1)\.

## Appendix AData Sources, Licensing, and Public\-Identifier Release

QO\-Benchdraws news articles from FNSPIDDonget al\.\([2024](https://arxiv.org/html/2606.04646#bib.bib21)\), a public dataset of NASDAQ financial news, and corporate events from S&P Capital IQ Key Developments, a proprietary structured\-event feed accessed under license\.

#### Disclaimer\.

The S&P Capital IQ Key Developments feed is proprietary and may not be redistributed\. The released benchmark therefore contains no S&P content: neither raw event records, vendor record identifiers \(e\.g\.,keyDevId, internal firm IDs\), nor S&P headlines or situation summaries\. Each ground\-truth event is published instead as a*public\-identifier tuple*— the participating firm keyed by a public identifier \(stock ticker, SEC CIK, or LEI\), the event type drawn from our eight public\-record\-anchored definitions, the anchor date, role, and counterparty — together with provenance article IDs into FNSPID\. These tuples are recoverable from public records and carry no proprietary text, so the benchmark is vendor\-independent and freely redistributable while remaining a faithful denotational gold standard\. Reconstructing the original vendor records requires a separate S&P Capital IQ license\.

## Appendix BCorpus Construction

#### FNSPID subset\.

We use the NASDAQ subset of FNSPIDDonget al\.\([2024](https://arxiv.org/html/2606.04646#bib.bib21)\):2,953,8502\{,\}953\{,\}850URL\-deduplicated articles pre\-tagged with8,5538\{,\}553distinct stock symbols\. We apply no explicit date filter; although FNSPID is nominally dated 1999–2023, its NASDAQ coverage before 2010 is sparse, so in practice the corpus spans 2010–2023—matching the period of the S&P events\. Date selection within this span happens per event below\.

#### Article matching\.

Each S&P event is matched to candidate FNSPID articles by exact ticker \(FNSPID articles carry a stock symbol; no alias or fuzzy normalization is applied\), restricted to a per\-event\-type asymmetric date window around the anchor date \(Table[5](https://arxiv.org/html/2606.04646#A2.T5)\)\. Articles with empty bodies are dropped, and duplicates are removed by URL, keeping the occurrence closest to the anchor\. The anchor is the legal/announcement date for each type; because news typically trails legal completion by several weeks, windows are widened after the anchor rather than symmetric\.

#### Attestation funnel\.

Three judges—Gemma\-4\-31B\-IT, Qwen3\.6\-27B, and gpt\-oss\-120B—independently label each \(event, article\) pair, and a pair is attested only on unanimous33\-of\-33confirmation\. From the16,41416\{,\}414S&P events across the eight types in 2010–2023, we select a1,3761\{,\}376\-event subset to bound judging cost, producing25,88825\{,\}888candidate \(event, article\) pairs;1,5911\{,\}591pairs \(6\.1%6\.1\\%\) are attested33\-of\-33, yielding\|E^\|=614\|\\widehat\{E\}\|=614distinct single\-article\-attestable events\.

Table 5:Per\-event\-type date windows around the S&P anchor date used to collect candidate FNSPID articles, in days before/after the anchor\. Windows are asymmetric because news typically trails the legal/announcement date\.

## Appendix CJudge\-Consensus Validation

The operational setE^\\widehat\{E\}relies on a unanimous33\-of\-33LLM\-judge consensus\. To validate it, three expert annotators \(using the same event definitions as the judges\) labeled a stratified sample of221221accepted \(event, article\) pairs \(up to3030per type\); every pair received two independent labels\.

#### Precision\.

Because the pool contains only accepted pairs, the check is precision\-only \(recall is not estimable without a separately drawn negative pool\)\. We count an accepted pair as correct only when both of its annotators confirm it—any single rejection, including a split vote, counts as incorrect—a conservative criterion\. Under it the33\-of\-33consensus reaches94\.1%94\.1\\%precision \(208/221208/221\)\. Precision is at least90%90\\%on seven of the eight event types; only M&A completions are lower, at73\.3%73\.3\\%, where an article often references the closing only with vague timing\.

#### Inter\-annotator agreement\.

Agreement is computed on the209209pairs co\-labeled by one annotator pair \(the remaining1212, co\-labeled by a different pair, are unanimously positive and leaveκ\\kappaundefined\)\. On these pairs raw agreement is96\.2%96\.2\\%and Cohen’sκ=0\.538\\kappa=0\.538\. The moderateκ\\kappadespite high raw agreement is a prevalence effect: the accepted\-only pool is dominated by positive labels, which inflates chance agreement and deflatesκ\\kappa; it does not indicate low reliability\.

#### Disagreement\.

Boundary errors concentrate on whether an article truly*attests*the event rather than merely*mentions*it: \(i\) M&A completions referenced only with vague timing inside earnings\-call transcripts or analyst recaps; \(ii\) IPO articles dated before the offering, using forward\-looking language \(“plans to raise,” “set to price”\); and \(iii\) tangential mentions of CEO/CFO transitions in earnings\-call acknowledgements\. These cases affect boundary decisions inE^\\widehat\{E\}; once an event tuple is admitted, gold answers are computed deterministically from normalized tuple fields\.

## Appendix DEvent Definitions

Each event type is defined by reference to a public\-record disclosure class \(Table[6](https://arxiv.org/html/2606.04646#A4.T6)\), keeping the ontology vendor\-independent and identical across paradigms\. This document is supplied verbatim to every paradigm as shared task supervision; full operational scope and edge cases accompany the released code\.

Table 6:The eight event types, each anchored to a public\-record disclosure class \(SEC 8\-K items, the Securities Act, or exchange listing rules\)\. The anchor date is the canonical date per type;*role*applies to M&A \(rumor records only the target\) and is null otherwise\.
## Appendix ETemplate Catalog

Tables[7](https://arxiv.org/html/2606.04646#A5.T7)and[8](https://arxiv.org/html/2606.04646#A5.T8)list all 18 templates with their signatures, output types, per\-template sample sizes, an example question, and the operator each diagnoses\.

Table 7:Capability A \(filtered retrieval\) templates\.NNis the per\-template count in the stratified evaluation sample \(at most 50 per template\)\.Table 8:Capability B \(compositional operations\) templates\.NNis the per\-template count in the stratified evaluation sample \(at most 50 per template\)\.
## Appendix FParadigm Configurations

All deployable paradigms share the corpus, the event\-definitions document, and the same answer LLM—Qwen3\.6\-27B \(vLLM, reasoning mode enabled\), decoding held fixed; only the retrieval architecture differs\.

RAGretrieves with a hybrid stack—dense \(Qwen3\-Embedding\-4B\) plus BM25, then a reranker \(Qwen3\-Reranker\-4B\)—feeding the top\-30 chunks to the answer LLM\.ReAct RAGuses the LangChaincreate\_react\_agentrecipe with light prompt adaptation: the agent issues up to five retrieval calls over the same hybrid stack, accumulating evidence across rounds before answering\.GraphRAGEdge and others \([2024](https://arxiv.org/html/2606.04646#bib.bib11)\)extracts entities and relations, clusters them into Leiden communities, and produces LLM\-generated community reports, queried in two modes—*local*\(entity match→\\toneighborhood→\\torelated\-community fetch\) and*global*\(fixed map\-reduce over community reports\)\. We run thegraphragpackage with unmodified source and generous \(not minimal\) index settings; the only prompt change, shared with all paradigms, prepends the event\-definitions document\. Local and global share the corpus\-side index but differ on the query side, isolating the query\-side contribution\.

IE→\\rightarrowSQLproceeds in three stages: \(i\) the database schema is*generated by an LLM \(GPT\-5\.5,T=0T\{=\}0\) from the event definitions alone*—never exposed to question templates—and frozen verbatim; a pre\-specified escape hatch for documented manual corrections was not invoked \(zero corrections\); \(ii\) a separate LLM \(Qwen3\.6\-27B,T=0T\{=\}0, reasoning mode\) extracts event tuples from each article into the frozen schema with provenance; \(iii\) per\-template SQL skeletons translate questions to SQL executed against the events database, with no LLM at query time\. Stage \(i\) is the schema\-leakage control: IE→\\rightarrowSQL is evaluated as an intrinsic test of schemaful retrieval, not as a hand\-tuned upper bound\.

LC\-oraclebypasses retrieval and feeds each question its gold\-supporting chunks linked via provenance\. It is a ceiling, not a deployable paradigm\.

Full prompts, chunking parameters, embedding model, retrievalkk, hop budgets, GraphRAG version and indexing settings, the IE→\\rightarrowSQL schema\-generation prompt and raw output, extraction prompts, and SQL skeletons accompany the released code\.

## Appendix GAdditional Results

#### Per\-template recall\.

Table[9](https://arxiv.org/html/2606.04646#A7.T9)reports per\-template recall for all paradigms—the resolution behind the operator\-family aggregate of Table[4](https://arxiv.org/html/2606.04646#S5.T4)\.

Table 9:Per\-template recall \(%\) under±\\pm7\-day date tolerance \(no provenance\); 785\-question stratified sample, Qwen3\.6\-27B answer model\.
#### Gold\-article retrieval recall\.

Table[10](https://arxiv.org/html/2606.04646#A7.T10)measures whether the gold\-attesting evidence reaches a paradigm at all, decoupled from whether the operator is then executed: for each question we compute the fraction of its gold\-attesting articles whose content the paradigm surfaces into its answering context \(for IE→\\rightarrowSQL, whose event the extraction stage recovers\), averaged by template\. The LC\-oracle is100%100\\%by construction\. For IE→\\rightarrowSQL, the extraction stage recovers roughly two\-thirds of gold articles, well above its end\-task recall \(Table[9](https://arxiv.org/html/2606.04646#A7.T9)\), localizing the dominant loss to the query/composition layer rather than to coverage\.

Table 10:Gold\-article retrieval recall \(%\) under±\\pm7\-day matching: per\-question coverage of gold\-attesting articles; capability columns are question\-weighted \(micro\) means over GT\-applicable templates\. For IE→\\rightarrowSQL this is extraction\-stage coverage; for the retrieval paradigms it is the fraction of gold articles surfaced into the answer context\. LC\-oracle is100%100\\%by construction \(fed gold chunks\)\. GR\-local gold\-article retrieval recall is not well\-defined—its entity\-graph neighborhood answers from community summaries that expose no clean mapping back to source articles, so per\-article gold coverage cannot be computed \(N/A\)\.†GR\-global retrieval is non\-discriminative—it forwards essentially the entire corpus \(all community reports\) to the LLM map step, so article\-level retrieval recall is≈\\approx100% by construction; its information loss occurs in community\-report summarization, not in retrieved\-set membership\.
#### Answer\-model robustness and the no\-context floor\.

Table[11](https://arxiv.org/html/2606.04646#A7.T11)reports the per\-operator profile of the LC\-oracle ceiling across three answer models \(Qwen3\.6\-27B; DeepSeek v4\-flash; v4\-pro Think Max\) alongside the no\-context floor \(Qwen, parametric only, zero articles\)\. The cross\-model invariance of the ceiling is discussed in §[5](https://arxiv.org/html/2606.04646#S5); beyond it, the floor isolates retrieval’s per\-family contribution\. No\-context recovers near\-zero recall on every family—retrieval and context account for nearly all of the ceiling, from77\.6%77\.6\\%on filtering down to its lower operator\-bound limits\.

Table 11:Answer\-model robustness and the no\-context floor \(%,±\\pm7\-day tolerant, question\-weighted micro\)\. The LC\-oracle ceiling is re\-scored with three answer models—Qwen3\.6\-27B \(open\-weight,∼5\{\\sim\}5K reasoning tokens/question\), DeepSeek v4\-flash \(no reasoning\), and v4\-pro \(Think Max,∼24\{\\sim\}24K reasoning tokens/question\)\. The hardest operators are model\-invariant—intersection collapses to∼4%\{\\sim\}4\\%for all three even with gold articles supplied—so the ceiling is operator\-bound, not capacity\-bound; models differ mainly on Cap B \(temporal joins\)\. The No\-ctx column \(Qwen, parametric only, zero articles\) is the floor: retrieval contributes nearly all recall on every family\.
QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

Similar Articles

@_reachsumit: OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries @dianetc_ et al pres…

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

@dianetc_: We set out to build a better retriever, so we looked for the hardest IR benchmarks. For each, we asked how much headroo…

UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL

When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

Submit Feedback

Similar Articles

@_reachsumit: OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries @dianetc_ et al pres…
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
@dianetc_: We set out to build a better retriever, so we looked for the hardest IR benchmarks. For each, we asked how much headroo…
UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL
When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering