Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting

arXiv cs.CL Papers

Summary

Proposes a two-phase non-parametric retrieval workflow for corporate credit underwriting that separates high-recall retrieval from utility ranking, using on-premise open-source models for compliance. The system addresses the similarity-utility gap in standard RAG pipelines for financial document analysis.

arXiv:2605.20684v1 Announce Type: new Abstract: Corporate credit underwriting requires analysts to extract actionable evidence from long, heterogeneous financial documents spanning hundreds of pages and multiple languages. Standard Retrieval-Augmented Generation (RAG) pipelines optimize for semantic similarity, which frequently surfaces passages that are topically related but lack decision utility, a problem we term the similarity-utility gap. We propose a two-phase non-parametric retrieval architecture that separates high-recall candidate retrieval from high-precision utility ranking. The first phase combines lexical and dense multilingual retrieval to construct a broad candidate pool. The second phase applies an adaptive retrieval controller that filters candidates using query intent and document structure signals, followed by an LLM-as-a-Judge utility scoring mechanism that ranks passages by analytical usefulness rather than semantic proximity. A context-aware extraction module preserves structural fidelity across narrative text and complex financial tables. The system is deployed entirely on-premise to satisfy enterprise data governance requirements. Evaluated on a multilingual corpus of proprietary financial documents with analyst-curated relevance labels, the system significantly outperforms naive retrieval baselines. In production deployment across more than 800 credit analysts, document review time was reduced from several hours to approximately three minutes, demonstrating the practical value of utility-aware RAG architectures for document-intensive decision-support workflows.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:34 AM

# Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting
Source: [https://arxiv.org/html/2605.20684](https://arxiv.org/html/2605.20684)
\\icmlsetsymbol

equal\*\{icmlauthorlist\}\\icmlauthorLinus Ng Junjiaocbcpresenting\\icmlauthorEzekiel Tee Kongquanocbc,gatech\\icmlauthorKelvin Hengocbc\\icmlauthorKenneth Zhu Keocbc\\icmlauthorZhao Jing Yuanocbc

\\icmlaffiliation

ocbcOCBC, Singapore\\icmlaffiliationgatechGeorgia Institute of Technology\\icmlcorrespondingauthorlinus\.ng@ocbc\.com\\icmlcorrespondingauthorezekieltee@ocbc\.com\\icmlcorrespondingauthorkelvinheng@ocbc\.com\\icmlcorrespondingauthorkennethzhu@ocbc\.com\\icmlcorrespondingauthorjingyuanzhao@ocbc\.com

\\printAffiliations

## 1Introduction

Corporate credit underwriting relies heavily on the analysis of long\-form financial documents such as annual reports and industry reports\[[5](https://arxiv.org/html/2605.20684#bib.bib2)\]\. Analysts must extract relevant financial indicators, assess risk disclosures, and synthesize insights from documents that can span hundreds of pages and multiple languages\[[5](https://arxiv.org/html/2605.20684#bib.bib2)\]\.

Retrieval\-Augmented Generation \(RAG\) systems have emerged as a promising approach to assist document\-intensive workflows\[[3](https://arxiv.org/html/2605.20684#bib.bib3),[2](https://arxiv.org/html/2605.20684#bib.bib4)\]\. By retrieving supporting passages from external corpora, RAG systems can improve factual grounding and reduce hallucination in language model outputs\[[3](https://arxiv.org/html/2605.20684#bib.bib3),[2](https://arxiv.org/html/2605.20684#bib.bib4)\]\. However, standard RAG pipelines typically prioritize semantic similarity between queries and document passages\[[3](https://arxiv.org/html/2605.20684#bib.bib3)\]\. In financial analysis tasks, this objective often fails to align with the needs of analysts\[[5](https://arxiv.org/html/2605.20684#bib.bib2)\]\.

Financial documents often contain narrative descriptions, regulatory disclosures, and repetitive boilerplate language\[[5](https://arxiv.org/html/2605.20684#bib.bib2)\]\. As a result, similarity\-based retrieval systems may surface passages that share terminology with the query but lack actionable analytical value\[[6](https://arxiv.org/html/2605.20684#bib.bib1)\]\. We refer to this issue as thesimilarity–utility gap\[[6](https://arxiv.org/html/2605.20684#bib.bib1)\]\.

To address this challenge, we propose a retrieval architecture designed specifically for enterprise decision\-support workflows\[[6](https://arxiv.org/html/2605.20684#bib.bib1)\]\. Our system introduces a two\-phase pipeline that separates high\-recall retrieval from high\-precision utility ranking\[[6](https://arxiv.org/html/2605.20684#bib.bib1)\]\. The architecture incorporates hybrid lexical\-semantic retrieval, adaptive candidate refinement, and a utility\-grounded ranking framework in which a language model evaluates passages according to their analytical usefulness\[[5](https://arxiv.org/html/2605.20684#bib.bib2),[6](https://arxiv.org/html/2605.20684#bib.bib1)\]\.

The system is designed for deployment in regulated financial environments and operates entirely on\-premise using self\-hosted open\-source models\. This ensures compliance with strict data governance policies while maintaining high retrieval performance\.

Our main contributions are:

- •A utility\-grounded retrieval framework that aligns passage ranking with decision usefulness in credit underwriting tasks\.
- •An adaptive retrieval controller that filters candidate passages using query intent and document structure signals\.
- •A context\-aware extraction module that preserves structural information in narrative and tabular financial content\.
- •An enterprise deployment demonstrating large productivity gains for analysts\.

## 2Related Work

Retrieval\-Augmented Generation has become a widely adopted approach for grounding large language models in external knowledge sources\[[3](https://arxiv.org/html/2605.20684#bib.bib3),[2](https://arxiv.org/html/2605.20684#bib.bib4)\]\. Early work introduced dense retrieval methods using neural embeddings to retrieve semantically relevant passages from large corpora\[[3](https://arxiv.org/html/2605.20684#bib.bib3)\]\. Hybrid retrieval approaches combining lexical search with dense embeddings have further improved recall in heterogeneous document collections\[[4](https://arxiv.org/html/2605.20684#bib.bib5)\]\.

Recent research has explored the use of language models as evaluators for ranking retrieved content\[[7](https://arxiv.org/html/2605.20684#bib.bib6)\]\. TheseLLM\-as\-a\-Judgeapproaches leverage the reasoning capabilities of language models to assess the quality and relevance of candidate passages\[[7](https://arxiv.org/html/2605.20684#bib.bib6)\]\. Such methods have been applied in question answering, information retrieval evaluation, and ranking tasks\.

In enterprise settings, RAG systems must also address constraints related to data governance, privacy, and auditability\. Financial and legal institutions often require on\-premise deployment and traceable source attribution, which introduces additional design considerations beyond model performance\.

Our work contributes to this literature by proposing a retrieval architecture that explicitly optimizes for analytical utility in enterprise workflows\. By combining hybrid retrieval, adaptive candidate control, and utility\-based ranking, the system prioritizes passages that contain actionable financial evidence rather than merely semantically similar text\.

## 3Methodology

### 3\.1Problem Setting

Corporate credit underwriting requires analysts to justify reported financial statements derived from long, heterogeneous financial documents such as annual reports and industry reports\. These documents often contain dense narrative sections, multilingual commentary, and unstructured financial tables\. Traditional RAG pipelines retrieve passages based primarily on semantic similarity, which frequently results in content that is topically related but not useful for decision\-making\.

We formalize the task as retrieving and ranking a set of document segments that maximize decision utility rather than semantic similarity\. Given a user queryqqand a corpus of long financial documentsD=\{d1,d2,…,dn\}D=\\\{d\_\{1\},d\_\{2\},\.\.\.,d\_\{n\}\\\}, the objective is to identify a set of passagesP∗⊂DP^\{\*\}\\subset Dthat contain verifiable evidence relevant to credit underwriting decisions\.

Our system introduces a two\-phase retrieval and re\-ranking architecture designed to bridge the gap between semantic similarity and decision utility while operating entirely within an on\-premise environment\.

### 3\.2System Overview

The proposed architecture consists of five main components: document ingestion, hybrid candidate retrieval, adaptive retrieval controller, utility\-grounded re\-ranking, and context\-aware evidence extraction\.

The overall pipeline is illustrated conceptually as:

qstatement→Rhybrid​\(D\)→Cadaptive→Jutility→Econtextq\_\{\\text\{statement\}\}\\rightarrow R\_\{\\text\{hybrid\}\}\(D\)\\rightarrow C\_\{\\text\{adaptive\}\}\\rightarrow J\_\{\\text\{utility\}\}\\rightarrow E\_\{\\text\{context\}\}whereqstatementq\_\{\\text\{statement\}\}a query supplemented with its corresponding financial statement,RhybridR\_\{\\text\{hybrid\}\}retrieves an initial candidate pool,CadaptiveC\_\{\\text\{adaptive\}\}filters candidates using query\-aware reasoning,JutilityJ\_\{\\text\{utility\}\}ranks candidates by decision usefulness,EcontextE\_\{\\text\{context\}\}extracts the final evidence

### 3\.3Document Ingestion

Corporate financial documents are segmented into structured sections using document layout cues\. Each segment is indexed with metadata that include the source of the document, the title of the section, and the page references\. This preprocessing step ensures that downstream components can leverage structural information during retrieval and extraction\.

### 3\.4Phase 1: Hybrid Candidate Retrieval

The first phase performs broad retrieval to maximize recall across multilingual and heterogeneous financial documents\.

Given a queryqq, we retrieve an initial candidate setC0C\_\{0\}using a hybrid retrieval strategy that combines keyword retrieval and dense semantic retrieval using multilingual embeddings:

C0=TopKkw​\(q,D\)∪TopKembed​\(q,D\)C\_\{0\}=\\text\{TopK\}\_\{\\text\{kw\}\}\(q,D\)\\cup\\text\{TopK\}\_\{\\text\{embed\}\}\(q,D\)
In Hybrid retrieval, keyword retrieval preserves precision for financial terminology, while semantic retrieval captures paraphrased or contextual references\.

The choice of the TopK value represents a critical hyperparameter at this stage: setting it too low risks restricting the candidate pool passed to the subsequent phases, while setting it too high diminishes the value of initial retrieval\. With a TopK value set to 50, the result is a high\-recall candidate pool of passages for further processing\.

### 3\.5Phase 2: Adaptive Retrieval Controller & Utility\-Grounded Re\-ranking

Not all retrieved passages are useful for credit analysis\. Financial reports contain boilerplate disclosures, legal notes, and narrative sections that may be semantically related to a query, but irrelevant to underwriting decisions\.

To address this issue, we introduce an adaptive retrieval controller that evaluates candidate passages using query intent and document structure\.

Given candidate passagesC0=\{p1,…,pk\}C\_\{0\}=\\\{p\_\{1\},\.\.\.,p\_\{k\}\\\}, the controller predicts relevance and support for a given passage:

R​e​li=f​\(qstatement,pi,mi\)Rel\_\{i\}=f\(q\_\{\\text\{statement\}\},p\_\{i\},m\_\{i\}\)Si=f​\(qstatement,pi,mi\)S\_\{i\}=f\(q\_\{\\text\{statement\}\},p\_\{i\},m\_\{i\}\)Ui=f​\(qstatement,pi,mi\)U\_\{i\}=f\(q\_\{\\text\{statement\}\},p\_\{i\},m\_\{i\}\)
where:

- •qstatementq\_\{\\text\{statement\}\}is a query supplemented its corresponding financial statement
- •pip\_\{i\}is the candidate passage
- •mim\_\{i\}represents structural metadata
- •R​e​liRel\_\{i\}represents relevancy \(boolean\) of candidate passage
- •SiS\_\{i\}represents evidence support \(boolean\) of candidate passage
- •UiU\_\{i\}represents utility score \(numerical score\) of candidate passage

The controller is implemented using a lightweight language model that evaluates whether a passage is likely to contain relevant and supportive information\.

This stage produces a refined candidate set:

C1=\{pi∈C0∣\(Si\)⋅\[R​e​li\]\}C\_\{1\}=\\\{p\_\{i\}\\in C\_\{0\}\\mid\(S\_\{i\}\)\\cdot\[Rel\_\{i\}\]\\\}
whereSi⋅\[R​e​li\]S\_\{i\}\\cdot\[Rel\_\{i\}\]represents the logical condition thatSiS\_\{i\}is effectively nullified unlessR​e​liRel\_\{i\}succeeds\.

This mechanism is conceptually related to adaptive retrieval strategies that condition retrieval on model reasoning rather than fixed pipelines\[[1](https://arxiv.org/html/2605.20684#bib.bib8)\]\.

The remaining candidates are ranked according to decision utility using an LLM\-as\-a\-Judge framework\. Rather than measuring similarity to the query, the judge evaluates each passage based on its usefulness for underwriting decisions\. Given a passagepip\_\{i\}, the judge produces a utility score:

J1=\{pi∈C1∣Ui≥Uthreshold\}J\_\{1\}=\\\{p\_\{i\}\\in C\_\{1\}\\mid U\_\{i\}\\geq U\_\{\\text\{threshold\}\}\\\}
whereUiU\_\{i\}is the utility score andUthresholdU\_\{\\text\{threshold\}\}is the tunable utility score threshold\.UthresholdU\_\{\\text\{threshold\}\}serves as a critical hyperparameter in the architecture, balancing precision and recall in the final evidence set\. A lowerUthresholdU\_\{\\text\{threshold\}\}increases recall by admitting a larger set of passages, which is advantageous for exploratory queries or scenarios where analysts require broad and comprehensive coverage\. Conversely, a higherUthresholdU\_\{\\text\{threshold\}\}prioritizes precision, returning only passages with the highest decision utility, which is ideal for targeted queries where concise evidence is needed\.

The utility\-grounded re\-ranking mechanism enables the system to prioritize passages containing financial indicators and industry signals\.

### 3\.6Context\-Aware Evidence Extraction

Financial documents frequently contain complex tables, footnotes, and structured subsections\. Simple chunk extraction can distort meaning, break structural relationships, or cause loss of attribution\. We therefore introduce a context\-aware extraction module that dynamically selects the appropriate extraction strategy based on document structure\.

Two extraction modes are employed depending on the structure of the source content\. For narrative sections, relevant text spans are extracted using markdown\-aware segmentation, which preserves structural elements such as section headers, bullet lists, and paragraph boundaries\.

When information appears in tables or structured financial statements, the system distinguishes between complex and non\-complex tables\. Non\-complex tables \(tables with single\-level headers and regular grid structures\) are parsed to extract the relevant rows or cells\. In contrast, complex tables containing multi\-level headers, hierarchical indices, merged cells, or irregular layouts are preserved along with source metadata, including the document name and page reference, to support manual verification and accurate source attribution\.

Two extraction modes are used:

- •Localized Passage Extraction For narrative sections, relevant text spans are extracted using markdown\-aware segmentation\. This preserves structural markers such as: - –section headers - –bullet lists - –paragraph boundaries
- •High\-Fidelity Table Citation When information resides within tables or structured financial statements, the system classifies tables as either complex tables or non\-complex tables: - –Complex tables: Multi\-level headers, hierarchical row indices, irregular structures, or merged cells that require specialized handling - –Non\-complex tables: Single\-level headers and row indices, well\-structured grids with regular formatting

For non\-complex tables, the system performs structured table parsing to extract the relevant rows or cells\. For complex tables, rather than attempting to parse the structure, the system preserves the full table context, including source metadata such as document name, and page reference\. This information is appended as supplementary information to support manual verification and referencing\.

This approach ensures that extracted financial metrics remain interpretable and can be reliably traced back to their original document context, while accommodating the diversity and structural complexity of tables across different reports\.

## 4System Architecture

The proposed system consists of five main components: document ingestion, hybrid retrieval, adaptive candidate control, utility\-grounded re\-ranking, and context\-aware extraction\.

Figure[1](https://arxiv.org/html/2605.20684#S4.F1)illustrates the overall pipeline\.

![Refer to caption](https://arxiv.org/html/2605.20684v1/pics/architecture.png)Figure 1:Utility\-grounded retrieval architecture for long\-document financial analysis\.To satisfy regulatory and data governance constraints in corporate credit underwriting, the proposed system is deployed entirely within an on\-premise environment, with all components executed under enterprise\-controlled infrastructure\. This ensures that sensitive financial data remains local across all stages of the pipeline\.

The deployment architecture is structurally aligned with the proposed formulation defined in Section 3\.2 and is realized through a collection of self\-hosted modules that implement each stage without reliance on external services\.

- •Retrieval Infrastructure Hybrid retrieval is implemented via the combination of lexical and dense retrieval mechanisms over locally indexed document collections\. Keyword\-based retrieval operates on structured indices, while dense retrieval is enabled through multilingual embedding models deployed on\-premise\. Document representations are stored and queried through an internal vector database, facilitating efficient construction and propagation of the candidate setC0C\_\{0\}\.
- •Agentic Reasoning Modules The adaptive retrieval controller and utility\-grounded re\-ranking stages are instantiated using self\-hosted language models\. Lightweight language models are utilized for relevance and support classification to enable efficient pruning of candidates, while higher\-capacity models perform utility scoring under the LLM\-as\-a\-Judge framework and support final response generation\. This separation reflects the staged evaluation functions defined in Section 3\.5, while maintaining computational tractability within an on\-premise setting\.
- •Context\-Aware Extraction The evidence extraction stage operates directly on the filtered and ranked candidate setJ1J\_\{1\}, applying structure\-aware processing to heterogeneous document segments\. Narrative text and tabular content are handled through distinct extraction strategies executed locally, preserving structural fidelity and source attribution without external transformation\.
- •Data Flow and Persistence System components are coordinated through internal messaging pipelines that support scalable ingestion and processing of financial documents\. Intermediate artifacts \(candidate passages, filtering decisions, utility scores, and associated metadata\) are persistently maintained, enabling reproducibility, traceability, and auditability of all outputs\.

This design ensures that all decision\-relevant computations remain verifiable, controllable, compliant with enterprise governance requirements, and that sensitive financial data never leaves the enterprise infrastructure\.

## 5Results

We evaluated the system on a multilingual corpus of financial documents with relevance labels curated by credit analysts\. Compared to traditional naive retrieval systems, the proposed approach significantly improves retrieval performance\.

Due to the proprietary nature of the financial documents used in this study, the evaluation was conducted in a restricted enterprise environment\. The dataset consists of internal corporate credit documents including financial reports, industry analyses, and related underwriting materials used in production workflows\. These documents contain confidential financial information and cannot be publicly released\. As a result, the annotated dataset and detailed case studies used for evaluation are not included in the public version of this work\.

To ensure meaningful evaluation despite these constraints, financial statements and additional contexts were curated by senior credit analysts based on real underwriting tasks\. Queries were constructed to reflect analytical questions commonly encountered in credit assessment workflows, such as identifying components in balance sheets, income statements, and industry outlook signals\.

In real\-world deployment across more than 800 analysts, the system reduced document review time from several hours to approximately three minutes\.

## 6Summary

The results highlight the limitations of similarity\-based retrieval for analytical tasks\. By incorporating utility signals into the ranking process, the proposed system surfaces passages containing actionable financial evidence\.

Adaptive candidate filtering further improves computational efficiency by reducing noise prior to the utility\-based ranking stage\.

## 7Conclusion & Future Work

We present a utility\-grounded retrieval architecture for corporate credit underwriting workflows involving long financial documents\. The proposed system combines hybrid retrieval, adaptive candidate filtering, and LLM\-based utility ranking to prioritize decision\-relevant evidence\.

Evaluation results demonstrate significant improvements in retrieval accuracy and analyst productivity\. The system has been successfully deployed in a large enterprise environment, highlighting the potential of utility\-aware RAG architectures for document\-intensive decision\-support applications\.

Future work will explore enhanced structured data extraction from financial tables\. The current system preserves traceability when detecting and retrieving tabular content, including document location and structural context\. This capability opens the possibility of integrating specialized Optical Character Recognition \(OCR\) and Vision Language Model \(VLM\) pipelines to extract structured numerical information directly from table regions\.

By combining table detection that leverages OCR extraction together with VLMs, the system could capture additional financial metrics that are often embedded within complex document layouts\. This structured information could then be incorporated into downstream retrieval and reasoning processes, enabling richer analytical queries that combine narrative explanations with precise financial figures\.

## Acknowledgments

This work was conducted at OCBC AI Lab\. The authors would like to express their sincere gratitude to Kok Ker Ern Kovan, Lee Sheng Kiat, and Germaine Goh Yanshan for their valuable contributions and support throughout the project\. Their insights and domain expertise were instrumental in the successful outcomes presented in this paper\. The authors would also like to specifically thank Ren Xuezhe, Lisa Tan, Weiyang Song, Qishuai Zhong, and Kenneth Loh Zhen Xiang for their significant contributions to the successful delivery and shipment of the product\. Their dedication and collaboration greatly enhanced the impact and practical application of this work\.

## References

- \[1\]\(2023\)Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.External Links:2310\.11511,[Link](https://arxiv.org/abs/2310.11511)Cited by:[§3\.5](https://arxiv.org/html/2605.20684#S3.SS5.p9.1)\.
- \[2\]Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, M\. Wang, and H\. Wang\(2024\)Retrieval\-augmented generation for large language models: a survey\.External Links:2312\.10997,[Link](https://arxiv.org/abs/2312.10997)Cited by:[§1](https://arxiv.org/html/2605.20684#S1.p2.1),[§2](https://arxiv.org/html/2605.20684#S2.p1.1)\.
- \[3\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, P\. Kuksa, and et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2005.11401)Cited by:[§1](https://arxiv.org/html/2605.20684#S1.p2.1),[§2](https://arxiv.org/html/2605.20684#S2.p1.1)\.
- \[4\]K\. Santhanam, O\. Khattab, J\. Saad\-Falcon, C\. Potts, and M\. Zaharia\(2022\)ColBERTv2: effective and efficient retrieval via lightweight late interaction\.External Links:2112\.01488,[Link](https://arxiv.org/abs/2112.01488)Cited by:[§2](https://arxiv.org/html/2605.20684#S2.p1.1)\.
- \[5\]B\. Sarmah, B\. Hall, R\. Rao, S\. Patel, S\. Pasquali, and D\. Mehta\(2024\)HybridRAG: integrating knowledge graphs and vector retrieval augmented generation for efficient information extraction\.External Links:2408\.04948,[Link](https://arxiv.org/abs/2408.04948)Cited by:[§1](https://arxiv.org/html/2605.20684#S1.p1.1),[§1](https://arxiv.org/html/2605.20684#S1.p2.1),[§1](https://arxiv.org/html/2605.20684#S1.p3.1),[§1](https://arxiv.org/html/2605.20684#S1.p4.1)\.
- \[6\]H\. Zhang, M\. Tang, K\. Bi, J\. Guo, S\. Liu, D\. Shi, D\. Yin, and X\. Cheng\(2025\)Utility\-focused llm annotation for retrieval and retrieval\-augmented generation\.External Links:2504\.05220,[Link](https://arxiv.org/abs/2504.05220)Cited by:[§1](https://arxiv.org/html/2605.20684#S1.p3.1),[§1](https://arxiv.org/html/2605.20684#S1.p4.1)\.
- \[7\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.External Links:2306\.05685,[Link](https://arxiv.org/abs/2306.05685)Cited by:[§2](https://arxiv.org/html/2605.20684#S2.p2.1)\.

Similar Articles

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Hugging Face Daily Papers

The paper introduces Direct Corpus Interaction (DCI), a novel approach allowing AI agents to query raw text directly using standard terminal tools instead of traditional embedding-based retrieval. By bypassing fixed similarity interfaces and offline indexing, DCI significantly outperforms conventional sparse, dense, and reranking baselines across multiple IR and agentic search benchmarks.