FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing

arXiv cs.CL Papers

Summary

FAB-Bench is a benchmark framework for evaluating Retrieval-Augmented Generation (RAG) systems in semiconductor manufacturing, with six diagnostic metrics and analysis across context windows. It provides 200 curated query-answer pairs and reveals context-scaling behaviors and attention dilution issues.

arXiv:2605.26476v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has become critical for knowledge-intensive applications, yet evaluating its performance in vertical domains remains difficult due to domain complexity, diverse context scales, and heavy reliance on expert assessments that are costly, inconsistent, and non-scalable. We introduce FAB-Bench, an end-to-end framework for adaptive benchmarking of RAG systems in semiconductor manufacturing. FAB-Bench defines six diagnostic metrics measuring factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, and reasoning consistency. The framework couples retriever diagnostics with generator-level reasoning analysis across context windows of 4K-32K tokens, quantifying how retrieval precision and generative fidelity co-evolve as contextual scope expands. From over 1,300 generated candidates, we curated a high-quality benchmark of 200 query-answer pairs spanning three synthesis strategies: needle-in-haystack, intra-document multi-topic, and cross-document multi-hop. Systematic evaluation across four LLMs and four RAG frameworks reveals three distinct context-scaling behaviors: logarithmic growth, early saturation, and cold-start dynamics, and identifies attention dilution as the primary mechanism behind performance degradation at extreme context lengths. Cross-framework validation on three additional production RAG systems confirms evaluation portability.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:06 AM

# FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing
Source: [https://arxiv.org/html/2605.26476](https://arxiv.org/html/2605.26476)
Jingbin Qian Congwen Yi Min Xia Wen Wu Jun Zhu Jian Guan\* FutureFab\.AI \*andrewg@futurefab\.ai

###### Abstract

Retrieval\-Augmented Generation \(RAG\) has become critical for knowledge\-intensive applications, yet evaluating its performance in vertical domains remains difficult due to domain complexity, diverse context scales, and heavy reliance on expert assessments that are costly, inconsistent, and non\-scalable\. We introduceFAB\-Bench, an end\-to\-endframework foradaptivebenchmarking of RAG systems in semiconductor manufacturing\. FAB\-Bench defines six diagnostic metrics measuring factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, and reasoning consistency\. The framework couples retriever diagnostics with generator\-level reasoning analysis across context windows of 4K–32K tokens, quantifying how retrieval precision and generative fidelity co\-evolve as contextual scope expands\. From over 1,300 generated candidates, we curated a high\-quality benchmark of 200 query–answer pairs spanning three synthesis strategies: needle\-in\-haystack, intra\-document multi\-topic, and cross\-document multi\-hop\. Systematic evaluation across four LLMs and four RAG frameworks reveals three distinct context\-scaling behaviors—logarithmic growth, early saturation, and cold\-start dynamics—and identifies attention dilution as the primary mechanism behind performance degradation at extreme context lengths\. Cross\-framework validation on three additional production RAG systems confirms evaluation portability\.

††Benchmark dataset available at:[https://github\.com/FuturefabAI/FAB\-Bench](https://github.com/FuturefabAI/FAB-Bench)*Keywords*RAG Evaluation⋅\\cdotVertical Domain Benchmark⋅\\cdotLLM\-as\-Judge⋅\\cdotContext Window Scaling⋅\\cdotSemiconductor Manufacturing

## 1Introduction

Large language models \(LLMs\) have demonstrated remarkable capabilities in various tasks\(Achiamet al\.,[2023](https://arxiv.org/html/2605.26476#bib.bib1); Brownet al\.,[2020](https://arxiv.org/html/2605.26476#bib.bib2)\), motivating the development of diverse evaluation benchmarks\. Early benchmarks such as GLUE\(Wanget al\.,[2018](https://arxiv.org/html/2605.26476#bib.bib4)\)and SuperGLUE\(Wanget al\.,[2019](https://arxiv.org/html/2605.26476#bib.bib5)\)focused on natural language understanding tasks, including sentiment analysis, textual understanding, and question answering\.MMLU\(Hendryckset al\.,[2021a](https://arxiv.org/html/2605.26476#bib.bib3)\)expands assessment to broad knowledge coverage across 57 subjects using 15,908 multiple\-choice questions spanning from elementary to professional levels, with an emphasis on zero\-shot and few\-shot evaluation of pre\-trained knowledge\. However, rapid benchmark saturation has significantly reduced its discriminative power: while GPT\-3 achieved only 43\.9% accuracy\(Brownet al\.,[2020](https://arxiv.org/html/2605.26476#bib.bib2)\), GPT\-4 exceeded 86%\(OpenAI,[2023](https://arxiv.org/html/2605.26476#bib.bib42)\)\. To improve linguistic coverage,C\-Eval\(Huanget al\.,[2023](https://arxiv.org/html/2605.26476#bib.bib6)\)extends this paradigm to Chinese with 13,948 questions across 52 subjects\. Despite their value, these benchmarks share fundamental limitations: reliance on staticpublicknowledge and the absence of retrieval contexts, making them poorly suited for evaluating RAG systems in vertical domains\.

Beyond generic language benchmarks, more task\- or capability\-specific evaluations have been proposed\.ARC\(Clarket al\.,[2018](https://arxiv.org/html/2605.26476#bib.bib7)\)evaluates scientific reasoning using graduate\-school–level questions;TruthfulQA\(Linet al\.,[2022](https://arxiv.org/html/2605.26476#bib.bib8)\)measures propensity for factual reliability;GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.26476#bib.bib9)\)andMATH\(Hendryckset al\.,[2021b](https://arxiv.org/html/2605.26476#bib.bib10)\)assess mathematical reasoning from graduate\-school to competition\-level; andHumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.26476#bib.bib11)\)evaluates code generation through 164 programming tasks\.

Domain\-specific benchmarks further address specialized requirements\. In healthcare,MedQA\(Jinet al\.,[2021](https://arxiv.org/html/2605.26476#bib.bib12)\),MedMCQA\(Palet al\.,[2022](https://arxiv.org/html/2605.26476#bib.bib13)\), andMultiMedQA\(Singhalet al\.,[2023a](https://arxiv.org/html/2605.26476#bib.bib14)\)evaluate medical reasoning under safety\-critical constraints, with Med\-PaLM 2 achieving 85% accuracy on USMLE\-style questions\(Singhalet al\.,[2023b](https://arxiv.org/html/2605.26476#bib.bib15)\)\. Legal benchmarks such asLegalBench\(Guha and others,[2023](https://arxiv.org/html/2605.26476#bib.bib17)\)andLawBench\(Feiet al\.,[2023](https://arxiv.org/html/2605.26476#bib.bib18)\)assess regulatory reasoning across jurisdictions, while financial benchmarks includingFinanceBench\(Islam and others,[2023](https://arxiv.org/html/2605.26476#bib.bib19)\),FinBen\(Xieet al\.,[2024](https://arxiv.org/html/2605.26476#bib.bib20)\), andBloombergGPT\(Wuet al\.,[2023](https://arxiv.org/html/2605.26476#bib.bib21)\)evaluate financial analysis and decision\-making\. In semiconductor design, ChipNeMo\(Liuet al\.,[2023a](https://arxiv.org/html/2605.26476#bib.bib22)\)adapts LLMs for chip design but focuses on model training rather than RAG evaluation\. Although these domain benchmarks improve specialization, they largely rely on public datasets and expert curation, and provide limited visibility into how effectively RAG systems retrieve, integrate, and reason over proprietary, multi\-document corpora\.

RAG has become the dominant paradigm for deploying LLMs in knowledge\-intensive applications, particularly in enterprise and industrial settings\(Lewiset al\.,[2020](https://arxiv.org/html/2605.26476#bib.bib23)\)—where models must reason and answer questions over proprietary documents not seen during training\. Despite the progress of domain\-specific benchmarks, many remain poorly suited for enterprise RAG use cases: they typically evaluate overpublicknowledge sources \(e\.g\., medical licensing exams, legal case law, published financial reports\) rather than the proprietary documentation enterprises actually deploy\(Chenet al\.,[2024b](https://arxiv.org/html/2605.26476#bib.bib25)\)\. Moreover, heavy dependence on manual expert curation fundamentally limits scalability\(Zhenget al\.,[2024](https://arxiv.org/html/2605.26476#bib.bib33)\)\. Existing evaluation metrics are often accuracy\-based, offering little diagnostic insight into whether failures arise from incomplete retrieval, faulty reasoning, or inadequate multi\-document synthesis\. They also lack mechanisms for systematic deployment optimization: they cannot evaluate RAG effectiveness on proprietary corpora, quantify knowledge augmentation value, or guide configuration decisions such as context window allocation\. Thus, these benchmarks emphasize knowledge recall or task\-specific reasoning while providing limited assessment of workflow\-level integration or nuanced judgment in real\-world settings\(Budleret al\.,[2025](https://arxiv.org/html/2605.26476#bib.bib16); Gaoet al\.,[2023](https://arxiv.org/html/2605.26476#bib.bib24)\)\. While recent work has begun to address RAG\-specific evaluation, important gaps remain: RGB\(Chenet al\.,[2024a](https://arxiv.org/html/2605.26476#bib.bib28)\)and RECALL\(Liuet al\.,[2023c](https://arxiv.org/html/2605.26476#bib.bib29)\)focus on general\-domain QA rather than specialized knowledge, ARES\(Saad\-Falconet al\.,[2024](https://arxiv.org/html/2605.26476#bib.bib27)\)requires substantial human calibration, and RAGAS\(Eset al\.,[2024](https://arxiv.org/html/2605.26476#bib.bib26)\)lacks domain\-specific customization\. As a result, enterprises still lack quantitative guidance for deployment\-critical decisions such as model selection and context window allocation, relying instead on ad\-hoc qualitative feedback\.

### 1\.1Contributions

In this work, we introduceFAB\-Bench, an end\-to\-end evaluation methodology for vertical\-domain RAG for realistic enterprise reasoning\. Our main contributions are summarized as follows:

- •Methodology for evaluating vertical\-domain RAG via cross\-document synthesis under adaptive benchmark generation\.We formulate vertical\-domain RAG evaluation as evidence\-based synthesis over long and heterogeneous private corpora, and design benchmarks that require explicit multi\-document integration, including needle\-in\-haystack grounding, intra\-document multi\-topic reasoning, and cross\-document multi\-hop composition\. To improve robustness and coverage of synthesized queries, we employ an adaptive generation mechanism with temperature modification, adjusting sampling temperature in response to quality and consistency signals to obtain diverse and stable benchmark instances\.
- •A diagnostic measurement protocol that attributes failures across retrieval and generation\.We introduce a six\-dimensional evaluation rubric—Completeness, Factuality, Context Utilization, Technical Depth, Relevance, and Support Quality—that separates missing evidence, irrelevant retrieval, shallow synthesis, and unsupported generation, enabling fine\-grained localization of performance bottlenecks\.
- •An empirical characterization of context\-window scaling regimes for vertical RAG\.By measuring performance from 4K to 32K tokens across four LLMs, we identify three distinct scaling behaviors and characterize attention dilution through metric\-level decomposition, offering actionable guidance for configuration decisions\.

## 2Related Work

### 2\.1LLM Evaluation Benchmarks

General\-purpose benchmarks have evolved from task\-specific assessments like GLUE\(Wanget al\.,[2018](https://arxiv.org/html/2605.26476#bib.bib4)\)and SuperGLUE\(Wanget al\.,[2019](https://arxiv.org/html/2605.26476#bib.bib5)\)to broad knowledge evaluations\. MMLU\(Hendryckset al\.,[2021a](https://arxiv.org/html/2605.26476#bib.bib3)\)covers 57 subjects with 15,908 questions, though rapid saturation \(GPT\-3: 43\.9%\(Brownet al\.,[2020](https://arxiv.org/html/2605.26476#bib.bib2)\)to GPT\-4: 86%\+\(OpenAI,[2023](https://arxiv.org/html/2605.26476#bib.bib42)\)\) reduces its discriminative power\. C\-Eval\(Huanget al\.,[2023](https://arxiv.org/html/2605.26476#bib.bib6)\)extends coverage to Chinese\. Capability\-specific benchmarks target scientific reasoning \(ARC\(Clarket al\.,[2018](https://arxiv.org/html/2605.26476#bib.bib7)\)\), factual reliability \(TruthfulQA\(Linet al\.,[2022](https://arxiv.org/html/2605.26476#bib.bib8)\)\), mathematical reasoning \(GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.26476#bib.bib9)\), MATH\(Hendryckset al\.,[2021b](https://arxiv.org/html/2605.26476#bib.bib10)\)\), and code generation \(HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.26476#bib.bib11)\)\)\. These benchmarks share a fundamental limitation: reliance on static, public knowledge without retrieval contexts, making them poorly suited for RAG evaluation\.

### 2\.2Domain\-Specific Evaluation

Domain benchmarks address specialized requirements but inherit similar limitations\. In healthcare, MedQA\(Jinet al\.,[2021](https://arxiv.org/html/2605.26476#bib.bib12)\), MedMCQA\(Palet al\.,[2022](https://arxiv.org/html/2605.26476#bib.bib13)\), and MultiMedQA\(Singhalet al\.,[2023a](https://arxiv.org/html/2605.26476#bib.bib14)\)evaluate medical reasoning, with Med\-PaLM 2 reaching 85% on USMLE\-style questions\(Singhalet al\.,[2023b](https://arxiv.org/html/2605.26476#bib.bib15)\)\. Legal benchmarks \(LegalBench\(Guha and others,[2023](https://arxiv.org/html/2605.26476#bib.bib17)\), LawBench\(Feiet al\.,[2023](https://arxiv.org/html/2605.26476#bib.bib18)\)\) assess regulatory reasoning, while financial benchmarks \(FinanceBench\(Islam and others,[2023](https://arxiv.org/html/2605.26476#bib.bib19)\), FinBen\(Xieet al\.,[2024](https://arxiv.org/html/2605.26476#bib.bib20)\), BloombergGPT\(Wuet al\.,[2023](https://arxiv.org/html/2605.26476#bib.bib21)\)\) evaluate financial analysis\. In semiconductor design, ChipNeMo\(Liuet al\.,[2023a](https://arxiv.org/html/2605.26476#bib.bib22)\)adapts LLMs for chip design but focuses on model training rather than RAG evaluation\. These benchmarks primarily evaluate parametric knowledge over public datasets, providing limited visibility into how RAG systems retrieve, integrate, and reason over proprietary multi\-document corpora\(Budleret al\.,[2025](https://arxiv.org/html/2605.26476#bib.bib16)\)\.

### 2\.3RAG Evaluation Frameworks

RAG\-specific evaluation has received increasing attention\. RAGAS\(Eset al\.,[2024](https://arxiv.org/html/2605.26476#bib.bib26)\)provides multi\-dimensional metrics \(faithfulness, relevance, context precision/recall\) and supports test generation from user\-provided corpora, but does not address context\-window scaling or domain\-specific metric customization\. ARES\(Saad\-Falconet al\.,[2024](https://arxiv.org/html/2605.26476#bib.bib27)\)automates RAG evaluation through prediction\-powered inference with multi\-dimensional scoring, but requires∼\\sim150 human\-annotated samples for calibration\. RGB\(Chenet al\.,[2024a](https://arxiv.org/html/2605.26476#bib.bib28)\)evaluates four RAG robustness abilities including information integration across documents, but uses a fixed general\-domain dataset\. RECALL\(Liuet al\.,[2023c](https://arxiv.org/html/2605.26476#bib.bib29)\)evaluates robustness against counterfactual knowledge\. CRAG\(Yanget al\.,[2024](https://arxiv.org/html/2605.26476#bib.bib30)\)provides a comprehensive benchmark with multi\-hop and aggregation questions requiring cross\-document synthesis, but operates on a fixed dataset without vertical\-domain customization\. MultiHop\-RAG\(Tang and Yi,[2024](https://arxiv.org/html/2605.26476#bib.bib31)\)specifically targets multi\-hop reasoning with evidence distributed across 2–4 documents, but is limited to a fixed English news corpus\. SCARF\(Rengoet al\.,[2025](https://arxiv.org/html/2605.26476#bib.bib32)\)proposes a system\-level assessment framework but does not include benchmark generation\.

FAB\-Bench complements these efforts by addressing two gaps that none of the above frameworks cover simultaneously: \(1\)systematic context\-window scaling analysisthat characterizes how RAG performance evolves from 4K to 32K tokens, and \(2\)domain\-specific evaluation with a structured knowledge base\(431 semiconductor terms across 7 weighted categories\) that enables precision\-aware benchmark generation and domain\-grounded scoring\.

### 2\.4LLM\-as\-Judge Methodology

Using LLMs as evaluation judges has become widespread following MT\-Bench\(Zhenget al\.,[2024](https://arxiv.org/html/2605.26476#bib.bib33)\), which demonstrated strong correlation between LLM judgments and human preferences\. G\-Eval\(Liuet al\.,[2023b](https://arxiv.org/html/2605.26476#bib.bib34)\)formalizes this through chain\-of\-thought prompting with probability\-weighted scoring\. However, LLM judges exhibit known biases including position bias, verbosity bias, and self\-enhancement bias\(Zhenget al\.,[2024](https://arxiv.org/html/2605.26476#bib.bib33); Wanget al\.,[2023](https://arxiv.org/html/2605.26476#bib.bib35)\)\. Recent work on calibrating LLM judges\(Liuet al\.,[2024b](https://arxiv.org/html/2605.26476#bib.bib36)\)suggests that structured rubrics with explicit scoring criteria mitigate these biases\. Our approach addresses reliability through: \(1\) structured rubrics with separate objective/subjective variants per metric; \(2\) chain\-of\-thought reasoning via G\-Eval; and \(3\) empirical validation of metric independence \(Section[5\.5](https://arxiv.org/html/2605.26476#S5.SS5)\)\.

## 3FAB\-Bench Framework

![Refer to caption](https://arxiv.org/html/2605.26476v1/x1.png)Figure 1:FAB\-Bench system overview\. The framework comprises two components: an adaptive benchmark generation system that produces domain\-specific QA pairs from proprietary corpora, and a multi\-dimensional evaluation platform that scores RAG system responses across six diagnostic metrics\.Evaluating RAG systems in vertical domains requires benchmarks satisfying four criteria:authenticity\(questions reflecting real\-world complexity\),contamination resistance\(avoiding train\-test overlap\),diagnostic granularity\(isolating specific failure modes\), anddiscriminative power\(meaningfully differentiating models\)\. FAB\-Bench addresses these through two design principles: evaluation mirrors deployment by generating questions from user\-provided corpora, and multi\-dimensional assessment enables precise failure attribution \(Figure[1](https://arxiv.org/html/2605.26476#S3.F1)\)\.

### 3\.1Knowledge Corpus and Domain Encoding

Our benchmark draws from three source types: academic literature comprising 150\+ peer\-reviewed papers from IEDM, ISSCC, and VLSI symposia; patent documents with 70\+ filings containing proprietary fabrication details; and industry standards from SEMI specifications\. The corpus totals approximately 347 million tokens across 188 topics\.

To enable domain\-aware processing, we constructed a hierarchical knowledge base𝒦\\mathcal\{K\}encoding 431 technical terms organized into seven semantic categories \(Table[1](https://arxiv.org/html/2605.26476#S3.T1)\), each with precision weightswiw\_\{i\}reflecting technical rigor requirements\. The knowledge base serves three functions: \(1\) computing technical densityρ​\(d\)\\rho\(d\)for adaptive generation control, \(2\) classifying document precision levels for temperature scheduling, and \(3\) weighting domain terminology in similarity computations\.

Table 1:Hierarchical knowledge base: seven semantic categories with term counts and precision weights\. Higher weights indicate categories requiring greater quantitative precision\.Questions span two balanced formats\. Objective questions \(50%\) assess factual accuracy through mathematical calculations, fill\-in\-blank, true/false, and multiple choice items\. Subjective questions \(50%\) evaluate reasoning through mechanism explanations, causal reasoning, comparative analysis, and problem diagnosis\. Both formats require context\-dependent answers that cannot be resolved from parametric knowledge alone\.

### 3\.2Cross\-Document Context Synthesis

A defining RAG capability is cross\-document synthesis—integrating information across multiple sources\. Single\-document benchmarks cannot distinguish retrieval\-based reasoning from parametric knowledge reliance\. We design three synthesis strategies targeting complementary objectives\.

The corpus is segmented into chunksC=⋃i=1mChunk​\(Di\)C=\\bigcup\_\{i=1\}^\{m\}\\text\{Chunk\}\(D\_\{i\}\)via sliding window \(512 tokens, 128\-token overlap\)\. TF\-IDF vectorsv→​\(c\)\\vec\{v\}\(c\)enable semantic comparison between chunks\.

#### Strategy 1: Needle\-in\-Haystack\.

Critical information is embedded within topically dissimilar distractors\. Given a target chunkctargetc\_\{\\text\{target\}\}, we select distractor chunks\{cd\}\\\{c\_\{d\}\\\}that minimize cosine similarity:cd=arg⁡minc∈C⁡cos⁡\(v→​\(ctarget\),v→​\(c\)\)c\_\{d\}=\\arg\\min\_\{c\\in C\}\\cos\(\\vec\{v\}\(c\_\{\\text\{target\}\}\),\\vec\{v\}\(c\)\)\. This tests precise fact location amid irrelevant content\.

#### Strategy 2: Intra\-Document Multi\-Topic\.

Chunks from different topic clusters within the same document are combined, requiring integration of dispersed information\. We cluster chunks within each document using TF\-IDF cosine similarity and select chunks from at least two distinct clusters \(cos⁡\(v→​\(ci\),v→​\(cj\)\)<0\.3\\cos\(\\vec\{v\}\(c\_\{i\}\),\\vec\{v\}\(c\_\{j\}\)\)<0\.3for cluster separation\)\.

#### Strategy 3: Cross\-Document Multi\-Hop\.

The most challenging strategy constructs contexts requiring cross\-source reasoning:

1. 1\.Select a seed chunkcseedc\_\{\\text\{seed\}\}from documentDiD\_\{i\}\.
2. 2\.Identify a semantically related chunk from a different document:clink=arg⁡maxc∈Dj,j≠i⁡cos⁡\(v→​\(cseed\),v→​\(c\)\)c\_\{\\text\{link\}\}=\\arg\\max\_\{c\\in D\_\{j\},j\\neq i\}\\cos\(\\vec\{v\}\(c\_\{\\text\{seed\}\}\),\\vec\{v\}\(c\)\)\.
3. 3\.Validate connection strength:cos⁡\(v→​\(cseed\),v→​\(clink\)\)\>θlink\\cos\(\\vec\{v\}\(c\_\{\\text\{seed\}\}\),\\vec\{v\}\(c\_\{\\text\{link\}\}\)\)\>\\theta\_\{\\text\{link\}\}, whereθlink=0\.1\\theta\_\{\\text\{link\}\}=0\.1\.
4. 4\.Construct the final context by combining seed, link, and distractor chunks\.

Questions are generatedaftercontext construction, guaranteeing that correct answers depend on synthesizing linked information from multiple documents\.

### 3\.3Adaptive Generation Control

Generating high\-quality test cases requires balancing precision and diversity\. We adapt generation parameters to content characteristics through two mechanisms\.

#### Technical Density and Precision Classification\.

We compute technical densityρ​\(d\)\\rho\(d\)as the ratio of domain term occurrences \(weighted by category\) to total words\. Documents are classified into precision levels:

p​\(d\)=\{highif​ρ​\(d\)\>0\.20​or​ωh​\(d\)\>8mediumif​ρ​\(d\)\>0\.12​or​ωh​\(d\)\>4lowotherwisep\(d\)=\\begin\{cases\}\\text\{high\}&\\text\{if \}\\rho\(d\)\>0\.20\\text\{ or \}\\omega\_\{h\}\(d\)\>8\\\\ \\text\{medium\}&\\text\{if \}\\rho\(d\)\>0\.12\\text\{ or \}\\omega\_\{h\}\(d\)\>4\\\\ \\text\{low\}&\\text\{otherwise\}\\end\{cases\}\(1\)whereωh​\(d\)=∑ci∈\{parameters, processes\}wi⋅\|\{t∈Tci:n​\(t,d\)\>0\}\|\\omega\_\{h\}\(d\)=\\sum\_\{c\_\{i\}\\in\\\{\\text\{parameters, processes\}\\\}\}w\_\{i\}\\cdot\|\\\{t\\in T\_\{c\_\{i\}\}:n\(t,d\)\>0\\\}\|counts high\-weight category term occurrences\.

#### Adaptive Temperature\.

Temperatureτ\\tauis computed as:

τ=clip​\(τmin​\(p\)\+τprog​\(k\)\+Δ​τc​\(c∗\)\+Δ​τfail​\(a,s\),0\.1,1\.0\)\\tau=\\text\{clip\}\\left\(\\tau\_\{\\min\}\(p\)\+\\tau\_\{\\text\{prog\}\}\(k\)\+\\Delta\\tau\_\{c\}\(c^\{\*\}\)\+\\Delta\\tau\_\{\\text\{fail\}\}\(a,s\),\\;0\.1,\\;1\.0\\right\)\(2\)where base ranges\[τmin,τmax\]\[\\tau\_\{\\min\},\\tau\_\{\\max\}\]are\[0\.4,0\.8\]\[0\.4,0\.8\]for high\-precision,\[0\.5,0\.9\]\[0\.5,0\.9\]for medium, and\[0\.6,1\.0\]\[0\.6,1\.0\]for low;τprog​\(k\)=\(τmax−τmin\)×min⁡\(k/20,0\.8\)\\tau\_\{\\text\{prog\}\}\(k\)=\(\\tau\_\{\\max\}\-\\tau\_\{\\min\}\)\\times\\min\(k/20,0\.8\)increases diversity as successful generationskkaccumulate;Δ​τc\\Delta\\tau\_\{c\}adjusts for dominant content category \(−0\.10\-0\.10for parameters,\+0\.05\+0\.05for applications\); andΔ​τfail\\Delta\\tau\_\{\\text\{fail\}\}boosts temperature after generation failures\. These mechanisms follow established temperature\-diversity trade\-offs\(Holtzmanet al\.,[2020](https://arxiv.org/html/2605.26476#bib.bib39); Renze,[2024](https://arxiv.org/html/2605.26476#bib.bib40)\)\.

#### Adaptive Similarity Threshold\.

To prevent duplicate generation while allowing necessary terminology overlap in technical content, the similarity threshold relaxes progressively:

θsim​\(r\)=max⁡\(0\.50,θbase​\(p\)−0\.05×r\)\\theta\_\{\\text\{sim\}\}\(r\)=\\max\\left\(0\.50,\\;\\theta\_\{\\text\{base\}\}\(p\)\-0\.05\\times r\\right\)\(3\)whereθbase\\theta\_\{\\text\{base\}\}is 0\.70 for high\-precision, 0\.75 for medium, and 0\.80 for low\-precision content, andrrcounts retry attempts\. Similarity combines weighted Jaccard overlap of all tokens and domain terms with TF\-IDF cosine similarity for corpora exceeding five questions\.

#### Generation Pipeline\.

![Refer to caption](https://arxiv.org/html/2605.26476v1/x2.png)Figure 2:Benchmark generation workflow\. The pipeline iterates through synthesized contexts, adaptively adjusting generation parameters based on precision classification and failure feedback\.Algorithm[1](https://arxiv.org/html/2605.26476#alg1)integrates these components into a fully automated pipeline \(Figure[2](https://arxiv.org/html/2605.26476#S3.F2)\)\.

Algorithm 1Adaptive QA Pair Generation0:Corpus

DD, test type

tt, target count

NN
0:QA pairs

𝒬=\{\(qi,di,ai\)\}i=1N\\mathcal\{Q\}=\\\{\(q\_\{i\},d\_\{i\},a\_\{i\}\)\\\}\_\{i=1\}^\{N\}
1:

C←Chunk​\(D\)C\\leftarrow\\textsc\{Chunk\}\(D\); compute TF\-IDF vectors

v→​\(c\)\\vec\{v\}\(c\)for all

c∈Cc\\in C
2:

𝒟←SynthesizeContexts​\(C,t\)\\mathcal\{D\}\\leftarrow\\textsc\{SynthesizeContexts\}\(C,t\)// Strategy 1, 2, or 3

3:

𝒬←∅\\mathcal\{Q\}\\leftarrow\\emptyset
4:foreach context

d∈𝒟d\\in\\mathcal\{D\}while

\|𝒬\|<N\|\\mathcal\{Q\}\|<Ndo

5:Classify precision

p​\(d\)p\(d\)via Eq\.[1](https://arxiv.org/html/2605.26476#S3.E1); compute density

ρ​\(d\)\\rho\(d\)
6:

τ←AdaptiveTemperature​\(p,\|𝒬\|,failures\)\\tau\\leftarrow\\textsc\{AdaptiveTemperature\}\(p,\|\\mathcal\{Q\}\|,\\text\{failures\}\)// Eq\.[2](https://arxiv.org/html/2605.26476#S3.E2)

7:

θ←AdaptiveThreshold​\(p,retries\)\\theta\\leftarrow\\textsc\{AdaptiveThreshold\}\(p,\\text\{retries\}\)// Eq\.[3](https://arxiv.org/html/2605.26476#S3.E3)

8:forretry

r=1r=1to

RmaxR\_\{\\max\}do

9:

\(q,a\)←LLMGenerate​\(d,t,τ\)\(q,a\)\\leftarrow\\textsc\{LLMGenerate\}\(d,t,\\tau\)
10:ifValid\(q,a\)\(q,a\)andUnique\(q,𝒬,θ\)\(q,\\mathcal\{Q\},\\theta\)then

11:

𝒬←𝒬∪\{\(q,d,a\)\}\\mathcal\{Q\}\\leftarrow\\mathcal\{Q\}\\cup\\\{\(q,d,a\)\\\};break

12:endif

13:endfor

14:endfor

15:return

𝒬\\mathcal\{Q\}

#### Expert Validation\.

A controlled ablation study validated the generation mechanism\. Domain experts rated QA pairs \(18 per condition, spanning ROB/MULTI/GEN types equally\) on four dimensions \(1–5 scale\): accuracy, relevance, difficulty, and diversity\. The full system combining adaptive parameters with enhanced prompts achieved the highest scores across all dimensions \(Accuracy: 5\.00, Relevance: 5\.00, Difficulty: 4\.60, Diversity: 4\.53\) with zero retries, while adaptive parameters alone yielded mixed results \(Accuracy: 4\.33, Difficulty: 3\.28\), indicating that prompt engineering is essential for quality generation \(Appendix[D](https://arxiv.org/html/2605.26476#A4)\)\.

Per\-question\-type analysis reveals an appropriate difficulty gradient in the generated benchmark \(Table[2](https://arxiv.org/html/2605.26476#S3.T2)\), confirming that the three synthesis strategies produce questions of distinct and meaningful complexity\.

Table 2:Expert\-rated quality by question type \(averaged across all ablation conditions\)\. The difficulty gradient \(ROB<<MULTI<<GEN\) confirms the three synthesis strategies produce appropriately graded complexity\.

### 3\.4Six\-Dimensional Evaluation Metrics

We define six diagnostic metrics, each evaluated on a 10\-point scale with separate rubrics for objective and subjective questions \(Table[3](https://arxiv.org/html/2605.26476#S3.T3)\)\. Full rubric definitions are provided in Appendix[E](https://arxiv.org/html/2605.26476#A5)\.

Table 3:Six\-dimensional evaluation metrics with failure mode attribution\.These metrics are designed to capture orthogonal failure modes\. A response may exhibit high relevance but low factuality \(on\-topic hallucination\), or strong technical depth but poor context utilization \(ignoring retrieved documents in favor of parametric knowledge\)\. This design enables diagnostic attribution: retrieval failures manifest as low context utilization despite high factuality, while generation failures appear as uniformly low scores\.

We empirically assess metric independence in Section[5\.5](https://arxiv.org/html/2605.26476#S5.SS5)through correlation analysis across all experimental configurations\.

### 3\.5Evaluation Platform Architecture

Our platform implements a three\-layer architecture inspired by SCARF\(Rengoet al\.,[2025](https://arxiv.org/html/2605.26476#bib.bib32)\):

- •Orchestration Layer:Manages test distribution, parallel execution across models and context configurations, and result aggregation\.
- •Adapter Layer:Normalizes heterogeneous RAG framework interfaces \(AnythingLLM, RAGFlow, MaxKB, Metaso\) into unified query\-response protocols, enabling fair cross\-framework comparison\. For frameworks without source exposure \(e\.g\., MaxKB\), the adapter employs heuristic context detection using domain\-specific indicators\.
- •Evaluation Layer:Applies the six\-dimensional metrics via G\-Eval\(Liuet al\.,[2023b](https://arxiv.org/html/2605.26476#bib.bib34)\)through DeepEval\(Confident AI,[2025](https://arxiv.org/html/2605.26476#bib.bib41)\)\. G\-Eval employs LLMs as judges with chain\-of\-thought reasoning: each metric is defined through evaluation objectives, reasoning procedures, and scoring rubrics\. The evaluator extracts token\-level probabilities over score candidates and computes normalized weighted scores\.

The platform supports two complementary evaluation paradigms:cross\-model comparison\(multiple LLMs within a fixed RAG framework, isolating model capabilities\) andcross\-framework comparison\(multiple RAG architectures with a fixed model, isolating retrieval and system design choices\)\.

## 4Experiments: Context Window Scaling Analysis

![Refer to caption](https://arxiv.org/html/2605.26476v1/x3.png)Figure 3:Aggregate performance trajectories across context windows \(4K–32K\)\. Three distinct scaling behaviors emerge: logarithmic growth \(DeepSeek\), early saturation \(Qwen\-Plus\), and cold\-start dynamics \(Gemini\)\.### 4\.1Experimental Configuration

We conducted context\-scaling analysis across four LLMs: DeepSeek\-v3\.2\-Exp \(2025\-09\-29\), Qwen\-Plus \(2025\-09\-11\), Gemini\-2\.5\-Flash, and Qwen\-2\.5\-72B\-Instruct\. All experiments used AnythingLLM as the unified RAG framework with a fixed, pre\-processed segmented JSON corpus to minimize confounding variables\.

#### Context Window and Output Settings\.

Context window size was varied via AnythingLLM’s workspace configuration through OpenAI\-compatible APIs\. Output token limits scaled with context: 1K for 4K context, 2K for 8K–10K, and 4K for≥\\geq12K\. We evaluated 11 configurations from \(4K, 1K\) to \(32K, 4K\), with finer granularity in the 10K–20K range where preliminary results indicated performance transitions\. For Qwen\-2\.5\-72B, we additionally tested extended contexts at 64K and 128K tokens\.

#### Evaluation Protocol\.

Each model–context configuration was evaluated on the full 200\-question benchmark\. The benchmark comprises 59 robustness questions \(needle\-in\-haystack\), 90 multi\-hop reasoning questions \(cross\-document synthesis\), and 51 generation quality questions \(intra\-document multi\-topic\)\. Responses were scored using GPT\-4\.1\-mini via DeepEval’s G\-Eval implementation\.

#### Statistical Scale\.

The evaluation encompasses 200 questions×\\times4 models×\\times11\+ configurations×\\times6 metrics = over 52,800 individual metric evaluations, providing sufficient statistical power for the reported comparisons\.

### 4\.2Overall Scaling Trends

Figure[3](https://arxiv.org/html/2605.26476#S4.F3)presents aggregate performance trajectories for the three primary models\.

#### Convergent Performance at Scale\.

At 4K context, model differences are pronounced: Gemini achieves only 0\.474, while Qwen\-Plus leads at 0\.689 and DeepSeek reaches 0\.619—a 45% gap between best and worst\. This spread narrows substantially as context increases: by 20K, all models cluster between 0\.80–0\.85, and at 28K, performance converges within a 1% band \(0\.868–0\.876\)\.

#### Divergent Saturation Patterns\.

Beyond 28K, models exhibit divergent behaviors\. Gemini and Qwen\-Plus bothdecline: Gemini drops from 0\.876 to 0\.836 \(−\-4\.6%\), and Qwen\-Plus from 0\.874 to 0\.853 \(−\-2\.4%\)\. In contrast, DeepSeek continues improving from 0\.868 to 0\.883 \(\+1\.7%\), demonstrating robust noise tolerance\.

#### Architecture\-Specific Inflection Points\.

Gemini shows minimal gains from 4K–8K before accelerating rapidly in the 12K–20K range, indicating a critical context threshold of∼\\sim12K tokens\. Qwen\-Plus reaches 90% of peak performance at approximately 16K tokens\. DeepSeek requires 20K tokens for equivalent relative performance but continues scaling where others plateau\.

### 4\.3Three Scaling Behaviors

Our analysis reveals three distinct scaling behaviors, presented in order of increasing context efficiency\.

#### Cold\-Start Dynamics\.

Gemini exhibits S\-curve behavior: poor initial performance \(0\.474 at 4K\) indicates limited parametric knowledge for this domain, with a critical mass requirement of∼\\sim12K tokens before effective reasoning activates\. Performance peaks at 28K \(0\.876\) then declines at 32K \(0\.836\)\.

#### Early Saturation\.

Qwen\-Plus achieves the highest initial performance \(0\.689 at 4K\), indicating strong parametric knowledge\. However, performance peaks at 28K \(0\.874\) and declines at 32K—extended sequences lead to attention dispersion where the model struggles to maintain focus amid noise\. Optimal for short\-to\-medium context \(4K–20K\)\.

#### Logarithmic Growth\.

DeepSeek exhibits consistent logarithmic scaling \(R2≈0\.91R^\{2\}\\approx 0\.91\), improving from 0\.619 \(4K\) to 0\.883 \(32K\)—the highest final performance with no decline at extended lengths\. This indicates robust attention filtering suited for complex multi\-document reasoning\.

#### Extended Context Validation\.

Qwen\-2\.5\-72B\-Instruct extends our analysis to 128K tokens\. Performance improves from 0\.594 \(4K\) to 0\.802 \(32K\), then plateaus at 64K \(0\.795\) and 128K \(0\.805\), confirming that marginal returns diminish beyond 32K\. Complete results for all models are provided in Appendix[G](https://arxiv.org/html/2605.26476#A7)\.

## 5Diagnostic Analysis: Metric\-Level Attribution

The scaling curves revealwhathappens; our six\-dimensional metrics diagnosewhy\.

### 5\.14K Context Window

![Refer to caption](https://arxiv.org/html/2605.26476v1/x4.png)Figure 4:Metric breakdown at 4K context window—the most resource\-constrained scenario\.Figure[4](https://arxiv.org/html/2605.26476#S5.F4)reveals dimension\-level performance at 4K:

Gemini\(0\.474\) shows catastrophically low Depth \(0\.374\), Completeness \(0\.399\), and Support Quality \(0\.350\), with only Context Utilization \(0\.669\) approaching acceptable levels—confirming inability to generate domain\-specific content without extensive grounding material\.

Qwen\-Plus\(0\.689\) leads through balanced scores: Factuality \(0\.663\), Depth \(0\.662\), Completeness \(0\.686\), and notably high Context Utilization \(0\.805\), indicating strong parametric knowledge compensating for limited retrieval\.

DeepSeek\(0\.619\) shows moderate performance with high Context Utilization \(0\.783\) but lower Depth \(0\.554\) and Support Quality \(0\.534\), suggesting conservative evidence extraction without speculation\.

### 5\.228K Context Window

![Refer to caption](https://arxiv.org/html/2605.26476v1/x5.png)Figure 5:Metric breakdown at 28K context window \(peak performance for most models\)\.At 28K \(Figure[5](https://arxiv.org/html/2605.26476#S5.F5)\), the performance landscape transforms:

Gemini\(0\.876\) achieves remarkable recovery, with Factuality \(0\.901\) and Completeness \(0\.886\) leading all models—confirming that its short\-context weakness stems from grounding dependency rather than fundamental incapability\.

Qwen\-Plus\(0\.874\) reaches its peak with Context Utilization \(0\.916\) and Completeness \(0\.894\) as strongest dimensions\.

DeepSeek\(0\.868\) demonstrates balanced excellence with all dimensions exceeding 0\.81 and no discernible weak points\.

### 5\.3Mechanism Attribution

Three mechanisms explain the observed behaviors:

- •Noise Tolerance \(DeepSeek\):Monotonic improvement through 32K indicates effective attention filtering—the model benefits from additional context without information overload\.
- •Parametric Compensation \(Qwen\-Plus\):Strong 4K performance reflects internal knowledge compensating for limited retrieval\. The 32K decline suggests an optimal context threshold beyond which noise degrades performance\.
- •Critical Mass Activation \(Gemini\):S\-curve behavior with 4K–8K stagnation followed by 12K–28K acceleration indicates a∼\\sim12K token threshold for reasoning activation in this domain\.

### 5\.4Attention Dilution at Extreme Context \(32K\)

![Refer to caption](https://arxiv.org/html/2605.26476v1/x6.png)Figure 6:Metric\-level performance changes from 28K to 32K context window\. Green: improvement; red: decline\. Gemini shows severe attention dilution with Relevance \(−\-5\.4%\) and Support Quality \(−\-5\.1%\) most affected\.Figure[6](https://arxiv.org/html/2605.26476#S5.F6)presents metric\-level changes from 28K to 32K, revealing a consistentattention dilution signaturein declining models\.

#### Attention Dilution Signature\.

For Gemini, the largest drops occur inRelevance\(−\-6\.1%\) andSupport Quality\(−\-6\.0%\), indicating failure to identify query\-relevant information\. TheDepthdecline \(−\-5\.3%\) suggests scattered attention prevents deep analysis\. Critically,Context Utilizationshows relatively smaller decline \(Gemini−\-2\.0%, Qwen\-Plus−\-2\.9%\), indicating these models stillattemptto leverage extended context but extract less value per token\. This distinguishes attention dilution from context truncation\.

#### Contrast with Noise\-Tolerant Architecture\.

DeepSeek continues improving at 32K \(\+1\.7%\), with Support Quality improvement \(\+3\.6%\) paired with slight Context Utilization decline \(−\-0\.4%\)—indicatingselectiveutilization where the model processes more context but references only high\-quality evidence\.

### 5\.5Metric Independence Analysis

To validate that our six metrics capture distinct failure modes rather than collapsing into a single quality factor, we analyze inter\-metric correlations across all experimental configurations \(∼\\sim33 model\-context combinations×\\times6 dimensions\)\.

Table 4:Inter\-metric Pearson correlation coefficients computed across all model\-context configurations\. Values below 0\.70 indicate meaningful independence; values above 0\.85 suggest potential redundancy\.Table[4](https://arxiv.org/html/2605.26476#S5.T4)reveals high aggregate correlations \(\>\>0\.90\) across all metric pairs when computed over mean scores per model\-context configuration\. This is expected: as context increases,allmetrics improve together because models receive more relevant information\. However, this aggregate correlation masks the diagnostic value that emerges at specific operating points\.

#### Diagnostic Independence at Fixed Operating Points\.

The metrics’ diagnostic utility manifests when comparing models at identical context windows, where inter\-metricprofilesdiverge meaningfully:

- •At 4K, Gemini’s Context Utilization \(0\.669\) is 79% higher than its Support Quality \(0\.350\)—a 2\.3×\\timesratio indicating the modelreadscontext but cannotciteit accurately\.
- •At 28K, Gemini achieves the highest Factuality \(0\.901\) but lower Depth \(0\.818\), while DeepSeek shows the opposite pattern \(Factuality: 0\.892, but highest balanced profile\)\. These cross\-metric divergences would be invisible to a single composite score\.
- •In the case study \(Section[7](https://arxiv.org/html/2605.26476#S7)\), Gemini\-2\.5\-Flash achieves high Context Utilization \(0\.90\) but catastrophic Factuality \(0\.17\)—a failure mode uniquely identifiable through multi\-dimensional evaluation\.

#### Distinct Sensitivity to Context Scaling\.

Different metrics respond differently to context expansion: Context Utilization saturates earliest \(reaching\>\>0\.80 by 8K for all models\), while Technical Depth and Support Quality show the steepest improvement trajectories and the largest model\-specific variation\. The 28K\-to\-32K attention dilution \(Section[5\.4](https://arxiv.org/html/2605.26476#S5.SS4)\) disproportionately affects Relevance and Support Quality while largely preserving Context Utilization—a diagnostic pattern only visible through multi\-dimensional evaluation\.

We acknowledge that the high aggregate correlations limit the metrics’ discriminative power for ranking models at a single context point\. The primary diagnostic value lies inprofile analysis—comparing metric patterns across models, context windows, and failure cases—rather than individual metric rankings\.

## 6Cross\-Framework Evaluation

To validate evaluation portability, we deployed our benchmark on three additional production RAG frameworks: RAGFlow, MaxKB, and Metaso\.

![Refer to caption](https://arxiv.org/html/2605.26476v1/x7.png)Figure 7:Performance comparison across three external RAG frameworks\.Table 5:Cross\-framework performance breakdown\. All frameworks evaluated on the same 200\-question benchmark\. Retrieval strategies differ: RAGFlow uses visual document parsing with hybrid search; MaxKB uses DeepSeek\-V3\.2 \(non\-thinking mode\) with chunk\-based retrieval; Metaso uses a proprietary model with web\-augmented retrieval\.#### Framework\-Specific Failure Modes\.

Each framework exhibits distinct metric profiles reflecting different architectural choices\.Metaso\(0\.67 avg\) leads in Context Utilization \(0\.711\) and Completeness \(0\.692\), suggesting effective information synthesis\.MaxKB\(0\.57 avg\) shows a notable gap between Context Utilization \(0\.662\) and Depth \(0\.531\), indicating surface\-level extraction without deep reasoning—consistent with its use of DeepSeek\-V3\.2 in non\-thinking mode\.RAGFlow\(0\.55 avg\) scores lowest in Depth \(0\.485\) and Support Quality \(0\.498\), suggesting chunk fragmentation issues despite its advanced visual parsing capabilities\.

#### Consistent Difficulty Ordering\.

Depth and Support Quality consistently score lower than Factuality across all frameworks, confirming the benchmark captures intrinsic task difficulty independent of platform architecture\.

#### Deployment Portability\.

Adapting our benchmark to each new framework required minimal engineering: API integration for query submission, response parsing, and source citation extraction\. The core evaluation pipeline remained unchanged, suggesting our methodology can serve as a reusable evaluation layer for heterogeneous enterprise RAG deployments\.

## 7Case Study: Pulsed Atomic Layer Etching

To illustrate how multi\-dimensional evaluation reveals failure modes invisible to single\-score metrics, we present a representative case from technical parameter extraction\.

### 7\.1Task Description

Test case MULTI\_069 requires extracting precise process parameters from patent documentation describing pulsed atomic layer etching \(ALE\) for ruthenium removal—a critical BEOL process in advanced semiconductor manufacturing\. The fill\-in\-the\-blank question requires five specific values: optimal bias voltage range, comparison direction, etch rate comparison, and synergy comparison relative to continuous ALE\.

The ground truth specifies:600V–1200Vbias window,higherbias voltages \(vs\. 60–100V for continuous ALE\),higheretch rates \(5–6 Å/cycle vs\. 2–3 Å/cycle\), andhighersynergy\.

### 7\.2Model Responses and Error Analysis

Table 6:Model responses for pulsed ALE parameter extraction \(MULTI\_069, 18K context\)\.Both Qwen\-Plus and Gemini exhibitprocess variant confusion: the retrieved context interleaves continuous ALE \(60–100V\) and pulsed ALE \(600–1200V\) specifications\. Qwen\-Plus extracts values near the continuous range \(50–150V\), while Gemini hallucinates 10–50V\. DeepSeek correctly disambiguates by reasoning about the pulsed duty cycle mechanism\.

### 7\.3Diagnostic Attribution via Multi\-Dimensional Metrics

![Refer to caption](https://arxiv.org/html/2605.26476v1/x8.png)Figure 8:Six\-dimensional comparison for MULTI\_069 \(18K context\)\. The divergence between Context Utilization and Factuality for Gemini \(0\.90 vs\. 0\.17\) reveals confident misattribution—a failure mode invisible to single\-score evaluation\.The radar chart \(Figure[8](https://arxiv.org/html/2605.26476#S7.F8)\) reveals a critical diagnostic pattern: Gemini achieves high Context Utilization \(0\.90\) but catastrophic Factuality \(0\.17\)\. This combination indicates the modelconfidently cites incorrect portionsof retrieved context—confusing continuous and pulsed ALE parameters\. A single composite score would obscure this specific failure mode, which has direct implications for safety\-critical semiconductor process specification\.

This case exemplifiesdisambiguation under structural similarity: when retrieved contexts contain multiple process variants with similar descriptive patterns but different quantitative specifications, models must parse document structure, maintain attention to qualifying terms across long spans, and cross\-validate extracted values\.

## 8Discussion

### 8\.1Interpreting Scaling Behaviors

We connect our three scaling behaviors to prior findings\. Qwen\-Plus’s strong 4K performance and early saturation aligns withMallenet al\.\([2023](https://arxiv.org/html/2605.26476#bib.bib38)\), who found models with stronger parametric knowledge show diminishing returns from retrieval\. DeepSeek’s sustained 32K improvement contrasts with the “lost in the middle” phenomenon\(Liuet al\.,[2024a](https://arxiv.org/html/2605.26476#bib.bib37)\), suggesting effective attention distribution\. Gemini’s S\-curve resembles multi\-document QA patterns requiring sufficient context density for coherent reasoning\(Liuet al\.,[2024a](https://arxiv.org/html/2605.26476#bib.bib37)\)\. These interpretations remain hypotheses; definitive mechanistic attribution would require access to model internals unavailable through commercial APIs\.

### 8\.2Dynamic Routing Strategy

The crossover point at∼\\sim15K–16K tokens motivates context\-aware model routing:

- •Short contexts \(<<14K\):Route to Qwen\-Plus for maximum efficiency\.
- •Complex reasoning \(\>\>16K\):Route to DeepSeek for stable multi\-document synthesis\.
- •Batch summarization \(20K–28K\):Deploy Gemini for massive context ingestion, capped at 28K\.

### 8\.3Limitations and Validity Threats

#### LLM\-as\-Judge Reliability\.

Our evaluation relies on GPT\-4\.1\-mini as the judge model via G\-Eval with structured rubrics\(Liuet al\.,[2023b](https://arxiv.org/html/2605.26476#bib.bib34); Zhenget al\.,[2024](https://arxiv.org/html/2605.26476#bib.bib33)\)\. While the QA generation pipeline has been validated through expert evaluation \(Appendix[D](https://arxiv.org/html/2605.26476#A4)\), full human–judge correlation studies for the six downstream evaluation metrics—using calibrated annotator pools with inter\-annotator agreement metrics—remain an important direction for future work\. The structured G\-Eval rubrics with separate objective/subjective scoring criteria are designed to mitigate known LLM judge biases, and the high consistency of our results across four LLMs and four RAG frameworks suggests the evaluation signal is robust\.

#### Benchmark Scale\.

Our 200 curated questions \(from 1,300\+ generated\) span three synthesis strategies across 188 topics\. While sufficient for identifying significant performance differences \(52,800\+ metric evaluations across all configurations\), larger benchmarks would enable finer\-grained edge\-case analysis\.

#### Retrieval Configuration\.

Experiments use AnythingLLM’s default retrieval \(max context snippets = 4, similarity threshold = 0\.25\)\. The cross\-framework evaluation \(Section[6](https://arxiv.org/html/2605.26476#S6)\) provides implicit retrieval ablation—RAGFlow, MaxKB, and Metaso employ different retrieval strategies \(visual parsing, chunk\-based, web\-augmented\) yet our metrics consistently diagnose framework\-specific failure modes\. Explicit retriever ablations \(e\.g\., TF\-IDF vs\. dense retrieval\) remain future work\.

#### Domain Specificity\.

Semiconductor manufacturing represents an extreme case of specialized knowledge\. The generation pipeline is domain\-agnostic \(requiring only a corpus and knowledge base\), but optimal context thresholds may require recalibration for other domains\.

#### Model Selection and Reproducibility\.

We evaluate three primary models and one extended model, pinning exact version strings and dates\. The rapidly evolving LLM landscape means specific findings may not transfer to newer releases, though the methodology and benchmark remain applicable\.

## 9Conclusion

We present FAB\-Bench, an automated framework for evaluating RAG systems on proprietary domain knowledge\. Through systematic experiments on semiconductor manufacturing documentation \(150\+ papers, 70\+ patents\), we make three contributions:

First, we establish a reproducible methodology for private\-domain RAG evaluation with adaptive benchmark generation and six diagnostic metrics enabling precise failure attribution\. The curated 200\-question benchmark dataset is released at[https://github\.com/FuturefabAI/FAB\-Bench](https://github.com/FuturefabAI/FAB-Bench)and is directly applicable to other vertical domains, requiring only a domain\-specific corpus and knowledge base\.

Second, context\-window scaling analysis \(4K–32K\) reveals three distinct behaviors—logarithmic growth,early saturation, andcold\-start dynamics—with metric\-level decomposition identifyingattention dilutionas the mechanism behind performance degradation at extreme context lengths\.

Third, cross\-framework validation on four production RAG systems confirms evaluation portability and demonstrates consistent diagnostic capability across heterogeneous architectures\.

#### Future Work\.

Priority extensions include: \(1\) human correlation studies validating LLM judge reliability with inter\-annotator agreement metrics; \(2\) explicit retriever ablations comparing sparse, dense, and hybrid retrieval strategies; \(3\) per\-category scaling analysis disaggregating performance across robustness, multi\-hop, and generation quality question types; and \(4\) extension to additional vertical domains with domain\-specific knowledge bases\.

## References

- GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal,et al\.\(2020\)Language models are few\-shot learners\.Advances in Neural Information Processing Systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26476#S2.SS1.p1.1)\.
- L\. Budler, L\. Gosak, and G\. Štiglic \(2025\)A brief review on benchmarking for large language models evaluation in healthcare\.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.26476#S2.SS2.p1.1)\.
- J\. Chen, H\. Lin, X\. Han, and L\. Sun \(2024a\)Benchmarking large language models in retrieval\-augmented generation\.InProceedings of AAAI,Vol\.38,pp\. 17754–17762\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p4.1),[§2\.3](https://arxiv.org/html/2605.26476#S2.SS3.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.26476#S2.SS1.p1.1)\.
- R\. Chen, J\. Wu, J\. Wang,et al\.\(2024b\)Rethinking domain\-specific llm benchmark construction: a comprehensiveness\-compactness approach\.arXiv preprint arXiv:2508\.07353\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p4.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.InProceedings of NeurIPS,Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.26476#S2.SS1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.26476#S2.SS1.p1.1)\.
- Confident AI \(2025\)DeepEval: open\-source llm evaluation framework\.Note:[https://docs\.confident\-ai\.com](https://docs.confident-ai.com/)Version 3\.2\.1Cited by:[3rd item](https://arxiv.org/html/2605.26476#S3.I2.i3.p1.1)\.
- S\. Es, J\. James, L\. Espinosa\-Anke, and S\. Schockaert \(2024\)RAGAs: automated evaluation of retrieval augmented generation\.InProceedings of EACL: System Demonstrations,pp\. 150–158\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p4.1),[§2\.3](https://arxiv.org/html/2605.26476#S2.SS3.p1.1)\.
- Z\. Fei, X\. Shen, D\. Zhu, F\. Zhou, Z\. Han, S\. Zhang, K\. Chen, Z\. Shen, and J\. Ge \(2023\)LawBench: benchmarking legal knowledge of large language models\.arXiv preprint arXiv:2309\.16289\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26476#S2.SS2.p1.1)\.
- Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, and H\. Wang \(2023\)Retrieval\-augmented generation for large language models: a survey\.arXiv preprint arXiv:2312\.10997\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p4.1)\.
- N\. Guhaet al\.\(2023\)LegalBench: a collaboratively built benchmark for measuring legal reasoning in large language models\.arXiv preprint arXiv:2308\.11462\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26476#S2.SS2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021a\)Measuring massive multitask language understanding\.Proceedings of ICLR\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26476#S2.SS1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021b\)Measuring mathematical problem solving with the math dataset\.Proceedings of NeurIPS\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.26476#S2.SS1.p1.1)\.
- A\. Holtzman, J\. Buys, L\. Du, M\. Forbes, and Y\. Choi \(2020\)The curious case of neural text degeneration\.InInternational Conference on Learning Representations,Cited by:[§3\.3](https://arxiv.org/html/2605.26476#S3.SS3.SSS0.Px2.p1.11)\.
- Y\. Huang, Y\. Bai, Z\. Zhu, J\. Zhang, J\. Zhang, T\. Su,et al\.\(2023\)C\-eval: a multi\-level multi\-discipline chinese evaluation suite for foundation models\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26476#S2.SS1.p1.1)\.
- P\. Islamet al\.\(2023\)FinanceBench: a new benchmark for financial question answering\.arXiv preprint\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26476#S2.SS2.p1.1)\.
- D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits \(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26476#S2.SS2.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in Neural Information Processing Systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p4.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: measuring how models mimic human falsehoods\.Proceedings of ACL\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.26476#S2.SS1.p1.1)\.
- M\. Liu, T\. Ene, R\. Kirber, C\. Cheng, N\. Tiber, T\. Greaves,et al\.\(2023a\)ChipNeMo: domain\-adapted llms for chip design\.arXiv preprint arXiv:2311\.00176\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26476#S2.SS2.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024a\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.Cited by:[§8\.1](https://arxiv.org/html/2605.26476#S8.SS1.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023b\)G\-eval: NLG evaluation using GPT\-4 with better human alignment\.InProceedings of EMNLP,pp\. 2511–2522\.Cited by:[§2\.4](https://arxiv.org/html/2605.26476#S2.SS4.p1.1),[3rd item](https://arxiv.org/html/2605.26476#S3.I2.i3.p1.1),[§8\.3](https://arxiv.org/html/2605.26476#S8.SS3.SSS0.Px1.p1.1)\.
- Y\. Liu, L\. Huang, S\. Li, S\. Chen, H\. Zhou, F\. Meng, J\. Zhou, and X\. Sun \(2023c\)RECALL: a benchmark for llms robustness against external counterfactual knowledge\.arXiv preprint arXiv:2311\.08147\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p4.1),[§2\.3](https://arxiv.org/html/2605.26476#S2.SS3.p1.1)\.
- Y\. Liu, H\. Zhou, Z\. Guo, E\. Shareghi, I\. Vulic, A\. Korhonen, and N\. Collier \(2024b\)Aligning with human judgement: the role of pairwise preference in large language model evaluators\.arXiv preprint arXiv:2403\.16950\.Cited by:[§2\.4](https://arxiv.org/html/2605.26476#S2.SS4.p1.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of ACL,pp\. 9802–9822\.Cited by:[§8\.1](https://arxiv.org/html/2605.26476#S8.SS1.p1.1)\.
- OpenAI \(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26476#S2.SS1.p1.1)\.
- A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu \(2022\)MedMCQA: a large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.Proceedings of CHIL\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26476#S2.SS2.p1.1)\.
- M\. Rengo, S\. Beadini, D\. Alfano, and R\. Abbruzzese \(2025\)A system for comprehensive assessment of rag frameworks\.arXiv preprint arXiv:2504\.07803\.Cited by:[§2\.3](https://arxiv.org/html/2605.26476#S2.SS3.p1.1),[§3\.5](https://arxiv.org/html/2605.26476#S3.SS5.p1.1)\.
- M\. Renze \(2024\)The effect of sampling temperature on problem solving in large language models\.InFindings of EMNLP,pp\. 7346–7356\.Cited by:[§3\.3](https://arxiv.org/html/2605.26476#S3.SS3.SSS0.Px2.p1.11)\.
- J\. Saad\-Falcon, O\. Khattab, C\. Potts, and M\. Zaharia \(2024\)ARES: an automated evaluation framework for retrieval\-augmented generation systems\.InProceedings of NAACL,Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p4.1),[§2\.3](https://arxiv.org/html/2605.26476#S2.SS3.p1.1)\.
- K\. Singhal, S\. Azizi, T\. Tu, S\. S\. Mahdavi, J\. Wei, H\. W\. Chung,et al\.\(2023a\)Large language models encode clinical knowledge\.Nature\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26476#S2.SS2.p1.1)\.
- K\. Singhal, T\. Tu, J\. Gottweis, R\. Sayres, E\. Wulczyn, L\. Hou,et al\.\(2023b\)Towards expert\-level medical question answering with large language models\.arXiv preprint arXiv:2305\.09617\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26476#S2.SS2.p1.1)\.
- Y\. Tang and Y\. Yi \(2024\)MultiHop\-rag: benchmarking retrieval\-augmented generation for multi\-hop queries\.arXiv preprint arXiv:2401\.15391\.Cited by:[§2\.3](https://arxiv.org/html/2605.26476#S2.SS3.p1.1)\.
- A\. Wang, Y\. Pruksachatkun, N\. Nangia, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. Bowman \(2019\)SuperGLUE: a stickier benchmark for general\-purpose language understanding systems\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26476#S2.SS1.p1.1)\.
- A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman \(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InProceedings of EMNLP Workshop BlackboxNLP,Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26476#S2.SS1.p1.1)\.
- P\. Wang, L\. Li, L\. Chen, Z\. Cai, D\. Zhu, B\. Lin, Y\. Cao, Q\. Liu, T\. Liu, and Z\. Sui \(2023\)Large language models are not fair evaluators\.arXiv preprint arXiv:2305\.17926\.Cited by:[§2\.4](https://arxiv.org/html/2605.26476#S2.SS4.p1.1)\.
- S\. Wu, O\. Irsoy, S\. Lu, V\. Dabravolski, M\. Dredze, S\. Gehrmann, P\. Kambadur, D\. Rosenberg, and G\. Mann \(2023\)BloombergGPT: a large language model for finance\.arXiv preprint arXiv:2303\.17564\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26476#S2.SS2.p1.1)\.
- Q\. Xie, W\. Han, Z\. Zhang,et al\.\(2024\)FinBen: a holistic financial benchmark for large language models\.arXiv preprint arXiv:2402\.12659\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26476#S2.SS2.p1.1)\.
- X\. Yang, K\. Sun, H\. Xin, Y\. Sun, N\. Bhalla, X\. Chen,et al\.\(2024\)CRAG – comprehensive rag benchmark\.arXiv preprint arXiv:2406\.04744\.Cited by:[§2\.3](https://arxiv.org/html/2605.26476#S2.SS3.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang,et al\.\(2024\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2605.26476#S1.p4.1),[§2\.4](https://arxiv.org/html/2605.26476#S2.SS4.p1.1),[§8\.3](https://arxiv.org/html/2605.26476#S8.SS3.SSS0.Px1.p1.1)\.

## Appendix ASemiconductor Domain Knowledge Base

This appendix provides the complete listing of 431 technical terms in our semiconductor domain knowledge base, organized into seven categories with associated precision weights\.

### A\.1Manufacturing Processes \(60 terms, weight = 1\.5\)

Core fabrication techniques:wafer, lithography, etching, CVD, PVD, ALD, ion implantation, annealing, CMP, cleaning, inspection, metrology, mask, exposure, develop, strip, deposition, diffusion, oxidation, epitaxy, packaging, wire bonding, flip chip, WLP, SiP, BGA, CSP

Advanced lithography:immersion lithography, multi\-patterning, self\-aligned, spacer, hard mask, anti\-reflective coating, photoresist, reticle

Etching techniques:reactive ion etching, RIE, plasma etching, dry etching, wet etching, isotropic etching, anisotropic etching

Deposition methods:PECVD, LPCVD, MOCVD, MBE, electroplating, sputtering, atomic layer deposition, physical vapor deposition

CMP and cleaning:chemical mechanical planarization, post\-CMP cleaning, megasonic cleaning, RCA clean, HF dip

Metrology:overlay metrology, critical dimension measurement, ellipsometry, optical CD, scatterometry

### A\.2Materials Science \(51 terms, weight = 1\.3\)

Substrate materials:silicon, GaAs, GaN, SiC, InP, germanium, polysilicon, monocrystalline, amorphous

Dielectric materials:high\-k, low\-k, SiO2, Si3N4, Al2O3, HfO2, silicon dioxide, silicon nitride, hafnium oxide, aluminum oxide, zirconium oxide, tantalum pentoxide, ultra\-low\-k, porous silica, carbon\-doped oxide

Metal materials:copper, aluminum, tungsten, titanium, tantalum, indium, silver, gold, platinum, nickel, cobalt, ruthenium, molybdenum

Compound semiconductors:gallium arsenide, gallium nitride, silicon carbide, indium phosphide, aluminum nitride, zinc oxide

2D materials:graphene, MoS2, transition metal dichalcogenides

Gate stack materials:metal gate, work function metal, barrier layer, capping layer, etch stop layer

### A\.3Device Physics \(62 terms, weight = 1\.4\)

Transistor types:MOSFET, FinFET, GAAFET, CMOS, BJT, JFET, IGBT, diode, transistor

Advanced transistors:gate\-all\-around, nanosheet, nanowire, CFET, complementary FET, vertical transistor, TFET, tunnel FET, FDSOI, fully depleted SOI

Memory devices:memory, DRAM, SRAM, NAND, NOR, FRAM, ReRAM, MRAM, flash memory, 3D NAND, PCM, phase change memory, RRAM, STT\-MRAM, emerging memory

Passive components:resistor, capacitor, inductor

Integrated circuits:IC, integrated circuit, ASIC, SoC, system on chip, logic circuit, analog circuit, mixed\-signal

Sensors and MEMS:MEMS, sensor, actuator, image sensor, CMOS sensor, pressure sensor, accelerometer, gyroscope

Specialty devices:RF, power device, optoelectronic, LED, laser diode, photodiode, solar cell, TFT, thin film transistor

### A\.4Process Nodes \(65 terms, weight = 1\.6\)

Process nodes:14nm, 10nm, 7nm, 5nm, 3nm, 2nm, 1nm, 28nm, 22nm, 16nm, 12nm, 8nm, 6nm, 4nm, process node, technology node

Lithography technologies:EUV, DUV, 193nm, 248nm, 365nm, ArF, KrF, i\-line, g\-line, extreme ultraviolet, deep ultraviolet, 13\.5nm, high\-NA EUV

Process integration:FinFET process, SOI, FD\-SOI, bulk, bulk silicon, strained silicon, strain engineering, SiGe, silicon germanium, HEMT, high electron mobility

Process stages:FEOL, BEOL, MEOL, front end of line, back end of line, middle of line

Advanced techniques:gate\-first, gate\-last, replacement gate, high\-k metal gate, HKMG, multiple patterning, SADP, SAQP, self\-aligned double patterning, self\-aligned quadruple patterning, LELE, LFLE

Interconnect:damascene, dual damascene, copper interconnect, via, through silicon via, TSV, backside power delivery

### A\.5Testing Methodologies \(51 terms, weight = 1\.2\)

Reliability testing:reliability, yield, failure analysis, electrical test, functional test, burn\-in, temperature cycling, thermal shock, humidity test

Failure mechanisms:electromigration, hot carrier, NBTI, TDDB, negative bias temperature instability, time dependent dielectric breakdown, stress migration, void formation

Quality control:SPC, statistical process control, parameter drift, defect density, process capability, Cpk, design of experiments, DOE

Inspection techniques:critical dimension, CD\-SEM, AFM, SEM, TEM, X\-ray, atomic force microscopy, scanning electron microscopy, transmission electron microscopy, optical inspection, e\-beam inspection

Electrical characterization:I\-V curve, C\-V measurement, DLTS, deep level transient spectroscopy, Hall effect, mobility measurement, sheet resistance, four\-point probe, contact resistance

Wafer\-level testing:probe test, wafer sort, parametric test, functional test, final test, package test

### A\.6Applications \(62 terms, weight = 1\.1\)

Computing processors:AI chip, GPU, CPU, NPU, TPU, FPGA, ASIC, graphics processing unit, neural processing unit, tensor processing unit, application specific IC, DSP, digital signal processor, MCU, microcontroller

Computing systems:edge computing, cloud computing, data center, server, supercomputer, quantum computing, neuromorphic computing, in\-memory computing

Consumer electronics:smartphone, tablet, laptop, wearable, smartwatch, AR glasses, VR headset

Automotive:automotive electronics, ADAS, autonomous driving, LiDAR, radar, infotainment, powertrain

Communication:5G, 6G, IoT, internet of things, wireless, RF transceiver, baseband, modem

AI/ML:neural network, machine learning, deep learning, inference, training, transformer, large language model

Emerging applications:blockchain, cryptocurrency, mining, virtual reality, VR, augmented reality, AR, metaverse, HPC, high performance computing

### A\.7Performance Parameters \(80 terms, weight = 1\.7\)

Voltage parameters:threshold voltage, Vth, supply voltage, VDD, VSS, voltage, operating voltage, breakdown voltage, junction temperature

Current parameters:leakage current, drive current, Ion, Ioff, current, saturation current, subthreshold current, gate leakage, junction leakage

Power parameters:power consumption, static power, dynamic power, power dissipation, TDP, thermal design power, power density, energy efficiency

Timing parameters:switching speed, delay, propagation delay, rise time, fall time, setup time, hold time, clock skew, clock frequency, access time

Frequency parameters:frequency, bandwidth, cutoff frequency, fT, fmax, operating frequency, clock speed

Resistance and capacitance:resistance, capacitance, Ron, on\-resistance, gate capacitance, parasitic capacitance, RC delay, interconnect resistance

Temperature parameters:operating temperature, junction temperature, thermal resistance, temperature coefficient

Noise and linearity:noise, noise figure, SNR, signal to noise ratio, linearity, THD, total harmonic distortion

RF parameters:gain, phase, S\-parameters, impedance matching, insertion loss, return loss

Performance metrics:DIBL, drain induced barrier lowering, subthreshold swing, SS, transconductance, gm, output conductance, GIDL, gate induced drain leakage, mobility, carrier mobility, saturation velocity

### A\.8Knowledge Base Design Rationale

#### Coverage breadth:

The 431 terms span the full semiconductor technology stack from materials science to system applications\.

#### Precision weighting:

Category weights \(wi∈\[1\.1,1\.7\]w\_\{i\}\\in\[1\.1,1\.7\]\) reflect technical rigor requirements, with parameters \(1\.7\) and process nodes \(1\.6\) weighted highest due to quantitative precision requirements\.

#### Term granularity:

Includes both high\-level concepts \(e\.g\., “transistor”\) and specific implementations \(e\.g\., “FinFET”, “GAAFET”, “nanosheet”\)\.

#### Temporal coverage:

Terms encompass mature technologies, current leading\-edge, and emerging concepts to ensure benchmark relevance as technology advances\.

## Appendix BGeneration System Implementation

### B\.1Precision Classification

Documents are classified based on technical densityρ​\(d\)\\rho\(d\)and weighted high\-precision category presence \(Eq\.[1](https://arxiv.org/html/2605.26476#S3.E1)\)\. The disjunctive logic ensures documents containing critical parameters are classified as high\-precision even with low overall density\.

### B\.2Adaptive Temperature Control

#### Base Temperature Ranges\.

Temperature ranges vary by precision level to balance accuracy with diversity:

\[τmin​\(p\),τmax​\(p\)\]=\{\[0\.4,0\.8\]p=high\[0\.5,0\.9\]p=medium\[0\.6,1\.0\]p=low\[\\tau\_\{\\min\}\(p\),\\tau\_\{\\max\}\(p\)\]=\\begin\{cases\}\[0\.4,0\.8\]&p=\\text\{high\}\\\\ \[0\.5,0\.9\]&p=\\text\{medium\}\\\\ \[0\.6,1\.0\]&p=\\text\{low\}\\end\{cases\}\(4\)

#### Progressive Temperature\.

Base temperature progresses with successful generations to encourage diversity:

τprogress​\(k\)=\(τmax−τmin\)×min⁡\(k20,0\.8\)\\tau\_\{\\text\{progress\}\}\(k\)=\(\\tau\_\{\\max\}\-\\tau\_\{\\min\}\)\\times\\min\\left\(\\frac\{k\}\{20\},0\.8\\right\)\(5\)wherekkis the count of successfully generated questions\.

#### Category\-Specific Adjustments\.

Fine\-grained adjustments based on dominant content category:

Δ​τc​\(c∗\)=\{−0\.10c∗=parameters−0\.08c∗=processes−0\.05c∗=devices−0\.03c∗=testing0\.00c∗=materials\+0\.02c∗=manufacturing\+0\.05c∗=applications\\Delta\\tau\_\{c\}\(c^\{\*\}\)=\\begin\{cases\}\-0\.10&c^\{\*\}=\\text\{parameters\}\\\\ \-0\.08&c^\{\*\}=\\text\{processes\}\\\\ \-0\.05&c^\{\*\}=\\text\{devices\}\\\\ \-0\.03&c^\{\*\}=\\text\{testing\}\\\\ \\phantom\{\-\}0\.00&c^\{\*\}=\\text\{materials\}\\\\ \+0\.02&c^\{\*\}=\\text\{manufacturing\}\\\\ \+0\.05&c^\{\*\}=\\text\{applications\}\\end\{cases\}\(6\)

#### Failure Recovery Boosts\.

Temperature increases after generation failures:

Δ​τattempt​\(a\)\\displaystyle\\Delta\\tau\_\{\\text\{attempt\}\}\(a\)=min⁡\(0\.25,0\.08×a\)\\displaystyle=\\min\(0\.25,0\.08\\times a\)\(7\)Δ​τsimilarity​\(s\)\\displaystyle\\Delta\\tau\_\{\\text\{similarity\}\}\(s\)=min⁡\(0\.15,0\.05×s\)\\displaystyle=\\min\(0\.15,0\.05\\times s\)\(8\)whereaacounts all failed attempts andsstracks consecutive similarity failures\.

#### Final Temperature Computation\.

τ=clip​\(τmin\+τprogress\+Δ​τc\+Δ​τattempt\+Δ​τsimilarity,0\.1,1\.0\)\\tau=\\text\{clip\}\\left\(\\tau\_\{\\min\}\+\\tau\_\{\\text\{progress\}\}\+\\Delta\\tau\_\{c\}\+\\Delta\\tau\_\{\\text\{attempt\}\}\+\\Delta\\tau\_\{\\text\{similarity\}\},0\.1,1\.0\\right\)\(9\)

#### Complementary Nucleus Sampling\.

Adaptive top\-ppmaintains coherence during high\-temperature exploration:

pnucleus​\(τ\)=\{0\.95τ≤0\.40\.900\.4<τ≤0\.70\.85τ\>0\.7p\_\{\\text\{nucleus\}\}\(\\tau\)=\\begin\{cases\}0\.95&\\tau\\leq 0\.4\\\\ 0\.90&0\.4<\\tau\\leq 0\.7\\\\ 0\.85&\\tau\>0\.7\\end\{cases\}\(10\)

### B\.3Adaptive Similarity Thresholds

#### Base Thresholds by Precision Level\.

θbase​\(p\)=\{0\.70p=high0\.75p=medium0\.80p=low\\theta\_\{\\text\{base\}\}\(p\)=\\begin\{cases\}0\.70&p=\\text\{high\}\\\\ 0\.75&p=\\text\{medium\}\\\\ 0\.80&p=\\text\{low\}\\end\{cases\}\(11\)
Lower thresholds for high\-precision content acknowledge that questions about specialized topics necessarily share technical terminology while remaining substantively different\.

#### Progressive Relaxation\.

θsim​\(r\)=max⁡\(0\.50,θbase​\(p\)−0\.05×r\)\\theta\_\{\\text\{sim\}\}\(r\)=\\max\(0\.50,\\theta\_\{\\text\{base\}\}\(p\)\-0\.05\\times r\)\(12\)

#### Weighted Similarity Computation\.

simweighted​\(q,q′\)=Jaccard​\(Wq,Wq′\)\+α⋅Jaccard​\(Tq,Tq′\)\\text\{sim\}\_\{\\text\{weighted\}\}\(q,q^\{\\prime\}\)=\\text\{Jaccard\}\(W\_\{q\},W\_\{q^\{\\prime\}\}\)\+\\alpha\\cdot\\text\{Jaccard\}\(T\_\{q\},T\_\{q^\{\\prime\}\}\)\(13\)whereWWdenotes all word tokens,T⊆WT\\subseteq Wdenotes technical terms, andα=0\.05\\alpha=0\.05\.

For corpora with≥5\\geq 5existing questions, TF\-IDF semantic similarity provides additional validation:

simsemantic​\(q,q′\)=cos⁡\(v→TF\-IDF​\(q\),v→TF\-IDF​\(q′\)\)\\text\{sim\}\_\{\\text\{semantic\}\}\(q,q^\{\\prime\}\)=\\cos\(\\vec\{v\}\_\{\\text\{TF\-IDF\}\}\(q\),\\vec\{v\}\_\{\\text\{TF\-IDF\}\}\(q^\{\\prime\}\)\)\(14\)Questions are rejected ifmaxq′∈𝒬⁡simsemantic​\(q,q′\)\>θsim\+0\.05\\max\_\{q^\{\\prime\}\\in\\mathcal\{Q\}\}\\text\{sim\}\_\{\\text\{semantic\}\}\(q,q^\{\\prime\}\)\>\\theta\_\{\\text\{sim\}\}\+0\.05\.

### B\.4Document Chunking

Documents are segmented using sliding window approach: chunk size 512 tokens, overlap 128 tokens \(25%\)\. Cross\-document linking threshold:θlink=0\.1\\theta\_\{\\text\{link\}\}=0\.1\.

## Appendix CDynamic Prompt Templates

This section describes the core logic of our dynamic prompt engineering system\.

### C\.1Prompt Architecture

All prompts consist of four modular components that combine based on generation context:

1. 1\.Base Template: Defines role \(AI Benchmark Scientist\), injects context sections, specifies JSON output format
2. 2\.Test\-Type Module: Scenario\-specific instructions based on evaluation target
3. 3\.Format Specification: Question type requirements with balance tracking
4. 4\.Diversity Directive: Activated after repeated similarity failures

### C\.2Test\-Type Configurations

Three test types target different RAG capabilities:

Table 7:Test\-type configurations
### C\.3Question Format Balance

We maintain target distribution of 40% objective and 60% subjective questions\. The system tracks:

robj=nobjnobj\+nsubjr\_\{\\text\{obj\}\}=\\frac\{n\_\{\\text\{obj\}\}\}\{n\_\{\\text\{obj\}\}\+n\_\{\\text\{subj\}\}\}\(15\)Whenrobj<0\.5r\_\{\\text\{obj\}\}<0\.5, the system forces objective question generation\.

#### Objective Questions \(40% target\)\.

Four subtypes with specific requirements:

- •Mathematical Calculation \(30%\): Must extract ALL parameters from context; use context\-specific formulas; show step\-by\-step with units
- •Fill\-in\-Blank \(25%\): Extract EXACT values with conditions and units as stated
- •True/False \(25%\): Statements about specific mechanisms requiring multi\-step understanding; create both true and false statements
- •Multiple Choice \(20%\): All options in one line; only context provides distinguishing details

#### Subjective Questions \(60% target\)\.

Five archetypes rotate via\(question\_count\)mod5\(\\text\{question\\\_count\}\)\\mod 5:

1. 1\.Definition/Specification: specific definitions or specification values
2. 2\.Process Explanation: process sequences or event chains
3. 3\.Causal Reasoning: reasons or purposes behind specifications
4. 4\.Comparative Analysis: compare/contrast related concepts
5. 5\.Problem Identification: potential problems, defects, or limitations

### C\.4Diversity Enhancement

When similarity failures exceed threshold \(s\>2s\>2\), a lightweight directive is injected:

> DIVERSITY FOCUS: Previousssuniqueness attempts suggest trying a new approach RECOMMENDED STYLE: Focus on “\[current archetype from rotation\]”

This provides guidance to explore unexplored patterns without over\-constraining generation\.

## Appendix DExpert Validation of Generated Benchmarks

To assess generation quality, domain experts rated QA pairs from a multi\-stage ablation study on four dimensions \(1–5 scale\): accuracy \(factual correctness of answer\), relevance \(alignment with context\), difficulty \(cognitive challenge level\), and diversity \(novelty relative to other questions\)\. Each condition generated 18 QA pairs \(6 per question type: ROB, MULTI, GEN\), with precision levels \(high/medium/low\) annotated per question based on the domain knowledge base classification\.

### D\.1Ablation Results

As shown in Table[8](https://arxiv.org/html/2605.26476#A4.T8), the full adaptive system with enhanced prompts \(Group E\) produces the highest\-quality outputs across all evaluated dimensions, with zero retry failures\.

Table 8:Expert evaluation of QA generation quality across ablation conditions \(18 QA pairs per group, 1–5 scale\)\.Notably, adaptive temperature and threshold alone \(Group D\) do not outperform the baseline in all dimensions—Difficulty \(3\.28 vs\. 4\.00\) and Diversity \(3\.17 vs\. 3\.67\) decrease, suggesting that parameter adaptation without prompt enhancement leads to conservative generation\. The full system \(Group E\) combines adaptive parameters with enhanced prompt templates \(Appendix[C](https://arxiv.org/html/2605.26476#A3)\), achieving consistent improvements across all dimensions\.

### D\.2Per\-Question\-Type Quality Analysis

Table[9](https://arxiv.org/html/2605.26476#A4.T9)breaks down expert ratings by question type, revealing systematic quality patterns across synthesis strategies\.

Table 9:Expert\-rated quality by question type \(averaged across all ablation groups\)\.The difficulty gradient \(ROB: 2\.38<<MULTI: 4\.12<<GEN: 4\.79\) confirms the three synthesis strategies produce appropriately graded complexity: needle\-in\-haystack questions are factual and straightforward, cross\-document multi\-hop questions require moderate reasoning, and generation quality questions demand deep analytical reasoning\. Accuracy remains high \(\>\>4\.5\) across all types, indicating factual correctness is maintained regardless of difficulty level\.

## Appendix EEvaluation Rubrics

All metrics use GPT\-4\.1\-mini via DeepEval’s G\-Eval with chain\-of\-thought reasoning, scoring 0–10 \(internally normalized to 0–1\)\. Rubrics adapt between objective and subjective questions\.

### E\.1Completeness

Table 10:Completeness rubric
### E\.2Technical Depth

Table 11:Technical depth rubric
### E\.3Factuality

Table 12:Factuality rubric
### E\.4Relevance

Table 13:Relevance rubric
### E\.5Context Utilization

Table 14:Context utilization rubric
### E\.6Support Quality

Table 15:Support quality rubric

## Appendix FEvaluation Platform Details

### F\.1Framework Capabilities

Table 16:RAG framework integration capabilities\.#### Key Differences\.

AnythingLLMprovides full retrieval transparency through its sources field, returning document chunks with relevance scores\.

RAGFlowuses an OpenAI\-compatible API endpoint, returning responses in standard format\. Source attribution requires inference from response content\.

MaxKBdoes not expose retrieval context through its API—sources are visible only in the web interface\. Our adapter implements heuristic detection using domain\-specific indicators \(e\.g\., “according to”, “standard specifies”, “SEMI”\) and professional terminology presence\.

Metasoreturns structured references but uses a proprietary model that cannot be reconfigured, limiting participation to cross\-framework evaluation\.

### F\.2Adaptive Evaluation Strategy

The evaluation layer adapts metric computation based on framework capabilities\. For frameworks without source exposure, Context Utilization switches from direct source comparison to heuristic inference based on response characteristics, ensuring fair evaluation across frameworks with different transparency levels\.

### F\.3Dual\-Mode Assessment

- •Mode A \(with\_kb\): Standard RAG query through framework’s native retrieval pipeline
- •Mode B \(without\_kb\): Gold context injected into prompt, bypassing retrieval to isolate generation capabilities

Cross\-mode score differences \(Δ=Mode B−Mode A\\Delta=\\text\{Mode B\}\-\\text\{Mode A\}\) enable failure attribution: large positiveΔ\\Deltaindicates retrieval failures; consistently low scores in both modes indicate generation\-stage weaknesses\.

## Appendix GComplete Experimental Results

Table 17:DeepSeek\-v3\.2\-Exp: complete results across all context configurations\.Table 18:Qwen\-Plus: complete results across all context configurations\.Table 19:Gemini\-2\.5\-Flash: complete results across all context configurations\.Table 20:Qwen\-2\.5\-72B\-Instruct: results with extended context windows up to 128K\.Table 21:Cross\-model comparison at key context configurations\.

Similar Articles

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

Hugging Face Daily Papers

RAMP is a production-grounded evaluation framework for LLM agents that exposes significant capability degradation invisible to static benchmarks, showing task completion rates collapsing from 100% to 20% across serial workflows. The framework assesses 15 mainstream models on realistic compiler-construction workloads with complex toolchain interactions and staged recovery mechanisms.

Generating Leakage-Free Benchmarks for Robust RAG Evaluation

arXiv cs.CL

This paper introduces SeedRG, a semi-synthetic benchmark generation pipeline designed to eliminate knowledge leakage in Retrieval-Augmented Generation (RAG) evaluation by creating novel examples that preserve reasoning structures but are absent from model parametric memory.