How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

arXiv cs.CL Papers

Summary

This paper introduces HieraRAG, a hierarchical framework for determining optimal granularity in RAG benchmarks. It generates 5,872 synthetic QA pairs across three dimensions and finds that ideal granularity varies by dimension, offering a portable procedure for practitioners.

arXiv:2606.12789v1 Announce Type: new Abstract: Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.
Original Article
View Cached Full Text

Cached at: 06/12/26, 08:50 AM

# How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation
Source: [https://arxiv.org/html/2606.12789](https://arxiv.org/html/2606.12789)
,Kaustubh DholeDepartment of Computer ScienceEmory UniversityAtlantaUSA[kaustubh\.dhole@emory\.edu](https://arxiv.org/html/2606.12789v1/mailto:[email protected]),Jason FanDepartment of Computer ScienceEmory UniversityAtlantaUSA[jason\.fan@emory\.edu](https://arxiv.org/html/2606.12789v1/mailto:[email protected]),Eugene AgichteinDepartment of Computer ScienceEmory UniversityAtlantaUSA[eugene\.agichtein@emory\.edu](https://arxiv.org/html/2606.12789v1/mailto:[email protected])andJoyce C\. HoDepartment of Computer ScienceEmory UniversityAtlantaUSA[joyce\.c\.ho@emory\.edu](https://arxiv.org/html/2606.12789v1/mailto:[email protected])

###### Abstract\.

Evaluating retrieval\-augmented generation \(RAG\) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity\. We presentHieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes*discriminative power*\(the standard deviation of generation quality across categories\) within a given RAG configuration\. As a case study, we generate 5,872 synthetic question–answer \(QA\) pairs from FineWeb\-10BT across 3 dimensions \(Question Complexity, Answer Type, Linguistic Variation\) at 3 granularity levels \(2, 4, and 8 categories\)\. With a BM25\+Falcon\-3\-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine\-grained distinctions \(discriminative power: 0\.053\) while answer type and linguistic variation peak at medium granularity\. We introduce aCoherence Ratiometric to quantify whether fine\-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions \(Question Complexity: 0\.40 vs\. Answer Type: 1\.44\)\. Human evaluation of 110 stratified QA pairs confirms synthetic quality\. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings\.

retrieval\-augmented generation, RAG evaluation, synthetic question generation, question answering

## 1\.Introduction

Retrieval\-augmented generation \(RAG\) has become the dominant approach for question\-answering \(QA\) systems, grounding language model outputs in retrieved evidence to improve factual accuracy\(Lewiset al\.,[2020](https://arxiv.org/html/2606.12789#bib.bib14)\)\. As organizations deploy RAG over proprietary corpora, from enterprise documentation to scientific literature, evaluating system performance requires benchmarks that capture diverse question characteristics\. Recent work demonstrates that diversity matters for RAG: diverse retrieved content improves answer quality\(Wanget al\.,[2025](https://arxiv.org/html/2606.12789#bib.bib49)\), and diverse instruction data enhances model capabilities\(Liuet al\.,[2025](https://arxiv.org/html/2606.12789#bib.bib50)\)\. But as synthetic QA generation becomes more popular for RAG benchmarking\(Filiceet al\.,[2025](https://arxiv.org/html/2606.12789#bib.bib1); Ip and Vongthongsri,[2025](https://arxiv.org/html/2606.12789#bib.bib42)\), a fundamental question remains: at what*level of granularity should question characteristics be varied*?

QA evaluation has evolved from factoid extraction\(Voorhees and Tice,[2000](https://arxiv.org/html/2606.12789#bib.bib48)\)toward multi\-hop reasoning\(Yanget al\.,[2018](https://arxiv.org/html/2606.12789#bib.bib45)\), diverse answer types\(Yonaet al\.,[2024](https://arxiv.org/html/2606.12789#bib.bib41)\), and varied linguistic formulations\(Bolotovaet al\.,[2022](https://arxiv.org/html/2606.12789#bib.bib44)\)\. This reveals natural dimensions along which questions vary: complexity \(single\- vs\. multi\-hop reasoning\), answer type \(factoid vs\. abstractive\), and linguistic variation \(vocabulary alignment, phrasing diversity\)\. Yet existing benchmarks treat these dimensions with inconsistent granularity\. HotpotQA categorizes by answer type \(Person, Date, Yes/No\) but provides limited complexity control\(Yanget al\.,[2018](https://arxiv.org/html/2606.12789#bib.bib45)\)\. KILT spans task formats but lacks systematic linguistic variation\(Petroniet al\.,[2021](https://arxiv.org/html/2606.12789#bib.bib47)\)\. Should we distinguish only between “simple” and “complex” questions, or use finer categories like “factoid,” “multi\-hop,” “reasoning,” and “comparative”? Finer distinctions may reveal nuanced performance differences, but they increase generation cost and risk redundant categories\. Without empirical guidance, designers risk under\-sampling failure modes or diluting evaluation signal with redundant distinctions\.

We address this gap throughHieraRAG, a hierarchical framework for synthetic RAG QA benchmark construction that systematically varies questions along 3 illustrative dimensions: \(1\) Question Complexity \(QC\), \(2\) Answer Type \(AT\), and \(3\) Linguistic Variation \(LV\)\. Each dimension is evaluated at 3 granularity levels: coarse \(2 categories\), medium \(4\), and fine \(8\)\. Unlike prior work that varies answer granularity while keeping questions fixed\(Yonaet al\.,[2024](https://arxiv.org/html/2606.12789#bib.bib41)\), we vary question characteristics themselves to probe how RAG systems respond to realistic query diversity\. To validate whether fine\-grained splits are well\-structured \(i\.e\., cleanly subdividing parent categories rather than introducing unrelated constraints\), we introduce a coherence ratio metric analogous to the clustering silhouette coefficient\(Rousseeuw,[1987](https://arxiv.org/html/2606.12789#bib.bib40)\)\. We generate 5,872 synthetic questions from the FineWeb\-10BT corpus\(Penedoet al\.,[2024](https://arxiv.org/html/2606.12789#bib.bib17)\), evaluate them on a`BM25\+Falcon\-3\-10B`pipeline\. To ensure synthetic question quality and validate category assignments, 2 annotators independently evaluate 110 QA pairs for correctness, answerability, and category alignment\. Through our experiments, we investigate:

- •RQ1:Which dimensions most differentiate RAG performance?
- •RQ2:Does finer granularity reveal insights or add noise?
- •RQ3:Are fine\-grained splits well\-structured?

Our findings reveal that optimal granularity varies by dimension: the complexity dimension benefits from fine\-grained distinctions \(8 categories,*discriminative power*—the standard deviation of generation quality across categories at a given granularity level—reaches 0\.053\) while answer type and linguistic variation peak at medium granularity \(4 categories\)\. The Coherence Ratio reveals structural differences across dimensions \(QC: 0\.40 vs\. AT: 1\.44\), explaining why some dimensions benefit from finer splits while others plateau\. Human evaluation confirms high synthetic quality \(98% acceptable\) while low consensus on fine\-grained categories \(29% agreement\) suggests semantic ambiguity\. Preliminary correlation analysis \(rp​b=0\.24r\_\{pb\}=0\.24\) indicates that our data\-driven Coherence Ratio aligns with this human perception, offering a potential automated proxy for separating categories\.

We contribute: \(1\) a hierarchical framework for determining optimal evaluation granularity in RAG benchmarks, \(2\) a Coherence Ratio metric for validating hierarchical question category structures, \(3\) empirical evidence—within a single`BM25\+Falcon\-3\-10B`configuration—that different question characteristics may require different granularity levels for diagnostic evaluation, and \(4\) a dataset of 5,872 hierarchically\-organized questions with code for replication\.111[https://github\.com/fensorechase/rag\-diverse\-benchmarks\-synthetic\-qa](https://github.com/fensorechase/rag-diverse-benchmarks-synthetic-qa)

## 2\.Methods

![Refer to caption](https://arxiv.org/html/2606.12789v1/x1.png)Figure 1\.Hierarchical structure of three question dimensions \(QC, AT, LV\) across three granularity levels\. Each dimension subdivides from coarse \(2 categories\) to medium \(4\) to fine \(8\)\. User Expertise set to “Novice” for example QAs shown\.### 2\.1\.Question Categorization Framework

To systematically evaluate RAG systems across diverse single\-turn question settings with a predefined corpus, we adopted a hierarchical categorization framework capturing 3 key dimensions of question variation \(Figure[1](https://arxiv.org/html/2606.12789#S2.F1)\):

1. \(1\)Question Complexity \(QC\)measures the cognitive demand required to answer questions, ranging from simple fact extraction to multi\-hop reasoning\.
2. \(2\)Answer Type \(AT\)indicates the expected response format, distinguishing between questions requiring direct extraction of information versus those requiring synthesis or generation of new formulations\. This dimension builds upon prior work exploring answer type and granularity\(Yonaet al\.,[2024](https://arxiv.org/html/2606.12789#bib.bib41)\)\.
3. \(3\)Linguistic Variation \(LV\)quantifies the lexical alignment between question phrasing and document content, ranging from verbatim terminology to paraphrased concepts requiring semantic understanding\.

For each dimension, we define categories at 3 granularity levels: coarse \(2 categories\), medium \(4\), and fine \(8\)\. Each fine\-grained category specializes its parent medium category, which subdivides a coarse category\. Categories within each level are mutually exclusive for generation, though our coherence analysis reveals semantic overlap in practice\. This structure lets us assess whether finer distinctions reveal additional performance variations, helping practitioners choose appropriate evaluation granularity\.

Single\-dimension assignment as a methodological choice\.For RQ1 and RQ2, each generated question is assigned to a single dimension during synthesis\. We adopt this design to isolate the effect of one dimension at a time, but we acknowledge two consequences\. First, real questions can be characterized along all three dimensions simultaneously; the single\-dimension assignment is a simplification, not a property of natural questions\. Second, when generating along one dimension, the other two dimensions become uncontrolled confounders that can shift retrieval and generation performance independently of the dimension under study, which could compress or amplify the discriminative power reported \(e\.g\., a QC\-controlled batch may incidentally contain more distant\-vocabulary questions, lowering its mean MAP\)\. RQ3’s factorial design \(§[3\.3](https://arxiv.org/html/2606.12789#S3.SS3)\) probes one such interaction\. Post\-hoc multi\-label tagging via a few\-shot classifier is a natural extension\.

### 2\.2\.Synthetic Question Generation

We generated 5,872 questions using DataMorgana\(Filiceet al\.,[2025](https://arxiv.org/html/2606.12789#bib.bib1)\), a tool for creating diverse synthetic QA benchmarks by leveraging Claude 3\.5 Sonnet\.222See our supplemental code,\(Filiceet al\.,[2025](https://arxiv.org/html/2606.12789#bib.bib1)\), for details on synthetic QA generation; we note our framework may also leverage other pipelines for synthetic QA generation\(Ip and Vongthongsri,[2025](https://arxiv.org/html/2606.12789#bib.bib42)\)\.Questions were generated from documents randomly sampled from FineWeb\-10BT\(Penedoet al\.,[2024](https://arxiv.org/html/2606.12789#bib.bib17)\), a web corpus with 10B tokens and∼\\sim15M documents\. Each question includes a DataMorgana\-generated reference answer and source document ID to evaluate retrieval and generation quality\.

User expertise as a held variable\.Across all three research questions, user expertise was randomly specified at generation time \(50% novice, 50% expert\) so that observed performance differences reflect question characteristics rather than user\-level variation\. Studying interactions between user persona and question dimensions \(QC, AT, LV\) is a natural extension for future work\.

RQ1:We generated 1,600 questions \(400 per dimension\) using coarse\-level categories from4 dimensions: QC, AT, LV, and Question Phrasing \(4 coarse categories only\)\.\(Wanget al\.,[2005](https://arxiv.org/html/2606.12789#bib.bib36); Gupta and Bendersky,[2015](https://arxiv.org/html/2606.12789#bib.bib37)\)\. Each question belongs to one dimension, enabling independent evaluation\. Our dimensions were informed by established QA datasets\(Bolotovaet al\.,[2022](https://arxiv.org/html/2606.12789#bib.bib44); Yanget al\.,[2018](https://arxiv.org/html/2606.12789#bib.bib45); Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2606.12789#bib.bib46); Petroniet al\.,[2021](https://arxiv.org/html/2606.12789#bib.bib47)\)and RAG evaluation frameworks\(Chenet al\.,[2024](https://arxiv.org/html/2606.12789#bib.bib32)\)\.

RQ2:We generated 3,272 questions across 3 dimensions \(QC, AT, LV\) at 3 granularity levels \(coarse/medium/fine\)\. We focus on these three because they exhibit hierarchical structure suitable for multi\-level subdivision\. Sample sizes per category vary due to generation stochasticity but provide sufficient power \(n≥38n\\geq 38per fine category\)\.

RQ3:We generated 1,000 questions in a 2×2 factorial design, crossing LV \(similar/distant\) with QC \(simple/complex\), yielding∼\\sim250 questions per cell\.

### 2\.3\.RAG System Configuration

We evaluate a standard two\-stage RAG pipeline on an NVIDIA H100\.Retrieval:We indexed 512\-token chunks of FineWeb\-10BT\(Penedoet al\.,[2024](https://arxiv.org/html/2606.12789#bib.bib17)\)using PyTerrier\(Macdonaldet al\.,[2021](https://arxiv.org/html/2606.12789#bib.bib2)\)and retrievedk=10k=10documents via`BM25`\.Generation:We employed`Falcon\-3\-10B\-Instruct`\(Team,[2024](https://arxiv.org/html/2606.12789#bib.bib18)\)\(temp=0\.6=0\.6\) to generate answers from retrieved contexts, instructing the model to refuse if information was insufficient\.

### 2\.4\.Evaluation Metrics

We evaluate the RAG system along two dimensions: retrieval quality and generation quality\(Eset al\.,[2024](https://arxiv.org/html/2606.12789#bib.bib35); Chenet al\.,[2024](https://arxiv.org/html/2606.12789#bib.bib32)\)\.

Retrieval:Mean Average Precision \(MAP\) as the primary metric, nDCG@10, and Recall@10 measure whether ground\-truth documents appear in top\-10 results\.

Generation:Cosine similarity \(CS\) as a primary metric \(MiniLM\-L6\-v2 embeddings\)\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.12789#bib.bib34)\), ROUGE\-1\(Lin,[2004](https://arxiv.org/html/2606.12789#bib.bib13)\), BLEU\(Papineniet al\.,[2002](https://arxiv.org/html/2606.12789#bib.bib12)\)assess answer quality against the reference answer\.

Discriminative Power:For each granularity level, we compute the standard deviation of cosine similarity scores across categories\. Higher standard deviation indicates that categories reveal meaningful performance distinctions, justifying finer granularity\.

Information Content:We compute normalized mutual information \(MI\) between category assignments and performance bins\. Higher MI indicates category assignments are more informative about system performance\(Vinhet al\.,[2010](https://arxiv.org/html/2606.12789#bib.bib39)\)\.

Hierarchical Calibration:Within a dimension, to validate that children categories meaningfully subdivide their parents, we introduce a novel*Coherence Ratio*inspired by the silhouette coefficient in clustering\(Rousseeuw,[1987](https://arxiv.org/html/2606.12789#bib.bib40)\)\. A*split*is one parent\-children grouping in the hierarchy \(e\.g\., the medium categorysummary\_or\_explanationsplitting into the fine categoriescondensed\_summaryandsentence \_extraction\)\. Each dimension contains 6 splits in total: 2 coarse\-to\-medium splits \(each coarse parent has 2 medium children\) and 4 medium\-to\-fine splits \(each medium parent has 2 fine children\)\. For a given corpus and set of questions across nested categories, this metric quantifies whether sibling categories provide non\-redundant answer evaluation signals while remaining vertically consistent, and is defined as:

\(1\)ρcoherence=σsibδvert\+ϵ,\\rho\_\{\\text\{coherence\}\}=\\frac\{\\sigma\_\{\\text\{sib\}\}\}\{\\delta\_\{\\text\{vert\}\}\+\\epsilon\},whereσsib\\sigma\_\{\\text\{sib\}\}is the standard deviation of CS across sibling categories \(horizontal discrimination\),δvert\\delta\_\{\\text\{vert\}\}is the absolute difference between parent and mean child CS \(vertical consistency\)\.333ϵ=0\.001\\epsilon=0\.001prevents division by zero\.Highρ\\rho\(ρ\>2\.0\\rho\>2\.0\) indicates discriminative but aligned siblings \(preferred\); lowρ\\rho\(ρ<1\.0\\rho<1\.0\) suggests poor hierarchical structure \(Figure[2](https://arxiv.org/html/2606.12789#S3.F2)\)\.

### 2\.5\.Human Validation

Two annotators independently evaluated 110 QA pairs stratified across dimensions and granularity levels\. Annotators rated answer correctness \(0\-3\), hallucination severity \(0\-2\), and question answerability \(0\-3\) for all questions\. For a subset of 60 questions, annotators additionally validated fine\-grained category assignments\. Inter\-annotator agreement for answer correctness was moderate \(Fleiss’κ=0\.47\\kappa=0\.47\)\. Reference answers proved highly reliable: 98\.2% were acceptable \(score≥\\geq2\) and 90\.9% were hallucination\-free\.

For category validation, annotators achieved 28\.6% combined agreement \(exact or hierarchical parent\-match\), significantly above the 12\.5% random baseline, but consistent with the inherent difficulty of consistently applying fine\-grained definitions manually\. Filice et al\.\(Filiceet al\.,[2025](https://arxiv.org/html/2606.12789#bib.bib1)\)note similar challenges during DataMorgana validation and rely primarily on automated faithfulness filtering rather than fine\-grained human category agreement, suggesting that low inter\-annotator consensus on fine categories is a known property of synthetic QA benchmarks rather than a flaw specific to our framework\. Practically, low agreement implies that fine\-grainedper\-categoryperformance estimates carry larger effective error bars than coarse\-level estimates, and should be interpreted as relative orderings rather than absolute values\.

## 3\.Experimental Results

### 3\.1\.RQ1: Most Discriminative QA Dimensions

To determine which aspects of question variation most strongly differentiate RAG performance, we evaluated 4 dimensions independently using coarse\-level categories \(Table[1](https://arxiv.org/html/2606.12789#S3.T1)\)\.

Table 1\.RQ1: Coarse\-level dimension comparison\. Each dimension uses 2\-4 coarse categories\. Range==max\(CS\)−\-min\(CS\) across those categories within a dimension\. Note that Range≠\\neqDiscPow in Table[2](https://arxiv.org/html/2606.12789#S3.T2); CS = Cosine Similarity\.LV emerged as the most discriminative dimension \(range=0\.077\) despite achieving the worst absolute performance \(MAP=0\.369, CS=0\.695\), while AT achieved the best performance \(MAP=0\.506, CS=0\.711\) yet showed lower discrimination \(range=0\.059\)\.

### 3\.2\.RQ2: Impact of Categorization Granularity

We evaluated whether increasing categorical granularity reveals additional performance distinctions \(Table[2](https://arxiv.org/html/2606.12789#S3.T2)\)\. Discriminative power \(DiscPow\) is measured as the standard deviation of cosine similarity scores across categories within a granularity level; higher DiscPow indicates that member categories provide greater diagnostic resolution for identifying system capabilities and limitations\.

Table 2\.RQ2: Discriminative power and performance over granularity levels\. QC=Question Complexity; AT=Answer Type; LV=Linguistic Variation; C=Coarse \(2 cat\.\); M=Medium \(4 cat\.\); F=Fine \(8 cat\.\)\. DiscPow=σ\\sigma\(CS\) across categories; NMI=norm\. mutual info\.Δ\\Delta: change from parent level\.QCshowed monotonic improvement across all three levels \(DiscPow from 0\.007 at coarse to 0\.053 at fine\), with a 4\.8×\\timesgain from coarse to medium and an additional 1\.5×\\timesgain at fine granularity\. This indicates that fine\-grained complexity distinctions \(e\.g\., single\-hop inference vs\. multi\-hop reasoning\) capture meaningful performance variation beyond coarse simple/complex categorization\. In contrast,ATandLVpeaked at medium granularity \(0\.044 and 0\.047 respectively\) and declined at fine \(0\.037 and 0\.031\)\. This suggestsdiminishing returns beyond 4 categoriesfor these dimensions\. Here, finer distinctions add noise rather than signal\. The normalized MI analysis confirms these patterns: QC increases monotonically but LV peaks at medium\.

Hierarchical calibration analysisof the 6 splits per dimension reveals that fine\-grained splits are not uniformly well\-structured across dimensions \(meanρcoherence\\rho\_\{\\text\{coherence\}\}=0\.73±\\pm0\.89\)\. AT shows the best\-calibrated split forsummary\_or\_explanation\(coherence=3\.31\), where children \(condensed summary,sentence extraction\) discriminate strongly while aligning with their parent\. Conversely, QC shows poor calibration \(mean coherence=0\.40\) despite achieving the highest discriminative power at fine granularity \(DiscPow=0\.053\)\. While the 8 fine QC categories successfully partition performance space, they do not cleanly subdivide their medium\-level parents\.

This also explains why AT’s discriminative powerdecreasesat fine granularity \(from 0\.044 to 0\.037\)\. Though there is one high\-quality split \(summary\_or\_explanation\), the other three AT medium\-to\-fine splits have low mean coherence \(0\.82\), creating fine categories that cluster together rather than expanding performance coverage\. LV’sconceptual\_rephrasesplit exhibits the poorest calibration, with high vertical deviation \(δvert\\delta\_\{\\text\{vert\}\}=0\.086\) indicating fine children \(domain shift terminology, abstraction level shift\) stray substantially from their parent’s performance profile\. These poorly calibrated splits introduce constraints beyond simple subdivision, consistent with hierarchy not being strictly taxonomic\. Validatingρcoherence\\rho\_\{\\text\{coherence\}\}, we observed a positive correlation \(rp​b=0\.24,p=0\.21r\_\{pb\}=0\.24,p=0\.21\) between a split’s Coherence Ratio and human agreement, suggesting thatρcoherence\\rho\_\{\\text\{coherence\}\}is a promising proxy for semantic separation\.

#### 3\.2\.1\.Which Categories Are Hardest?

Within fine granularity, we observe substantial performance variation \(Table[3](https://arxiv.org/html/2606.12789#S3.T3)\)\. For example in QC, extractive span questions \(CS=0\.619\) underperform comparative synthesis \(concepts\) \(CS=0\.766\) by 24%\.

Table 3\.RQ2 Fine\-grained: Top\-2 and bottom\-2 performing categories per dimension by generation quality \(CS\)\.DimensionCategorynMAPCSQCcomparative\_synthesis\_concepts50\.566\.766single\_hop\_inference59\.495\.755entity\_extraction62\.606\.636extractive\_span55\.523\.619ATexplanatory\_synthesis48\.395\.738condensed\_summary44\.349\.725phrase\_extraction46\.513\.647ordered\_sequence38\.547\.625LVabstraction\_level\_shift57\.451\.729domain\_shift\_terminology60\.333\.722synonym\_based\_rephrase59\.293\.683low\_lexical\_overlap61\.225\.624Notably, several ”complex” children yield higher RAG CS than ”simple” children \(e\.g\., comparative\_synthesis\_concepts CS=0\.766 vs\. extractive\_span CS=0\.619\), and distant\-vocabulary categories achieve high generation scores \(CS=0\.722–0\.729\) despite poor retrieval \(MAP=0\.225–0\.333\)\.

### 3\.3\.RQ3: Interaction Between LV and QC

To assess whether vocabulary mismatch amplifies complexity effects, we conducted a 2×2 factorial experiment on LV×QC \(Table[4](https://arxiv.org/html/2606.12789#S3.T4)\)\.

Table 4\.RQ3: LV×QC interaction\. Gap shows difference \(simple to complex\) within LV level\.LVQCnMAPCSGapSimilarSimple252\.547\.637–SimilarComplex265\.566\.672\+\.035DistantSimple265\.237\.578–DistantComplex218\.164\.578\+\.000Main effects:Similar \(pooled\)–517\.557\.655–Distant \(pooled\)–483\.201\.578–\.077LV shows a strong main effect \(Δ\\Delta=0\.076, 13% relative\), while complexity shows a weak effect \(Δ\\Delta=0\.018, 3%\)\. The vocabulary gap is consistent across complexity levels: similar vocabulary helps complex questions by \+0\.035 CS, while distant vocabulary shows no complexity effect \(\+0\.0\)\. This suggestsadditive rather than interactive effects\. Here, LV and QC are largely independent\.

The interaction pattern reveals that when vocabulary is mismatched, complexity becomes irrelevant because retrieval has already failed \(MAP=0\.164–0\.237 for distant vocabulary vs\. 0\.547–0\.566 for similar\)\. Complexity distinctions only matter when the system successfully retrieves relevant documents\.

![Refer to caption](https://arxiv.org/html/2606.12789v1/x2.png)Figure 2\.Demonstrated Coherence Ratio \(ρ\\rho\) calculation for two medium to fine\-grained splits within AT and LV dimensions\.ρ\>2\.0\\rho\>2\.0indicates discriminative\-yet\-aligned children \(preferred\);ρ<1\.0\\rho<1\.0suggests poor hierarchical structure\.

## 4\.Discussion and Conclusion

Our hierarchical framework reveals that optimal evaluation granularity varies by dimension\. QC benefits from fine\-grained distinctions \(8 categories\) while AT and LV peak at medium \(4 categories\)\. This challenges one\-size\-fits\-all benchmark design and demonstrates how single\-level evaluation can obscure critical patterns, as seen where LV dominated at coarse granularity \(RQ1\), yet QC showed strongest discriminative power hierarchically \(RQ2\)\. The additive effects of vocabulary and complexity \(RQ3\) also support independent dimension design\.

Calibration for Validation\.We introduce the Coherence Ratio to validate whether fine splits are well\-structured\. High coherence \(e\.g\., AT’ssummary\_or\_explanation: 3\.31\) indicates discriminative yet aligned children, whereas low coherence \(e\.g\., QC mean: 0\.40\) reveals redundant distinctions\. The correlation between Coherence Ratio and human agreement \(rp​b=0\.24r\_\{pb\}=0\.24\) suggests that our metric captures true semantic split boundaries\. Practitioners can useρ<1\.0\\rho<1\.0as a signal to refine category definitions or reduce granularity before large\-scale QA benchmark generation\.

Generalization and Limitations\.Several limitations should be acknowledged during interpretation of our results: first, all experiments use a single retriever \(`BM25`\), generator \(`Falcon\-3\-10B`\), and corpus \(FineWeb\-10BT\)\. The specific granularity findings should be read as illustrative for this configuration rather than as general properties of RAG benchmarking;`BM25`’s lexical sensitivity likely amplifies the LV signal at coarse level \(Table[2](https://arxiv.org/html/2606.12789#S3.T2)\), and a dense or hybrid retriever may shift which dimensions are most discriminative\. The portable contribution is HieraRAG’s procedure \(hierarchical design, discriminative power, Coherence Ratio\), not the category\-level outcomes\. Second, single\-dimension assignment in RQ1/RQ2 leaves the other two dimensions as uncontrolled confounders that can inflate or deflate estimated discriminative power; this could be addressed using full\-factorial coverage with post\-hoc multi\-label tagging via few\-shot LLM classifiers\. Third, correlation of Coherence Ratio with human agreement is positive but non\-significant \(rp​b=0\.24r\_\{pb\}\{=\}0\.24,p=0\.21p\{=\}0\.21,n=60n\{=\}60\); we treat it as a useful diagnostic that requires further validation\. Finally, because properties of synthetic queries differ from human ones,\(Zendelet al\.,[2025](https://arxiv.org/html/2606.12789#bib.bib51)\), future work should evaluate granularity findings against human\-written benchmarks\.

###### Acknowledgements\.

This work is supported by the National Science Foundation \(NSF\) grant IIS\-2145411 and CISE Graduate Fellowships under Grant No\. 2313998\. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author\(s\) and do not necessarily reflect the views of the National Science Foundation\.

## References

- V\. Bolotova, V\. Blinov, F\. Scholer, W\. B\. Croft, and M\. Sanderson \(2022\)A Non\-Factoid Question\-Answering Taxonomy\.InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR ’22,New York, NY, USA,pp\. 1196–1207\.External Links:ISBN 978\-1\-4503\-8732\-3,[Link](https://doi.org/10.1145/3477495.3531926),[Document](https://dx.doi.org/10.1145/3477495.3531926)Cited by:[§1](https://arxiv.org/html/2606.12789#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.12789#S2.SS2.p3.1)\.
- J\. Chen, H\. Lin, X\. Han, and L\. Sun \(2024\)Benchmarking large language models in retrieval\-augmented generation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 17754–17762\.Cited by:[§2\.2](https://arxiv.org/html/2606.12789#S2.SS2.p3.1),[§2\.4](https://arxiv.org/html/2606.12789#S2.SS4.p1.1)\.
- S\. Es, J\. James, L\. E\. Anke, and S\. Schockaert \(2024\)RAGAs: automated evaluation of retrieval augmented generation\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations,pp\. 150–158\.Cited by:[§2\.4](https://arxiv.org/html/2606.12789#S2.SS4.p1.1)\.
- S\. Filice, G\. Horowitz, D\. Carmel, Z\. Karnin, L\. Lewin\-Eytan, and Y\. Maarek \(2025\)Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana\.arXiv\.External Links:2501\.12789,[Document](https://dx.doi.org/10.48550/arXiv.2501.12789)Cited by:[§1](https://arxiv.org/html/2606.12789#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.12789#S2.SS2.p1.1),[§2\.5](https://arxiv.org/html/2606.12789#S2.SS5.p2.1),[footnote 2](https://arxiv.org/html/2606.12789#footnote2)\.
- M\. Gupta and M\. Bendersky \(2015\)Information retrieval with verbose queries\.Foundations and Trends in Information Retrieval9\(3\-4\),pp\. 209–354\.Cited by:[§2\.2](https://arxiv.org/html/2606.12789#S2.SS2.p3.1)\.
- J\. Ip and K\. Vongthongsri \(2025\)Deepeval\.Note:original\-date: 2023\-08\-10T05:35:04ZExternal Links:[Link](https://github.com/confident-ai/deepeval)Cited by:[§1](https://arxiv.org/html/2606.12789#S1.p1.1),[footnote 2](https://arxiv.org/html/2606.12789#footnote2)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee, K\. Toutanova, L\. Jones, M\. Kelcey, M\. Chang, A\. M\. Dai, J\. Uszkoreit, Q\. Le, and S\. Petrov \(2019\)Natural Questions: A Benchmark for Question Answering Research\.Transactions of the Association for Computational Linguistics7,pp\. 452–466\.Note:Place: Cambridge, MA Publisher: MIT PressExternal Links:[Link](https://aclanthology.org/Q19-1026/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00276)Cited by:[§2\.2](https://arxiv.org/html/2606.12789#S2.SS2.p3.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2606.12789#S1.p1.1)\.
- C\. Lin \(2004\)Rouge: a package for automatic evaluation of summaries\.InText summarization branches out,pp\. 74–81\.Cited by:[§2\.4](https://arxiv.org/html/2606.12789#S2.SS4.p3.1)\.
- W\. Liu, J\. Chen, K\. Ji, L\. Zhou, W\. Chen, and B\. Wang \(2025\)Rag\-instruct: boosting llms with diverse retrieval\-augmented instructions\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 3865–3888\.Cited by:[§1](https://arxiv.org/html/2606.12789#S1.p1.1)\.
- C\. Macdonald, N\. Tonellotto, S\. MacAvaney, and I\. Ounis \(2021\)PyTerrier: declarative experimentation in python from bm25 to dense retrieval\.InProceedings of the 30th acm international conference on information & knowledge management,pp\. 4526–4533\.Cited by:[§2\.3](https://arxiv.org/html/2606.12789#S2.SS3.p1.2)\.
- K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu \(2002\)Bleu: a method for automatic evaluation of machine translation\.InProceedings of the 40th annual meeting of the Association for Computational Linguistics,pp\. 311–318\.Cited by:[§2\.4](https://arxiv.org/html/2606.12789#S2.SS4.p3.1)\.
- G\. Penedo, H\. Kydlíček, A\. Lozhkov, M\. Mitchell, C\. A\. Raffel, L\. Von Werra, T\. Wolf,et al\.\(2024\)The fineweb datasets: decanting the web for the finest text data at scale\.Advances in Neural Information Processing Systems37,pp\. 30811–30849\.Cited by:[§1](https://arxiv.org/html/2606.12789#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.12789#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2606.12789#S2.SS3.p1.2)\.
- F\. Petroni, A\. Piktus, A\. Fan, P\. Lewis, M\. Yazdani, N\. D\. Cao, J\. Thorne, Y\. Jernite, V\. Karpukhin, J\. Maillard, V\. Plachouras, T\. Rocktäschel, and S\. Riedel \(2021\)KILT: a Benchmark for Knowledge Intensive Language Tasks\.arXiv\.Note:arXiv:2009\.02252 \[cs\]External Links:[Link](http://arxiv.org/abs/2009.02252),[Document](https://dx.doi.org/10.48550/arXiv.2009.02252)Cited by:[§1](https://arxiv.org/html/2606.12789#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.12789#S2.SS2.p3.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 3982\.Cited by:[§2\.4](https://arxiv.org/html/2606.12789#S2.SS4.p3.1)\.
- P\. J\. Rousseeuw \(1987\)Silhouettes: a graphical aid to the interpretation and validation of cluster analysis\.Journal of computational and applied mathematics20,pp\. 53–65\.Cited by:[§1](https://arxiv.org/html/2606.12789#S1.p3.1),[§2\.4](https://arxiv.org/html/2606.12789#S2.SS4.p6.7)\.
- F\. Team \(2024\)The falcon 3 family of open models\.External Links:[Link](https://huggingface.co/blog/falcon3)Cited by:[§2\.3](https://arxiv.org/html/2606.12789#S2.SS3.p1.2)\.
- N\. X\. Vinh, J\. Epps, and J\. Bailey \(2010\)Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance\.Journal of Machine Learning Research11,pp\. 2837–2854\.Cited by:[§2\.4](https://arxiv.org/html/2606.12789#S2.SS4.p5.1)\.
- E\. M\. Voorhees and D\. M\. Tice \(2000\)The TREC\-8 Question Answering Track Evaluation\.NIST3\(en\)\.Note:Last Modified: 2017\-02\-17T13:34\-05:00 Publisher: Ellen M\. Voorhees, D M\. TiceExternal Links:[Link](https://www.nist.gov/publications/trec-8-question-answering-track-evaluation)Cited by:[§1](https://arxiv.org/html/2606.12789#S1.p2.1)\.
- Q\. Wang, C\. Nass, and J\. Hu \(2005\)Natural language query vs\. keyword search: effects of task complexity on search performance, participant perceptions, and preferences\.InIFIP Conference on Human\-Computer Interaction,pp\. 106–116\.Cited by:[§2\.2](https://arxiv.org/html/2606.12789#S2.SS2.p3.1)\.
- Z\. Wang, B\. Bi, Y\. Luo, S\. Asur, and C\. N\. Cheng \(2025\)Diversity enhances an llm’s performance in rag and long\-context task\.arXiv preprint arXiv:2502\.09017\.Cited by:[§1](https://arxiv.org/html/2606.12789#S1.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: A Dataset for Diverse, Explainable Multi\-hop Question Answering\.arXiv\.Note:arXiv:1809\.09600 \[cs\]External Links:[Link](http://arxiv.org/abs/1809.09600),[Document](https://dx.doi.org/10.48550/arXiv.1809.09600)Cited by:[§1](https://arxiv.org/html/2606.12789#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.12789#S2.SS2.p3.1)\.
- G\. Yona, R\. Aharoni, and M\. Geva \(2024\)Narrowing the Knowledge Evaluation Gap: Open\-Domain Question Answering with Multi\-Granularity Answers\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 6737–6751\.External Links:[Link](https://aclanthology.org/2024.acl-long.365/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.365)Cited by:[§1](https://arxiv.org/html/2606.12789#S1.p2.1),[§1](https://arxiv.org/html/2606.12789#S1.p3.1),[item 2](https://arxiv.org/html/2606.12789#S2.I1.i2.p1.1)\.
- O\. Zendel, S\. F\. D\. Al Lawati, L\. Rashidi, F\. Scholer, and M\. Sanderson \(2025\)A comparative analysis of linguistic and retrieval diversity in llm\-generated search queries\.InProceedings of the 34th ACM International Conference on Information and Knowledge Management,pp\. 4014–4023\.Cited by:[§4](https://arxiv.org/html/2606.12789#S4.p3.3)\.

Similar Articles

Generating Leakage-Free Benchmarks for Robust RAG Evaluation

arXiv cs.CL

This paper introduces SeedRG, a semi-synthetic benchmark generation pipeline designed to eliminate knowledge leakage in Retrieval-Augmented Generation (RAG) evaluation by creating novel examples that preserve reasoning structures but are absent from model parametric memory.